Friday, May 24, 2024

STAIR ONE: CLUSTERING

 CLUSTERING

Cluster analysis, or clustering, is the process of arranging data points into groups according to how similar they are to one another. This approach falls under the category of unsupervised learning, which, in contrast to supervised learning, does not have a goal variable and instead seeks to extract insights from unlabeled data points.

The goal of clustering is to create homogeneous groups of data points from a diverse dataset. The points with the highest similarity score are then grouped together after the similarity is assessed using a metric such as Euclidean distance, Cosine similarity, Manhattan distance, etc.

For Example, in the graph given below, we can clearly see that there are 3 circular clusters forming on the basis of distance.

Now it is not necessary that the clusters formed must be circular in shape. The shape of clusters can be arbitrary. There are many algorithms that work well with detecting arbitrary shaped clusters. 

For example, In the below given graph we can see that the clusters formed are not circular in shape.

A) k- means Clustering:

K-Means clustering is one of the most popular and straightforward clustering algorithms in unsupervised machine learning. It partitions a dataset into K distinct, non-overlapping subsets (clusters) based on similarity, where K is a user-defined parameter. Here’s a detailed overview of K-Means clustering:

The K-Means algorithm works as follows:

  1. Initialization: Select K initial centroids randomly from the dataset. These centroids are the starting points for each cluster.
  2. Assignment: Assign each data point to the nearest centroid based on a distance metric (usually Euclidean distance). This forms K clusters.
  3. Update: Recalculate the centroid of each cluster by taking the mean of all data points assigned to that cluster.
  4. Convergence: Repeat the assignment and update steps until the centroids no longer change significantly, or a maximum number of iterations is reached.
  5. K-Means

Applications of Clustering in different fields:

  1. Marketing: It can be used to characterize & discover customer segments for marketing purposes.
  2. Biology: It can be used for classification among different species of plants and animals.
  3. Libraries: It is used in clustering different books on the basis of topics and information.
  4. Insurance: It is used to acknowledge the customers, their policies and identifying the frauds.
  5. City Planning: It is used to make groups of houses and to study their values based on their geographical locations and other factors present. 
  6. Earthquake studies: By learning the earthquake-affected areas we can determine the dangerous zones. 
  7. Image Processing: Clustering can be used to group similar images together, classify images based on content, and identify patterns in image data.
As, I previously said during the journey I will take readers through hand on journey. I am providing the link of folder which is freely accessible where I have posted various documents in which i have implemented the model on easiest level. Any beginner can easily understand the models.
Those models are implemented in "Jupyter notebook" which is the platform for implementing python projects. 


Kindly, refer the link provided below:



References:
1) https://medium.com/@pranav3nov/understanding-k-means-clustering-f5e2e84d2129


 





1 comment:

Probability and Statistical Operation Using Python

 STATISTICS AND POBABILITY  STATISTICS: The process of gathering information, tabulating it, and interpreting it numerically is known as sta...