Essays.club - Get Free Essays and Term Papers
Search

Survey Paper – Clustering in Data Mining

Autor:   •  September 23, 2018  •  3,063 Words (13 Pages)  •  483 Views

Page 1 of 13

...

[pic 2]

Figure 2 – Stages of Clustering

Mostly clustering is one of the preliminary phase in analysis while data mining. It starts with realization of clusters and relevant objects to build structure of relationship and association between the data and grows from that point and onwards. This procedure signify the ongoing processing of cluster modeling, other relevant attributes are derived from other statistical means and can determine distinctiveness of clusters in regards to preferred output. Case in point, new sales promotion would be more helpful to target the audiences who have higher significance of buying the goods.

Categorizations:

Clustering data mining algorithms can be further classify into “partition-based algorithms”, “hierarchical-based algorithms”, “density-based algorithms” and “grid-based algorithms”.

Partition Algorithm:

Partitioning algorithms, separate the data set into further smaller divisions, each division called cluster however it represent as partition while working with this algorithm. It substantiates the possibility of creating subsets based on available variables driving through the data sets. It shuffles the pointing approach and displace between multiple partitions, [3].

Partition structures constructed by the algorithm are further divided into k-medoids and k-means methods for optimizations and better representation. K-medoids is the area where most majorities represents the data points in specific partition. It signifies the importance by having unlimited data representation in cluster, its shifting of mediod point based on prime dense location. It is least sensitive to boundaries which are mostly representing noise. In the case of k-means, it represents the most central point of the population within the cluster which is meanly derived from average of all existing points in cluster. “This works conveniently only with numerical attributes and can be negatively affected by a single outlier”, [3].

[pic 3]

Figure 3 – Partitioned Clustering

Hierarchical Algorithm:

Hierarchical [2][5][6] clustering construct a breakdown of the set of data based on relationship or defined principle. Dendrogram that is tree shaped structure of representation that build the progression of records, it provision integration and division based on set principle for hierarchical algorithm and depth of hierarchy can be defined based on analysis, further granularity differentiate the parent and child relationship between each node level. These levels help to provide range of node along with relationship while exploring the data.

Hierarchical clustering is categorized into “agglomerative” and “divisive”. An agglomerative clustering initiate the analysis from one point cluster and continually combine more and more, most similar clusters until it create best possible combination of all inline clusters. Divisive clustering initiate from a single cluster as well but in reverse manner that consist of all data points and repetitively divide into the most apposite cluster.

[pic 4]

Figure 4 – Representation of Agglomerative and Divisive

On positive ground hierarchical clustering does not predominately have information of structure that it is going to merge or divide, it is simple and straight forward to implement this approach. On the other hand, it is impossible to reverse the division or combination mechanism that algorithm have done, prediction of possible structure is very difficult.

Density-based Algorithm:

Density based algorithms have capability to ascertain the cluster based on the area of population that represent higher density. It will scan once and will try to construct random shaped clusters along with handling noise area. By definition is it can be elaborated as quantity of objects in particular region of data objects. The approach allows to continuously growing objects in quantity as long it does not go beyond the defined criteria within the region.

Density Reachability - A position "a" is believe to be density reachable from another position "b" if position "a" is within “x” region from position "b" and "b" has adequate number of objects in it surrounding which are within region “x”, [9].

Density Connectivity - A position "a" and "b" are believe to be density connected if there is a another point present "c" which has adequate number of objects in its surroundings and both the positions "a" and "b" are within the “x” region. This is a continuity process. So, if "a" is in surrounding of "c", "c" is surrounding of "d", "d" is in surrounding of "e" which in circle is in surrounding of "a" that mean "b" is neighbor of "a" [9].

Density based algorithm does not prerequisite specifications to limit or enlarge the variable spectrum. It has ability to discover and handle noisy positions that appear in clustering plot while projection. It has limitation to handle neck type data set projections and high variation data sets are not handled properly and there is probability to have incorrect presentation.

[pic 5]

Figure 5 – Density Algorithm - [9]

Grid-based Algorithm:

Grid based clustering is more appropriate to handle space and give more importance while analyzing rather than data objects or data points. It uses grid based data structure along with density to construct clusters. It

At the beginning, it populate the source data into predefined number of cells in the grid that construct the grid structure and then perform algorithmic operations on the number of possible spaces. It has capabilities to create representation between infinite numbers of data objects through the data flow to predestined numbers of grids.

Grid base algorithms have exclusivity in term of fast processing time as data objects, it just represent in a cell of the grid and consider as single point. This architecture allows the algorithm to be self tuned while managing itself in regards to amount of data from the data flow. Grid density is able to manage noise data points and

...

Download:   txt (21.2 Kb)   pdf (69.9 Kb)   docx (21.6 Kb)  
Continue for 12 more pages »
Only available on Essays.club