Dunn index
The Dunn index (DI) (introduced by J. C. Dunn in 1974) is a metric for evaluating clustering algorithms. This is part of a group of validity indices including the Davies–Bouldin index or Silhouette in
P4-metric
P4 metric enables performance evaluation of the binary classifier.It is calculated from precision, recall, specificity and NPV (negative predictive value).P4 is designed in similar way to F1 metric, h
Automatic clustering algorithms
Automatic clustering algorithms are algorithms that can perform clustering without prior knowledge of data sets. In contrast with other cluster analysis techniques, automatic clustering algorithms can
MinHash
In computer science and data mining, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was
Variation of information
In probability theory and information theory, the variation of information or shared information distance is a measure of the distance between two clusterings (partitions of elements). It is closely r
Silhouette (clustering)
Silhouette refers to a method of interpretation and validation of consistency within clusters of data. The technique provides a succinct graphical representation of how well each object has been class
Determining the number of clusters in a data set
Determining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actuall
SimHash
In computer science, SimHash is a technique for quickly estimating how similar two sets are. The algorithm is used by the Google Crawler to find near duplicate pages. It was created by Moses Charikar.
Elbow method (clustering)
In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of cl
Jaccard index
The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for gauging the similarity and diversity of sample sets. It was developed by Grove Karl Gilbert in 1884 as his
Adjusted mutual information
In probability theory and information theory, adjusted mutual information, a variation of mutual information may be used for comparing clusterings. It corrects the effect of agreement solely due to ch
Similarity measure
In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single defi
Fowlkes–Mallows index
The Fowlkes–Mallows index is an external evaluation method that is used to determine the similarity between two clusterings (clusters obtained after a clustering algorithm), and also a metric to measu
Dasgupta's objective
In the study of hierarchical clustering, Dasgupta's objective is a measure of the quality of a clustering, defined from a similarity measure on the elements to be clustered. It is named after Sanjoy D
Simple matching coefficient
The simple matching coefficient (SMC) or Rand similarity coefficient is a statistic used for comparing the similarity and diversity of sample sets. Given two objects, A and B, each with n binary attri
Rand index
The Rand index or Rand measure (named after William M. Rand) in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings. A form of the Rand index
F-score
In statistical analysis of binary classification, the F-score or F-measure is a measure of a test's accuracy. It is calculated from the precision and recall of the test, where the precision is the num
Davies–Bouldin index
The Davies–Bouldin index (DBI), introduced by David L. Davies and Donald W. Bouldin in 1979, is a metric for evaluating clustering algorithms. This is an internal evaluation scheme, where the validati
Balanced clustering
Balanced clustering is a special case of clustering where, in the strictest sense, cluster sizes are constrained to or , where is the number of points and is the number of clusters. A typical algorith
Hopkins statistic
The Hopkins statistic (introduced by Brian Hopkins and John Gordon Skellam) is a way of measuring the cluster tendency of a data set. It belongs to the family of sparse sampling tests. It acts as a st