聚类kmeans入门

ist707_3

Posted by renjie on February 6, 2020


Clustering Techniques

本文资料来源:ist707

Finding groups of objects. (unsupervised learning)

Minimize the intra-cluster distance
Maximize the inter-cluster distance clustering

Apply: finding customers are similar within each segment but different across segments

  • Eplore a large data
  • Classification without training data
  • Outlier detection.

(does not answer any question directly, but give insights)

两大类型:

  • Partitional (flat) Clustering
    • Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors
    • Typical methods: k-means, k-medoids, CLARANS, EM

Partitional

  • Hierarchical clustering
    • Create a hierarchical decomposition of the set of data (or objects) using some criterion.
    • Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON

Hierarchical


K-MEANS ALGROITHM

Centroid: Calculated as the average of all data examples in a cluster

  • For numeric variable, use the mean as the average
  • For nominal data, use mode as the average

算法流程

STEP

initial centroid

判断两点距离

Distance between numeric values

Euclidean distance Euclidean distance Manhattan distance

Distance between categorical/ nominal values

  • Simple matching Simple matching

选定一基准,与之比较,距离为不匹配的属性数量/总数

  • Convert nominal to binary/numeric variables Convert

Binary variables可以为symmetric或者asymmetric,即0和1是否同样重要

计算所有属性的距离然后相叠加(aggregate)即总距离

判断相似性

  • If defining a distance measure d in [0,1] range, similarity can be defined as 1-d.
  • Cosine similarity measure(余弦相似性) Cosine similarity 0表示垂直(perpendicular) 1表示相同方向
    经常用于比较文件相似性中(text mining, information retrieval)
    • 每个单词都算一个dimension

Normalization

  • 不同属性值单位不同,需要normalize来保证距离计算不被某些列支配
  • 文档长短不同会影响结果

模型衡量标准(SSE)

sse

不无脑增加k值以减少sse,应以实际情况为准

K-means总结

通过迭代不断进行cluster中心的变换来进行分类,不同初始值的选择对最终结果有影响,可以通过sse来衡量。

  • 优势:高效
  • 劣势 :需要提前确定k的值, 对噪声和异常值敏感,不同类数据不能重合

K-Medoids

因为k-means算法对于outlier非常敏感,可以用medoids替代

K-Medoids:most centrally located object (data point) in a cluster.

PAM(Partitioning Around Medoid)

在每次迭代中,所有可能的对象对被分析,每个对中的一个对象是中心点,而另一个是非代表对象。对可能的各种组合,估算聚类结果的质量 ***

聚合分类

  • 硬聚类:把数据确切地分到某一类中,比如K-Means。
    硬就是说“强硬”,是属于A类就是A类,不会跑到B类。

  • 软聚类:把数据以一定的概率分到各类中,比如高斯混合模型(GMM),比如模糊C均值模型(Fuzzy c-Means)。聚类的结果往往是样本1在A类的概率是0.7,在B类的概率是0.3。软聚类又称为模糊聚类(fuzzy clustering)

聚合模型好坏的判断方法

  • Cluster Cohesion: Measures how closely related are objects in a cluster
    • high intra-class similarity
    • SSE as a cohesion measure
  • Cluster Separation: Measure how distinct or well-separated a cluster is from other clusters
    • low inter-class similarity

就是一开始提到的inner和intra距离

聚合小结

因为没有训练集,所以很难说作为一个高准确率的classification模型。得到模型后,需要花时间去理解分类后各个集合的含义,但作为一种非监督学习模型,用途较为广泛。