聚类kmeans入门

Clustering Techniques

本文资料来源：ist707

Finding groups of objects. (unsupervised learning)

Minimize the intra-cluster distance
Maximize the inter-cluster distance clustering

Apply: finding customers are similar within each segment but different across segments

Eplore a large data
Classification without training data
Outlier detection.

(does not answer any question directly, but give insights)

两大类型:

Partitional (flat) Clustering
- Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors
- Typical methods: k-means, k-medoids, CLARANS, EM

Partitional

Hierarchical clustering
- Create a hierarchical decomposition of the set of data (or objects) using some criterion.
- Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON

Hierarchical

K-MEANS ALGROITHM

Centroid: Calculated as the average of all data examples in a cluster

For numeric variable, use the mean as the average
For nominal data, use mode as the average

算法流程

STEP

initial centroid

判断两点距离

Distance between numeric values

Euclidean distance Manhattan distance

Distance between categorical/ nominal values

Simple matching

选定一基准，与之比较，距离为不匹配的属性数量/总数

Convert nominal to binary/numeric variables

Binary variables可以为symmetric或者asymmetric，即0和1是否同样重要

计算所有属性的距离然后相叠加（aggregate）即总距离

判断相似性

If defining a distance measure d in [0,1] range, similarity can be defined as 1-d.
Cosine similarity measure（余弦相似性） 0表示垂直(perpendicular) 1表示相同方向
经常用于比较文件相似性中(text mining, information retrieval)
- 每个单词都算一个dimension

Normalization

不同属性值单位不同，需要normalize来保证距离计算不被某些列支配
文档长短不同会影响结果

模型衡量标准(SSE)

sse

不无脑增加k值以减少sse，应以实际情况为准

K-means总结

通过迭代不断进行cluster中心的变换来进行分类，不同初始值的选择对最终结果有影响，可以通过sse来衡量。

优势：高效
劣势：需要提前确定k的值，对噪声和异常值敏感，不同类数据不能重合

K-Medoids

因为k-means算法对于outlier非常敏感，可以用medoids替代

K-Medoids：most centrally located object (data point) in a cluster.

PAM(Partitioning Around Medoid)

在每次迭代中，所有可能的对象对被分析，每个对中的一个对象是中心点，而另一个是非代表对象。对可能的各种组合，估算聚类结果的质量 ***

聚合分类

硬聚类：把数据确切地分到某一类中，比如K-Means。
硬就是说“强硬”，是属于A类就是A类，不会跑到B类。
软聚类：把数据以一定的概率分到各类中，比如高斯混合模型(GMM)，比如模糊C均值模型(Fuzzy c-Means)。聚类的结果往往是样本1在A类的概率是0.7，在B类的概率是0.3。软聚类又称为模糊聚类(fuzzy clustering）

聚合模型好坏的判断方法

Cluster Cohesion: Measures how closely related are objects in a cluster
- high intra-class similarity
- SSE as a cohesion measure
Cluster Separation: Measure how distinct or well-separated a cluster is from other clusters
- low inter-class similarity

就是一开始提到的inner和intra距离

聚合小结

因为没有训练集，所以很难说作为一个高准确率的classification模型。得到模型后，需要花时间去理解分类后各个集合的含义，但作为一种非监督学习模型，用途较为广泛。

ist707_3