Clustering Techniques
本文资料来源:ist707
Finding groups of objects. (unsupervised learning)
Minimize the intra-cluster distance
Maximize the inter-cluster distance
Apply: finding customers are similar within each segment but different across segments
- Eplore a large data
- Classification without training data
- Outlier detection.
(does not answer any question directly, but give insights)
两大类型:
- Partitional (flat) Clustering
- Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors
- Typical methods: k-means, k-medoids, CLARANS, EM
- Hierarchical clustering
- Create a hierarchical decomposition of the set of data (or objects) using some criterion.
- Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
K-MEANS ALGROITHM
Centroid: Calculated as the average of all data examples in a cluster
- For numeric variable, use the mean as the average
- For nominal data, use mode as the average
算法流程
判断两点距离
Distance between numeric values
Euclidean distance Manhattan distance
Distance between categorical/ nominal values
- Simple matching
选定一基准,与之比较,距离为不匹配的属性数量/总数
- Convert nominal to binary/numeric variables
Binary variables可以为symmetric或者asymmetric,即0和1是否同样重要
计算所有属性的距离然后相叠加(aggregate)即总距离
判断相似性
- If defining a distance measure d in [0,1] range, similarity can be defined as 1-d.
- Cosine similarity measure(余弦相似性)
0表示垂直(perpendicular) 1表示相同方向
经常用于比较文件相似性中(text mining, information retrieval)- 每个单词都算一个dimension
Normalization
- 不同属性值单位不同,需要normalize来保证距离计算不被某些列支配
- 文档长短不同会影响结果
模型衡量标准(SSE)
不无脑增加k值以减少sse,应以实际情况为准
K-means总结
通过迭代不断进行cluster中心的变换来进行分类,不同初始值的选择对最终结果有影响,可以通过sse来衡量。
- 优势:高效
- 劣势 :需要提前确定k的值, 对噪声和异常值敏感,不同类数据不能重合
K-Medoids
因为k-means算法对于outlier非常敏感,可以用medoids替代
K-Medoids:most centrally located object (data point) in a cluster.
PAM(Partitioning Around Medoid)
在每次迭代中,所有可能的对象对被分析,每个对中的一个对象是中心点,而另一个是非代表对象。对可能的各种组合,估算聚类结果的质量 ***
聚合分类
-
硬聚类:把数据确切地分到某一类中,比如K-Means。
硬就是说“强硬”,是属于A类就是A类,不会跑到B类。 -
软聚类:把数据以一定的概率分到各类中,比如高斯混合模型(GMM),比如模糊C均值模型(Fuzzy c-Means)。聚类的结果往往是样本1在A类的概率是0.7,在B类的概率是0.3。软聚类又称为模糊聚类(fuzzy clustering)
聚合模型好坏的判断方法
- Cluster Cohesion: Measures how closely related are objects in a cluster
- high intra-class similarity
- SSE as a cohesion measure
- Cluster Separation: Measure how distinct or well-separated a cluster is from other clusters
- low inter-class similarity
就是一开始提到的inner和intra距离
聚合小结
因为没有训练集,所以很难说作为一个高准确率的classification模型。得到模型后,需要花时间去理解分类后各个集合的含义,但作为一种非监督学习模型,用途较为广泛。