Association Rules(关联规则)入门
Frequent Pattern Analysis
Association rule mining: Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction
根据不同物件出现的频率找规则
Application:
- 产品组合推荐(网购网站,亚马逊等)
- 菜单设计
- 网页设计(点击流分析)
- dna序列分析
特点:
- Actionable, 发现的规律可以及时应用
- Trivial, 有时候没什么用
- Inexplicable, 有时候难以解释规则的原因
基本概念
itemset: a collection of one or more items. k-itemset contains k items
3-itemset: {A,B,C}:0, {B, E, F}:2
Association Rule
Association rules are generated based on frequent itemsets. We can split a frequent itemsets into two subsets, put one on the LHS, the other on the RHS.
e.g. { E, F } -> { B } 表示当EF出现的时候,B大概率出现的规则
Metrics to evaluate the rule’s strength
Support P(X, Y)
- Fraction of transactions that contain both X and Y
- Support({E, F} -> {B}) = support_count({B,E,F}) / N = 2/5
how many transactions contain them
**Confidence P(Y | X)=P(X, Y)/P(X)** |
- How frequently items in Y appear in transactions that contain X
- confidence({E,F} -> {B}) = support({B,E,F}} / support({E,F})
conditional probability when people bought x, how likely it they also bought Y
###Apriori algorithm Given a set of transactions T, the goal of association rule mining is to find all rules where:
- support ≥ minsup threshold
- confidence ≥ minconf threshold
算法:
-
Brute-force: 找出所有项,筛出满足条件的项
-
Frequent Itemset Generation:如果一个项集是频繁的,那它所有的子项集也都是频繁的
衡量相关性Lift
那么我们需要怎样大小的support,confidence和lift值呢?
AR小结
作为data mining的一种相当有用的算法,关联规则可以找到一些非常具有insights的规则供我们使用。而Apriori为其中一种算法,可以在r语言中简单使用。在高confidence和lift的rules中选择我们感兴趣的关联规则。