Association Rules(关联规则)入门

ist707_2

Posted by renjie on February 9, 2020


Association Rules(关联规则)入门

Frequent Pattern Analysis

Association rule mining: Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction

根据不同物件出现的频率找规则

Application:

  • 产品组合推荐(网购网站,亚马逊等)
  • 菜单设计
  • 网页设计(点击流分析)
  • dna序列分析

特点:

  • Actionable, 发现的规律可以及时应用
  • Trivial, 有时候没什么用
  • Inexplicable, 有时候难以解释规则的原因

基本概念

itemset: a collection of one or more items. k-itemset contains k items

3-itemset: {A,B,C}:0, {B, E, F}:2

Association Rule

Association rules are generated based on frequent itemsets. We can split a frequent itemsets into two subsets, put one on the LHS, the other on the RHS.

e.g. { E, F } -> { B } 表示当EF出现的时候,B大概率出现的规则

Metrics to evaluate the rule’s strength

Support P(X, Y)

  • Fraction of transactions that contain both X and Y
  • Support({E, F} -> {B}) = support_count({B,E,F}) / N = 2/5

how many transactions contain them

**Confidence P(Y X)=P(X, Y)/P(X)**
  • How frequently items in Y appear in transactions that contain X
  • confidence({E,F} -> {B}) = support({B,E,F}} / support({E,F})

conditional probability when people bought x, how likely it they also bought Y


###Apriori algorithm Given a set of transactions T, the goal of association rule mining is to find all rules where:

  • support ≥ minsup threshold
  • confidence ≥ minconf threshold

算法

  1. Brute-force: 找出所有项,筛出满足条件的项

  2. Frequent Itemset Generation:如果一个项集是频繁的,那它所有的子项集也都是频繁的

衡量相关性Lift


那么我们需要怎样大小的support,confidence和lift值呢?

AR小结

作为data mining的一种相当有用的算法,关联规则可以找到一些非常具有insights的规则供我们使用。而Apriori为其中一种算法,可以在r语言中简单使用。在高confidence和lift的rules中选择我们感兴趣的关联规则。