Introduction to Data Mining

<<< Previous    Up    Next >>>

Lesson 6.6.3.2

Speeding Searching and Matching

 

Searching and Matching for candidate itemsets can be speeded by using hash counts to filter candidates. When counting candidate k-1 itemsets, get counts for the "hash-groups" of k-itemsets instead of getting counts for all k-itemsets. The procedure is explained here.

  1. Use hash function, h on the k-itemsets.

  2. For each transaction t, if a k-itemset s is a subset of t, then add one to the count of h(s).

  3. Remove all candidates, q (generated by Apriori) such that h(q)'s count is < minSupp.

The idea is quite useful for candidates of size k = 2, but often not so useful elsewhere. For sparse datasets, the step for k=2, can be the most expensive for Apriori.

 

An Hash based example

We have transactional dataset, D = [134, 235, 1235, 25].

Suppose hash function, h is: h(x,y) = ((order of x)*10 + (order of y)) mod 7.

        where order of 1 is 1, order of 2 is 2, and so on.

E.g.     h(1,4) = 0,    h(1,5) = 1,     h(2,4) = 3,    ........

             bucket0        bucket1        bucket2        bucket3        bucket4        bucket5        bucket6

                14                 15                 23               24                 25                12                13

                35                                                                                                                      34   

counts     3                    1                   2                0                    3                 1                  3

Then 2-itemsets hashed to buckets 1 and 5 cannot be frequent (e.g. 15, 12), so remove them from C2.

 

<<< Previous    Up    Next >>>