Current location - Loan Platform Complete Network - Big data management - How the beer diaper correlation algorithm came to be
How the beer diaper correlation algorithm came to be

I. Background:

In a supermarket, a particularly interesting phenomenon was discovered through big data analysis: the sales data curves of diapers and beer, two unrelated commodities, were surprisingly similar at the initial stage, so diapers and beer were placed

together. Little did we know that this move would actually lead to a dramatic increase in sales of both diapers and beer. This is not a joke, but a real-life example of big data from the Walmart supermarket chain in the US, which has been much talked about by businesses.

Women in the U.S. usually stay at home with their children, so they often ask their husbands to buy diapers for them on the way home from work, and the husbands stop by to buy their favorite beer while they're at it.

This discovery has led to big profits for businesses, but how do you make the connection between beer and diaper sales from the vast but cluttered array of big data? And what does this teach us?

It's all about associations!

Association, in fact, is very simple, that is, several things or events are often at the same time, "beer + diapers" is very typical of the two associated goods. The so-called correlation reflects the knowledge of the dependence or association between one event and other events

. When we look for English literature, we can find two English words that can describe the meaning of association. The first is relevance and the second is association, both of which can be used to describe the degree of association between events. The former is mainly used in the Internet content and documents, such as search engine algorithms in the correlation between documents, we use

the word is relevance; while the latter is often used in the actual things, such as e-commerce sites on the degree of correlation between the goods we are using association to express, and the association rules are

with the associationrules.

If there is an association between two or more attributes, then the value of one of the attributes can be predicted based on the values of the other attributes

. In simple terms, an association rule can be represented as A → B, where A is referred to as the premise or left part (LHS) and B is referred to as the result or right part (RHS). If

we want to describe the association rule about diapers and beer (the person who buys diapers also buys beer), then we can represent it like this: buy diapers → buy beer.

Two concepts of association algorithms

An important concept in association algorithms is support, which is the probability that the data set contains a few specific items.

For example, if beer and diapers appear 50 times in 1000 transactions, the association has a support of 5%.

Another concept that is very relevant to the association algorithm is confidence, which is the probability of B occurring when A has already appeared in the dataset, and is calculated as: probability of A and B occurring at the same time/probability of A occurring.

Data

Data association is an important class of discoverable knowledge that exists in a database. If there is some regularity between the values of two or more variables, it is called an association. Associations can be categorized as simple associations, temporal associations, causal associations, etc.

The purpose of association analysis is to find out the hidden network of associations in the database. Sometimes the correlation function of the data in the database is not known or if it is known, it is uncertain, so the rules generated by association analysis carry a confidence level.

Association rule mining discovers interesting associations or correlation links between sets of items in a large amount of data. It is an important topic in data mining and has been extensively studied by the industry in recent years.

A typical example of association rule mining is shopping basket analysis. Association rule research helps to discover the links between different goods (items) in a transaction database, and to find out the pattern of customers' purchasing behavior, such as the effect of having purchased one item on the purchase of other items. The results of the analysis can be applied to merchandise shelf layout, inventory arrangement, and categorization of users based on purchase patterns.

The discovery process of association rules can be divided into the following two steps:

The first step is to iteratively identify all FrequentItemsets, which requires that the support of the FrequentItemsets is not lower than the minimum value set by the user;

The second step is to construct rules from the FrequentItemsets with a confidence that is not lower than the minimum value set by the user to generate the association rules. Identifying or discovering all frequent itemsets is the core of the association rule discovery algorithm and the most computationally intensive part.

Support

The two thresholds of support and confidence are the two most important concepts describing association rules. The frequency of occurrence of an item group is called support and reflects the importance of the association rule in the database. And confidence measures the degree of trustworthiness

of an association rule. If a rule satisfies both min-support and min-confidence, it is said to be a strong association rule.

Association Rule Data Mining Phase

The first

phase must be to find out all the high-frequency itemsets (LargeItemsets) from the original collection of information. LargeItemsets means that the frequency of occurrence of an item set must reach a certain

level in relation to all records. Taking a 2-itemset containing two items, A and B, as an example, we can find the support of the item set containing {A,B}. If the support is greater than or equal to the set MinimumSupport threshold, {A,B} is called a high-frequency item set. A k-itemset that satisfies the MinimumSupport threshold is called a high-frequency k-itemset

(Frequentk-itemset), which is generally denoted as Largek or Frequentk. The algorithm and tries again to generate an item set of length more than k from the Largek item set

Largek+1, until it cannot find a longer high-frequency item set. longer high-frequency itemsets can be found.

The second phase of association rule mining is to generate association rules. Generating association rules from high-frequency item groups is done by using the high-frequency k-item groups from the previous step to generate rules, and under the conditional threshold of MinimumConfidence, if the confidence obtained from a rule satisfies MinimumConfidence, the rule is said to be an association rule.

For example, a rule generated by a high-frequency k-item group {A,B} is said to be a correlation rule if its confidence is greater than or equal to the MinimumConfidence.

In the case of

"Beer + diapers", the use of association rule mining technology, the records in the transaction database for data mining, first of all, must be set to the minimum support and the minimum credibility of the two thresholds, where it is assumed that the minimum

support min- support=5% and min-confidence=65%. Therefore association rules that meet the requirements will have to satisfy both of these conditions. The association rule {diaper, beer} will be acceptable if the association rule {diaper, beer} found after mining

satisfies the following conditions. It can be described by the formula:

Support(Diapers, Beer)≥5%andConfidence(Diapers, Beer)≥65%.

Support(diapers, beer) ≥5% means, in this example, that at least 5% of all transactions show that both diapers and beer were purchased

Confidence(diapers, beer) ≥65% means, in this example, that at least 5% of all transactions containing diapers were purchased

Confidence(diapers, beer) ≥65% means that at least 5% of all transactions containing diapers were purchased

Confidence(diapers, beer) ≥65% means that at least 5% of all transactions containing diapers were purchased

Confidence(diapers, beer) ≥65% means that at least 5% of all transactions containing diapers were purchased. Confidence(Diapers, Beer) ≥ 65% in this example means that in at least 65% of all transactions that include diapers, beer will be purchased at the same time.

So, in the future, if a consumer purchases diapers, we can recommend that the consumer also purchase beer. This recommendation is based on the {diapers, beer} association rule, since past transactions support the behavior that "most diaper purchases are accompanied by beer purchases".

From the above, it is also clear that association rule mining is usually better suited to situations where the metrics in the records take on discrete values.

If the index value in the original database is taken as continuous data, appropriate data discretization should be carried out before the association rule mining (in fact, it is the value of a certain interval corresponds to a certain value), the discretization of the data is an important part of the data mining before the discretization process is reasonable or not, will have a direct impact on the results of the association rule mining.