Current location - Loan Platform Complete Network - Big data management - 07_Recommendation System Algorithms
07_Recommendation System Algorithms
? Demographic-based recommendation with user profiling, content-based recommendation, and collaborative filtering-based recommendation.

1, demographic-based recommendation mechanism ( Demographic-based Recommendation) is one of the easiest to implement the recommendation method, it is simply based on the basic information of the system user to find out the degree of relevance of the user, and then the similar user favorite other items recommended to the current user.

2. For user information that does not have a clear meaning (such as login time, geography, and other contextual information), the user can be labeled with a classification by means of clustering and other means.

3, for users with specific labels, and can be based on preset rules (knowledge) or models, recommend the corresponding items.

4. The process of labeling user information is generally known as User Profiling.

(1) User Profile is the enterprise through the collection and analysis of consumer social attributes, habits, consumption behavior and other key information after the data, the perfect abstraction of a user's business as a whole is the enterprise application of big data technology is the basic way.

(2) User profiles provide enough information for companies to quickly find precise user groups and more extensive feedback on user needs.

(3) As the root of big data, it perfectly abstracts the whole picture of a user's information, providing enough data base for further accurate and rapid analysis of user behavior, consumption habits, and other important information.

1. Content-based Recommendations (CB) discovers the relevance of items based on the metadata of the recommended items or content, and then recommends similar items for users based on their past preference records.

2. By extracting the intrinsic or extrinsic feature values of items, similarity calculation is realized. For example, a movie can be characterized by its director, actors, user tag UGC, user comments, duration, style, etc.

3.

3, the user (user) personal information features (based on preference records or preset interest labels), and the item (item) features match, you can get the user's interest in the degree of the item. This has been used successfully in some movie, music, and book social networking sites, and some sites have also asked professionals to genetically code/tag items (PGC).

4. Similarity calculation:

5. Feature extraction for items - tagging

- expert tagging (PGC)

- user-defined labeling (UGC)

- dimensionality reduction to analyze the data, and extract the cryptographic labels (LFM)

- user-defined labeling (UGC)

- data reduction to analyze the data, and extract the cryptographic labels (LFM)

This is the most important feature of the PGC, but it is not the only one that can be applied. )

? For feature extraction of text information - keywords

- Segmentation, Semantic Processing and Sentiment Analysis (NLP)

- Latent Semantic Analysis (LSA)

6. High-level structure of content-based recommender systems

7. Feature engineering

(1) Features ( feature): information extracted from the data that is useful for outcome prediction.

? The number of features is the observed dimension of the data.

? Feature engineering is the process of using specialized background knowledge and techniques to process data so that the features work better on machine learning algorithms.

? Feature engineering generally includes feature cleaning (sampling, cleaning of anomalous samples), feature processing and feature selection.

? Features are categorized according to different data types, and there are different feature processing methods: numerical, categorical, temporal, and statistical.

(2) Numerical feature processing

Continuous numerical values are used to represent the current dimension features, and numerical features are usually processed mathematically, the main practices are normalization and discretization.

* Range-adjusted normalization:

Features should be equal to each other, and the difference should be within the feature.

For example, the amplitude of house price and housing area is different, the house price may be between 3,000,000~150,000,000 (million), and the housing area is between 40-300 (square meters), so obviously the two features are equal, but after inputting them into the same model, due to the difference in amplitude of the different results, which is unreasonable

* Numerical adjustment normalization:

* Numerical adjustment normalization:

* Numerical adjustment normalization:

The difference between features and features should be reflected in the features.

* Numerical feature processing - discretization

Discretization of two ways: equal step - simple but not necessarily effective; equal frequency - min - > 25% ->. > 25% -> 75% -> max

Comparison of the two methods:

Equal-frequency discretization method is very accurate, but the need to each time the distribution of the data once again from the new calculations, because yesterday the user to buy things on Taobao and the price distribution is not necessarily the same today, so yesterday to do the equal frequency of the cut-off point may not be applicable, and the most important need to be avoided on line The most important thing to avoid online is that it is not fixed and needs to be calculated on the spot, so the model trained yesterday may not be available today.

Isofrequency is not fixed, but it is very accurate, and iso-step is fixed and very simple, so both have applications in industry.

(3) Category-type feature processing

Category-type data itself has no size relationship, you need to encode them as numbers, but they can not have a predetermined size relationship between them, so it is important to be both fair and distinguish them, then directly open multiple spaces.

One-Hot Coding/Dummy Variables: what One-Hot Coding/Dummy Variables does is to expand the category-type data in parallel, i.e., after One-Hot Coding Dummy Variables, the space of this feature expands.

(4) Processing Temporal Features

Temporal features can be viewed as both continuous and discrete values.

Continuous values: duration (the length of web browsing); interval (the interval between the last purchase/click and the present).

Discrete values: time of day; day of week; month/week of year; weekday/weekend.

(5) Statistical feature processing

Plus or minus the average: how much the price of the item is above the average price, and how much the user spends more than under a certain category.

Quartile: the goods belong to the quartile at the price of sold goods.

Subsequency: how many times the item is in the popularity quartile.

Proportionality: the proportion of good/medium/bad reviews of the item in the e-commerce.

8. Recommendation system common feedback data :

9. UGC-based recommendation

? Users use labels to describe their views on the item, so user-generated labeling (UGC) is the link between the user and the item, but also an important source of data to respond to user interest.

A dataset of user tagging behavior is generally represented by a collection of triples (user,item,tag), where a record (u,i,b) indicates that user u tagged item b.

A simplest algorithm:

- Count the most frequently used tags by each user

- For each tag, count the items that have been tagged with this For each tag, count the items that have been tagged the most times

- For a user, first find the tags he uses, then find the most popular items with those tags, and recommend them to him

- So user u's interest in item i is given by , where is the number of times user u has tagged item b, and is the number of times item i has been tagged with item b.

The simple algorithm is as follows.

A simple algorithm that directly multiplies the number of times a user has hit a tag with the number of times the item has been tagged can simply show the user's interest in a feature of the item.

This approach tends to give more weight to popular tags (tags that anyone would give, such as "blockbuster", "funny", etc.), popular items (tagged by the largest number of people), and if a popular item corresponds to a popular tag at the same time, then it will "dominate" the list. If a popular item corresponds to a popular tag, then it will "dominate the list" and the personalization and novelty of the recommendation will be reduced.

A similar problem occurs with keyword extraction in news content. For example, in the following news, which keywords should receive higher weight?

10, TF-IDF: Term Frequency- -Inverse Document Frequency (TF-DF) is a common weighting technique used for information retrieval and text mining.

TFDF is a statistical method for evaluating the importance of a word to a document set or a corpus of documents. The importance of a word increases proportionally with the number of times it appears in a document, but decreases inversely with the frequency of its appearance in the corpus.

TFIDF=TF IDF

The main idea of TF-IDF is that if a word or phrase occurs TF frequently in a text and rarely in other texts, it is considered to have good category differentiation and is suitable for categorization.

Various forms of TF-DF weighting are often applied by search engines as a measure or rating of the degree of relevance between a document and a user query.

Term Frequency (TF): refers to the frequency with which a given term appears in the document. This number is normalized to the number of words to prevent bias towards longer documents. (The same word may have a higher word count in a longer document than in a shorter one, regardless of the importance of the word.)

Inverse Document Frequency (IDF): a measure of the general importance of a word, the IDF of a particular word can be obtained by dividing the total number of documents by the number of documents containing the word, and then taking the quotient to the logarithm of the number of documents containing the word, which represents the inverse document frequency of word i in the set of documents, N represents the total number of documents in the set, and N represents the total number of documents in the set, which represents the number of words contained in the set, and N represents the total number of words in the set. N is the total number of documents in the document set, and the number of documents that contain the term i in the document set.

(11) Improvement of TF-IDF for UGC-based recommendation: In order to avoid the popular tags and popular items to get more weight, we need to penalize the "popular".

? Borrowing the idea of TF-IDF, we take all the tags of an item as "documents" and the tags as "words", so as to calculate the "word frequency" of the tags (the frequency of all tags of the item) and the "word frequency" of the tags (the frequency of all tags of the item). The "word frequency" (the frequency in all tags of an item) and the "inverse document frequency" (the frequency in which the tag is prevalent in the tags of other items) are calculated.

Since "all tags of item i" should have no effect on the tag weights, and "total number of all tags" N is certain for all tags, these two items can be omitted. On the basis of the simple algorithm, we directly add a penalty term for popular tags and popular items: , which records how many different users have used tag b, and records how many different users have tagged item i.

(I) Collaborative Filtering (CF)

1. Collaborative Filtering (CF)-based recommendation: Content based (CB) mainly utilizes the content features of items that have been evaluated by the user, whereas the CF method can also take advantage of the content of items that have been rated by other users.

CF can address some of the limitations of CB:

? - When item content is incomplete or difficult to obtain, recommendations can still be given through feedback from other users.

- CF is based on the quality of evaluation of items among users, avoiding the dryness of judging the quality of items that may be caused by CB relying only on the content.

- CF recommendations are not limited by content, as long as other similar users give an interest in different items, CF can give users recommendations for items with widely differing content (but with some intrinsic connection)

Divided into two categories: nearest-neighbor-based and model-based.

2. Near-neighbor-based recommender systems: based on the same "word-of-mouth" criterion. Should we recommend Titanic to Cary?

(B) Collaborative filtering based on the nearest neighbor

1, based on the user (User-CF): The basic principle of collaborative filtering based on the user recommendation is that, according to the preferences of all users on the item, to find and the current user tastes and preferences similar to the "neighbor" user group, and recommend the preferred items of the nearest neighbor. The basic principle is to discover a group of "neighboring" users with similar tastes and preferences to the current user, and recommend items preferred by their immediate neighbors.

? In general, this is done by calculating a "K-nearest-neighbors" algorithm; based on the historical preference information of these K neighbors, recommendations are made for the current user.

User-CF and demographic-based recommendation mechanisms:

- Both compute the similarity of users and calculate recommendations based on a group of similar "neighboring" users.

- They differ in how they calculate user similarity: demographic-based mechanisms only consider the characteristics of the users themselves, whereas user-based collaborative filtering calculates user similarity on the basis of users' historical preferences, with the underlying assumption that users who like similar things are likely to have the same or similar tastes and preferences.

2. Item-based (Item-CF): The basic principle of item-based collaborative filtering recommendation is similar to the user-based one, except that it uses all the users' preferences for items to discover the similarity between items and items, and then recommends similar items to users based on their historical preference information.

Item-CF and Content-Based (CB) Recommendation

- Both are actually predictive recommendations based on item similarity, except that the similarity is calculated differently, the former is inferred from the user's historical preferences, while the latter is based on the attribute feature information of the item itself.

Also with collaborative filtering, how do you choose between user-based and item-based strategies?

- E-commerce, movie, music sites, where the number of users is much larger than the number of items.

- News sites, where the number of items (news text) can be larger than the number of users.

3. Comparison of User-CF and Item-CF

? Again with collaborative filtering, how should you choose between the two strategies User-CF and item-CF?

?Item-CF application scenarios

-? Item-based collaborative filtering ( Item-CF ) recommendation mechanism is Amazon in the user-based mechanism on the improvement of a strategy because in most of the Web site, the number of items is much smaller than the number of users, and the number of items and similarity is relatively stable, at the same time, the item-based mechanism is a bit better than the user-based real-time , so Item-CF became the The mainstream of the current recommendation strategy.

?User-CF application scenarios

- Imagine in some news recommender system, maybe the number of items one by one, that is, the number of news, may be larger than the number of users, and the degree of news updating is also very fast, so its similarity is still unstable, and this time, using the User-cf may be more effective.

So, the choice of recommendation strategy actually has a lot to do with the specific application scenario.

4. Recommendation based on collaborative filtering advantages and disadvantages

? (1) Advantages of collaborative filtering-based recommendation mechanism:

It does not require strict modeling of items or users, and does not require that the description of item features be machine-understandable, so this approach is also domain-independent.

Recommendations computed by this approach are open to **** with the experience of others, and are well suited to support users in discovering potential interest preferences.

(2) Problems

The core of the approach is based on historical data, so there is a "cold start" problem for new items and new users.

The effectiveness of recommendations depends on the amount and accuracy of the user's historical good data.

In most implementations, historical user preferences are stored in a sparse matrix, and there are some obvious problems with computing on sparse matrices, including the possibility that the wrong preferences of a small number of people can have a significant impact on the accuracy of recommendations.

Users with particular tastes cannot be given good recommendations.

(C) model-based collaborative filtering

1, the basic idea

(1) the user has certain characteristics that determine his preferences

(2) the item has certain characteristics that affect the user needs to choose it or not.

(3) A user chooses an item because the user characteristics match the item characteristics.

Based on this idea, the establishment of the model is equivalent to extracting features from behavioral data, and "labeling" the user and the item at the same time; this is essentially the same as the demographic-based user labeling and content-based item labeling, which are feature extraction and matching.

When there are explicit features (e.g., user labels, item classification labels), we can directly match them to make recommendations; when there are not, we can use hidden features based on existing preference data, which requires the use of a hidden semantic model (LFM).

2, model-based collaborative filtering recommendation, is based on sample user preference information, training a recommendation model, and then based on real-time user preference information to predict the score of the new item, calculate the recommendation

Recommendation based on the near-neighborhood and model-based recommendation

- Recommendation based on the near-neighborhood is the prediction of the direct use of existing user preference data, the near-neighborhood data to predict the score of the new item, the near-neighborhood data to predict the score of the new item. The nearest-neighbor data is used to predict preferences for new items (similar to classification)

- The model-based approach uses this preference data to train the model, find the internal patterns, and then use the model to make predictions (similar to regression)

When training the model, it can either extract features based on the content of the labels, or it can allow the model to develop the latent features of the items; this is called a Latent Semantic Model (LSM), and the model is called a Latent Semantic Model (LSM). Latent Factor Model (LFM).

(1) Latent Factor Model (LFM): The goals of collaborative filtering with Latent Factor Model:

- Reveal hidden features that explain why the corresponding predictive scores have been given

- Such features may not be directly describable in words, and in fact, we don't need to know about them, similar to "metaphysics"

- The features may not be directly describable in words and in fact we don't need to know about them. Metaphysics"

Dimensionality reduction via matrix decomposition

- Collaborative filtering algorithms rely heavily on historical data, and preference data is often sparse in general recommender systems; this requires dimensionality reduction of the raw data.

- The matrix after decomposition represents the hidden features of users and items

Examples of cryptosemantic models: probabilistic-based cryptosemantic analysis (pLSA), implicit Dirichlet Distribution Model (LDA), Matrix Factorization Model (Singular Value Decomposition-based model, SVD)

(2) LFM dimensionality reduction methods --Matrix Factor Decomposition

(3) Further Understanding of LFM

We can assume that there are intrinsic reasons why the users give such scores to the movies, and we can dig out the hidden factors affecting the users' scores, and then, based on the correlation between the unrated movies and these hidden factors, we can determine the predicted score for this unrated movie.

There should be some hidden factors that affect the user's score, such as the movie: actors, subject matter, era, ... or even hidden factors that are not directly understandable to humans.

To find the hidden factors, you can correlate user and Iiem (find out what makes the user like/dislike the item, and what determines the user's liking/disliking of the item), and then you can speculate whether the user will like a certain movie that they haven't seen yet.

(4) Matrix factorization

(5) Solving the model - loss function

(6) Algorithm for solving the model - ALS

Now, the problem of matrix factorization has been transformed into a standard optimization problem that requires solving P and Q to minimize the objective loss function.

The minimization process is usually solved using stochastic gradient descent algorithms or alternating least squares (ALS)

The idea behind ALS is that, since the two matrices, P and Q, are unknown and coupled together by matrix multiplication, in order to decouple them, you can fix Q and treat P as a variable. Q, treat P as a variable, and minimize P through the loss function, which is a classical least squares problem; and then in turn, fix the resulting P, treat Q as a variable, and solve for Q: and so on alternately until the error meets the read condition, or reaches the iteration limit.

(7) Gradient descent algorithm