Cosine Similarity evaluates the similarity of two vectors by calculating the cosine of their angle. The vectors are plotted in vector space according to their coordinates, the angle between them is found, and the cosine of the angle is derived, which can be used to characterize the similarity of the two vectors. The smaller the angle, the closer the cosine is to 1, and the more their directions match, the more similar they are.
In two dimensions, for example, a and b are two vectors, and we want to calculate their angle θ. The cosine theorem tells us that we can use the following formula to find it:
To use the cosine similarity algorithm for text processing, we first need to vectorize the text, and then represent the words as "word vectors". Representation can be regarded as a core technology that introduces Deep Learning algorithms into the NLP field. The first step in transforming natural language processing into a machine learning problem is to mathematize the text in a way. The idea is as follows:
Example:
Sentence A: This boot is too big. That one is the right size.
Sentence B: This boot is not too small, that one fits better.
1, Chinese participle:
After using stuttering participle to participle the above two sentences, we get two sets of words:
2, list all the words, put listA and listB in a set, which constitutes a bag of words:
3, use the set of words to calculate the frequency of the words in listA and listB respectively.
4. OneHot coding of listA and listB yields the following results:
listAcode = [1, 2, 1, 2, 1, 1, 1, 1, 1, 0, 0]
listBcode = [1, 2, 1, 1, 0, 0, 1, 1, 1, 1]
5. Yields two sentence's word frequency vectors, it becomes a matter of calculating the cosine of the angle between the two vectors, the larger the value the higher the similarity.
6. The cosine of the two vectors is 0.805823, which is close to 1, indicating that the two sentences are very similar.
The steps to calculate the similarity of two sentences are as follows:
1. Divide the complete sentence into independent word sets by Chinese word splitting;
2. Find the concatenation set of two word sets (word packets);
3. Calculate the word frequency of the respective word sets and vectorize the word frequencies;
4. Substitute into the cosine formula to find the text similarity.
Note that after the word packs are determined, the order of the words can not be modified, otherwise the vector will be affected.
The above is the similarity calculation for two sentences, if it is the similarity calculation for two articles, the steps are as follows:
1. Find out the keywords of the respective articles and synthesize them into a set of words;
2. Find the concatenation of the two sets of words (the word packet);
3. Calculate the frequency of the respective set of words and quantize the frequency of the words;< /p>
4.
4. Substitute the cosine formula to find the text similarity.
Sentence similarity calculation is just a sub-part of article similarity calculation. The keyword extraction of the article can be realized by other algorithms.
Term Frequency (TF), is the number of times a word appears in an article or sentence. To find keywords in a very long article, it is generally understood that the more critical a word is to the article, the more often it appears in the article, so we use the "word frequency" statistics.
But this is not absolute, for example, the words "地", "的", "啊", etc., the number of times they appear in the center of an article is not helpful. They are just part of the grammatical structure of the Chinese language. These words are also known as "deactivated words", so they should be filtered out when counting the frequency of words in an article.
Does just filtering out stop words solve the problem? Not necessarily. For example, if you analyze a government work report, the word "China" is bound to appear many times in each article, but does it help the main idea of each report? Compare to the terms "anti-corruption", "artificial intelligence", "big data", "Internet of Things" and so on. The word "China" should be secondary in the article.
The advantage of the TF algorithm is that it is simple and fast, and the results are more realistic. The disadvantage is that it is not comprehensive enough to measure word frequency, and it does not take into account factors such as word nature and word position, and sometimes important words may not appear many times. This algorithm does not reflect the positional information of the words, and words in the front of the page are considered to be of equal importance to words in the back of the page, which is unscientific.
In relation to the idea of hierarchical analysis, each word can be given a specific weight, the most common type of word to give a smaller weight, the corresponding less common word to give a larger weight, this weight is called "inverse document frequency" (Inverse Doucument This weight is called "Inverse Doucument Frequency" (abbreviated as IDF), and its size is inversely proportional to the commonness of a word. The TF-IDF value is the word frequency TF and inverse document frequency IDF multiplication, the larger the value, the higher the importance of the word on the article. This is the TF-IDF algorithm.