Author | zhouyue65
Source | Junquan Metrics
Text Mining: the process of extracting valuable knowledge from a large amount of textual data and using that knowledge to reorganize information.
First, the corpus (Corpus)
The corpus is a collection of all the documents we want to analyze.
II. Chinese Word Segmentation
2.1 Concepts:
Chinese Word Segmentation (Chinese Word Segmentation): to slice a sequence of Chinese characters into a single word.
eg: My hometown is Zhanjiang City, Guangdong Province - > I / my / hometown / is / Guangdong Province / Zhanjiang City
Stop Words (Stop Words):
data processing, the need to filter out certain words or words
√ ubiquitous words, such as web, website, etc..
√ Auxiliaries, adverbs, prepositions, conjunctions, etc., such as of, to, got;
2.2 Installing the Jieba package:
The easiest way is to install it directly with CMD: type pip install jieba, but it doesn't seem to work on my computer.
Then here: https://pypi.org/project/jieba/#files下载了jieba0.39解压缩后 put it in Python36Libsite-packages, and then in cmd, pip install jieba downloaded successfully, I don't know what's the reason. I don't know what it is.
Then I installed jieba under anaconda environment. First, I put jieba0.39 in the Anaconda3Lib directory, and then I typed pip install jieba under the Anaconda propt as follows:
2.3 Code practice:
The most important thing about jieba is that it's a very easy to use, but it's not the only thing you can do.
jieba's main method is the cut method:
jieba.cut method accepts two input parameters:
1) The first parameter is the string to be split
2) The cut_all parameter is used to control whether to use the full pattern
jieba.cut_for_search method accepts one parameter: the string to be split, the method is suitable for search engines to build inverted indexes of the split, the granularity is relatively fine
Note: the string to be split can be a gbk string, utf-8 string, or unicode
The structure returned by jieba.cut and jieba.cut_for_search is an iterative gbk string. is an iterable generator, which can be used to get every word (unicode) obtained after a word split using a for loop, or list(jieba.cut(...)) The output is: I love
Python
The female officer
who passes through her section every month has to personally
account for
the installation of technical devices such as 24-port switches
work
The list can also be used in a list(jieba.cut(...cut(...)). Split word function for specialized scenarios:
There will be a situation where the Zhenwu Seven Intercept Formation and the Big Dipper and Big Dipper Formation are split into several words. To improve this, we use the import thesaurus method.
However, if there are a lot of words to be imported, the method of adding a thesaurus like jieba.add_word() is not efficient.
We can use jieba.load_userdict('D:PDM2.2JinYongMartialMartialWeaponStrokes.txt') method to import the whole thesaurus at one time, with one specific word for each line in the txt file.
2.3.1 Segmentation of a large number of articles
First build the corpus:
After segmentation, we need to process the information, that is, the article from which this segmentation comes.
Fourth, word frequency statistics
3.1 Term Frequency:
The number of times a word appears in the document.
3.2 Word Frequency Statistics with Python
3.2.1 Another way to remove deactivated words, plus if judgment
Some common methods used in the code:
Grouping statistics:
Determine whether the value of a column in a data frame contains any of the values in an array:
Inverse: (for Boolean values)
Fourth, the word cloud drawing
Word Cloud (Word Cloud): is the text of the word frequency of the higher words, to give visual prominence to the formation of the "keyword rendering", so that the country traveled away from the large amount of textual information, so that the viewer can understand the main idea of the text at a glance through the scan.
This is the first time I've ever seen the website.
4.1 Installing the Word Cloud Toolkit
This address: https://www.lfd.uci.edu/~gohlke/pythonlibs/ , you can find basically all the Python libraries, go in and download them according to your system and Python version.
It's easy to install in python, but it took a bit of effort to install it in anaconda, and finally put the word cloud files in the C:UsersAdministrator directory before installing it successfully.
V. Beautify the word cloud (word cloud into a picture image)
VI. Keyword extraction
The results are as follows:
VII. Keyword extraction is realized
Term Frequency: refers to the number of times a given word appears in the document.
Calculation formula: TF = the number of times it appears in the document
Inverse Document Frequency (Inverse Document Frequency): IDF is the weight of each word, the size of which is inversely proportional to the degree of commonness of a word
Calculation formula: IDF = log (the total number of documents / (the number of documents containing the word - 1))
TDF = log (the number of documents containing the word - 1)
VII.
TF-IDF (Term Frequency-Inverse Document Frequency): a measure of whether a term is a keyword or not; the larger the value, the more likely it is to be a keyword.
Calculation formula: TF - IDF = TF * IDF
7.1 Document Vectorization
7.2 Code Practice