python data mining - text analysis

Author | zhouyue65

Source | Junquan Metrics

Text Mining: the process of extracting valuable knowledge from a large amount of textual data and using that knowledge to reorganize information.

First, the corpus (Corpus)

The corpus is a collection of all the documents we want to analyze.

II. Chinese Word Segmentation

2.1 Concepts:

Chinese Word Segmentation (Chinese Word Segmentation): to slice a sequence of Chinese characters into a single word.

eg: My hometown is Zhanjiang City, Guangdong Province - > I / my / hometown / is / Guangdong Province / Zhanjiang City

Stop Words (Stop Words):

data processing, the need to filter out certain words or words

√ ubiquitous words, such as web, website, etc..

√ Auxiliaries, adverbs, prepositions, conjunctions, etc., such as of, to, got;

2.2 Installing the Jieba package:

The easiest way is to install it directly with CMD: type pip install jieba, but it doesn't seem to work on my computer.

Then here: https://pypi.org/project/jieba/#files下载了jieba0.39解压缩后 put it in Python36Libsite-packages, and then in cmd, pip install jieba downloaded successfully, I don't know what's the reason. I don't know what it is.

Then I installed jieba under anaconda environment. First, I put jieba0.39 in the Anaconda3Lib directory, and then I typed pip install jieba under the Anaconda propt as follows:

2.3 Code practice:

The most important thing about jieba is that it's a very easy to use, but it's not the only thing you can do.

jieba's main method is the cut method:

jieba.cut method accepts two input parameters:

1) The first parameter is the string to be split

2) The cut_all parameter is used to control whether to use the full pattern

jieba.cut_for_search method accepts one parameter: the string to be split, the method is suitable for search engines to build inverted indexes of the split, the granularity is relatively fine

Note: the string to be split can be a gbk string, utf-8 string, or unicode

The structure returned by jieba.cut and jieba.cut_for_search is an iterative gbk string. is an iterable generator, which can be used to get every word (unicode) obtained after a word split using a for loop, or list(jieba.cut(...)) The output is: I love

Python

The female officer

who passes through her section every month has to personally

account for

the installation of technical devices such as 24-port switches

work

The list can also be used in a list(jieba.cut(...cut(...)). Split word function for specialized scenarios:

There will be a situation where the Zhenwu Seven Intercept Formation and the Big Dipper and Big Dipper Formation are split into several words. To improve this, we use the import thesaurus method.

However, if there are a lot of words to be imported, the method of adding a thesaurus like jieba.add_word() is not efficient.

We can use jieba.load_userdict('D:PDM2.2JinYongMartialMartialWeaponStrokes.txt') method to import the whole thesaurus at one time, with one specific word for each line in the txt file.

2.3.1 Segmentation of a large number of articles

First build the corpus:

After segmentation, we need to process the information, that is, the article from which this segmentation comes.

Fourth, word frequency statistics

3.1 Term Frequency:

The number of times a word appears in the document.

3.2 Word Frequency Statistics with Python

3.2.1 Another way to remove deactivated words, plus if judgment

Some common methods used in the code:

Grouping statistics:

Determine whether the value of a column in a data frame contains any of the values in an array:

Inverse: (for Boolean values)

Fourth, the word cloud drawing

Word Cloud (Word Cloud): is the text of the word frequency of the higher words, to give visual prominence to the formation of the "keyword rendering", so that the country traveled away from the large amount of textual information, so that the viewer can understand the main idea of the text at a glance through the scan.

This is the first time I've ever seen the website.

4.1 Installing the Word Cloud Toolkit

This address: https://www.lfd.uci.edu/~gohlke/pythonlibs/ , you can find basically all the Python libraries, go in and download them according to your system and Python version.

It's easy to install in python, but it took a bit of effort to install it in anaconda, and finally put the word cloud files in the C:UsersAdministrator directory before installing it successfully.

V. Beautify the word cloud (word cloud into a picture image)

VI. Keyword extraction

The results are as follows:

VII. Keyword extraction is realized

Term Frequency: refers to the number of times a given word appears in the document.

Calculation formula: TF = the number of times it appears in the document

Inverse Document Frequency (Inverse Document Frequency): IDF is the weight of each word, the size of which is inversely proportional to the degree of commonness of a word

Calculation formula: IDF = log (the total number of documents / (the number of documents containing the word - 1))

TDF = log (the number of documents containing the word - 1)

VII.

TF-IDF (Term Frequency-Inverse Document Frequency): a measure of whether a term is a keyword or not; the larger the value, the more likely it is to be a keyword.

Calculation formula: TF - IDF = TF * IDF

7.1 Document Vectorization

7.2 Code Practice

Name of creative trendy company name of new food packaging company

Please ask how the current smart home industry development, smart home industry positioning when you want to make reference?

How about the School of Management of Hefei University of Technology?

October 22, 2022 Shenyang people can go to Dandong

Chongqing University of Posts and Telecommunications talk about Internet innovation and entrepreneurship

When bidding, if you quickly find the procurement of key people, the chance of winning the bid is greater?

Completion acceptance time should not be ahead of how many minutes with the completion time

Xiaomi's mobile phone card is very dark.

2022 Intelligent Interaction Design Professional Direction Prospects

How to change the layering disorder of Guanglianda BIM measuring paper?