Current location - Loan Platform Complete Network - Big data management - What AI PM should know about natural language processing (NLP)
What AI PM should know about natural language processing (NLP)

NLP is a subfield of artificial intelligence. As AI product managers, we must at least know what NLP is and what it can do, so that we can acquire a problem-solving thinking and will encounter problems and methods. Next, I will briefly introduce NLP from three aspects: "what NLP is, what it can do, and current problems encountered".

1. What is NLP

NLP is called natural language processing in Chinese. Simply put, it is a discipline that allows computers to understand, analyze and generate natural language. The general research process is: develop a language that can express language ability. Model - Propose various methods to continuously improve the capabilities of language models - Design various application systems based on language models - Continuously improve language models.

NLP currently has two ways of understanding natural language:

1. Rule-based understanding of natural language, that is, designing a program by formulating a series of rules, and then passing this program to solve natural language problems. The input is the rule and the output is the program;

2. Understand natural language based on statistical machine learning, that is, use a large amount of data to train a model through machine learning algorithms, and then use this model to solve natural language problems. The input is the data and the desired results, and the output is the model.

Next, we will briefly introduce the common tasks or applications of NLP.

2. What NLP can do:

1. Word segmentation

Chinese can be divided into several levels: characters, words, phrases, sentences, paragraphs, and documents. If you want to express a meaning, many times A meaning cannot be expressed through one word. At least one word can better express a meaning. Therefore, in general, "word" is used as the basic unit, and "word" combinations are used to represent "phrases, sentences, paragraphs, documents" , As for whether the input to the computer is a phrase, sentence, paragraph, or document, it depends on the specific scenario. Since Chinese words are not separated by spaces like English, the computer cannot distinguish which words are in a text, so word segmentation is required. There are currently two commonly used methods for word segmentation:

(1) Based on rules: Heuristic (heuristic), keyword table

(2) Based on machine learning/statistical methods: HMM ( Hidden Markov model), CRF (Conditional Random Field)

(Note: I will not introduce the principle and implementation process of the method in detail here. If you are interested, you can learn more about it on Baidu)

The current word segmentation technology is very mature, and the accuracy of word segmentation has reached a usable level. There are also many third-party libraries for us to use, such as jieba, so generally in actual applications we will use "jieba + custom dictionary ” method for word segmentation.

2. Word encoding

Now divide the text "I like you" into three words: "I", "like" and "you" through word segmentation. At this time, these three words Words serve as input to the computer and cannot be understood by the computer, so we convert these words into a way that the computer can understand, that is, word encoding. Nowadays, it is common to represent words as word vectors as the input and representation space of machine learning. There are currently two representation spaces:

(1) Discrete representation:

A.One-hot representation

Suppose our corpus is:

I like you, do you have feelings for me?

Dictionary {"I": 1, "like": 2, "you": 3, "right": 4, "have": 5 , "feeling": 6, "is": 7}. One *** has seven dimensions.

So One-hot is used to express:

"I" ?: [1, 0, 0, 0, 0, 0, 0]

"Like ": [0, 1, 0, 0, 0, 0, 0]

·········

"?" ?: [0, 0, 0 , 0, 0, 0, 1]

That is, a word is represented by one dimension

B. bag of word: that is, the vectors of all words are directly summed as the vector of a document.

So "I like you" is expressed as: "[1, 1, 1, 0, 0, 0, 0]".

C. Bi-gram and N-gram (language model): Taking into account the order of words, word combinations are used to represent a word vector.

The idea behind these three methods is that different words represent different dimensions, that is, a "unit" (word or word combination, etc.) is a dimension.

(2) Distributed representation: word2vec, which represents an actual matrix vector. The idea behind it is that "a word can be represented by its nearby words".

Discrete or distributed representation spaces have their own advantages and disadvantages. Interested readers can check the information on their own to learn more, which will not be elaborated here.

There is a problem here. When the corpus is larger, it contains more words, and the dimension of the word vector is larger. In this way, the amount of space storage and calculation will increase exponentially. Therefore, when engineers deal with word vectors, they usually Dimensionality reduction means that some information will be lost, which will affect the final effect. Therefore, as a product manager, when following up on project development, you also need to understand the rationality of engineers' dimensionality reduction.

3. Automatic summarization

Automatic summarization refers to automatically summarizing key text or knowledge from the original text. Why do you need automatic summarization? There are two main reasons: (1) Information overload, we need to extract the most useful and valuable text from a large amount of text; (2) The cost of manual summarization is very high. At present, there are two solutions for automatic summarization: the first is extractive, which finds some key sentences from the original text to form a summary; the other is abstract, where the computer first understands the original text content, and then express it with your own meaning. Automatic summarization technology is currently the most widely used in the field of news. In the era of information overload, this technology is used to help users understand the most and most valuable news in the shortest time. In addition, how to extract structured knowledge from unstructured data will also be a major direction for question and answer robots.

4. Entity recognition

Entity recognition refers to identifying specific categories of entities in a text, such as person names, place names, numerical values, proper nouns, etc. It is widely used in information retrieval, automatic question answering, knowledge graph and other fields. The purpose of entity recognition is to tell the computer that this word belongs to a certain type of entity, which helps to identify the user's intention. For example, Baidu's knowledge graph:

The entity identified in "How old is Stephen Chow" is "Stephen Chow" (a star entity), and the relationship is "age". The search system can know that the user is asking about the age of a certain star. , and then combine the data "Stephen Chow? Date of birth? June 22, 1962" and the current date to calculate Stephen Chow's age, and display the result directly to the user instead of displaying a link to the candidate answer.

In addition, common NLP tasks include: topic identification, machine translation, text classification, text generation, sentiment analysis, keyword extraction, text similarity, etc. I will give you a brief introduction when I have time in the future.

3. The current difficulties of NLP

1. The language is not standardized and has high flexibility

Natural language is not standardized. Although some basic rules can be found, natural language is too flexible. The same Meaning can be expressed in many ways, and it is difficult to understand natural language based on rules or learn the inherent characteristics of data through machine learning.

2. Typos

When processing text, we will find a large number of typos. How to let the computer understand the true meaning of these typos is also a major difficulty in NLP< /p>

3. New words

We are in an era of rapid development of the Internet. A large number of new words are generated online every day. How do we quickly discover these new words and let computers understand them is also NLP Difficulties

4. There are still shortcomings in using word vectors to represent words

As mentioned above, we use word vectors to let computers understand words, but word vectors represent Space, it is discrete, not continuous. For example, it represents some positive words: good, very good, great, awesome, etc. In the word vector space from "good" to "very good", you cannot find some words, from "Good" continues to "very good", so it is discrete and discontinuous. The biggest problem with discontinuity is that it is not differentiable. It is very easy for computers to process differentiable functions. If it is not differentiable, the amount of calculation will increase. Of course, there are also some algorithms that calculate word vectors and perform continuous approximation, but this is definitely accompanied by the loss of information. In short, word vectors are not the best way to represent words. A better mathematical language is needed to represent words. Of course, it may be that our human natural language itself is discontinuous, or humans cannot create a "continuous" natural language. .

Summary: Through the above content, we already have a rough idea of ??"what NLP is, what it can do, and the current problems." As artificial intelligence product managers, understanding NLP technology can improve our own technical understanding, which is very helpful in understanding industry needs and promoting project development. In fact, this can give us a kind of connection ability to connect needs with engineers. Get up and connect the problem to the solution. Although there are many shortcomings in artificial intelligence technologies such as NLP, we need to adjust our mentality. The application of artificial intelligence has just begun, and it is bound to be imperfect. Don't be a critic, but a promoter of the artificial intelligence era.

nt-sizf@?2W?