First, the idea of decision tree algorithm
Second, the nature of decision tree learning
Third, summary.
First, the decision tree algorithm idea:
Decision tree is a basic classification and regression method. This paper mainly discusses the classification decision tree. The decision tree model has a tree structure. In the classification problem, it represents the process of classifying examples based on features. It can be regarded as a condition set of if-then, and it can also be regarded as a conditional probability distribution defined in feature space and class space. A decision tree consists of nodes and directed edges. There are two types of nodes: internal nodes and leaf nodes. An internal node represents a feature or attribute, and a leaf node represents a class. (The ellipse represents the internal node, and the box represents the leaf node)
? Relationship between decision tree and if-then rule
A decision tree can be regarded as a collection of multiple if-then rules. The process of transforming a decision tree into an if-then rule is as follows: a rule is constructed for each path from the root node to the leaf node of the decision tree; The characteristics of internal nodes on the path correspond to the conditions of rules, while the categories of leaf nodes correspond to the conclusions of rules. The path of decision tree or its corresponding if-then rule set has an important property: mutual exclusion and completeness. In other words, each instance is covered by a path or a rule, and can only be covered by a path or a rule. Coverage here refers to the condition that the characteristics of the instance are consistent with the characteristics on the path or that the instance meets the rules.
? Relationship between decision tree and conditional probability distribution
Decision tree also represents the conditional probability distribution of classes under given characteristics. This conditional probability distribution is defined on the division of feature space. The feature space is divided into disjoint units or regions, and a class probability distribution is defined in each unit to form a conditional probability distribution. The path of the decision tree corresponds to a unit in the partition. The conditional probability distribution represented by decision tree consists of the conditional probability distribution of classes under the given conditions of each unit.
? Advantages of decision tree model
Decision tree model has strong readability and fast classification speed. When learning, according to the principle of minimizing loss function, a decision tree model is established by using training data; When forecasting, new data are classified by decision tree model.
Second, the essence of decision tree learning:
Decision tree learning is to summarize a set of classification rules from the training data set. There may be multiple decision trees or no decision trees that are not inconsistent with the training data set. We need to train a decision tree with less contradiction with the training data and good generalization ability. From another point of view, decision tree learning is a conditional probability model for training data set estimation. There are infinitely many conditional probability models of classes based on feature space division. The conditional probability model we choose should not only fit the training data well, but also predict the unknown data well. Decision tree learning uses loss function to express this goal, and the usual loss function is a regularized maximum likelihood function. The learning strategy of decision tree is to minimize the loss function as the objective function. When the loss function is determined, the decision tree learning problem becomes the problem of choosing the optimal decision tree in the sense of loss function. This process is usually a recursive process of selecting the best features, and the training data is segmented according to the features, so that each sub-data set has the best classification. This process corresponds to feature selection, decision tree generation and decision tree pruning.
? Feature selection: selecting features that can classify training data can improve the learning efficiency of decision trees.
? Generation of decision tree: according to different characteristics as the root node, different sub-nodes are divided to form different decision trees.
? Decision tree selection: which feature as the root node has the largest decision tree information gain value is the final decision tree (the best classification feature).
? Information entropy: In information theory and probability statistics, entropy is a measure to express the uncertainty of random variables. Let x be a discrete random variable with finite value, and its probability distribution is P(X=) =, i= 1, 2,3 ... n, then the entropy of the random variable x is defined as
H(X) =? — ? ,0 & lt=? h(X)& lt; =? 1, the greater the entropy, the greater the uncertainty of random variables.
Conditional entropy (Y|X): indicates the uncertainty of the random variable y when the random variable x is known.
? Information gain? : Indicates the degree to which the uncertainty of information of category Y is reduced by knowing the information of feature X. ..
Information gain? = information entropy (parent node entropy)-conditional entropy (child node weighted entropy)
Third, summary:
superiority
1, with high interpretability, can handle nonlinear data, does not need data normalization, and has no preference for data distribution.
2. It can be used for feature engineering and feature selection.
3. It can be transformed into a rule engine.
disadvantaged
1, generated heuristically, is not the optimal solution.
2, easy to over-fit.
3. Small data changes will change the shape of the whole number.
4. Not friendly to data with unbalanced categories.