Current location - Loan Platform Complete Network - Big data management - Random forest algorithm
Random forest algorithm
/QQ _ 16633405/ Articles/Details /6 1200502

/ 12 199764/viewspace- 1572056/

/colorful _ sky/article/details/82082854

Randomness is the core of random forest. By randomly selecting samples and features, the correlation between decision trees is reduced. Randomness in random forest has two meanings: one is to randomly select the same number of data from the original training data as training samples, and the other is to randomly select some features to build a decision tree. These two randomness make the correlation between decision trees smaller and further improve the accuracy of the model.

Random forest does not use pruning of decision tree, so how to control the over-fitting of the model? Mainly by controlling the depth of the tree (max_depth), the minimum number of samples for nodes to stop splitting (min_size) and other parameters. Random forests can also handle missing values.

Assuming that there are n samples in the training set, and each sample has d features, it is necessary to train a random forest with T trees. The specific algorithm flow is as follows:

1. For T decision tree, the following operations are repeated respectively: a. Using Bootstrap sampling, the training set D with the size of n is obtained from the training set D; B. Randomly select m(m

2. If it is a regression problem, the final output is the average output of each tree;

3. If it is a classification problem, determine the final category according to the voting principle.

The generation of each tree is random. Regarding the number of randomly selected features, there are two main methods to determine the size of randomly selected features. One is cross-validation and the other is empirical setting m= log_2 d+1.

1. classification interval: the classification interval refers to the proportion of correctly classified samples in the forest minus the proportion of wrongly classified decision trees, and the classification interval of random forest is obtained by averaging the classification interval of each sample. For the classification interval, of course, the bigger the better. The larger the classification interval, the more stable the classification effect of the model and the better the generalization effect.

2. Out-of-bag error: For each tree, some samples are not extracted, and such samples are called out-of-bag samples. The prediction error rate of random forest for out-of-bag samples is called out-of-bag error (OOB). The calculation method is as follows:

(1) For each sample, calculate the sample as the classification of out-of-bag samples;

(2) Decide the classification results of samples by voting;

(3) The ratio of the number of misclassified samples to the total number of samples is taken as the bagging error of random forest.

3. Representation of variable importance: In fact, representation of variable importance cannot be regarded as model performance evaluation, because in some practical applications, it is necessary to check which features are relatively important. At this time, the importance of characterizing variables is particularly important. There are two main calculation methods:

It is obtained by calculating the average information gain of features (1);

(2) By calculating the influence of each feature on the model accuracy, a new sample is generated by disrupting the eigenvalue order of a feature in the sample, and the new sample is put into the established random forest model to calculate the accuracy. Compared with unimportant features, even if the order is disrupted, it will not have a great impact on the results, but for important features, it will have a great impact on the results.

Advantages:

1, for most data, its classification effect is good.

2. It can deal with high-dimensional features, and it is not easy to produce over-fitting, and the model training speed is relatively fast, especially for big data.

3. When deciding the category, it can evaluate the importance of variables.

4. Strong adaptability to data set: it can handle both discrete data and continuous data, and the data set does not need to be standardized.

Disadvantages:

1, random forest is prone to over-fitting, especially when the data set is small or low-dimensional data set.

2. The calculation speed is slower than that of a single decision tree.

3. When we need to infer out-of-range independent variables or non-independent variables, random forests do not do well.

Classification problem

Regression problem

Commonly used methods: refer to/w952470866/article/details/78987265.

Predict_proba(x): Give the result with probability value. The sum of the probabilities of each point in all labels is 1.

Predict(x): Predict the result of x, and call Predict_proba () internally. According to the result of probability, which type has the highest predicted value is which type.

Predict_log_proba(x): It is basically the same as predict_proba, except that the result is handled by log ().

Fit(X, y, sample_weight=None): Establish decision tree forest from training data set (x, y). X is the training sample, and y is the target value (class label in classification, real number in regression).

parameter

Compared with GBDT, GBDT has many frame parameters, the most important ones are the maximum number of iterators, step size and sub-sampling rate, and it is laborious to adjust the parameters. RandomForest is relatively simple, because there is no dependency between weak learners in bagging framework, which reduces the difficulty of parameter tuning. In other words, to achieve the same tuning effect, RandomForest takes less time to tune parameters than GBDT.

Bagging frame parameters:

N_estimators: the maximum number of weak learners (the number of random forest classifiers (trees) established). Too small is easy to under-fit, too large is easy to over-fit, generally choose a moderate value. Increase can reduce the variance of the whole model, improve the accuracy of the model, and have no effect on the deviation and variance of the sub-model. Because the second term of the variance formula of the whole model is reduced, the improvement of accuracy has an upper limit. In practical application, the value can be between 1 and 200;

N_jobs: The number of processors allowed by the engine. If the value is 1, only one processor can be used, and if the value is-1, there is no limit. Setting n_jobs can speed up model calculation;

Oob_score: whether to use heterodyne to evaluate the quality of the model, the default is False, and the recommended setting is True, because the heterodyne score reflects the generalization ability of a model after fitting;

Shopping cart decision tree parameters:

Max _ features: the maximum number of features considered in RF division. You can use many types of values, and the default value is "None", which means that all feature numbers are considered when dividing; If it is "log2", it means that log2N features are considered at most when dividing; If it is "sqrt" or "auto", it means that N√N features are considered at most when dividing. If it is an integer, it represents the absolute number of features considered. If it is a floating-point number, it means the percentage of features considered, that is, the number of rounded features (percentage xN) is considered, where n is the total number of features of the sample. Generally speaking, if the number of sample features is small, such as less than 50, we can use the default "none". If the number of features is very large, we can flexibly use other values just described to control the maximum number of features considered in the division, thus controlling the generation time of the decision tree.

Max_depth: the maximum depth of the decision tree. The default value is None, and the decision tree will not limit the depth of the subtree when building the subtree. When doing so, each leaf node will have only one category, or it will reach min_samples_split. Generally speaking, when there are few data or features, this value can be ignored. If there are many samples and features in the model, it is suggested to limit this maximum depth, and the specific value depends on the distribution of data. Common values can be between 10 and 100.

Min_samples_split: the minimum number of samples required for internal node subdivision; the default value is 2. This value limits the conditions for the subtree to continue to be divided. If the number of samples of a node is less than min_samples_split, it will not continue to try to select the best feature for division. The default value is 2. If the sample size is not large, you don't need to care about this value. If the sample size is large, it is recommended to increase this value.

Min_samples_leaf: the minimum number of samples of a leaf node. This value limits the minimum number of samples of leaf nodes. If the number of leaf nodes is less than the number of samples, it will be pruned together with the sibling nodes. The default value is 1. You can enter an integer with the minimum number of samples, or the minimum number of samples as a percentage of the total number of samples. If the sample size is not large, don't care about this value. If the sample size is large, it is recommended to increase this value.

Min_weight_fraction_leaf: the minimum sample weight sum of leaf nodes. This value limits the minimum value of the weighted sum of all samples of the leaf node. If it is less than this value, it will be trimmed together with the sibling nodes. The default value is 0, which means that the weight is not considered. Generally speaking, if there are many samples with missing values, or the distribution category deviation of the samples in the classification tree is great, the sample weight will be introduced, so we should pay attention to this value.

Max_leaf_nodes: Maximum number of leaf nodes. By limiting the maximum number of leaf nodes, over-fitting can be prevented. The default value is None, that is, there is no limit to the maximum number of leaf nodes. If restrictions are added, the algorithm will establish an optimal decision tree in the maximum number of leaf nodes. If there are not many features, this value can be ignored, but if there are many features, it can be limited, and the specific value can be obtained through cross-validation.

Min _ quality _ split: the minimum impurity of node division. This value limits the growth of decision trees. If the impurity (based on Gini coefficient, mean square deviation) of a node is less than this threshold, the node will not regenerate into a child node, that is, a leaf node. It is generally not recommended to change the default value of 1e-7.

The most important decision tree parameters mentioned above include the maximum number of features max_features, the maximum depth max_depth, the minimum number of samples min_samples_split needed to subdivide internal nodes and the minimum number of samples min_samples_leaf of leaf nodes.

Parameter tuning: The tuning of random forest parameters also occupies a certain position in data analysis and mining, and a good tuning method can achieve twice the result with half the effort. Tuning reference/cherdw/article/details/54971771