Decision Tree (Decision Tree) is commonly used to study models of category attribution and predictive relationships, for example, four personal characteristics such as whether or not one smokes, whether or not one drinks alcohol, one's age, one's weight, and so on, may affect 'whether or not one has cancer', and the above four personal characteristics are referred to as 'Characteristics ', i.e., the independent variable (influencing factor X), and 'whether to have cancer' is called 'label', i.e., the dependent variable (influenced term Y). When modeling a decision tree, its possible to first divide the age, for example, by 70 years old, when the age is greater than 70 years old, it may be more likely to be classified as 'suffering from cancer', then divide the weight, for example, greater than 50 kilograms as a boundary, when greater than 50 kilograms it is more likely to be classified as 'suffering from cancer ', and so on, the cycle continues, after the logical combination between features (e.g., age greater than 70 years old, weight greater than 50 kilograms), it will correspond to the label of whether or not to have cancer.
The decision tree is a predictive model, and in order for it to have good predictive power, it is therefore usually necessary to divide the data into two groups, training data and test data. The training data is used to build the model to use, that is, to establish the correspondence between the combination of features and labels, after obtaining such a correspondence (after the model), and then use the test data is used to validate the strengths and weaknesses of the current model. Typically, the ratio of training data to test data is 9:1, 8:2, 7:3, 6:4 or 5:5 (e.g., 9:1 means that 90% of all the data is used for training the model, and the remaining 10% is used for testing whether the model is good or bad). The specific ratio depends on the amount of research data, there is no fixed standard, if the research data is small, such as only a few hundred data, you can consider 70% or 60%, or even 50% of the data for training, the remaining data for testing. The above includes model construction and model prediction of two, if the training data to get the model is excellent, at this time you can consider its preservation and deployed to use (this is the application of computer engineering, SPSSAU does not provide); in addition, when the decision tree model construction is completed can be predicted, such as a new patient, whether he will suffer from cancer and how likely to suffer from cancer.
The decision tree model can be used to determine the quality of features, such as the above four items, such as whether to smoke, whether to drink, age, weight, etc., the importance of the four items for the prediction of 'whether to develop cancer' can be ranked for screening out the most useful feature items.
Decision tree model construction, the need to set the parameters, the purpose of which is to build a good model (the criteria for a good model is usually: good evaluation of the model obtained from the training data, and good evaluation of the test data). Special attention should be paid to the fact that the model evaluation results on the training data may be very good (even 100% accuracy and other indicators), but the evaluation results on the test data are very bad, which is called 'overfitting'. This is called 'overfitting' and thus requires special attention in the actual study of the data. The construction of the model is usually the more complex parameter settings, it will bring the training data, the better the model evaluation results, but the test results are very bad, and thus in the construction of the decision tree, you need to pay special attention to the parameters of the relevant settings, the next will be used to illustrate the relevant case data.
The operation of SPSSAU is as follows:
The training set ratio is selected as a default: 0.8, that is, 80% (150 * 0.8 = 120 samples). 0.8=120 samples) for training the decision tree model, the remaining 20% or 30 samples (test data) for model validation. It should be noted that, in most cases, the data will first be standardized, the treatment is generally used for normal standardization, the purpose of this treatment is to allow the data to maintain consistency of the scale. Of course, other measures can also be used, such as intervalization, normalization and so on.
Then the parameter settings are as follows:
The node splitting standard is gini coefficient by default (the parameter value is just the way of calculating the splitting standard, don't need to set it), the node division method is the best method, that is, it is the combination of the characteristics of the order of the advantages and disadvantages of the classification of the classification of the division, if you want to set up the parameter comparison needs to be considered, it is recommended that you can switch to the parameter value of the random, that is, the priority of random features, for the comparison. The priority order of random features, used to compare the model training effect.
The minimum sample size of nodes can be defaulted to 2, and the minimum sample size of leaf nodes can be defaulted to 1. It should be noted that: if the amount of data is large, it is recommended that the value of the two parameters as large as possible to reduce the phenomenon of overfitting, but the larger the value of the two parameters is usually the worse the fitting effect of the training model. Specifically, the fitting effect of the test data should prevail, because the training model is prone to overfitting phenomenon. The maximum depth of the tree parameter, on behalf of the decision tree up to a few layers, the larger the parameter value set, the training model fitting effect is usually better, but may bring overfitting, this case for the demonstration of the needs of the first set to 4 layers. (Another tip: the maximum depth of the tree will be affected by the minimum sample size of the node split, the minimum sample size of the leaf nodes, not set to 4 it will necessarily be 4).
Example of SPSSAU results: