Current location - Loan Platform Complete Network - Big data management - How to create without missing values?
How to create without missing values?
Missing data is a common problem. If it is a large data set and the missing ratio is small (for example, 10% or less), you can consider deleting it directly, but if it is a small data set, it cannot be deleted easily.

It is suggested to use missing value filling technology to solve this problem.

In SPSS, there are two menus to complete the missing filling. One is to replace the missing values under the transformation menu, and the other is to analyze the missing values under the analysis menu.

The former is simple and easy to operate, while the latter is slightly professional and complicated. In most cases, I think replacing missing values can meet the requirements. Therefore, I mainly introduce the video course of SPSS from introduction to practice and the missing value replacement method in this paper.

Case data

For the economic data of 1 1 year, the "tertiary industry value" data of 20 13 years is missing. The real value is given by the "original" variable on the right, so as to compare the missing situation, that is, the "tertiary industry value" of 20162 years.

SPSS replaces missing values

Menu: Convert → Replace Missing Values, and the dialog box is as follows:

SPSS provides five missing filling methods here, namely:

serial mean

Use the average of the entire sequence to fill in missing values.

Average value of adjacent points

Fill in missing values with the average of valid surrounding values. The span of adjacent points is the number of effective values used to calculate the upper and lower average of missing values.

Median value of adjacent points

Fill the missing value with the middle value of the valid surrounding value. The span of adjacent points is the number of effective values used to calculate the median above and below the missing value.

linear interpolation method

Replace missing values with linear interpolation. The last valid value before the missing value and the first valid value after the missing value are used for interpolation. If the first or last case in the sequence has a missing value, there is no need to replace it, and the calculation principle is similar to "average value of adjacent points".

Linear trend of this point

Use the linear trend of this point to fill in the missing values. Regression the existing sequence on the index variable of 1 to n, and fill the missing values with the predicted values. Simply understood as the system will use linear fitting method to determine the replacement value. Other variables as independent variables, missing sequence variables as dependent variables, and then modeling and forecasting.

As long as you know the above five methods, don't worry too much about the calculation principle (there is no need to waste your brain, choose to trust SPSS). In particular, the first two methods based on average are the most commonly used and easy to understand.

explain

I skip the first method, and the text will explain. The "tertiary industry value" consists of 10 numbers except the missing data in 20 13, and its average value is 750.06, which is the so-called "sequence average", so 750.06 is used as the filling value of the missing data in 20 13.

For example, look at the second method: the average value of adjacent points.

Our group of data is time series data, and it is found that the "tertiary industry value" data is increasing year by year, so it is obviously not good to use the average value of the whole series to calculate the missing value. If you think about it carefully, it will be more effective and closer to the true value to calculate the average value with the data of nearly 1 year or the data of nearly 2 years.