In statistics, regression analysis refers to a method of statistical analysis that identifies interdependent quantitative relationships between two or more variables. Regression analysis is divided into univariate regression and multiple regression analysis according to the number of variables involved, simple regression analysis and multiple regression analysis according to the number of dependent variables, and linear regression analysis and nonlinear regression analysis according to the type of relationship between the independent variable and the dependent variable.
Methods
There are various regression techniques used for prediction. These techniques have three main measures (number of independent variables, type of dependent variable, and shape of the regression line).
1. Linear RegressionLinear Regression
It is one of the most familiar modeling techniques. Linear regression is usually one of the preferred techniques when people learn predictive modeling. In this technique, the dependent variable is continuous, the independent variable can be continuous or discrete, and the regression line is linear in nature.
Linear regression uses a best-fit straight line (also known as a regression line) to establish a relationship between the dependent variable (Y) and one or more independent variables (X).
Multiple linear regression can be expressed as Y = a + b1*X + b2*X2 + e, where a represents the intercept, b represents the slope of the line, and e is the error term. Multiple linear regression can predict the value of the target variable based on the given predictor variable (s).
2. Logistic Regression
Logistic regression is used to calculate the probability of "Event=Success" and "Event=Failure". Logistic regression should be used when the type of dependent variable is a binary (1 / 0, true/false, yes/no) variable. Here, the value of Y is 0 or 1 and it can be represented by the following equation.
odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence
ln(odds) = ln(p/(1-p))
logit(p) = ln(p/(1 -p)) = b0+b1X1+b2X2+b3X3.... +bkXk
In the above equation, p expresses the probability of having a certain characteristic. You should be asking the question, "Why use log log in the equation?" .
Because the binomial distribution (the dependent variable) is being used here, a link function that is optimal for this distribution needs to be chosen. It is the Logit function. In the above equation, the parameters are chosen by estimating the great likelihood of the observed sample, rather than minimizing the sum-of-squares error (as used in ordinary regression).
3. Polynomial RegressionPolynomial Regression
For a regression equation, if the exponent of the independent variable is greater than 1, then it is a polynomial regression equation. This is shown in the following equation:
y=a+b*x^2
In this regression technique, the line of best fit is not a straight line. Rather, it is a curve used to fit the data points.
4. Stepwise Regression Stepwise Regression
This form of regression can be used when dealing with multiple independent variables. In this technique, the selection of independent variables is done in an automated process that includes non-human manipulation.
This feat is done by looking at the values of statistics such as R-square, t-stats and AIC metrics to identify significant variables. Stepwise regression fits a model by simultaneously adding/removing covariates based on specified criteria. Some of the most commonly used stepwise regression methods are listed below:
The standard stepwise regression method does two things. That is, it adds and removes the required predictions for each step.
The forward selection method starts with the most significant predictions in the model and then adds variables for each step.
The backward elimination method starts with all the predictions of the model at the same time and then eliminates the least significant variable at each step.
The purpose of this modeling technique is to maximize predictive power using the minimum number of predictor variables. It is also one of the ways to deal with high-dimensional datasets.2
5. Ridge RegressionRidge Regression
Ridge regression analysis is used when there is multiple ****collinearity (high correlation of the independent variables) between the data. In the presence of multiple **** linearity, even though the estimates measured by the least squares (OLS) method are not biased, their variance can be large, making the observed values far from the true values. Ridge regression reduces the standard error by adding a bias value to the regression estimates.
In a linear equation, prediction error can be divided into 2 components, one due to bias and one due to variance. The prediction error may be caused by either or both. Here, the error caused by variance will be discussed.
Ridge regression solves the problem of multiple ****linearity by shrinking the parameter λ (lambda). Consider the following equation:
L2=argmin||y=xβ||
+λ||β||
In this equation, there are two components. The first is the least squares term and the other is λ times β-squared, where β is the vector of correlation coefficients that is added to the least squares term along with the shrinkage parameter to get a very low variance.
6. Lasso Regression lasso regression
It is similar to ridge regression, Lasso (Least Absolute Shrinkage and Selection Operator) also gives a penalty term on the regression coefficient vector. In addition, it reduces the degree of variability and improves the accuracy of the linear regression model. Take a look at the following equation:
L1=agrmin||y-xβ||
+λ||β||
Lasso regression differs from Ridge regression in one respect, in that it uses a penalization function that is L1-paradigm instead of L2-paradigm. This causes the penalty (or a value equal to the sum of the absolute values of the constrained estimates) to be such that some of the parameter estimates are equal to zero.
The larger the value of the penalty used, the closer further estimates will make the shrinkage value converge to zero. This will result in variables to be selected from the given n variables.
If the predicted set of variables is highly correlated, Lasso selects one of the variables and shrinks the others to zero.
7. ElasticNet regression
ElasticNet is a hybrid of Lasso and Ridge regression techniques. It uses L1 for training and L2 is prioritized as the regularization matrix. ElasticNet is useful when there are multiple related features; Lasso picks one of them at random, while ElasticNet picks two.
The practical advantage between Lasso and Ridge is that it allows ElasticNet to inherit some of the stability of Ridge in the cyclic state.
Data exploration is an inevitable part of building a predictive model. It should be the preferred step when selecting an appropriate model, such as identifying relationships and effects of variables.
The advantages of comparing different models suitable for analyzing different indicator parameters such as statistical significance, R-square, Adjusted R-square, AIC, BIC, and error terms, and the other is the Mallows' Cp criterion. This one mainly checks for possible biases in your model by comparing the model to all possible sub-models (or choosing them carefully).
Cross-validation is the best way to evaluate predictive models. Here, split your dataset in two (one for training and one for validation). Use a simple mean square deviation between the observed and predicted values to measure the accuracy of your predictions.
If your dataset is a mixture of multiple variables, then you should not choose the automatic model selection method because you should not want to put all the variables in the same model at the same time.
It will also depend on your purpose. It may be the case that a less powerful model is easier to implement than one that is highly statistically significant. Regression regularization methods (Lasso, Ridge, and ElasticNet) work well with high dimensionality and multiple **** linearity between variables in the dataset.3
Assumptions and content
There are a number of conditional assumptions that are generally made about the data in data analysis:
Concordance of variances
Linear relations
Effect Accumulation
Variables have no measurement error
Variables follow a multivariate normal distribution
Observations are independent
The model is complete (it does not contain variables that should not be entered, nor omit variables that should be entered)
Error terms are independent and follow a (0, 1) normal distribution.
Realistic data often do not fully meet the above assumptions. As a result, statisticians have developed numerous regression models to address the constraints of the process assumed in linear regression models.
The main elements of regression analysis are:
①From a set of data, to determine the quantitative relationship between certain variables, i.e., to establish a mathematical model and estimate the unknown parameters. The common method of estimating parameters is the least squares method.
②The degree of confidence in these relational equations is tested.
③In a relationship in which many independent variables **** the same influence on a dependent variable, determine which (or which) independent variable's influence is significant and which independent variable's influence is insignificant, and add the independent variables whose influence is significant to the model, while eliminating the variables whose influence is insignificant, usually by using the methods of stepwise regression, forward regression, and backward regression.
④Predicting or controlling a production process using the required relational equation. The application of regression analysis is very extensive, and statistical software packages make it very easy to calculate various regression methods.
In regression analysis, variables are divided into two categories. One category is the dependent variable, they are usually a class of indicators of interest in the actual problem, usually denoted by Y; and the other category of variables that affect the value of the dependent variable is called the independent variable, denoted by X.
The main problems studied in regression analysis are:
(1) Determining the expression of the quantitative relationship between Y and X, which is called the regression equation;
(2) Testing the credibility of the obtained regression equation;
(3) Determining the effect or lack of effect of the independent variable X on the dependent variable Y;
(4) Using the obtained regression equation to for prediction and control.