Issues in Linear Model Building and Data Mining
Duration: 1.0 dayCourse Description
Data mining applications that involve building linear models usually involve variable selection. The stepwise family of methods is the most utilized for the selection of variables, both for linear regression and logistic regression. This seminar discusses methods of variable selection for present Giga-bases and problems related to selecting variables.The seminar introduces the current standard of the stepwise family as well as problems associated with it, such as the issues of redundant and suppressed variables and orthogonalization. For the specific case of logistic regression, the seminar notes the difference in variable search with the linear regression case and focus on measures of classification and precision. For both types of linear models, issues of goodness of fit are discussed.
If time permits, the seminar concludes with a quick overview of colinearity problems in the area of variable selection.
Course Contents
Correlation, orthogonality, and significance in linear model variable selection.Likelihood function and variable selection in logistic regression.
Stopping mechanisms.
Working issues:
- Spurious correlations.
- Correlations, redundancy, suppression effects and variable selection (ELI plots).
- Logistic regression and confounders.
- Interpretation myths.
- Critical missing value imputation issues.
- Product transformations and the marginality principle (Nelder).
- To bin or not to bin: Binning and variable selection.
- Isn't it great that we can we always blame colinearity? Not so fast.
- Balanced/Unbalanced target.
- ROC, K-S, Precision, and Optimal cutoff point.

