Issues in Linear Model Building and Data Mining
Duration: 2.0 daysThis courses discusses methods of and problems in variable selection for present Giga-bases. Data mining applications typically imply building linear models that usually involve variable selection, of which the stepwise family of methods is the most utilized (both for linear as well as for logistic regressions).
The present standard of the stepwise family as well as problems associated with it (such as the issues of redundant and suppressed variables and orthogonalization) are introduced. For the specific case of logistic regression, the difference in variable search with the linear regression case and focus on measures of classification and precision are noted. For both types of linear models, issues of goodness-of-fit are discussed.
Who should attend: Experienced predictive modelers and data miners who want to learn about issues that complicate model building and approaches to resolve these issues
Prerequisites
Before attending this course, you should have- knowledge of statistical linear models
- experience developing and assessing predictive models.
Course Contents
Data Mining World- 'mirage' models
- dummy variables and stepwises
- number of observations and of predictors
- spurious correlations and causality, Suppression Effects
- variable importance and interpretation
- colinearity
- variable selection and colinearity
- working with interactions
- removing interaction colinearity
- interactions - case studies
- principle of marginality
- logistic regression basics
- odds, probability, log-odds interpretation
- likelihood, deviance, inference and model fit
- interactions in logistic regression
- stepwise family selection
- variable selection methods - quick comparison
- entry/removal of variables in stepwise searches
- coefficient estimation
- residual analysis, marginal model plots
- balanced and unbalanced populations
- quasi- and complete-separation
- confounding and colinearity
- variable selection and colinearity in logistic regression
- classification tables
- ROC curves and measures
- direct marketing derived measures: gains table, K-S statistic
- cutoff, high noon of decision making
- ROC and cutoff probabilities
- cutoff decisions based on event precision
- practical approach: profit

