Data Mining Techniques: Theory and Practice
Duration: 3.0 daysThis course introduces a data mining methodology that is a superset to the SAS SEMMA methodology around which SAS Enterprise Miner is organized. The course also introduces a wide range of data mining algorithms and both theoretical knowledge and practical skills. In this class, you work through all the steps of a data mining project, beginning with problem definition and data selection, and continuing through data exploration, data transformation, sampling, portioning, modeling, and assessment.
Learn how to
- use a data mining methodology
- build and use decision trees and neural networks for modeling and scoring
- use survival analysis and create survival curves.
Who should attend: Business analysts, their managers, and statisticians
Prerequisites
No prior knowledge of statistical or data mining tools is required.Course Contents
Introduction to Data Mining- what is data mining?
- directed and undirected data mining
- models
- profiling and prediction
- why have a methodology?
- how data miners can inadvertently learn things that are not true
- translating business problems into data mining problems
- the importance of model stability
- finding the right input variables
- sampling to create balanced model sets
- partitioning to create training, validation, and test sets
- data preparation
- model assessment
- developing intuition about data
- data structure
- data types
- data values
- exploring distributions
- summary statistics
- histograms
- using SAS Enterprise Miner for data exploration
- the null hypothesis
- statistical significance
- confidence bounds
- variance and standard deviation
- standardized values
- correlation
- linear regression
- logistic regression
- using SAS Enterprise Miner to build regression models
- decision trees as data exploration and classification tools
- decision trees for modeling and scoring
- decision trees for variable selection
- alternate representations of decision trees
- algorithms used to build decision trees
- splitting criteria
- recognizing instability and overfitting in decision tree models
- capturing interactions between variables
- using SAS Enterprise Miner to build decision trees
- origins of neural networks
- neural networks compared with regression
- algorithms used to train neural networks
- data preparation requirements for neural networks
- picking appropriate inputs for neural networks
- creating neural network models using SAS Enterprise Miner
- similarity and distance
- distance metrics appropriate for different kinds of data
- the role of the training set in memory-based reasoning (MBR)
- combining the votes of several neighbors
- other K-nearest neighbor techniques
- collaborative filtering
- using the SAS Enterprise Miner MBR node
- more on similarity and distance
- the k-means algorithm
- divisive clustering
- agglomerative clustering
- data preparation for clustering
- interpreting clusters
- finding clusters with SAS Enterprise Miner
- origins of survival analysis
- how business data is different from clinical data
- hazards and hazard charts
- retention curves and survival curves
- calculating survival from retention
- calculating hazards empirically
- parametric hazard models
- censoring
- competing risks
- survival-based forecasting
- using SAS code in SAS Enterprise Miner to create survival curves
- market basket analysis
- association rules
- sequential pattern analysis
- using SAS Enterprise Miner to discover associations in retail data
- background on graph theory
- sphere of influence
- using link analysis to generate derived variables
- graph-coloring algorithm
- Kleinberg's algorithm
- optimization techniques and problems (SAS/OR software)
- other algorithms
- linear programming problems
- genetic algorithms

