SAS® Machine Learning Features
Model development with modern machine learning algorithms
- Decision forests:
- Distribution of independent training runs.
- Supports intelligent hyperparameter autotuning of model parameters.
- Generation of SAS ASTOREs for production scoring.
- Gradient boosting:
- Automated iterative search for optimal partition of the data in relation to selected label variable.
- Automated resampling of input data with adjusted weights based on residuals.
- Automated generation of weighted average for final supervised model.
- Supports binary, nominal and interval labels.
- Ability to customize tree training with variety of options for numbers of trees to grow, splitting criteria to apply, depth of subtrees and compute resources.
- Automated stopping criteria based on validation data scoring to avoid overfitting.
- Generation of SAS ASTOREs for production scoring.
- Neural networks:
- Automated intelligent tuning of parameter set to identify optimal model.
- Supports modeling of count data.
- Intelligent defaults for most neural network parameters.
- Ability to customize neural networks architecture and weights.
- Techniques include deep forward neural network (DNN), convolutional neural networks (CNNs), recurrent neural networks (RNNs) and autoencoders.
- Ability to use an arbitrary number of hidden layers to support deep learning.
- Automatic standardization of input and target variables.
- Automatic out-of-bag validation for early stopping to avoid overfitting.
- Supports intelligent hyperparameter autotuning of model parameters.
- Generation of SAS ASTOREs for production scoring.
- Segmentation model for deep learning.
- Network development platform for mobile or IoT device.
- Deep learning and biomedical imaging algorithm used together to quickly identify and visualize shapes.
- Load native DICOM files.
- End-to-end pipeline to process audio streams and analyze audio data directly from a MIC or from large audio file.
- Support vector machines:
- Models binary target labels.
- Supports linear and polynomial kernels for model training.
- Ability to include continuous and categorical in/out features.
- Automated scaling of input features.
- Ability to apply the interior-point method and the active-set method.
- Supports data partition for model validation.
- Supports cross-validation for penalty selection.
- Generation of SAS ASTOREs for production scoring.
- Factorization machines
- Supports the development of recommender systems based on sparse matrices of user IDs and item ratings.
- Ability to apply full pairwise-interaction tensor factorization.
- Includes additional categorical and numerical input features for more accurate models.
- Supercharge models with timestamps, demographic data and context information.
- Supports warm restart (update models with new transactions without full retraining).
- Generation of SAS ASTOREs for production scoring.
- Bayesian networks:
- Learns different Bayesian network structures, including naive, tree-augmented naive (TAN), Bayesian network-augmented naive (BAN), parent-child Bayesian networks and Markov blanket.
- Performs efficient variable selection through independence tests.
- Selects the best model automatically from specified parameters.
- Generation of SAS ASTOREs for production scoring.
- Dirichlet Gaussian mixture models (GMM):
- Can execute clustering in parallel and is highly multithreaded.
- Performs soft clustering, which provides not only the predicted cluster score but also the probability distribution over the clusters for each observation.
- Learns the best number of clusters during the clustering process.
- Uses a parallel variational Bayes (VB) method as the model inference method. This method approximates the (intractable) posterior distribution and then iteratively updates the model parameters until it reaches convergence.
- Semisupervised learning algorithm:
- Highly distributed and multithreaded.
- Returns the predicted labels for both the unlabeled data table and the labeled data table.
- T-distributed stochastic neighbor embedding (t-SNE):
- Highly distributed and multithreaded.
- Returns low-dimensional embeddings that are based on a parallel implementation of the t-SNE algorithm.
Latest statistical algorithms
- Clustering:
- K-means, k-modes or k-prototypes clustering.
- Parallel coordinate plots to interactively evaluate cluster membership.
- Scatter plots of inputs with cluster profiles overlaid for small data sets and heat maps with cluster profiles overlaid for large data sets.
- Detailed summary statistics.
- Generate on-demand cluster ID as a new column.
- Supports holdout data (training and validation) for model assessment.
- Decision trees:
- Computes measures of variable importance.
- Supports classification and regression trees.
- Based on a modified C4.5 algorithm or cost-complexity pruning.
- Interactively grow and prune a tree. Interactively train a subtree.
- Set tree depth, max branch, leaf size, aggressiveness of tree pruning and more.
- Use tree map displays to interactively navigate the tree structure.
- Generate on-demand leaf ID, predicted values and residuals as new columns.
- Supports holdout data (training and validation) for model assessment.
- Supports pruning with holdout data.
- Supports autotuning.
- Logistic regression:
- Models for binary data with logit and probit link functions.
- Influence statistics.
- Variable selection, including iteration plot.
- Supports forward, backward, stepwise and lasso variable selection.
- Frequency and weight variables.
- Residual diagnostics.
- Summary table includes overall ANOVA, model dimensions, fit statistics, model ANOVA, Type III test and parameter estimates.
- Generate on-demand predicted values and residuals as new columns.
- Supports holdout data (training and validation) for model assessment.
- Linear regression:
- Influence statistics.
- Variable selection, including iteration plot.
- Supports forward, backward, stepwise and lasso variable selection.
- Frequency and weight variables.
- Residual diagnostics.
- Summary table includes overall ANOVA, model dimensions, fit statistics, model ANOVA, Type III test and parameter estimates.
- Generate on-demand predicted values and residuals as new columns.
- Supports holdout data (training and validation) for model assessment
- Generalized linear models:
- Distributions supported include beta, normal, binary, exponential, gamma, geometric, Poisson, Tweedie, inverse Gaussian and negative binomial.
- Supports forward, backward, stepwise and lasso variable selection.
- Variable selection, including iteration plot.
- Offset variable support.
- Frequency and weight variables.
- Residual diagnostics.
- Summary table includes model summary, iteration history, fit statistics, Type III test table and parameter estimates.
- Informative missing option for treatment of missing values on the predictor variable.
- Generate on-demand predicted values and residuals as new columns.
- Supports holdout data (training and validation) for model assessment.
- Generalized additive models:
- Distributions supported include normal, binary, gamma, Poisson, Tweedie, inverse Gaussian and negative binomial.
- Supports one- and two-dimensional spline effects.
- GCV, GACV and UBRE methods for selecting the smoothing effects.
- Offset variable support.
- Frequency and weight variables.
- Residual diagnostics.
- Summary table includes model summary, iteration history, fit statistics and parameter estimates.
- Supports holdout data (training and validation) for model assessment.
- Nonparametric logistic regression:
- Models for binary data with logit, probit, log-log and c-log-log link functions.
- Supports one- and two-dimensional spline effects.
- GCV, GACV and UBRE methods for selecting the smoothing effects.
- Offset variable support.
- Frequency and weight variables.
- Market basket analysis.
- The new NLHS_RANGE option in the PROC MBANALYSIS statement enables you to specify the range of number of items in the left-hand side (LHS) of a rule.
- The new NRHS_RANGE option in the PROC MBANALYSIS statement enables you to specify the range of number of items in the right-hand side (RHS) of a rule.
- The new ANTECEDENTLIST= option in the PROC MBANALYSIS statement enables you to specify the regular expression strings to match in the antecedent (left-hand side) of a rule.
- The new CONSEQUENTLIST= option in the PROC MBANALYSIS statement enables you to specify the regular expression strings to match in the consequent (right-hand side) of a rule.
- The new SEPARATOR= option in the PROC MBANALYSIS statement enables you to specify the separator character in the antecedent (left-hand side) or consequent (right-hand side) of a rule.
- The maximum limit of the ITEMS= option is set to 1,000.
- The maximum number of rules that are generated per thread on each node is 1 million.
- k-NN (K Nearest Neighbor)
- Highly distributed and multithreaded.
- Returns k-nearest neighbors based on a parallel implementation of the k-NN search algorithm .
Analytical data preparation
- Distributed data management routines provided via code:
- T-distributed stochastic neighbor embedding (t-SNE).
- Feature binning.
- High-performance imputation of missing values in features with user-specified values, mean, pseudo median and random value of non-missing values.
- Feature dimension reduction.
- Large-scale principal components analysis (PCA), including moving windows and robust PCA.
- Unsupervised learning with cluster analysis and mixed variable clustering.
- Large-scale data exploration and summarization.
- Large-scale data profiling of input data sources.
- Sampling: Supports random and stratified sampling, oversampling for rare events and indicator variables for sampled records.
Integrated text analytics
- Supports 32 native languages out of the box:
- English.
- Arabic.
- Chinese.
- Croatian.
- Czech.
- Danish.
- Dutch.
- Farsi.
- Finnish.
- French.
- German.
- Greek.
- Hebrew.
- Hindi.
- Hungarian.
- Indonesian.
- Italian.
- Japanese.
- Korean.
- Norwegian.
- Polish.
- Portuguese.
- Romanian.
- Russian.
- Slovak.
- Slovenian.
- Spanish.
- Swedish.
- Tagalog.
- Turkish.
- Thai.
- Vietnamese.
- English.
- Automated parsing, tokenization, part-of-speech tagging and lemmatization.
- Predefined concepts extract common entities such as names, dates, currency values, measurements, people, places and more.
- Automated feature extraction with machine-generated topics (singular value decomposition and latent Dirichlet allocation).
- Supports machine learning and rules-based approaches within a single project.
- Automatic rule generation with the BoolRule.
- Classify documents more accurately with deep learning (recurrent neural networks).
Model assessment
- Automatically calculates supervised learning model performance statistics.
- Produces output statistics for interval and categorical targets.
- Creates lift table for interval and categorical target.
- Creates ROC table for categorical target.
Model scoring
- Automatically generates SAS DATA step code for model scoring.
- Applies scoring logic to training, holdout data and new data.
SAS® procedures (PROCs) & CAS actions
- A programming interface (SAS® Studio) allows IT or developers to access a CAS server, load and save data directly from a CAS server, and support local and remote processing on a CAS server.
- Python programmers or IT staff can access data and perform basic data manipulation against a CAS server or execute CAS actions using PROC CAS.