SAS® Machine Learning Features

Model development with modern machine learning algorithms 

  • Decision forests:
    • Distribution of independent training runs.
    • Supports intelligent hyperparameter autotuning of model parameters.
    • Generation of SAS ASTOREs for production scoring.
  • Gradient boosting:
    • Automated iterative search for optimal partition of the data in relation to selected label variable.
    • Automated resampling of input data with adjusted weights based on residuals.
    • Automated generation of weighted average for final supervised model.
    • Supports binary, nominal and interval labels.
    • Ability to customize tree training with variety of options for numbers of trees to grow, splitting criteria to apply, depth of subtrees and compute resources.
    • Automated stopping criteria based on validation data scoring to avoid overfitting.
    • Generation of SAS ASTOREs for production scoring.
  • Neural networks:
    • Automated intelligent tuning of parameter set to identify optimal model.
    • Supports modeling of count data.
    • Intelligent defaults for most neural network parameters.
    • Ability to customize neural networks architecture and weights.
    • Techniques include deep forward neural network (DNN), convolutional neural networks (CNNs), recurrent neural networks (RNNs) and autoencoders.
    • Ability to use an arbitrary number of hidden layers to support deep learning.
    • Automatic standardization of input and target variables.
    • Automatic out-of-bag validation for early stopping to avoid overfitting.
    • Supports intelligent hyperparameter autotuning of model parameters.
    • Generation of SAS ASTOREs for production scoring.
    • Segmentation model for deep learning.
    • Network development platform for mobile or IoT device.
    • Deep learning and biomedical imaging algorithm used together to quickly identify and visualize shapes.
    • Load native DICOM files.
    • End-to-end pipeline to process audio streams and analyze audio data directly from a MIC or from large audio file.
  • Support vector machines:
    • Models binary target labels.
    • Supports linear and polynomial kernels for model training.
    • Ability to include continuous and categorical in/out features.
    • Automated scaling of input features.
    • Ability to apply the interior-point method and the active-set method.
    • Supports data partition for model validation.
    • Supports cross-validation for penalty selection.
    • Generation of SAS ASTOREs for production scoring.
  • Factorization machines
    • Supports the development of recommender systems based on sparse matrices of user IDs and item ratings.
    • Ability to apply full pairwise-interaction tensor factorization.
    • Includes additional categorical and numerical input features for more accurate models.
    • Supercharge models with timestamps, demographic data and context information.
    • Supports warm restart (update models with new transactions without full retraining).
    • Generation of SAS ASTOREs for production scoring.
  • Bayesian networks:
    • Learns different Bayesian network structures, including naive, tree-augmented naive (TAN), Bayesian network-augmented naive (BAN), parent-child Bayesian networks and Markov blanket.
    • Performs efficient variable selection through independence tests.
    • Selects the best model automatically from specified parameters.
    • Generation of SAS ASTOREs for production scoring.
  • Dirichlet Gaussian mixture models (GMM):
    • Can execute clustering in parallel and is highly multithreaded.
    • Performs soft clustering, which provides not only the predicted cluster score but also the probability distribution over the clusters for each observation.
    • Learns the best number of clusters during the clustering process.
    • Uses a parallel variational Bayes (VB) method as the model inference method. This method approximates the (intractable) posterior distribution and then iteratively updates the model parameters until it reaches convergence.
  • Semisupervised learning algorithm:
    • Highly distributed and multithreaded.
    • Returns the predicted labels for both the unlabeled data table and the labeled data table.
  • T-distributed stochastic neighbor embedding (t-SNE):
    • Highly distributed and multithreaded.
    • Returns low-dimensional embeddings that are based on a parallel implementation of the t-SNE algorithm.

Latest statistical algorithms

  • Clustering:
    • K-means, k-modes or k-prototypes clustering.
    • Parallel coordinate plots to interactively evaluate cluster membership.
    • Scatter plots of inputs with cluster profiles overlaid for small data sets and heat maps with cluster profiles overlaid for large data sets.
    • Detailed summary statistics.
    • Generate on-demand cluster ID as a new column.
    • Supports holdout data (training and validation) for model assessment.
  • Decision trees:
    • Computes measures of variable importance.
    • Supports classification and regression trees.
    • Based on a modified C4.5 algorithm or cost-complexity pruning.
    • Interactively grow and prune a tree. Interactively train a subtree.
    • Set tree depth, max branch, leaf size, aggressiveness of tree pruning and more.
    • Use tree map displays to interactively navigate the tree structure.
    • Generate on-demand leaf ID, predicted values and residuals as new columns.
    • Supports holdout data (training and validation) for model assessment.
    • Supports pruning with holdout data.
    • Supports autotuning.
  • Logistic regression:
    • Models for binary data with logit and probit link functions.
    • Influence statistics.
    • Variable selection, including iteration plot.
    • Supports forward, backward, stepwise and lasso variable selection.
    • Frequency and weight variables.
    • Residual diagnostics.
    • Summary table includes overall ANOVA, model dimensions, fit statistics, model ANOVA, Type III test and parameter estimates.
    • Generate on-demand predicted values and residuals as new columns.
    • Supports holdout data (training and validation) for model assessment.
  • Linear regression:
    • Influence statistics.
    • Variable selection, including iteration plot.
    • Supports forward, backward, stepwise and lasso variable selection.
    • Frequency and weight variables.
    • Residual diagnostics.
    • Summary table includes overall ANOVA, model dimensions, fit statistics, model ANOVA, Type III test and parameter estimates.
    • Generate on-demand predicted values and residuals as new columns.
    • Supports holdout data (training and validation) for model assessment
  • Generalized linear models:
    • Distributions supported include beta, normal, binary, exponential, gamma, geometric, Poisson, Tweedie, inverse Gaussian and negative binomial.
    • Supports forward, backward, stepwise and lasso variable selection.
    • Variable selection, including iteration plot.
    • Offset variable support.
    • Frequency and weight variables.
    • Residual diagnostics.
    • Summary table includes model summary, iteration history, fit statistics, Type III test table and parameter estimates.
    • Informative missing option for treatment of missing values on the predictor variable.
    • Generate on-demand predicted values and residuals as new columns.
    • Supports holdout data (training and validation) for model assessment.  
  • Generalized additive models:
    • Distributions supported include normal, binary, gamma, Poisson, Tweedie, inverse Gaussian and negative binomial.
    • Supports one- and two-dimensional spline effects.
    • GCV, GACV and UBRE methods for selecting the smoothing effects.
    • Offset variable support.
    • Frequency and weight variables.
    • Residual diagnostics.
    • Summary table includes model summary, iteration history, fit statistics and parameter estimates.
    • Supports holdout data (training and validation) for model assessment.  
  • Nonparametric logistic regression:
    • Models for binary data with logit, probit, log-log and c-log-log link functions.
    • Supports one- and two-dimensional spline effects.
    • GCV, GACV and UBRE methods for selecting the smoothing effects.
    • Offset variable support.
    • Frequency and weight variables.
    • Market basket analysis.
    • The new NLHS_RANGE option in the PROC MBANALYSIS statement enables you to specify the range of number of items in the left-hand side (LHS) of a rule.
    • The new NRHS_RANGE option in the PROC MBANALYSIS statement enables you to specify the range of number of items in the right-hand side (RHS) of a rule.
    • The new ANTECEDENTLIST= option in the PROC MBANALYSIS statement enables you to specify the regular expression strings to match in the antecedent (left-hand side) of a rule.
    • The new CONSEQUENTLIST= option in the PROC MBANALYSIS statement enables you to specify the regular expression strings to match in the consequent (right-hand side) of a rule.
    • The new SEPARATOR= option in the PROC MBANALYSIS statement enables you to specify the separator character in the antecedent (left-hand side) or consequent (right-hand side) of a rule.
    • The maximum limit of the ITEMS= option is set to 1,000.
    • The maximum number of rules that are generated per thread on each node is 1 million.
  • k-NN (K Nearest Neighbor)
    • Highly distributed and multithreaded.
    • Returns k-nearest neighbors based on a parallel implementation of the k-NN search algorithm .

Analytical data preparation

  • Distributed data management routines provided via code:
    • T-distributed stochastic neighbor embedding (t-SNE).
    • Feature binning.
    • High-performance imputation of missing values in features with user-specified values, mean, pseudo median and random value of non-missing values.
    • Feature dimension reduction.
    • Large-scale principal components analysis (PCA), including moving windows and robust PCA.
    • Unsupervised learning with cluster analysis and mixed variable clustering.
  • Large-scale data exploration and summarization.
  • Large-scale data profiling of input data sources.
  • Sampling: Supports random and stratified sampling, oversampling for rare events and indicator variables for sampled records.

Integrated text analytics

  • Supports 32 native languages out of the box:
    • English.
    • Arabic.
    • Chinese.
    • Croatian.
    • Czech.
    • Danish.
    • Dutch.
    • Farsi.
    • Finnish.
    • French.
    • German.
    • Greek.
    • Hebrew.
    • Hindi.
    • Hungarian.
    • Indonesian.
    • Italian.
    • Japanese.
    • Korean.
    • Norwegian.
    • Polish.
    • Portuguese.
    • Romanian.
    • Russian.
    • Slovak.
    • Slovenian.
    • Spanish.
    • Swedish.
    • Tagalog.
    • Turkish.
    • Thai.
    • Vietnamese.
  • Automated parsing, tokenization, part-of-speech tagging and lemmatization.
  • Predefined concepts extract common entities such as names, dates, currency values, measurements, people, places and more.
  • Automated feature extraction with machine-generated topics (singular value decomposition and latent Dirichlet allocation).
  • Supports machine learning and rules-based approaches within a single project.
  • Automatic rule generation with the BoolRule.
  • Classify documents more accurately with deep learning (recurrent neural networks).

Model assessment

  • Automatically calculates supervised learning model performance statistics.
  • Produces output statistics for interval and categorical targets.
  • Creates lift table for interval and categorical target.
  • Creates ROC table for categorical target.

Model scoring

  • Automatically generates SAS DATA step code for model scoring.
  • Applies scoring logic to training, holdout data and new data.

SAS® procedures (PROCs) & CAS actions

  • A programming interface (SAS® Studio) allows IT or developers to access a CAS server, load and save data directly from a CAS server, and support local and remote processing on a CAS server.
  • Python programmers or IT staff can access data and perform basic data manipulation against a CAS server or execute CAS actions using PROC CAS.

Back to Top