SAS^® for Machine Learning and Deep Learning

SAS for Machine Learning and Deep Learning

Interactive programming in a web-based development environment

Visual interface for the entire analytical life cycle process.
Drag-and-drop interactive interface requires no coding, though coding is an option.
Supports automated code creation at each node in the pipeline.
Choose best practice templates (basic, intermediate or advanced) to get started quickly with machine learning tasks or take advantage of our automated modeling process.
Interpretability reports such as PD, LIME, ICE, and Kernel SHAP.
Share modeling insights via a PDF report.
Explore data from within Model Studio and launch directly into SAS Visual Analytics.
Edit models imported from SAS Visual Analytics in Model Studio.
View data within each node in Model Studio.
Run SAS® Enterprise Miner™ 14.3 batch code within Model Studio.
Provides a collaborative environment for easy sharing of data, code snippets, annotations and best practices among different personas.
Create, manage and share content and administer content permissions via SAS Drive.
The SAS lineage viewer visually displays the relationships between decisions, models, data and decisions.

Intelligent automation with human oversight

Public API to automate many of the manual, complex modeling steps to build machine learning models – from data wrangling, to feature engineering, to algorithm selection, to deployment.
Automatic Feature Engineering node for automatically cleansing, transforming, and selecting features for models.
Automatic Modeling node for automatically selecting the best model using a set of optimization and autotuning routines across multiple techniques.
Interactively adjust the pruning and splitting of decision tree nodes.
Automated data prep suggestions from meta learning.
Automated pipeline generation with complete customization capability.

Natural language generation

View results in simple language to facilitate understanding of reports, including model assessment and interpretability

Embedded support for Python & R languages

Embed open source code within an analysis, and call open source algorithms within Model Studio.
The Open Source Code node in Model Studio is agnostic to Python or R versions.
Manage Python models in a common repository within Model Studio.

Deep learning with Python (DLPy)

Build deep learning models for image, text, audio and time-series data using Jupyter Notebook.
High level APIs are available on GitHub for:
- Deep neural networks for tabular data.
- Image classification and regression.
- Object detection.
- RNN-based tasks – text classification, text generation and sequence labeling.
- RNN-based time-series processing and modeling.
Support for predefined network architectures, such as LeNet, VGG, ResNet, DenseNet, Darknet, Inception, ShuffleNet, MobileNet, YOLO, Tiny YOLO, Faster R-CNN and U-Net.
Import and export deep learning models in the ONNX format.
Use ONNX models to score new data sets in a variety of environments by taking advantage of Analytic Store (ASTORE)

SAS procedures (PROCs) & CAS actions

A programming interface (SAS Studio) allows IT or developers to access a CAS server, load and save data directly from a CAS server, and support local and remote processing on a CAS server.
Python, Java, R, Lua and Scala programmers or IT staff can access data and perform basic data manipulation against a CAS server, or execute CAS actions using PROC CAS.
CAS actions support for interpretability, feature engineering and modeling.
Integrate and add the power of SAS to other applications using REST APIs.

Highly scalable, distributed in-memory analytical processing

Distributed, in-memory processing of complex analytical calculations on large data sets provides low-latency answers.
Analytical tasks are chained together as a single, in-memory job without having to reload the data or write out intermediate results to disks.
Concurrent access to the same data in memory by many users improves efficiency.
Data and intermediate results are held in memory as long as required, reducing latency.
Built-in workload management ensures efficient use of compute resources.
Built-in failover management guarantees submitted jobs always finish.
Automated I/O disk spillover for improved memory management.

Model development with modern machine learning algorithms

Reinforcement learning:
- Techniques include Fitted Q-Network (FQN) and Deep Q-Network (DQN).
- FQN can train a model over precollected data points without the need to communicate with the environment.
- Uses replay memory and target network techniques to decorrelate the non-i.i.d. data points and stabilize the training process.
- Ability to specify a custom environment for state-action pairs and rewards.
Decision forests:
- Automated ensemble of decision trees to predict a single target.
- Automated distribution of independent training runs.
- Supports intelligent autotuning of model parameters.
- Automated generation of SAS code for production scoring.
Gradient boosting:
- Automated iterative search for optimal partition of the data in relation to selected label variable.
- Automated resampling of input data several times with adjusted weights based on residuals.
- Automated generation of weighted average for final supervised model.
- Supports binary, nominal and interval labels.
- Ability to customize tree training with variety of options for numbers of trees to grow, splitting criteria to apply, depth of subtrees and compute resources.
- Automated stopping criteria based on validation data scoring to avoid overfitting.
- Automated generation of SAS code for production scoring.
- Access LightGBM, a popular open source modeling package.
Neural networks:
- Automated intelligent tuning of parameter set to identify optimal model.
- Supports modeling of count data.
- Intelligent defaults for most neural network parameters.
- Ability to customize neural networks architecture and weights.
- Techniques include deep forward neural network (DNN), convolutional neural networks (CNNs), recurrent neural networks (RNNs) and autoencoders.
- Ability to use an arbitrary number of hidden layers to support deep learning.
- Support for different types of layers, such as convolution and pooling.
- Automatic standardization of input and target variables.
- Automatic selection and use of a validation data subset.
- Automatic out-of-bag validation for early stopping to avoid overfitting.
- Supports intelligent autotuning of model parameters.
- Automated generation of SAS code for production scoring.
Support vector machines:
- Models binary target labels.
- Supports linear and polynomial kernels for model training.
- Ability to include continuous and categorical in/out features.
- Automated scaling of input features.
- Ability to apply the interior-point method and the active-set method.
- Supports data partition for model validation.
- Supports cross-validation for penalty selection.
- Automated generation of SAS code for production scoring.
Factorization machines:
- Supports the development of recommender systems based on sparse matrices of user IDs and item ratings.
- Ability to apply full pairwise-interaction tensor factorization.
- Includes additional categorical and numerical input features for more accurate models.
- Supercharge models with timestamps, demographic data and context information.
- Supports warm restart (update models with new transactions without full retraining).
- Automated generation of SAS score code for production scoring.
Bayesian networks:
- Learns different Bayesian network structures, including naive, tree-augmented naive (TAN), Bayesian network-augmented naive (BAN), parent-child Bayesian networks and Markov blanket.
- Performs efficient variable selection through independence tests.
- Selects the best model automatically from specified parameters.
- Generates SAS code or an analytics store to score data.
- Loads data from multiple nodes and performs computations in parallel.
Dirichlet Gaussian mixture models (GMM):
- Can execute clustering in parallel and is highly multithreaded.
- Performs soft clustering, which provides not only the predicted cluster score but also the probability distribution over the clusters for each observation.
- Learns the best number of clusters during the clustering process, which is supported by the Dirichlet process.
- Uses a parallel variational Bayes (VB) method as the model inference method. This method approximates the (intractable) posterior distribution and then iteratively updates the model parameters until it reaches convergence.
Semisupervised learning algorithm:
- Highly distributed and multithreaded.
- Returns the predicted labels for both the unlabeled data table and the labeled data table.
T-distributed stochastic neighbor embedding (t-SNE):
- Highly distributed and multithreaded.
- Returns low-dimensional embeddings that are based on a parallel implementation of the t-SNE algorithm.
Generative adversarial networks (GANs)
- Techniques include StyleGANs for image data and GANs for tabular data.
- Generate synthetic data for deep learning models.

Analytical data preparation

Feature engineering best practice pipeline includes best transformations.
Distributed data management routines provided via a visual front end.
Large-scale data exploration and summarization.
Cardinality profiling:
- Large-scale data profiling of input data sources.
- Intelligent recommendation for variable measurement and role.
Sampling:
- Supports random and stratified sampling, oversampling for rare events and indicator variables for sampled records.

Data exploration, feature engineering & dimension reduction

T-distributed stochastic neighbor embedding (t-SNE).
Feature binning.
High-performance imputation of missing values in features with user-specified values, mean, pseudo median and random value of nonmissing values.
Feature dimension reduction.
Large-scale principal components analysis (PCA), including moving windows and robust PCA.
Unsupervised learning with cluster analysis and mixed variable clustering.
Segment profiles for clustering.

Integrated text analytics

Supports 33 native languages out of the box:
- English
- Arabic
- Chinese
- Croatian
- Czech
- Danish
- Dutch
- Farsi
- Finnish
- French
- German
- Greek
- Hebrew
- Hindi
- Hungarian
- Indonesian
- Italian
- Japanese
- Kazakh
- Korean
- Norwegian
- Polish
- Portuguese
- Romanian
- Russian
- Slovak
- Slovenian
- Spanish
- Swedish
- Tagalog
- Turkish
- Thai
- Vietnamese
Stop lists are automatically included and applied for all languages.
Automated parsing, tokenization, part-of-speech tagging and lemmatization.
Predefined concepts extract common entities such as names, dates, currency values, measurements, people, places and more.
Automated feature extraction with machine-generated topics (singular value decomposition and latent Dirichlet allocation).
Supports machine learning and rules-based approaches within a single project.
Automatic rule generation with the BoolRule.
Classify documents more accurately with deep learning (recurrent neural networks).

Model assessment

Automatically calculates supervised learning model performance statistics.
Produces output statistics for interval and categorical targets.
Creates lift table for interval and categorical target.
Creates ROC table for categorical target.
Creates Event Classification and Nominal Classification charts for supervised learning models with a class target.

Model scoring

Automatically generates SAS DATA step code for model scoring.
Applies scoring logic to training, holdout data and new data.

SAS Viya in-memory engine

CAS (SAS Cloud Analytic Services) performs processing in memory and distributes processing across nodes in a cluster.
User requests (expressed in a procedural language) are translated into actions with the parameters needed to process in a distributed environment. The result set and messages are passed back to the procedure for further action by the user.
Data is managed in blocks and can be loaded in memory and on demand.
If tables exceed memory capacity, the server caches the blocks on disk. Data and intermediate results are held in memory as long as required, across jobs and user boundaries.
Includes highly efficient node-to-node communication. An algorithm determines the optimal number of nodes for a given job.
Communication layer supports fault tolerance and lets you remove or add nodes from a server while it is running. All components can be replicated for high availability.
Support for legacy SAS code and direct interoperability with SAS 9.4M6 clients.
Supports multitenancy deployment, allowing for a shared software stack to support isolated tenants in a secure manner.

SAS for Machine Learning and Deep Learning

Interactive programming in a web-based development environment

Intelligent automation with human oversight

Natural language generation

Embedded support for Python & R languages

Deep learning with Python (DLPy)

SAS procedures (PROCs) & CAS actions

Highly scalable, distributed in-memory analytical processing

Model development with modern machine learning algorithms

Analytical data preparation

Data exploration, feature engineering & dimension reduction

Integrated text analytics

Model assessment

Model scoring

SAS Viya in-memory engine

Suivez-nous

Qu'est-ce que...