SAS Visual Text Analytics Features List

Data preparation & visualization

  • Ingests, cleanses and transforms data for analysis, easily accepting multiple file formats through local or remote file systems, relational databases and cloud storage.
  • Provides an intuitive user interface that accounts for important factors such as localization/internationalization and accessibility.
  • Provides the ability to visualize extracted entities, facts and relationships using network diagrams or path analysis.
  • Provides the ability to extract data from concepts node into a format ready for SAS Visual Analytics.
  • Term map enables you to visually identify relationships between terms.
  • The graphical user interface provides a visual programming flow.
  • Model explainability features natural language generation (NLG) descriptions for all output.

Parsing

  • Parsing actions are provided as out-of-the-box functionality across all supported languages.
  • Text parsing supports distributed accumulation, which leads to faster processing of data by fully distributing all aspects of the accumulation process across the grid.
  • Tokenization chops character sequences into individual sentences, words or morphemes that can then be used as input for part-of-speech tagging.
  • Lemmatization associates words with their base forms.
  • Misspelling analysis associates misspelled words with a set of variants that includes the properly spelled word.
  • Part-of-speech tagging grammatically classifies words based on their definition and context.
  • Sentence boundary disambiguation determines where sentences begin and end.
  • Dependency parsing assigns syntactic relationships between the words of a sentence through the application of deep learning algorithms.

Trend analysis

  • Automatic topic discovery uses two unsupervised machine learning methods – singular value decomposition and latent Dirichlet allocation – to group documents based on common themes.
  • Relevance scores calculate how well each document belongs to each topic, and a binary flag shows topic membership above a given threshold.
  • Merge or split topics automatically generated by the machine (unsupervised machine learning) to create user-defined topics (subject matter expertise to refine automated AI output).

Information extraction

  • Automatically pulls out structured information from an unstructured or semistructured data type to create new structured data using tasks such as entity recognition, relationship extraction and coreference resolution.
  • Uses predefined concepts to extract common entities, such as names, organizations, locations, expressions of time, dates, quantities, percentages and more.
  • Scores text data using Named Entity Recognition (NER) models backed by machine learning to extract information from text to improve and quicken decision making.
  • Lets you create custom concepts using keywords, Boolean operators, regular expressions, predicate logic and a wide array of linguistic operators.
  • Enables you to reference a predefined or custom concept in a categorization rule for extra contextual specificity or reach.
  • Automatically generates relevant concept rules and fact rules based on existing rules for a concept.
  • Lets you use the sandbox associated with each predefined and custom concept to quickly test new rules and subsets of your model against a document collection.
  • Identify and group languages within a set of documents containing multiple languages for faster, more accurate contextual analysis.

Hybrid modeling approaches

  • BERT-based classification is used to capture the context and meaning of words in a text to improve accuracy compared with traditional models. In addition to general classification, the BERT-based classification can be used to do sentiment analysis.
  • NLP capabilities include automated parsing, tokenization, part-of-speech tagging, lemmatization and misspelling detection.
  • Lets you apply start and stop lists.
  • Uses special tags, qualifiers and operators in linguistic rules that take advantage of parsing actions to allow for more precision or better recall/abstraction capabilities.
  • Uses rules-based linguistic methods to extract key concepts.
  • Automatic parsing can be used along with deep learning algorithms (recurrent neural networks) to classify documents and sentiment more accurately.
  • Automates topic generation with unsupervised machine learning.
  • Supervised/probabilistic machine learning models include BoolRule, Conditional Random Field and Probabilistic Semantics.
  • BoolRule enables automatic rule generation for document categorization.
  • Conditional Random Field and Probabilistic Semantics are used to label and sequence data and can automate entity and relationship extraction by learning the contextual rules of a given entity. Automatic rule builders promote topics to categories with supervised machine learning.

Sentiment analysis

  • Subjective information is identified in text and labeled as positive, negative or neutral using machine learning or a rules-based approach. That information is associated with an entity, and a visual depiction is provided through a sentiment indicator display.
  • Identifies and analyzes terms, phrases and character strings that imply sentiment.
  • Visually depicts sentiment through sentiment indicator display at a document or topic level.
  • Provides a modern machine learning method for sentiment based on the BERT open framework.

Corpus analysis

  • Run an action to carry out corpus analysis to create a set of output tables containing counts and summary statistics.
  • View and understand insights about information complexity, vocabulary diversity, information density and comparison metrics against a predetermined reference corpus.
  • Further analyze or visualize these statistics (using the counts) in reports created in SAS Visual Analytics.

Flexible deployment

  • SentiConcepts, Sentiment, Topics and Categories nodes provide score code needed to deploy models on an external data set.
  • Score code is natively threaded for distributed processing, taking maximum advantage of computing resources to reduce latency to results, even on very large data sets.
  • Analytic store (ASTORE) is a binary file that represents the scoring logic from a specific model or algorithm. This compact asset allows for easy score code movement and integration into existing application frameworks. ASTORE support is available for the Concepts, Sentiment and Categories nodes.

Native support for 33 languages

  • Automatically detect represented languages in multi-lingual corpora (documents).
  • Out-of-the-box text analysis for 33 languages:
    • Arabic
    • Chinese
    • Croatian
    • Czech
    • Danish
    • Dutch
    • English
    • Farsi
    • Finnish
    • French
    • German
    • Greek
    • Hebrew
    • Hindi
    • Hungarian
    • Indonesian
    • Italian
    • Japanese
    • Kazakh
    • Korean
    • Norwegian
    • Polish
    • Portuguese
    • Romanian
    • Russian
    • Slovak
    • Slovene
    • Spanish
    • Swedish
    • Tagalog
    • Turkish
    • Thai
    • Vietnamese
  • Default stop list for each language the application supports.
  • Built-in lexicons that support parsing actions such as tokenization, lemmatization, misspelling analysis, part-of-speech tagging, dependency parsing and sentence boundary disambiguation.

Open platform

  • Seamlessly integrate with existing systems and open source technology.
  • Add the power of SAS Analytics to other applications using REST APIs.
  • Open APIs and a microservices architecture enable you to bypass the native GUI and use your own UI or build a custom search application.
  • Quickly and easily publish select text analytics models to Microanalytics services (MAS) APIs, which you can embed in your web applications for on-demand categorization and concept extraction.
  • Out-of-the-box analytical programming interfaces for text summarization, text data segmentation, text parsing and mining, topic modeling, text rule development and scoring, text rule discovery, term mapping and topic term mapping, conditional random field and search.
  • Support for the entire analytics life cycle from data to discovery and deployment.
  • Code in a variety of programming languages, including SAS, Python, R, Java, Scala and Lua.