SAS® Visual Text Analytics Features
Data preparation and visualization
- Ingests, cleanses and transforms data for analysis, easily accepting multiple file formats through local or remote file systems, relational databases and cloud storage.
- Provides an intuitive user interface that accounts for important factors such as localization/internationalization and accessibility.
- Provides the ability to visualize extracted entities, facts and relationships using network diagrams or path analysis
- Term map enables you to visually identify relationships between terms.
- Graphical user interface provides a visual programming flow.
- Parsing actions are provided as out-of-the-box functionality across all supported languages.
- Text parsing supports distributed accumulation which leads to faster processing of data by fully distributing all aspects of the accumulation process across the grid.
- Tokenization chops character sequences into individual sentences, words or morphemes which can then be used as input for part-of-speech tagging.
- Lemmatization associates words with their base forms.
- Misspelling analysis associates misspelled words with a set of variants that include the properly spelled word.
- Part-of-speech tagging grammatically classifies words based on their definition and context.
- Sentence boundary disambiguation determines where sentences begin and end.
- Dependency parsing assigns syntactic relationships between the words of a sentence through the application of deep learning algorithms.
- Automatic topic discovery uses two unsupervised machine learning methods – singular value decomposition and latent Dirichlet allocation – to group documents based on common themes.
- Relevance scores calculate how well each document belongs to each topic, and a binary flag shows topic membership above a given threshold.
- Merge or split topics automatically generated by the machine (unsupervised machine learning) to create user-defined topics (subject matter expertise to refine automated AI-output).
- Automatically pulls out structured information from an unstructured or semi-structured data type to create new structured data using tasks such as entity recognition, relationship extraction and coreference resolution.
- Uses predefined concepts to extract common entities, such as names, organizations, locations, expressions of time, dates, quantities, percentages and more.
- Lets you create custom concepts using keywords, Boolean operators, regular expressions, predicate logic and a wide array of linguistic operators.
- Enables you to reference a predefined or custom concept in a categorization rule for extra contextual specificity or reach.
- Automatically generates relevant concept rules and fact rules based on existing rules for a concept.
- Lets you use the sandbox associated with each predefined and custom concept to quickly test new rules and subsets of your model against a document collection.
Hybrid modeling approaches
- NLP capabilities include automated parsing, tokenization, part-of-speech tagging, lemmatization and misspelling detection.
- Lets you apply start and stop lists.
- Uses special tags, qualifiers and operators in linguistic rules that take advantage of parsing actions to allow for more precision or better recall/abstraction capabilities.
- Uses rules-based linguistic methods are used for extracting key concepts.
- Automatic parsing can be used along with deep learning algorithms (recurrent neural networks) to classify documents and sentiment more accurately.
- Automates topic generation with unsupervised machine learning.
- Supervised/probabilistic machine learning models include BoolRule, Conditional Random Field and Probabilistic Semantics.
- BoolRule enables automatic rule generation for document categorization.
- Conditional Random Field and Probabilistic Semantics are used to label and sequence data and can automated entity and relationship extraction by learning the contextual rules of a given entity. Automatic rule builders promote topics to categories with supervised machine learning.
- Identifies and analyze terms, phrases and character strings that imply sentiment.
- Visually depicts sentiment through sentiment indicator display at a document or topic level.
- Provides ability to use recurrent neural networks for more accurate sentiment classification.
- Concepts, Sentiment, Topics and Categories nodes provide score code needed to deploy models on an external data set.
- Score code is natively threaded for distributed processing, taking maximum advantage of computing resources to reduce latency to results, even on very large data sets.
- Analytic store (ASTORE) is a binary file that represents the scoring logic from a specific model or algorithm. That compact asset allows for easy score code movement and integration into existing applications frameworks. ASTORE support is available for the Concepts, Sentiment and Categories nodes.
Native support for 33 languages
- Out-of-the-box text analysis for 33 languages:
- Default stop list for each language the application supports.
- Built-in lexicons that support parsing actions such as tokenization, lemmatization, misspelling analysis, part-of-speech tagging, dependency parsing and sentence boundary disambiguation.
- Seamlessly integrate with existing systems and open source technology.
- Add the power of SAS analytics to other applications using REST APIs.
- Open APIs and a microservices architecture enable you to bypass the native GUI and use your own UI or build a custom search application.
- Out-of-the-box analytical programming interfaces for text summarization, text data segmentation, text parsing and mining, topic modeling, text rule development and scoring, text rule discovery, term mapping and topic term mapping, conditional random field and search.
- Support for the entire analytics life cycle from data, to discovery and deployment.
- Code in a variety of programming languages including SAS, Python, R, Java, Scala and Lua.
- Data and model lineage and governance allow you to maintain access and control of data management and analytics.