High-performance text mining procedures
Only available in the high-performance mode.
- Processes in a symmetric multiprocessing (SMP) mode, taking advantage of multicore processers on an enabled SAS server – decreasing processing times for compute-intensive tasks such as text parsing and singular value decomposition (SVD) generation.
- High-performance text parsing includes automated part-of-speech and noun group detection, entity and multiword term identification, stemming and synonym detection.
- Term and frequency weighting can be configured from default settings.
- High-performance SVD transformation reduces the document collection to a numeric, structured representation. SVD transformation output can be used as input into high-performance data mining nodes or any other analysis.
- High-performance, target-based weighting is available for more accurate categorical target estimates.
- Graphical and tabular output assesses terms and their distributions within a collection.
- Large-scale text data scoring enabled.
Term profiling
- Enables you to describe and predict a target variable based on the detailed terms.
- When the target variable is time, the Profile node illustrates the trends of terms over the selected time period, helping you visibly assess emerging and declining terms.
- Clarify which terms are more meaningful to one another.
- Visually assess emerging or declining terms over time, including how terms are changing in relation to one another.
- Results and graphs are interactively linked for easy exploration.
Automatic Boolean rule generation
- Simplify taxonomy development and automatically generate Boolean rules.
- Resulting rules can be used to directly categorize documents based on rule matches.
- Allows rules to also be exported as Boolean rules – and used as a starter rule set for SAS Enterprise Content Categorization.
- Includes graphical output to compare rules between training and validation data.
- Enables active learning by:
- Providing automated, machine-generated suggestions of categories and topics that can be recharacterized by the user.
- Modifying the target assigned to the rules, and when rules are regenerated based on these user-defined modifications, the model is updated.
User-friendly, flexible interface
- Merge topics to simplify similar results.
- Use topic displays to show document terms/all terms, highlighting why a document was assigned to a particular topic.
- Use view mode to illustrate just the terms in a single document or within a topic, or to sort text documents.
- Obtain document-level sentiment insights with an AFFIN sentiment list available as a sample data set with more than 2,000 terms and pre-assigned polarity weights.
- Modify, save and share process-flow diagrams of text mining analysis.
- Add tables and replace tables (from previous efforts) to nodes for more control over table imports.
- Extend text nodes further by customizing algorithms or declaring new, user-written business rules for predictive modeling, clustering, visualization and reporting – deployable as SAS score code.
- Determine which document languages to include in further processing. Conforms to accessibility standards for the Windows platform. Accessibility features relate to standards for electronic information technology that were adopted by the US government under Section 508 of the US Rehabilitation Act of 1973.
Integrated document filtering
- Employ sophisticated dimension reduction techniques that enable advanced filtering through weighting, integrated spell checking and transformation of qualitative data into compact formats.
- Create synonym data sets and import previously defined synonyms into the text filter node to improve reusability of existing assets.
Visual analysis of results
- Use the concept link diagram to analyze results visually and to effectively explore the relationships between terms.
- Use interactive diagrams to communicate results to key stakeholders:
- Employ diagrams that cluster results, derive topic assessments and link associations among terms.
- Use success graph and linked document rules table to explore generated Boolean rules.
Flexible options
- Define your own multiword terms (phrases such as "drag and drop").
- Choose from one of 18 pre-specified entity definitions for address, company, date, phone number, SSN, time and others to ensure extraction from input content.
- Create your own custom entities to be extracted from text inputs, including a list of pre-defined entities (such as defined districts or product codes) using the SAS Concept Creation for SAS Text Miner add-on.
Interactive text importing interface
- Lets you dynamically create data sets from files contained in a directory or crawled from the web.
- Gives access to numerous forms of textual data, including PDFs, Microsoft Word, extended ASCII text, HTML, Microsoft Office formats, spreadsheets, presentations, email and database formats.
- Extracts, transforms and loads textual data into a SAS data set for mining.
- Accepts even potentially proprietary formats, converts the formats, and filters or extracts the text from the files, placing a copy in a plain file and referencing the data to SAS.
- Identifies each document's language and transcodes it to the session encoding format.
Native support for multiple languages
- Supports Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Thai, Turkish and Vietnamese. Includes dialects for Simplified and Traditional Chinese, Parisian and Canadian French, Old and New World German, Nynorsk and Bokmål Norwegian, Portuguese for Portugal and Brazil, and Spanish for South America and Spain.
- You can select which languages to include based on a pre-defined input variable.