Technologies /Analytics / Data Mining

Products and Solutions

 

Text Mining with SAS® Text Miner

Capitalize on the value hidden in textual information

Features

Universal data access

  • Access to numerous forms of textual data, including PDF, extended ASCII text, HTML and Microsoft Word.
  • Web crawling capabilities.
  • Ability to extract, transform and load textual data into a SAS data set for mining.
  • Support for multiple languages

  • Total language list: Danish, Dutch, English, Finnish, French, German, Italian, Japanese, Korean, Norwegian (Bokmal), Portuguese, Spanish, Swedish, Traditional Chinese and Simplified Chinese.
  • Support for Latin-1, Double Byte Character and UTF-8 encodings.
  • European languages (Latin-1 encoding): Danish, Dutch, English, Finnish, French, German, Italian, Norwegian (Bokmal), Portuguese, Spanish and Swedish.
  • Far-Eastern languages (Double Byte Character Support): Japanese, Korean, Simplified Chinese and Traditional Chinese.
  • Encoding support for Unicode UTF-8.

Self-documenting interface

  • User-friendly interface eliminates manual coding with visual diagrams.
  • Process flow diagrams can be modified, saved and shared with others.
  • Flexible reporting allows results to be published in a concise HTML format.

Comprehensive text preprocessing capabilities

  • Capture and distill the most important underlying information within a document collection.
  • Default or customized stop lists for each language to remove terms with little or no informational value.
  • Automated spelling correction.
  • Stemming to identify root words.
  • Part-of-speech tagging based on sentence context.
  • Noun group extraction for identifying phrase-level concepts such as "competitive intelligence."
  • User-defined multiword tokens, such as "point and click."
  • User-customized and default synonym lists.
  • Compound word splitting into distinct subterms.

Extensive feature extraction

  • Broad customizable data dictionaries can extract particular pieces of information such as names of people, products, organizations, URLs and addresses.
  • Extracted entities are then normalized and included in a matrix table.
  • Entity extraction is available for English, French, German and Spanish.

Dimension reduction techniques

  • Textual data is preprocessed into an information-rich matrix for application of powerful dimension reduction techniques.
  • Rollup terms automatically identify the n-highest weighted terms in a document.
  • Singular value decomposition (SVD) transforms each document into an n-dimensional subspace.

Text clustering algorithms

  • Group documents based on their content.
  • Expectation-maximization clustering groups documents using spatial clustering techniques.
  • Hierarchical clustering using Ward’s agglomerative method facilitates automatic grouping of documents into taxonomies. Documents grouped into hierarchical clusters belong to one leaf cluster as well as its parent clusters.
  • Cluster documents downstream in the Process Flow Diagram using K-means or SOM/Kohonen clustering.
  • Profile clusters using additional structured data from original documents (age, purchase propensity, etc.).

Download the complete SAS Text Miner Fact Sheet.

 

Ready to learn more?

Call us at 1-800-727-0025 (US and Canada) or request more information.

 

News

SAS acquires Teragram to strengthen industry-leading text mining, analytics

Webcast

Supercharging Your Business Intelligence with Text Analytics

The Power To Know