Products & Solutions / Text Mining

Text Mining with SAS® Text Miner

Capitalize on the value hidden in textual information

SAS Text Miner provides a rich suite of tools for discovering and extracting knowledge from text documents. It transforms textual data into a usable, intelligible format that facilitates classifying documents, finding explicit relationships or associations between documents, and clustering documents into categories. It's the first mining solution that tightly integrates text-based information with structured data for improved analyses and decision making.

Benefits

  • Save money and resources.
  • Recognize trends and spot business opportunities.
  • Process a variety of information sources, including text and traditional databases, to deliver complete views of an organization.

Read more

Features

  • Universal data access
  • Support for multiple languages
  • Self-documenting interface
  • Comprehensive text preprocessing capabilities
  • Extensive feature extraction
  • Dimension reduction techniques
  • Text clustering algorithms
  • Interactive training
  • Document categorization

Read more

No other software delivers this depth and breadth of analytic functionality. 

—Dr. Patricia Cerrito, PhD

Professor of Mathematics

University of Louisville

Read full story

Screenshots

Capitalize on the value hidden in document collections with a unified business analytics framework to improve your predictive models.

Screenshot: Capitalize on the value hidden in document collections with a unified business analytics framework to improve your predictive models.
More ScreenshotsEnlarge
 

How SAS® Is Different

  • Ability to access a wide variety of document formats (e.g., PDF, ASCII, HTML, Microsoft Word and WordPerfect) in numerous languages.
  • A distinctive, integrated interface for analyzing text (unstructured data) in conjunction with multiple related database (structured) fields.
  • Sophisticated text parsing capabilities.
  • Ability to transform data into a compact, information-rich structure.
  • An interactive results browser, enabling analysts to interactively explore concepts and relationships between documents and dynamically make modifications to further tailor analyses.

Benefits

  • Save money and resources. There are many tasks that are currently performed manually or completely ignored. With SAS Text Miner, organizational activities are streamlined, resulting in immediate ROI and performance gain.
  • Recognize trends and spot business opportunities. Analysis of information such as blogs, customer feedback and call center notes may provide valuable information about your customers’ critical issues, insights into service and product needs. This helps decision makers gain meaningful insights that successfully drive overall business direction.
  • Process a variety of information sources, including text and traditional databases, to deliver complete views of an organization. Combining structured data and unstructured data types enables you to automate many of the manual steps required before analysis traditionally begins. 

Features

Universal data access
  • Access to numerous forms of textual data, including PDF, extended ASCII text, HTML and Microsoft Word.
  • Web crawling capabilities.
  • Ability to extract, transform and load textual data into a SAS data set for mining.
Support for multiple languages
  • Total language list:  English, French, German, Italian, Portuguese, Spanish, Traditional Chinese and Simplified Chinese.
  • Support for Latin-1, Double Byte Character and UTF-8 encodings.
  • European languages (Latin-1 encoding): English, French, German, Italian, Portuguese and Spanish.
  • Far-Eastern languages (Double Byte Character Support): Simplified Chinese and Traditional Chinese.
  • Encoding support for Unicode UTF-8.
Self-documenting interface
  • User-friendly interface eliminates manual coding with visual diagrams.
  • Process flow diagrams can be modified, saved and shared with others.
  • Flexible reporting allows results to be published in a concise HTML format.
Comprehensive text preprocessing capabilities
  • Capture and distill the most important underlying information within a document collection.
  • Default or customized stop lists for each language to remove terms with little or no informational value.
  • Automated spelling correction.
  • Stemming to identify root words.
  • Part-of-speech tagging based on sentence context.
  • Noun group extraction for identifying phrase-level concepts such as "competitive intelligence."
  • User-defined multiword tokens, such as "point and click."
  • User-customized and default synonym lists.
  • Compound word splitting into distinct subterms.
Extensive feature extraction
  • Broad customizable data dictionaries can extract particular pieces of information such as names of people, products, organizations, URLs and addresses.
  • Extracted entities are then normalized and included in a matrix table.
  • Entity extraction is available for English, French, German and Spanish.
Dimension reduction techniques
  • Textual data is preprocessed into an information-rich matrix for application of powerful dimension reduction techniques.
  • Rollup terms automatically identify the n highest-weighted terms in a document.
  • Singular value decomposition (SVD) transforms each document into an n-dimensional subspace.
Text clustering algorithms
  • Group documents based on their content.
  • Expectation-maximization clustering groups documents using spatial clustering techniques.
  • Hierarchical clustering using Ward’s agglomerative method facilitates automatic grouping of documents into taxonomies. Documents grouped into hierarchical clusters belong to one leaf cluster as well as its parent clusters.
  • Cluster documents downstream in the Process Flow Diagram using K-means or SOM/Kohonen clustering.
  • Profile clusters using additional structured data from original documents (age, purchase propensity, etc.).
Interactive training
  • Provides a concise summary of results that includes document, term and cluster tables.
  • Sorts term table by terms, term frequency, number of documents, weight and term role.
  • Expands parent terms to identify child terms and their related statistics.
  • Toggle between full and partial text view of the documents.
  • Find the n most similar items for the selected document, term or cluster.
  • Filter term(s) to show documents that contain them and clusters that contain those documents.
  • Filter document(s) to show all terms in the documents, as well as revised cluster counts.
  • Filter cluster(s) to show all documents in the filtered clusters, as well as the terms in those documents.
  • Modify the keep and drop term lists.
  • Treat selected terms as equivalent.
  • Reweight terms using a different algorithm.
  • Select the number of SVD dimensions.
  • Browse concept links.
  • Browse taxonomies.
  • View the top n most representative terms for each cluster.
  • Recluster anytime using a subset of documents or terms.
Document categorization
  • Advanced techniques such as neural networks, memory-based reasoning, regression models and decision trees will assign documents to predefined categories.
  • Seamlessly combine quantitative and qualitative data with text analysis to improve predictions.
  • Graphically display performance assessments of multiple models to compare and select the best one to deploy as score code for categorizing new documents.

Screenshots

Capitalize on the value hidden in document collections with a unified business analytics framework to improve your predictive models.

Capitalize on the value hidden in document collections with a unified business analytics framework to improve your predictive models.

Enlarge

System Requirements

Supported platforms

  • AIX: Version 5.3 and Version 6.1 on POWER architectures
  • HP-UX Itanium: HP-UX 11iv2 (11.23), 11iv3 (11.31)
  • Linux for x86 (x86-32): RHEL 4 and 5, SuSE SLES 9 and 10
  • Microsoft Windows (x86-32): Windows XP Professional, Windows Vista*, Windows Server 2003 family
  • Microsoft Windows on x64 (EM64T/AMD64): Windows XP Professional for x64, Windows Vista* for x64, Windows Server 2003 for x64
  • Solaris on SPARC: Version 9, 10

*NOTE: Windows Vista Editions that are supported include Enterprise, Business and Ultimate

Supported Web browsers

  • Internet Explorer 6 on Windows XP Pro
  • Internet Explorer 7 on Windows XP Pro and Windows Vista*
  • Firefox 2.0 on Windows XP Pro, Windows Vista* and Linux x86 (SuSE and RHEL)

Middle tier required/optional software

  • SAS client and middle tier require Sun JRE 1.5

Required software

  • SAS Enterprise Miner is required and must be installed on the same machine

Ready to learn more?

Call us at 1-800-727-0025 (US and Canada) or request more information.