Products & Solutions / Enterprise Content Categorization

SAS® Enterprise Content Categorization

Have confidence in your content by using automatic, consistent categorization

SAS Enterprise Content Categorization, powered by Teragram technology, applies natural language processing (NLP) and advanced linguistic techniques to automatically categorize large volumes of multilingual content that is acquired, generated or exists in a repository. It correctly parses and analyzes content for entities and events, which are then used to create metadata, develop taxonomies and generate the category rules and concept definitions that can be applied to large volumes of documents and which trigger business processes.

Benefits

  • Find the information you need, when you need it.
  • Improve efficiency and purge content chaos.
  • Extend existing investments.

Read more

Features

  • Categorization
  • Entity and fact extraction
  • Collaborative taxonomy management
  • Add-on industry taxonomy starter kits

Read more

Screenshot

Wikipedia and DBPedia are integrated with SAS Enterprise Content Categorization.


Screenshots

How SAS® Is Different

  • SAS offers flexible options to suit your organization's needs as you get started with content categorization efforts. Add-ons for industry-specific taxonomy starter kits contain prebuilt rules that define concepts and their attribute values and provide detailed taxonomies. Wikipedia and DBPedia are integrated with SAS Enterprise Content Categorization – you can use no-cost downloads to automatically generate categories and subcategories and to extract entities.
  • Get the most from your investment using prebuilt, in-depth methods to understand content and to ensure that you can apply the technology to different corpora. Easily create advanced linguistic rules with prebuilt rule-development operators such as co-referencing, triple detection, stemming functions, and case-insensitive operators as well as predefined extraction concepts and automated rule generation.
  • Extend your unstructured content knowledge with SAS Enterprise Content Categorization and turn masses of content and document collections into reusable assets that span departmental silos, regardless of owner or location. For example, test documents that are stored in Microsoft Excel or use the conversion tools to standardize different document formats.
  • The solution is part of the SAS Business Analytics Framework, so you can integrate text insights into other fact-based information systems. Use machine-generated Boolean rules discovered in SAS Text Miner as inputs to your categorization projects, and bring your ontologies defined in SAS Ontology Management to the categorization server to classify your content with semantic terms. The result is faster, more efficient information insight and better relevance, to boost knowledge retention and sharing. Use the taxonomies from your categorization projects in SAS Sentiment Analysis to derive detailed opinions from the taxonomy concepts.

Benefits

  • Find the information you need, when you need it. Find the information you need regardless of whether it's been used before or whether you know its exact location. The flexible, intuitive software provides multiple ways to improve content retrieval, as it accurately defines metadata from the content itself and delivers only the most meaningful material related to your needs.
  • Improve efficiency and purge content chaos. In the era of inexpensive commodity storage, it isn't efficient or effective for organizations to retain all of their data. By categorizing and extracting texts based on sophisticated linguistic rules and taxonomies, you keep only what is needed – and you can filter the information even before it has been stored, to reduce overhead.
  • Extend existing investments. The software transforms corporate textual data into a reusable asset, extending the value of predefined investments by building upon existing indexes and integrating with content management systems like Documentum and Microsoft SharePoint, and with search technologies like Endeca and FAST ESP.

Features

Categorization
  • Identify interrelationships between categories and concepts and adjust relevance schemes for concepts, as well as for priorities of concepts and their associated rules.
  • Automatically generate rules using features such as frequent phrase extraction and maximum entropy classifiers.
  • Refine models, using straightforward graphical reports that illustrate relevant statistics for category matches and show the precision, recall and numbers of passing and failing documents.
  • Specify XPath expressions in Boolean category rules to locate matching content in XML elements.
  • Jump-start taxonomy development efforts using no-cost downloads to automatically generate categories and subcategories from Wikipedia in different languages.
  • Use syntax checking and duplicate rule elimination for classifier concepts.
  • Automatically apply NLP/advanced linguistic technologies to classify and identify key information.
  • Use linguistic rules (unique, identifying terms) – and/or add Boolean operators to your unique terms – for added specificity in determining category membership.
  • Author simple or complex category rules and concept definitions.
  • Develop a list of unique identifying terms for each category rule.
  • Weight selective terms or the categories themselves to create more exclusive membership requirements.
  • Use test and document interfaces to validate the application of rules and definitions to batch, entire or content components.
  • Automatically apply rules and definitions to incoming texts using the client APIs in C#.NET, Java, Perl or Python.
Entity and fact extraction
  • Use predefined extraction concepts to shorten the rule-writing process, including: address, location, date, phone number, SSN, person, time and more. New keyboard shortcuts simplify rule writing.
  • Extract entities from DBPedia with predefined plug-ins, available as no-cost downloads.
  • Detect triples, such as subject-verb-object relationships.
  • Hyperlink concept matches for extractions.
  • Use prebuilt co-reference operators and enhanced pronoun resolution rules in an intuitive GUI to help resolve pronouns more easily, addressing syntactical functions such as:
    • Linking to a matched string with its canonical form.
    • Co-referencing classifier definitions.
    • Restricting forward and preceding co-reference matches.
    • Use UNLESS and NOT operators to limit matches.
  • Limit XML fields for specified matches.
  • Use case-insensitive operators for greater rule-matching precision, including:
    • Prebuilt stemming to match on all word forms or only on noun or verb forms.
    • Paragraph operator to define the number of word, noun or verb matches within a paragraph.
    • Sentence operators to consider the maximum number of sentences and the number of beginning or ending words within a sentence where a match can occur.
  • Locate and return related pieces of data that form a fact or an event based on their context in real time (e.g., a person in relation to a company, the merger of two companies, etc.).
  • Identify unknown information and write context-specific rules that automatically extract facts and events without a requirement for precompiled dictionaries.
  • Automatically return only the facts and events with the highest priority or those with the longest match.
  • Customize matching criteria using these options:
    • Contextual markers.
    • Parts of speech.
    • Identifiers for lower- or uppercase words.
    • Boolean operators.
  • Write more than one rule to extract all of the possible permutations of the data you're seeking.
  • Disambiguate facts and events by excluding certain matches.
  • Distill vast quantities of information into simple concepts and a few easy-to-understand pieces of information.
  • Use dictionary, grammar and regular expression-based concepts to simplify the process of locating related data that's needed for subject-matter expertise.
  • Perform complex information tasks using an intuitive GUI.
  • Automate the customized classification and entity application to large volumes of multilingual content that's acquired, generated or may exist in a repository.
Collaborative taxonomy management
  • Take advantage of an intuitive, easy-to-use graphical interface for creating and managing taxonomies in a secure, controlled and audited environment.
  • Easily identify the referenced category in the hierarchy tree view of your taxonomy with a simple double click.
  • Automatically generate synonyms.
  • Easily export testing results to a CSV file for simplified inclusion in any SAS data set, or tab-delimited TXT files, or use with Microsoft Excel.
  • Simplify project development with reorganized taxonomy project settings and simplified server installation of a document conversion utility and categorization processor.
  • Use graphical reports to help identify taxonomy rule refinements, illustrating the statistics for category matches, as well as the precision, recall and numbers of passing and failing documents.
  • Define user-permission levels, including read, write, category rules and concept definitions.
  • Use new APIs for Python services and for testing programs to generate extraction output.
  • Add-on search and indexing modules:
    • Automatically discerns query semantics and enables superior drill-down and investigative capabilities by categorizing multifaceted information.
    • Enables virtual indexing for automatic segmentation and federation of search indexes that scale beyond the limits of a single machine. Multiple indexes are maintained in a managed fashion (Note: multitenancy required).
    • Use the new markup matcher as a point-and-click interface to simplify the extraction of fielded data from HTML or XML documents, specific to a site.
    • Includes an easy-to-use interface for search and document processing.
  • Add-on Web crawling:
    • Automatically download documents from the Internet using different types of downloading operations based on network bandwidth, crawling politeness and information coverage.
    • Use predefined crawler plug-ins for easy access to popular third-party services, including Google, Facebook, Twitter, Bing, BoardReader, Flickr, LinkedIn and Yahoo.
    • Use a point-and-click interface in markup matcher to extract fielded data HTML or XML documents specific to a site. You can also edit and test your matches, for both XPath and regular expression rules.
    • Choose incremental recrawl of sites and/or limit transversal depth of Web crawls.
Add-on industry taxonomy starter kits
  • Use prebuilt industry-specific taxonomies when developing industry-specific classifications to get immediate ROI and jump-start document classification initiatives.

Screenshots

Screenshot
Wikipedia and DBPedia are integrated with SAS Enterprise Content Categorization.

Create categories and subcategories from Wikipedia – and extraction concepts from DBPedia – with no-cost downloads from support.sas.com.

View Screenshot

Screenshot
Automatically generate rules

Automatic rule generation options include best method, term types and result specifications. Results detail the relative importance to the corpora.

View Screenshot

Screenshot
Illustrate relationships in testing

Canonical relationships for entities and associated co-referencing are illustrated in testing to assist with model development and refinement.

View Screenshot

Screenshot
Test Excel documents within the studio environment

You can test Microsoft Excel documents without leaving the SAS Content Categorization Studio environment.

View Screenshot

Screenshot
Easily test model performance with validation data.

Easily test model performance with validation data to assess extraction rules for pronoun resolution and co-referencing.

View Screenshot

Screenshot
Create precision and recall reports.

Explicit reports examining precision and recall help guide the model development process.

View Screenshot

Screenshot
Generate custom reports.

Virtually any standard or custom report can easily be created to address user needs, through direct integration with SAS reporting technologies.

View Screenshot

Screenshot
Locate facts from documents using advanced linguistic technologies.

Predicate rules are among the advanced linguistic technologies that can be used to locate facts from documents and associate them with their given categories.

View Screenshot

System Requirements

SAS Enterprise Content Categorization is a standalone product that requires no other SAS modules.

Client Environment
  • Microsoft Windows (x86-32 and x64): Windows XP Professional, Windows Vista*, Windows 7** family
  • Microsoft Windows (x64): Windows XP Professional for x64, Windows Vista* for x64, Windows 7** for x64
Server Environment
  • HP-UX Itanium: HP-UX 11iv3 (11.31)
  • HP-UX PA-RISC: HP-UX 11iv3 (11.31)
  • IBM AIX: Versions 6.1 and 7.1 (x64) on POWER architectures
  • Linux for x86 (x86-32): RHEL 5 and 6, SuSE SLES 10 and 11
  • Linux for x64 (EM64T/AMD64): RHEL 5 and 6, SuSE SLES 10 and 11
  • Microsoft Windows (x86-32 and x86-64): Windows XP Professional, Windows Vista*, Windows 7**, Windows Server 2003 family, Windows Server 2008 family
  • Microsoft Windows on x64 (EM64T/AMD64): Windows XP Professional for x64, Windows Vista* for x64, Windows 7** for x64, Windows Server 2003 for x64, Windows Server 2003 x86-32, Windows Server 2008 for x64
  • Solaris on SPARC: Version 10
  • Solaris on x64: Version 10

*NOTE: Windows Vista supported editions are Enterprise, Business and Ultimate.
**NOTE: Windows 7 supported editions are Professional, Enterprise and Ultimate.

Ready to learn more?

Call us at 1-800-727-0025 (US and Canada) or request more information.