SAS® Enterprise Content Categorization
Automate content categorization for faster, more relevant search and retrieval
Benefits
- Find the information you need, when you need it.
- Improve efficiency and purge content chaos.
- Extend existing investments.
Features
- Entity, fact and event extraction
- Category classification
- Collaborative and flexible taxonomy management
- Out-of-the-box integration
- Support for more than 27 languages
- Add-on modules available
How SAS® Is Different
- To jump-start your taxonomy development projects, the software provides prebuilt, industry-specific taxonomy starter kits and prebuilt rules for defining concepts and their attribute values. Use add-ons for search and indexing, Web crawling, text summarization, document duplication detection and more.
- Wikipedia data is integrated with SAS Enterprise Content Categorization to automatically generate categories when you create initial definitions.
- The software includes prebuilt operators, such as co-referencing, to resolve pronouns more easily; stemming to match on all word forms, or only on noun or verb forms; and case-insensitive operators to improve rule-matching precision.
- With SAS Enterprise Content Categorization, you can turn masses of content and document collections into reusable assets that span departmental repositories, regardless of owner or location – for faster, more efficient information organization, access and relevance, and to boost knowledge retention and sharing.
Benefits
- Find the information you need, when you need it. With SAS Enterprise Content Categorization, you can find the information you need regardless of whether it's been used before or whether you know its exact location. The flexible, intuitive software provides multiple ways to retrieve content, as it accurately identifies metadata from the content itself and delivers only the most meaningful material related to your inquiry.
- Improve efficiency and purge content chaos. By applying sophisticated linguistic rules to identify and extract terms, then automatically applying the defined intelligence to the content, the software reduces the overhead of content categorization processes. Use it with large repositories to determine which documents are similar, contain only small variations or have been substantially modified.
- Extend existing investments. The software transforms corporate textual data into a reusable asset, extending the value of existing investments by building upon existing indexes and integrating with content management systems like Documentum and Microsoft SharePoint, and with search technologies like Endeca and FAST ESP.
Features
- Entity, fact and event extraction
-
- Locate and return related pieces of data that form a fact or an event based on their context in real time (e.g., a person in relation to a company, the merger of two companies, etc.).
- Identify unknown information without a requirement for precompiled dictionaries.
- Customize matching criteria using options such as contextual markers, parts of speech tags and Boolean operators.
- Use prebuilt co-reference operators in an intuitive GUI to help resolve pronouns more easily, addressing syntactical functions such as:
- Linking to a matched string with its canonical form.
- Co-referencing classifier definitions.
- Restricting forward and preceding co-reference matches.
- Assigning a new concept name for a match on a specific term.
- XML fields can be limited for specified matches.
- Case-insensitive operators allow for greater rule-matching precision, including:
- Prebuilt stemming can match on all word forms or only on noun or verb forms.
- Sentence and paragraph operator defines the number of word, noun or verb matches within a paragraph.
- Write more than one rule to extract all of the possible permutations of the data you seek.
- Disambiguate facts and events by excluding certain matches.
- Category classification
-
- Control the number of terms that an automatic category rule generates for each category.
- Adjust relevance schemes (frequency-based or zone-based) for concepts used in category rules.
- Use linguistic rules, which are unique identifying terms, or add Boolean operators to your unique terms for added specificity in determining category membership.
- Develop a list of unique identifying terms for each category rule.
- Weight selective terms or the categories themselves, creating more exclusive membership requirements.
- Use the testing facilities to validate application of rules and definitions to batch, entire or content components.
- Includes enhanced XML handling along with better syntax checking and duplicate rule elimination for classifier concepts.
- Automatically apply the rules and definitions to incoming texts using the client APIs in C, C++, C#, .NET, Java, Perl or Python.
- Collaborative and flexible taxonomy management
-
- Define user-permission levels, including read, write, category rules and concept definitions.
- Use an unlimited number of taxonomy nodes, and apply categories and concepts generated to large volumes of input documents.
- Develop a hierarchical taxonomy where related topics are grouped together, or create a flat taxonomy where there is no relationship between any of the nodes in the taxonomy tree.
- Automatically generate categories from Wikipedia data to jump-start your taxonomy development.
- Benefit from improved PDF conversion.
- Out-of-the-box integration
-
- Configure APIs to automatically tag content from Microsoft Office SharePoint, Endeca, FAST ESP, Documentum or other systems.
- Documents are tagged prior to indexing to speed up processing time.
- Extends capabilities of existing search tools and content management systems, increasing the relevance of retrieved materials.
- Support for more than 27 languages
-
- Language tools: NLP/advanced linguistic technologies that leverage:
- Part-of-speech recognition and tagging: Recognizes nouns, verbs, adjectives, etc.
- Stemming: Locates the various forms of an input noun or verb.
- Case sensitivity: Specifies uppercase and/or lowercase recognition for concepts.
- Use the following two options with Germanic and Asian languages:
- Compound recognition and compound decomposition: Break apart the recognized compound words.
- Segmentation for Asian languages.
- The product ships with English and the native language if other than English. Additional languages are licensed as add-ons.
- Language tools: NLP/advanced linguistic technologies that leverage:
- Add-on modules available
-
- Multiple, prebuilt industry-specific taxonomy starter kits are available as add-on modules for SAS Enterprise Content Categorization:
- Provide immediate ROI by classifying industry-specific content to help jump-start document classification initiatives.
- Include detailed concepts and attribute values with predefined rules, helpful for initiating taxonomy project development.
- Search and indexing add-ons automatically discern query semantics and enable superior drill-down and investigative capabilities by categorizing multifaceted information:
- Include an easy-to-use interface for search and document processing.
- Can extract entities, concepts and facts within a unified document processor.
- Support multiple document schemas with multiple instantiations, and also retain the original URL in any split documents.
- Web crawling add-ons automatically download documents from the Internet by performing several different kinds of downloading operations based on network bandwidth, crawling politeness and information coverage:
- Have an easy-to-use interface for defining and managing Web and internal file system crawls.
- Set individual quotas for a specified URL, and define project quotas.
- Other add-on modules include text summarization, document duplication detection, content alerts and additional languages.
- Multiple, prebuilt industry-specific taxonomy starter kits are available as add-on modules for SAS Enterprise Content Categorization:
Screenshots
Easily test model performance with validation data.
Easily test model performance with validation data to assess extraction rules for pronoun resolution and co-referencing.
Create precision and recall reports.
Explicit reports examining precision and recall help guide the model development process.
Generate custom reports.
Virtually any standard or custom report can easily be created to address user needs, through direct integration with SAS reporting technologies.
System Requirements
SAS Enterprise Content Categorization is a standalone product that requires no other SAS modules.
Client Environment
- Microsoft Windows (x86-32 and x64): Windows XP Professional, Windows Vista*, Windows 7** family
- Microsoft Windows (x64): Windows XP Professional for x64, Windows Vista* for x64, Windows 7** for x64
Server Environment
- HP-UX Itanium: HP-UX 11iv3 (11.31)
- HP-UX PA-RISC: HP-UX 11iv3 (11.31)
- IBM AIX: Versions 6.1 and 7.1 (x64) on POWER architectures
- Linux for x86 (x86-32): RHEL 5 and 6, SuSE SLES 10 and 11
- Linux for x64 (EM64T/AMD64): RHEL 5 and 6, SuSE SLES 10 and 11
- Microsoft Windows (x86-32 and x86-64): Windows XP Professional, Windows Vista*, Windows 7**, Windows Server 2003 family, Windows Server 2008 family
- Microsoft Windows on x64 (EM64T/AMD64): Windows XP Professional for x64, Windows Vista* for x64, Windows 7** for x64, Windows Server 2003 for x64, Windows Server 2003 x86-32, Windows Server 2008 for x64
- Solaris on SPARC: Version 10
- Solaris on x64: Version 10
*NOTE: Windows Vista supported editions are Enterprise, Business and Ultimate.
**NOTE: Windows 7 supported editions are Professional, Enterprise and Ultimate.
Ready to learn more?
Call us at 1-800-727-0025 (US and Canada) or request more information.


