SAS® Enterprise Content Categorization Add-On Modules
Customize your solution for information organization, access and findability
SAS Enterprise Content Categorization add-on modules apply natural language processing (NLP) and advanced linguistic techniques to automate text-processing operations so that additional efficiency gains are quickly realized by organizations where they are needed most.
Benefits
- Enable users to find the information they need quickly.
- Drive faster, more efficient information access.
- Purge content chaos that spans multiple enterprise repositories.
Features
- SAS® Industry Taxonomy Rules can help quick-start implementation
- SAS® Document Duplication Detection recognizes which documents are similar
- SAS® Text Summarization distills documents and creates concise summaries
- SAS® Web Crawler automatically downloads documents from the Internet
- SAS® Content Alerts provides notification services through a variety of alert media
- SAS® Search and Indexing automatically discerns query semantics and provides drill-down capabilities
- SAS® Content Categorization Information Workbench combines human editorial review with automatic categorization
- SAS® Text Data Language Packs provide choices of Asian, Eastern and Western European, and Middle Eastern languages
How SAS® Is Different
- SAS extends your business processes that rely on accurate content categorization with several add-on modules, providing effective search and retrieval activities, meaningful summaries of materials, real-time alerts to new content availability and more.
- These unique technologies enable richer processing at the level of words, linguistic relations and word meanings – solving the issues associated with excessive electronic information materials and their exponential growth rate.
- SAS Enterprise Content Categorization Add-On Modules ensure that you can customize your SAS content categorization solution to meet specific organizational needs.
Benefits
- Enable users to find the information they need quickly. Effective findability retrieves content in context so users can find the information they need whether they know where it is or not. Add-on capabilities include: out-of-the box industry taxonomies to quick-start categorization efforts; a high-performance Web crawler that automatically downloads appropriate documents from the Internet and internal file systems; a text summarization module that conveys key aspects from a document in condensed form; and a scalable, real-time alert notification service that delivers documents to millions of users at individually specified times.
- Drive faster, more efficient information access. The SAS Search and Indexing add-on module automatically discerns query semantics so users can find the information they need, without needing to know how each search term is defined to the system. Quickly honing in on what is needed and narrowing down the information to only the relevant content, this add-on applies stemming and automatic spelling correction, enabling richer preprocessing. By applying these linguistic technologies at the preprocessing level, searches become more accurate and meaningful.
- Purge content chaos that spans multiple enterprise repositories. Enterprise repositories often contain many documents that have been duplicated or edited and republished. Extending the categorization of similar content, the SAS Document Duplication Detection add-on helps organizations minimize their content stores, maintaining only those materials that meet the threshold standards of similarity.
Features
- SAS® Industry Taxonomy Rules can help quick-start implementation
-
-
Prebuilt taxonomies provide a place to begin categorization efforts, when previous taxonomies, product lists or dictionaries are not readily available.
-
Term hierarchies, yellow pages, predefined rules, attributes and attribute values provide an initial, rich source for categorization definitions.
-
Industry-relevant information can be made available in more than 30 languages, providing shorter implementation cycles in the development of complex linguistic rules.
-
Updates are included as part of the licensing agreement.
-
Platforms supported (client only): Microsoft Windows (x86-32 and x64).
-
- SAS® Document Duplication Detection recognizes which documents are similar
-
-
Designed to recognize which documents within a large set are similar up to a threshold of similarity.
-
Configurable similarity threshold allows the system to detect versions of documents that have been substantially re-edited or to focus on documents that are only small variations of others.
-
Documents are abstracted from their actual format to focus on the content.
-
Platforms supported (server only): AIX, HP-UX Itanium, HP-UX PA-RISC, Linux for x86, Linux for x64, Microsoft Windows (x86-32), Microsoft Windows on x64, Solaris on SPARC and Solaris x64.
-
- SAS® Text Summarization distills documents and creates concise summaries
-
-
Documents are summarized automatically for wide distribution of content.
-
Natural order of key sentences describes the essence of text so it is meaningful to readers.
-
The relative importance of special concepts (i.e., anchor words or word strings) can be defined to capture subject-matter expertise.
-
Existing concepts and concept taxonomies are used to define single concepts or relationships and form the basis of definitions that are sought in the identification of key sentences, including classifier concepts (authority lists), regex concepts (regular expressions) and grammar concepts (syntactic patterns).
-
Documents written in different languages can be summarized while retaining the inherent meaning within the natural language of the source content. Word tokenization is dependent on the language of the materials being summarized.
-
Platforms supported (client): Linux for x86, Microsoft Windows (x86-32 and x64).
-
Platforms supported (server): AIX, HP-UX PA-RISC, HP-UX Itanium, Linux for x86, Linux for x64, Microsoft Windows (x86-32), Microsoft Windows on x64, Solaris on SPARC and Solaris x64.
-
- SAS® Web Crawler automatically downloads documents from the Internet
-
-
Starting URLs and parameters for the crawler can be defined from the thin-client interface. The crawler then follows the hyperlinks in the Web while repeatedly sending HTTP requests to simultaneously obtain corresponding HTML content and any URLs existent within that content.
-
High-performance crawling: Used in a multiple-threading mode to allow the configuration of the number of threads.
-
Distributed crawling: Distributed running mode to optimize crawling. When multiple crawlers are running simultaneously, each crawler will send the correct set of links to the crawler to which they might belong.
-
Incremental crawling: Enables continuous downloads.
-
Page quality: Crawl the highest-quality pages first, when the quantity of object pages is very large. Duplicates of URLs or page contents are automatically removed.
-
Polite downloads prevent complaints or access blocking from crawled sites. Specify the minimum access interval for continuous downloads from each site, maximum parallel connections to each site or domain, or the maximum number of times to retry each failed HTTP request.
-
JavaScript parsing: URL extraction from JavaScripts where content is often deeply embedded.
-
Logon for cookie-supported and password-protected websites.
-
Enhanced management and configuration.
-
Entry points: Specify a list of URLs as seeds to start the crawling and define the number of pages to start from each seed.
-
Portal list: Define URLs to download without extracting new URLs.
-
Link-following restrictions: Define link-following rules with regular expressions to restrict the crawling area – e.g., restrict the crawling in a directory, server or domain.
-
Excluded paths: Provide a list of URL paths that will be excluded in the crawling. Any URL that is not an entry point will not be extracted if it contains an excluded pattern.
-
Platforms supported (server only): AIX, HP-UX PA-RISC, HP-UX Itanium, Linux for x86, Linux for x64, Microsoft Windows (x86-32), Microsoft Windows on x64, Solaris on SPARC and Solaris x64.
-
- SAS® Content Alerts provides notification services through a variety of alert media
-
-
HTML, text or XML email alerts can be specified.
-
Email, SMS or other means of alerts are available.
-
Multiple alerts to the same user can be combined into a single alert.
-
All alerts are encoded in an intermediate XML format for delivery processing.
-
Users can specify the time when alerts are sent (time of day or as soon as possible).
-
Users can communicate directly through the SMTP protocol to a send-mail server. Automatically check for returned emails by accessing a POP server.
-
Preformatted files can be generated for use with existing email programs.
-
Highly scalable to millions of users with a constant flow of documents.
-
Platforms supported (server only): AIX, HP-UX PA-RISC, HP-UX Itanium, Linux for x86, Linux for x64, Microsoft Windows (x86-32), Microsoft Windows on x64, Solaris on SPARC and Solaris x64.
-
- SAS® Search and Indexing automatically discerns query semantics and provides drill-down capabilities
-
-
Linguistic techniques can be applied to search queries and documents from the thin-client interface. These techniques are applied at the preprocessing level to provide a more accurate and relevant search.
-
Advanced linguistics technologies such as stemming, concept extraction and automatic spelling correction can be used to provide richer processing at the level of words, linguistic relations and word meanings.
-
Information can be organized into an intuitive hierarchical directory, which encapsulates specific categories into more general categories, allowing for greater flexibility and faceted search.
-
Searches can be narrowed down within a category, or users can browse documents in the category of interest.
-
Platforms supported (server only): AIX, HP-UX PA-RISC, HP-UX Itanium, Linux for x86, Linux for x64, Microsoft Windows (x86-32), Microsoft Windows on x64, Solaris on SPARC and Solaris x64.
-
- SAS® Content Categorization Information Workbench combines human editorial review with automatic categorization
-
-
Workflow tool incorporating automatic abstracting, categorization and entity extraction is designed for indexers or editors.
-
Human editorial review is combined with automatic abstracting, categorization and metadata tagging.
-
Measurable business value and productivity is increased and return on investment is speeded dramatically, while the risks of full automation are eliminated.
-
A feedback loop to the taxonomy tool is provided for editing the taxonomy, based on the use of nodes in the taxonomy.
-
Platforms supported (client only): Linux for x86, Microsoft Windows (x86-32 and x64).
-
- SAS® Text Data Language Packs provide choices of Asian, Eastern and Western European, and Middle Eastern languages
-
-
SAS Enterprise Content Categorization ships with English and the native language, if not English.
-
Platforms supported (client only): Microsoft Windows (x86-32 and x64).
-
Screenshots
Wizard-driven interface
A wizard-driven interface helps you easily define crawling, indexing and search configurations.
System Requirements
All add-ons must license SAS Enterprise Content Categorization or the single-user version SAS Content Categorization. Because supported platforms vary for each add-on, please check the Features tab for specific platform information.
Client environment
-
Linux for x86 (x86-32): RHEL 4, SuSE SLES 9
-
Microsoft Windows (x86-32 and x64): Windows XP Professional, Windows Vista*, Windows Server 2003 family
Server environment
-
AIX: Versions 5.3 and 6.1 (x64) on POWER architectures
-
HP-UX Itanium: HP-UX 11iv2 (11.23), 11iv3 (11.31)
-
HP-UX PA-RISC: HP-UX 11iv2 (11.23), 11iv3 (11.31)
-
Linux for x86 (x86-32): RHEL 4, SuSE SLES 9
-
Linux for x64 (EM64T/AMD64): RHEL 4, SuSE SLES 9
-
Microsoft Windows (x86-32): Windows XP Professional, Windows Server 2003,Windows Vista*
-
Microsoft Windows on x64 (EM64T/AMD64): Windows XP Professional for x64, Windows Vista* for x64, Windows Server 2003 for x64
-
Solaris on SPARC: Versions 9 and 10
-
Solaris on x64: Version 10
Ready to learn more?
Call us at 1-800-727-0025 (US and Canada) or request more information.


