Building a more searchable library

SRA re-indexed its collection faster, more acurrately — and discovered new documents along the way

When a government agency needed a better way to index, categorize and generate metadata for documents created over the last century, it chose SRA International to manage the project, and SRA used SAS® Text Analytics. With SAS®, categorizing is more accurate – and faster. The SAS® Content Categorization Studio provided the agency with the capability to re-index hundreds of thousands of documents in its public-facing collection in hours rather than months. And new documents can now be indexed in seconds rather than minutes or hours.

SRA is a leading provider of technology and strategic consulting services and solutions to government organizations and commercial clients. It works in areas such as global health, disaster response planning, and infrastructure and managed services. When one of its customers asked for assistance creating a corporate taxonomy and automating it, SRA used SAS Text Analytics for the project.

SAS allows you to work under the hood. It's not a black box.

Bill McKinney
Taxonomy Manager

"SAS allows you to work under the hood. It's not a black box,'' explains Bill McKinney, the taxonomy manager for the project. "The client had developed an extensive thesaurus over many years and wanted to build a semantic foundation that could be leveraged by the text analytics tool to generate persistent metadata. Essentially, what we did was analyze the semantic problem and devise a machine solution that resembles a human solution."

The client had some unique needs. It creates hundreds of reports a year and has a digitized library collection that dates back more than 100 years. It was extremely challenging to manually index because of changes in information organization. Manually implementing relatively small changes to global metadata was beyond the agencies available resources. Without updated metadata, it was hard for people to search the database.

"The client had a findability problem and an information organization problem,'' McKinney says. "And the organization wanted to develop a taxonomy to make content easier to find and make the process of generating metadata more efficient (meta-tagging). For the client, creating a taxonomy and applying taxonomy metadata manually to the document collections was too difficult, perhaps impossible. So we needed a tool to automatically categorize and programmatically apply metadata to content."

SRA also needed a product that could be integrated into a content management or search platform. "We wanted the flexibility of having a tool that was standalone,'' McKinney says. Adds Troy Pomroy, the project's technical lead: "We needed something that we could integrate into other systems based on the organization's architecture."

Earning Quick Results

The organization had wanted to re-index its entire collection of published reports for more than five years. "But manual re-indexing would have required too much time," says McKinney. With  SAS® Enterprise Content Categorization, re-indexing was done in hours. In the past, efforts to manually re-index even a fragment of the collection would involve bringing in additional labor for months on end. "Now, once all the rules are written, if you want to make an adjustment we can just tweak a rule, test it, look at the sample data and – if we're happy with it – publish it," explains Pomroy.

Some library scientists are hesitant to use tools to automatically categorize, tag or create metadata. They are, rightfully, suspicious of the accuracy. With this project, the accuracy rate of categorization is 90 percent, versus a historic average of 75 percent when human indexers did the work. While this requires a significant investment to develop categorization rules, once rules are created the tool applies them with greater consistency than human indexers. "Essentially, the machine is categorizing to a greater degree of accuracy than the human beings were," Pomroy says. This has helped SRA earn the support of the organization's Library Director.

The project won't render indexers obsolete. Pomroy notes that people who hold that role can be freed to work as analysts to improve "rules" and help keep the system running efficiently. "We'd much rather have analysts writing rules that affect thousands of documents than indexers reading one document at a time."

The project has even led to a pleasant discovery. The library had digitized documents from the 1800s, but the old technology was not able to find the documents. The documents were there but not searchable. "Now you can go and see documents that, basically, people didn't know existed," Pomroy says.

"From my perspective, the tool has been everything that we expected it to be,'' says Pomroy. "We hit our target, and we continue to implement new functionality."

Challenge

Extract, analyze, categorize, index and apply metadata to a government organization's library with more than 100 years' worth of reports.

Solution

SAS® Text Analytics

Benefits

  • Re-indexed collection in hours rather than months or years.
  • Increased accuracy by 25%.

本文中所展示的成果取决于文中所述的特定场景、业务模式、数据输入和计算环境。每位SAS客户的体验都因其业务和技术特性而不同,请勿将本文观点视为通用观点。实际的成本节约、成果和效果最终取决于每位客户的实际配置和条件。SAS不保证每位客户都能取得本文类似的成果。SAS仅对SAS的产品和服务提供保证,请参阅SAS的产品服务质保条款,本文中提及的内容不能视为质保条款。客户可以按照合约商定的条款分享SAS软件实施项目的成功案例,相应的品牌和产品名称归属相应的公司所有。

Back to Top