Building a more searchable library
SRA re-indexed its collection faster, more accurately — and discovered new documents along the way
When a government agency needed a better way to index, categorize and generate metadata for documents created over the last century, it chose SRA International to manage the project, and SRA used SAS® Text Analytics. With SAS®, categorizing is more accurate – and faster. The SAS® Content Categorization Studio provided the agency with the capability to re-index hundreds of thousands of documents in its public-facing collection in hours rather than months. And new documents can now be indexed in seconds rather than minutes or hours.
SRA is a leading provider of technology and strategic consulting services and solutions to government organizations and commercial clients. It works in areas such as global health, disaster response planning, and infrastructure and managed services. When one of its customers asked for assistance creating a corporate taxonomy and automating it, SRA used SAS Text Analytics for the project.
SAS allows you to work under the hood. It's not a black box.
"SAS allows you to work under the hood. It's not a black box,'' explains Bill McKinney, the taxonomy manager for the project. "The client had developed an extensive thesaurus over many years and wanted to build a semantic foundation that could be leveraged by the text analytics tool to generate persistent metadata. Essentially, what we did was analyze the semantic problem and devise a machine solution that resembles a human solution."
The client had some unique needs. It creates hundreds of reports a year and has a digitized library collection that dates back more than 100 years. It was extremely challenging to manually index because of changes in information organization. Manually implementing relatively small changes to global metadata was beyond the agencies available resources. Without updated metadata, it was hard for people to search the database.
"The client had a findability problem and an information organization problem,'' McKinney says. "And the organization wanted to develop a taxonomy to make content easier to find and make the process of generating metadata more efficient (meta-tagging). For the client, creating a taxonomy and applying taxonomy metadata manually to the document collections was too difficult, perhaps impossible. So we needed a tool to automatically categorize and programmatically apply metadata to content."
SRA also needed a product that could be integrated into a content management or search platform. "We wanted the flexibility of having a tool that was standalone,'' McKinney says. Adds Troy Pomroy, the project's technical lead: "We needed something that we could integrate into other systems based on the organization's architecture."
Earning Quick Results
The organization had wanted to re-index its entire collection of published reports for more than five years. "But manual re-indexing would have required too much time," says McKinney. With SAS® Enterprise Content Categorization, re-indexing was done in hours. In the past, efforts to manually re-index even a fragment of the collection would involve bringing in additional labor for months on end. "Now, once all the rules are written, if you want to make an adjustment we can just tweak a rule, test it, look at the sample data and – if we're happy with it – publish it," explains Pomroy.
Some library scientists are hesitant to use tools to automatically categorize, tag or create metadata. They are, rightfully, suspicious of the accuracy. With this project, the accuracy rate of categorization is 90 percent, versus a historic average of 75 percent when human indexers did the work. While this requires a significant investment to develop categorization rules, once rules are created the tool applies them with greater consistency than human indexers. "Essentially, the machine is categorizing to a greater degree of accuracy than the human beings were," Pomroy says. This has helped SRA earn the support of the organization's Library Director.
The project won't render indexers obsolete. Pomroy notes that people who hold that role can be freed to work as analysts to improve "rules" and help keep the system running efficiently. "We'd much rather have analysts writing rules that affect thousands of documents than indexers reading one document at a time."
The project has even led to a pleasant discovery. The library had digitized documents from the 1800s, but the old technology was not able to find the documents. The documents were there but not searchable. "Now you can go and see documents that, basically, people didn't know existed," Pomroy says.
"From my perspective, the tool has been everything that we expected it to be,'' says Pomroy. "We hit our target, and we continue to implement new functionality."
Extract, analyze, categorize, index and apply metadata to a government organization's library with more than 100 years' worth of reports.
- Re-indexed collection in hours rather than months or years.
- Increased accuracy by 25%.