Rule 1: Text data is messy
Text data, by its inherent free-form nature and an individual’s writing style, is plagued by misspellings, acronyms, clipped text (e.g. ttfn), emoticons, etc. Data pre-processing is required. As with any analytic project, the results are dependent on the quality of the input and the exploration of that data, requiring modifying and cleansing the data to best formulate the analysis.
If you are looking to identify consistent patterns across a document collection, differences in the input text need to be addressed (creating synonyms, for example) in order to find new insights across the collection that would not be found by looking at any one report in isolation. On the other hand, it may be those very differences that you want to find amongst the text collection – as is often the case examining claims for suspicious activity, abuse and fraud, and in doing so, potentially saving millions of dollars.
So, there’s some initial work with the data. However, the resulting models are embedded into operational systems, automating this pre-processing which in turn saves substantial manual effort.
Rule 2: Text models change over time
Evaluation of customer reviews could have sparked a new social media branding campaign, or identification of potential product issues before customers even notice may have led to pro-actively predicting emerging future problems.
Business analytics solves problems – and once solved, behavior changes. Your customers are now more satisfied and hopefully, their perceptions of your company have improved. The way you operate may have changed too. This means the content or words found in tweets, blogs and other forums may also change, by design.
And so must the models. New concepts emerge, neologisms develop, more involved analysis is done, and new sources of input are included. This is the benefit of text analytics: it continues to improve the business over time, refining your insights every step of the way.
You’ll need an open management environment to test and validate models – one that permits users to override, modify and extend models, rules and taxonomies. It will need to be managed and controlled with explicit administration rights and audit capabilities within the system. As you begin to deal with more languages, acquire a new company, extend to new research areas and augment your document management systems, scalable, flexible technology becomes critical.
Realize that your needs and insights will demand multiple ways to examine text in order to improve different aspects of your organizational operations over time – there’s no one size fits all.
Rule 3: Collect metrics from model implementation
Text models and rules need to be tested and validated to ensure that input data has not changed significantly to warrant new modeling coefficients, or confirm that classifications are still within acceptable standards. But even more importantly from a business context, you’ll need to measure the improvements gained.
Ongoing monitoring will inform any new actions that are required. Improvements and efficiencies gained through the automation of document materials can be measured in hard numbers. For example, resource cost reduction, dollars saved, decreases in the number of complaints, product defect reductions and time spent searching for relevant materials can all be quantified.
Read the full blog post: Business Analytics 101: Text Analytics