Abstracts
This page is updated weekly so be sure to check back often for the latest information.Data augmentation by predicting spending pleasure using commercially-available external data
Philippe Baecke, Ghent University
Since customer relationship management (CRM) plays an increasingly important role in a company's marketing strategy, the database of the company can be considered as a valuable asset to compete with others. Consequently, companies constantly try to augment their database through data collection themselves, but also trough the acquisition of commercially available external data. Until now, little research has been done on the usefulness of these commercially available external databases for CRM. This study presents a methodology for such external data vendors based on random forests predictive modeling techniques to create commercial variables that solve the shortcomings of a classic transactional database. Eventually, we predicted spending pleasure variables, a composed measure of purchase behavior and attitude, in 26 product categories for more than 3 million respondents. Enhancing a company's transactional database with these variables can significantly improve the predictive performance of existing CRM
models. This has been demonstrated in a case study with a magazine publisher for which prospects needed to be identified for new customer acquisition.
Data mining in the financial services industry: change we need
Bart Baesens, Katholieke Universiteit Leuven (Belgium) & University of Southampton (United Kingdom)
In this talk, we will elaborate on some recent challenges that have emerged when applying data mining within the financial services industry for applications such as credit risk management, fraud detection, and anti-money laundering, We will first discuss issues related to data quality and master data management. Next, we discuss some important technical data mining challenges such as model interpretability, model compliance, and learning using networked data. This will be followed by an overview of model monitoring, model back testing, and model benchmarking. Finally, we also discuss how to incorporate macro-economic effects into data mining models by means of stress testing procedures. Throughout the talk, key recommendations and clear guidelines will be provided with respect to the challenges mentioned. Ideas presented in the talk can also have relevance in other data mining application fields, e.g. marketing, medicine, pharmaceutical, and governmental,
Evaluation of Treatment Effect in Subgroups Generated by Survival Trees
Chakib Battioui, Eli Lilly (Co-Authors Eric Su, Ilya Lipkovich, Mathew Rotelli)
A new data mining method has been proposed to identify subgroups with respect to time to primary study outcome using available baseline covariates in survival data arising from two treatment arms of a clinical trial. The method is based on building a survival regression tree for each treatment group for response defined as estimated cumulative hazard, and then constructing the Kaplan-Meier estimator stratified by treatment and computing log-rank test for treatment difference in each terminal node. Bias-adjusted estimate for treatment effect in subgroups using data re-sampling is proposed and evaluated by simulations. The approach was applied to one of the failed clinical trials with 1115 patients randomized to treatments A (N=558) and B (N=557). Significant treatment differences were estimated in subjects within specific cut-offs of age and duration of disease by first building a survival tree within the treatment group A and selecting the node with non-responders.
Don't Bore Your Expert: How to Interactively Learn Classifiers from Unlabeled Data
Michael Berthold, University of Konstanz
Building classifiers based on unlabeled data becomes increasingly important as data repositories grow and the human ability to label at least subsets of this data remains constant. Especially in the life sciences, such classification problems are becoming more important every day. We will focus on one example in this talk which stems from high-throughput microscope screening cameras. These devices are able to produce hundreds of thousands of images per day. Since they generate different types of pictures for each assay, a new classifier is needed every time. Asking the user to label some dedicated images is possible, but the number of such queries needs to be kept to an absolute minimum, preferably no more than one hundred queries for an assay consisting of several million images.
In this talk we discuss an adaptive active classification scheme which establishes ties between the two opposing concepts of unsupervised clustering of the underlying data and the supervised task of classification. The learning scheme allows for an initial clustering of large datasets, then queries a few selected examples and subsequently adjusts the classification boundaries based on a small number of additional labels. The learner hence asks the human expert for labels of the most informative examples throughout the learning process and therefore keeps the costs for supervision at a manageable level. We discuss results on real world applications and show how this approach also points out artefacts and other interesting outliers in the data.
High Performance Analytics with In-Database Processing
Stephen Brobst, Teradata
This workshop will show how to use in-database processing techniques that are available with SAS 9.1.3 (and above) for deep analytics with faster turnaround and less storage space for developing analytic applications. We will focus primarily on the use of these techniques with Base SAS and Enterprise Miner, with some discussion of deployment within SAS applications as well. We will provide best practices for execution of a phased deployment strategy of these capabilities and benchmark results for performance comparison with traditional implementations.
How to Translate Data Mining Outputs into Measurable Uplifts in ROI
Mark Carmichael, Eclipse International
For the most part, direct marketers today
- repeat old direct marketing techniques instead of testing new technologies. Typically marketers use the same direct marketing channels year after year instead of regularly evaluating and reallocating budget to improve return. More than 80% of marketers struggle to respond to marketing data in order to improve results.
- hand off the planning process. Many marketers offload marketing planning to an agency and rarely question the agency's recommendations.
- focus on reach, rather than on behaviour and intent. Today, most direct marketing planning tools identify which channels will deliver the greatest reach based on historical data. But this does not help a marketer influence customer behaviour or predict a propensity to take future action.
How much to spend on which media is an age-old question. So what's different now? In a word: technology.
Technology forces are changing direct marketing in three important ways - all of which demand that marketers re-engineer the way they spend their money. Technology
- changes consumer behavior dramatically. Consumers are no longer a mass-market reachable through traditional media. Because of technology, consumers now go online for information, multitask across multiple channels, rely on peers and personal research for product information, and expect immediate, customized responses to their needs.
- alters the purchase process. The shift in consumer information consumption has also changed the impact of traditional marketing tactics - like broadcast media and print - on purchase decisions. According to DoubleClick's Touchpoints II study, 64% of consumers feel the Internet has altered their purchase process from five years ago. In fact, today, Web sites have more influence over purchases than any other media, and consumers are more self-reliant and more willing to buy even high-ticket, high-consideration products online.7
- plays an increasing role in marketing processes. Technology helps marketers get smarter about customers, deliver more relevant messages, and measure impact of their campaigns. Marketing automation, data warehouses, Web analytics, personalised URLs, email, SMS, CRM software, and contact management systems are all examples of technologies playing increasing roles in the marketing process. They enable marketers to shift from mass branding to an approach that better suits unpredictable customer behaviors: cross-channel integration of customer-facing channels.
Data analysis is only as good as the approach a marketer takes to executing on this. Outdated planning processes and technology realities require a new, more scientific way to plan that focuses on customer behavior and objectively considers all direct marketing resources. Marketers must facilitate, streamline, customize and track communications better and more effectively. By enriching the information flow (by making it more relevant, timely and meaningful), the marketer also improves the quality of the communication itself.
By embracing a mix of emerging digital technologies, sophisticated analysis techniques and continual testing of changing creative and campaign variables, marketers find the optimal approach to deliver content and information to a customer in a manner far more efficient and effective than traditional direct marketing approaches.
Develop an approach to utilize data analysis outputs to
- understand customers better
- communicate with them more effectively
- measure effectiveness
- determine (through testing) which permutation of variables yields highest ROI
Panel Discussion: Teaching Data Mining on Campus and Online: Opportunities and Challenges
Session Moderator: Dr. Goutam Chakraborty, Professor (Marketing), Oklahoma State University
Panel Members:
- Dr. Goutam Chakraborty, Oklahoma State University
- Dr. Carl Lee, Central Michigan University
- Dr. Ronald Klimberg, Saint Joseph's University
- Dr. Mike Speed, Texas A&&M University
- Dr. Tom Bohannon, SAS and Texas A&M University
Customer Dynamics & Data Fusion: Revealing the Evolution of Customer Relationships
Gary Class, Wells Fargo & Company
Customer Dynamics explores how customers' behavior changes over time and the impact of those changes on the customers' relationship with the Firm.
Critical to the development of predictive models of Customer Dynamics is the acquisition and alignment of disparate data regarding the customers' interactions with the firm. These include both Behavioral data (direct observations of customers' behavior) and Attitudinal data (indirect assessment of customers' motivations and beliefs via survey). Data Fusion is a methodology to connect and calibrate these disparate data to understand the "why" behind the "what".
The scope of customer interaction data can be expanded beyond the traditional sources of "Structured" transactional data (from accounting and operational systems) to include "Semi-structured" data (e.g. middleware messages & server logs) and "Unstructured" data (inbound emails, agent notes in CRM systems, etc.).
Capitalizing upon the information assets resulting from Data Fusion, the firm can observe how customer portfolios evolve over time and identify the drivers of important Customer Dynamics such as customer retention & revenue growth.
Ultimately, these insights can provide a rich Decision Support tool-set that can enable business strategies and tactics that are information-based and highly scale-able.
A Method of Segmenting Customers Based on their Purchase Transaction Patterns
Randy Collica, Hewlett-Packard Co.
Segmenting customers based on their behavior or their attitudinal attributes are the mainstream of customer segmentation basics, however, until recently segmenting customers based on the pattern of their transactions has only been a nice idea. SAS/ETS™ have come out with a brand new procedure in SAS 9.2 called Proc Similarity™. This procedure allows one to measure both the distance between the input sequence and a target sequence while taking into account the ordering (pattern) of the sequence. Similarity metrics can be computed in preparing the times series data and these metrics can then be scaled or transformed and used with other data mining tasks to complete a segmentation analysis.
This presentation shows how you to take typical customer purchase transactions and using the Similarity procedure and other firmagraphic data a customer segmentation based on the purchase transactions of those customers. This presentation uses SAS Enterprise Miner™ and SAS Enterprise Guide™ as tools to accomplish this analysis technique. Final comments about the business applicability
Enterprise Value Optimization through Analytics - Innovate and Iterate Everywhere
Martin Ellingsworth, ISO Innovative Analytics & Cheryl Doninger, SAS
Don't be a solution looking for a problem -- work on changes that executives know are important until you make serious headway and don't stop until you are adding up the results of how much impact you are having on shareholder value, and then do it again and again. Continuous improvement on things that matter is the only way to stay relevant as an analyst or analytical organization (internal or external). Better analytics, better data, better decision support describes the virtuous cycle of a best in class business model.
Problem-solution innovation with analytics is the focus of this talk and several examples will be cited. Creating an infrastructure to back up this promise along with a portfolio of cycle innovations in Insurance, Healthcare, and Mortgage industries will be discussed.
The High ROI of Data Mining for Innovative Organizations
John Elder, Elder Research, Inc.
Data mining can enhance your bottom line in three basic ways: by streamlining a process, eliminating bad cases, or highlighting the good. In rare situations, a fourth way -- creating something new -- is possible. But modern organizations are so effective at their core tasks that data mining usually results in an iterative, rather than transformative, improvement. Still, the impact can be dramatic.
Dr. Elder will share the story (problem, solution, and effect) of nine projects conducted over the last decade for some of America's most innovative agencies and corporations:
Streamline:
- Cross-selling for HSBC
- Image recognition for Anheuser-Busch
- Biometric identification for Lumidigm (for Disney)
- Optimal decisioning for Peregrine Systems (now part of Hewlett-Packard)
- Quick decisions for the Social Security Administration
- Tax fraud detection for the IRS
- Warranty Fraud detection for Hewlett-Packard
- Sector trading for WestWind Foundation
- Drug efficacy discovery for Pharmacia & UpJohn (now part of Pfizer)
Understanding Latent Semantics in Textual Data
Nick Evangelopoulos, University of North Texas & Terry Woodfield, SAS
Summarization and visualization of textual data typically involves Singular Value Decomposition (SVD), a matrix operation capable of reducing the dimensionality of the term-document structure by introducing a small number of latent semantic dimensions. Traditional textual analysis approaches utilize the SVD dimensions as attributes for clustering and other visualization techniques, or as inputs in predictive models, without making any effort to interpret them. This presentation will show how the SVD dimensions can be transformed into dimensions that can be easily labeled. The demystification of the SVD dimensions opens the way to a number of text mining possibilities. On the data exploration front, these include the summarization of collections of documents. On the predictive analytics front, they include the incorporation of meaningful inputs into interpretable models, such as Decision Trees. Several applications of this technique, including classification of news stories into meaningful topics,
extraction of meaningful issues from customer feedback comments, and analysis of open-ended survey questions, will be discussed. The presentation will conclude with an illustration of topic extraction from a collection of textual data.
Financial Data Mining with Algorithmic Trading
Robert Golan, DBmind Technologies, Inc.
Algorithmic Trading has changed the world the way the Traders trade and Trade Support supports. There is a Brave New World happening with the "hands on" Trading evolving into "hands off" Algol Trading. Not all trades need to be made in ultra low latency timing. Future trading will rely on a broader set of data which will be mined for relevance. An important series of XBRL Financial Reporting events are happening throughout the world and especially in the USA. A critical mass of financial data will be ready for mining which will be a boon for transparent "low touch" fundamental style algorithmic trading. Also, "low touch" trading such as program trading & direct market access (DMA) will evolve into advanced Algol Trading strategies. Stock and economic indicators combined with XBRL will add value for Algol Trading. This is about a well thought out strategic high latency trading strategy with data mining discovering the governing rules while adding
the expert rules with validation. Yes, the trader is still the key to making this all happen. Both fundamental and technical trading rules need to be combined with the expert rules, the data mined rules, and most importantly the regulatory environment rules. RegNMS in the USA and MiFID in Europe have indirectly helped the adoption of electronic trading and it is important to integrate the GRC related rules in an agile way. Agility is the key and thus the rules need to be placed into a rules engine and managed by the experts for proper compliance, risk management, and governance. Japan, China, and the Netherlands with regards to XBRL are ready to be data mined with Algol Trading now. A XBRL US survey is indicating at least 340 of the estimated 500 public companies that the SEC requires to begin filing in XBRL format in June 2009, have already converted their financial statements into XBRL. XBRL US, is the non-profit XML standard setter that developed and maintains the US GAAP taxonomy used by filers to
comply with the SEC mandate. Almost $7 trillion in market capitalization will be represented by this XBRL financial data which is over 50% of the total market cap for all publicly traded companies reporting to the SEC. As this XBRL Financial Data ripens, a wonderful harvest awaits us data miners which will enhance the current Algol Trading strategies which use this data.
Advanced Analytics on Multi-Terabyte Datasets
David Shamlin, SAS Institute & Peter Pawlowski, Aster Data
Are you struggling to run timely analysis on large data sets? Are you being forced to simplify models or work with summarized data sets versus being able to large scale analytics on very big data sets that encompass terabytes of historical and fresh data. Are you trying to run a large volume of complex queries concurrently? You’ve heard of MapReduce — wondering what’s behind all the MapReduce buzz?
Come to this session with David Shamlin, Sr. R&D Director for SAS Institute, and Peter Pawlowski, MTS for Data Mining at Aster Data, to learn about Aster’s Massively Parallel Data Warehouse and In-Database MapReduce and how this works with SAS solutions. We will look at how the tight coupling of SQL and MapReduce provided by Aster Data creates new ‘big data’ analytics opportunities when combined with SAS.
Impact of an Analytics Program on an Emergent Economy: The Mexican Case
Viterbo H. Berberena González & Guillermo Hijar, Universidad Anahuac
The evolution of the program in analytics offered by Anahuac University since January, 2005 is presented over a time frame beginning with the creation of executive courses and culminating with a Masters Degree in Analytical Intelligence. The various difficulties encountered in the process and their solution will be emphasized, in particular from the market needs standpoint and the receptiveness of the entrepreneurial community about the analytical approach as a competitive advantage.
The program in analytics, beyond its intrinsic value, has served as a methodological frame to reengineer and improve other graduate programs in the Graduate School of Engineering at Anahuac University.
Presentation Structure
- Background: Description of the analytics level at the beginning of year 2005
- The creation of a diploma degree course in Data Mining
- The academic program design
- Support from SAS Institute (materials)
- Instructors' training
- Data mining course start - up (159 hours of instruction)
- High demand vs. lack of instructors
- Students' profile
- First complaints and appraisals
- Revision and improvement of the academic program (reduction to 120 hours of instruction)
- A boom in the demand for a formal graduate course in Analytic Intelligence surges.
- Design and development of a Master's Degree in Analytic Intelligence
- Research among Mexican Universities
- Brainstorming with the Director of the Institute of Advanced Analytics and a visit to North Carolina State University
- Brainstorming with the Director of SAS Global Academic Program and a visit to SAS Institute HQ's
Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection
Vincent Granville, AnalyticBridge
Hidden Decision Trees (HDT) is a new data mining technique to handle large volumes of data, with non linear structures and strongly correlated independent variables. It was recently used to detect large botnets with a rule-driven fraud detection engine. Illustrations will be provided in this context, particularly in connection with scoring Internet transactions.
The technique is easy to implement in any programming language. It is more robust than decision trees or logistic regression, and helps detect natural final nodes, and create new types of segmentations. No decision tree is actually built, but the final output of an hidden decision tree procedure consists of a few hundred nodes from multiple non-overlapping small decision trees. Interpretation is simple, each node corresponding to a particular type of fraud. Typically, HDT's are part of an hybrid scoring algorithm, where less than 20% of the transactions are scored using an alternate classifier, usually a simplified logistic regression or naive Bayes. We will also discuss more sophisticated versions of HDT's that can be used when a large number of non-binary rules are present.
Why Our Models Fail - Sometimes It's in the EDA
Dudley Gwaltney, SunTrust Bank
Everything about the model looks good - the Goodness of Fit; the lift, the model validates well. But when implemented, the model fails. Why? The answer might not be the final model, but the Exploratory Data Analysis (EDA).
As modelers, sometimes we are so anxious to try out the latest modeling techniques that we forget that the EDA is the most important step in the process. The presentation will review how a thorough EDA can lead to a successful modeling project and an incomplete EDA, a failed project.
The presentation will include:
- Is the target rate significant enough to support the project?
- Is the population size large enough?
- The importance of understanding missing values?
- What are the data sources - Is the data consistent over time, are updates provided, etc?
- How will the model be implemented?
If the modeler does their due diligence during the EDA, the chances of success are much greater. If not, the chance of success is left to chance.
Impact on Customer Churn of Intensity of Sharing of User-generated Content at an Online Service
Zainab Jamal, HP Labs
Customers continue to increasingly interact online - sharing photos, videos, songs and other user-generated content. Research on social networks has ignored to a large extent the existence of social networks within the customer base of online services that have sharing functionalities. We believe the availability of such sharing functionalities increases the stickiness of these online services as it allows their customers to easily connect and share with their social network. In the absence of such functionalities, customers are more likely to move to competitive services that allow them to not only store and manage their digital content but also to share it with their social network.
In this study, we evaluate the value of sharing by establishing a link between the sharing activity of a customer as captured by RFM metrics adapted to sharing behavior and key performance metrics like customer churn and CLV.
We use a conditional hazard model for multiple events with heterogeneity to estimate the impact of sharing activity on the probability of customer churn. We also include other factors that may impact the churn rate like recency, frequency and amount of purchases, amount of digital content uploaded, number of other customers who viewed the content as well as number of calls to the customer service center. We show that intensity of sharing has a negative impact on customer churn. Thus, the more sharing the customers engage in the more likely they are to stay with the online service.
Optimal Design of Mailing Campaigns - Insights from a Series of Studies
Manfred Krafft, University of Münster
Textbooks and practitioner-oriented publications on direct marketing contain manifold - and often conflicting - recommendations for designing successful direct mail pieces. Academic research has tried to shed more light on the fuzzy issue of successful direct mail design and to provide answers to the following questions:
- What drives consumers to open direct mail pieces?
- What induces them to respond to a direct mail offering?
- various envelope characteristics on opening behavior, and
- design characteristics of the envelope content (i.e., the letter, brochure, and response device) on consumers' sustainable attention rate (SAR) of direct mail pieces.
ApproxMAP : Intelligent Sequential Pattern Mining via Alignment
Hye-Chung Kum, University of North Carolina
The goal of sequential pattern mining is to detect patterns in a database comprised of sequences of itemsets. For example, retail stores often collect customer purchase records in sequence databases in which a sequential pattern indicates a customer's buying habit. In such a database, each purchase is represented as a set of items, itemsets, purchased together, and a customer sequence would be a sequence of such itemsets.
Sequential pattern mining is commonly defined as finding the complete set of frequent subsequences in such a database. However, the sheer volume of the results in the traditional support based frequent sequential pattern mining methods has led to increasing interest in new intelligent mining methods to find more meaningful and compact results. One such approach is the consensus sequential pattern mining method based on sequence alignment. Consensus sequential patterns can detect general trends in a group of similar sequences, and may be more useful in finding non-trivial and interesting long patterns. It can be used to detect general trends in the sequential database for natural customer groups, which is more useful than finding all frequent subsequences in the database. Formally, consensus sequential patterns are patterns shared by many sequences in the database but not necessarily exactly contained in any one of them.
In this talk, we will describe ApproxMAP (APPROXimate Multiple Alignment Pattern mining), an effective algorithm for consensus sequential pattern mining. It has been successfully applied to many areas such as multi-database mining, temporal streaming data mining, and policy analysis. Furthermore, we will present a detailed comparison study of the alignment based methods and support base methods. Many of you will be familiar with the most commonly available support based sequential pattern mining. The comparison study will illustrate how well ApproxMAP works in comparison.
Marketing Optimization for Increased Revenues
Choudur K. Lakshminarayan, HP Laboratories
We present a systematic and reliable way of assessing marketing effectiveness and optimally allocating investments among marketing instruments that HP employs to reach our customers based on a fixed budget. As marketing budgets are planned, the methods and tools we developed assist marketing managers to assess various allocation scenarios and determine the best investment strategy to maximize revenue. This program is unique in that it innovatively integrates econometrics, expert assessments of business conditions and mathematical optimization to build a unified framework for revenue/profit maximization based on a standardized platform. The methodology is flexible for adoption in other areas where optimization is involved.
Net Lift Prediction Models: How to Maximize Marketing Impact and What Data Miners Can Learn from Presidential Campaigns
Kim Larsen, Charles Schwab & Co
The true effectiveness of a marketing campaign is measured by quantifying the incremental impact. That is, additional revenue directly attributable to the campaign.
Measuring incremental impact is typically done by creating a random control group that will not receive the offer. If the clients contacted by the campaign (a.k.a. the "test group") have better post-campaign performance than the control group, the campaign has been effective and further rollouts can be considered.
The problem is that targeting strategies are often not designed to maximize the incremental impact. Typical targeting models yield impressive conversion rates for the test group, but too often the results are equally impressive for the control group. In such cases, the incremental impact is insignificant and marketing dollars could have been spent elsewhere.
The purpose of this talk is to demonstrate how to build Net Lift Models (also referred to as Uplift Models) that optimize the incremental impact of marketing campaigns by maximizing the difference in conversion rates between the test and control groups. We will consider three different ways to build such models in SAS and discuss the pros and cons of each method.
Mining Transactional and Time Series Data
Michael Leonard & Meredith John, SAS
Web sites and transactional databases collect large amounts of time-stamped data related to their suppliers and/or customers over time. Mining this time-stamped data can help business leaders make better decisions by listening to their suppliers or customers via their transactions collected over time. A business can have many suppliers and/or customers and may have a set of transactions associated with each one. However, the size of each set of transactions may be quite large making it difficult to perform many traditional data mining tasks. This paper proposes techniques for large-scale reduction of time-stamped data using time series analysis, seasonal decomposition, and automatic time series model selection. After data reduction, traditional data mining techniques can then be applied to the reduced data along with other profile data. This paper demonstrates these techniques using SAS/ETS and Enterprise Miner.
Long-Term Value Modeling in the Automobile Industry
Bruce Lund, Marketing Associates LLC
Businesses often classify their customer base in terms of the customers' predicted long-term value (LTV). LTV may influence marketing strategies, particularly CRM and concern resolution. This paper describes an approach to LTV modeling in the automobile industry.
Defined: LTV provides an estimate of the time-adjusted profits to an automobile company from future new-vehicle purchases by a household.
Two statistical models will be discussed:
- A "Vehicle-Segment Choice Model" which predicts the likelihood that the next new-vehicle purchase from the Company will be within a particular vehicle segment.
- A "Time to Next Acquisition Model" which provides: (A) the probability that the household will ever make another new-vehicle purchase from the Company, and (B) assuming the household will make another new-vehicle purchase, the probability distribution for the household of the time in months to this next purchase.
Lastly, the LTV calculation will be described.
The implementation of the model for a major automobile company will be outlined and performance metrics will be discussed.
Experiences on Experiment Design in Direct Marketing
Riku Mäkeläinen, TeliaSonera
TeliaSonera is the leading Nordic and Baltic telecommunications operator. After several years of good experiences on predictive modeling, the Swedish Broadband marketing unit decided to further improve direct marketing by using design and analysis of experiments when selecting target customers. This presentation focuses on sharing experiences from development work until present. The presentation includes a concrete case. Tools used include SAS QC/ADX, SAS Enterprise Miner, SAS Campaign Studio, and SAS STAT.
Rapid Analytics Prototyping to Achieve Accelerated Business Impact / Self-Sustaining Investments
Punit Mahajan, Infosys Technologies
Traditional methodology of deploying analytics is a long lead time (often multi-year) project, consisting of data warehousing, tool evaluation, selection & implementation, report & analytics configuration, and finally integration into business processes. Most often, by the time the project is completed, market dynamics or even internal dynamics like new LoBs or Acquisitions redefine the requirements. Rushing through to meet internal timeline or external competition pressures, is equally risky. Going through this catch-22 can be frustrating, to say the least.
The session will focus on an alternative model that offers much needed agility:
To start with a flexible model right after the implementation of the data warehouse, alongside the longer term roadmap. The rapid prototyping track focuses on using the right bolt-on tools or even native querying utilities on top of sandbox warehouses than those of enterprise quality, create report & analysis templates and render them on as simple as spreadsheets. This prototyping track can greatly accelerate time-to-market, enable functional stakeholders validate templates before they are rolled out far and wide, and also let them consume these outputs to the extent they meet immediate priorities. Not only these, quick business impact realized through this approach also funds the longer term roadmap – which progresses without rushing through, and sets you on a path for sustainable long term competitive advantage through enterprise scale analytics roll out.
We will also talk about how mastering this hit-the-ground-running approach one subject area at a time requires partnering with the right players – those that provide agile bolt-on tools required for prototyping (as well as have enterprise analytics solutions), and those with the right analytic resources that help in rapid prototyping (as well as have end-to-end expertise in DW, SI & Analytics laden business process management).
Why Just Count Crime When You Can Prevent it: Changing Public Safety Outcomes with Operational Analytics
Colleen McCue, MC2 Solutions, LLC
Doing more with less. Why just count crime when you can anticipate, prevent and respond more effectively? Operationally relevant and actionable analysis allows command staff and police managers to leverage advanced analytics in support of meaningful, information based tactics, strategy and policy decisions in the applied public safety and security environment. As the law enforcement community increasingly is asked to do more with less, these methods represent an opportunity to prevent crime and respond more effectively, while optimizing increasingly scarce or limited resources. Case studies will include risk-based deployment and resource allocation, or "just-in-time" policing, illegal narcotics markets, hostile surveillance, and the behavioral analysis and modeling of violent crime to include homicide, aggravated assault, robbery, and sexual assault. Ultimately, the incorporation of meaningful, operationally relevant and actionable analysis into information based police tactics, strategy and policy
promises to increase public safety and change outcomes.
Active Use of Data Mining in the Customer Life Cycle Management Process of a Telecom Operator
Robert Moberg and Mattias Andersson, TRE
The challenge of a Telecom operator, besides recruiting new customers, is to reduce the attrition rate and at the same time maintain or increase the average revenue per user, all under the greater mission to run a profitable business.
This presentation will give deeper insight on how the Swedish branch of the global Telco 3 works with strategic Customer LifeCycle Management (CLM) and the critical contributions from SAS Analytics and SAS Enterprise Miner in this process. The highly individualized communication reaches its targets with high precision via a wide variety of channels; e.g., Invoice, SMS, MMS, DM, TM, and Self-service. Besides increased lift values and higher yield on the campaigns, everything is now launched with less effort and more efficiently in an automated process.
Strategic Marketing Analytics During Turbulent Times
Will Neafsey, Ford Motor Company
The past 12-18 months will long be remembered as a chaotic and trying time for business and personal endurance. During this time, we have seen the collapse and disappearance of businesses and industries we all presumed would outlast our natural lives. Marketing executives and managers in all industries are looking backward and questioning every decision made leading up today. In this tumultuous economy, today's target customers may be the same people marketers ignored as little as a year ago.
The purpose of this talk is to discuss a variety of strategic marketing analytic techniques like segmentation and predictive modeling. These techniques will continue to be some of our best weapons throughout the economic turmoil and the recovery that lies ahead. General examples from the automotive industry will be used to illustrate the benefits and potential pitfalls of using advanced analytics in these uncertain times."
Exploiting randomness for aCRM choice modeling: Random MultiNomial Logit and Random Interaction MultiNomial Logit
Anita Prinzie, University of Manchester
Random Forests (RF) is a successful classifier exhibiting performance comparable to Adaboost, but is more robust. The exploitation of two sources of randomness, random inputs (bagging) and random features, make RF accurate classifiers in several domains. We hypothesize that methods other than classification or regression trees could also benefit from injecting randomness. We generalize the RF framework to other multiclass classification algorithms like the well-established MultiNomial Logit (MNL). We propose Random MNL (RMNL) as a new bagged classifier combining a forest of MNLs estimated with randomly selected features. The Random MNL only includes main effects. However, given the omnipresence of interaction effects in consumer behaviour, refraining from the assessment of the predictive value of interaction effects in RU models is unacceptable.
Therefore, we present a Random Interaction MultiNomial Logit (RIMNL) model reintegrating interaction effects in the input space of a MultiNomial Logit model. In a first step, a Random MultiNomial Logit (RMNL) is estimated, building a forest of multinomial logits with randomly selected variables, amongst which 60% randomly selected main effects and 40% two-way interaction effects created from two randomly selected main effects. From the raw importance variable scores, we select a subset of best main effects as well as a subset of best interactions. In a second step, a RMNL model estimates multinomial logit models randomly selecting features from the feature space integrating the best main effects as well as the best interaction effects from step 1. The RIMNL model is applied to the same data as the RMNL model. The results prove RIMNL's ability to detect powerful interactions.
Improving drug safety through data mining of observational healthcare databases
Patrick Ryan, GlaxoSmithKline Research & Development
Drug safety continues to be a major public health concern in the United States, with adverse drug reactions ranking as the 4th to 6th leading cause of death, and resulting in health care costs of $3.6 billion annually. Recent media attention and public scrutiny of high-profile drug safety issues have increased visibility and skepticism of the effectiveness of the current post-approval safety surveillance processes. Recent calls have been made to establish a national active drug safety surveillance system that leverages observational data, including administrative claims and electronic health records, to monitor and evaluate potential safety issues of medicines. However, the development and evaluation of appropriate statistical methods for observational data have not yet been studied. This presentation will highlight the opportunities and challenges for applying data mining to observational healthcare databases, and discuss the potential role of exploratory analysis in better understanding the effects of medicines.
The power of the Group Processing Facility in SAS Enterprise Miner
Sascha Schubert, SAS
The group processing facility in SAS Enterprise Miner provides users with the power to create loops over analytical models in very customized ways. With this users can automate the model building process to
- build predictive models based on pre-defined segments, such as gender or age groups, or value-based grouping.
- apply automated model performance optimization routines, such as bagging and boosting
- apply Cross-Validation techniques to test model robustness.
Managing Your Brand Using Text Analytics and Network Analysis
Saratendu Sethi, Teragram and Barry deVille, SAS
In this presentation Saratendu Sethi (Teragram R&D Director) and Barry de Ville (SAS Analytics Consultant) illustrate the power of SAS text analytics in processing blog content. They illustrate content extraction, sentiment identification, concept clustering and social network analysis as methods of delivering business insight in a wide range of brand and business management settings.
Text Mining to Discover Influential Communications in Social Movements
Judy Spomer and Randall LaViolette, Sandia National Laboratories
Social Movement Theory (SMT) is an area of study in Sociology and Political Science that provides an analytical framework for understanding the factors involved in organized social action. A social movement develops in response to an issue about which people are mobilized in an effort to solve the problem. Much of this research has focused on understanding the framing process, whereby a Social Movement Organization issues communications intended to influence the perceptions and direct the actions of the members of a community or general population. The web has become the principal medium for these organizations to distribute framing documents, i.e., those that describe an issue, identify victims, place blame, propose solutions, and ask readers to take action on an issue. Here models, that are able to discover small numbers of framing documents in a large corpus, are developed by combining Latent Semantic Analysis techniques with classification modeling algorithms. The models themselves provide insight
into the character of framing documents. Global warming was selected as the social movement upon which to base this study. Global warming framing documents, collected from web sites, were combined with non-framing documents that also address global warming. This corpus served to train and test statistical models that not only detected framing documents, but further classified these by framing task (diagnostic, prognostic, motivational) with high accuracy. The accuracy was assessed both internally and against more straightforward approaches that were uninformed by SMT. These methods were implemented with SAS software and serve as a resource for the study of both SMT and active social movements.
Sample Space Partition Tests for Finding Associations Between Variables in Large Data Sets
Olivier Thas, Ghent University
One of the major objectives of data-mining is finding relations among a large set of variables. There are many issues involved from a statistical point of view. First there is hypothesis testing, where the focus is on finding and proving the existence of a relation. A second approach consists in the construction of classes of models that may be used for prediction purposes. The typical high dimensionality is an important concern for both hypothesis testing and prediction.
In this presentation I will focus on the search and the confirmation of relations between variables through hypothesis testing. First I will introduce a class of statistical tests that are completely nonparametric in the sense that all types of dependence can be picked up. These tests are referred to as Sample Space Partition (SSP) tests. I will also demonstrate how these tests are related to the concept of generalized correlation coefficients. The tests can be applied to continuous and to discrete variables. When they are used for testing an association between a discrete and a continuous variable, the results are interpretable in terms of moment differences between the continuous distributions in the classes defined by the discrete variable.
Throughout the presentation the methods will be illustrated on real data examples. I will also demonstrate the use of the false discovery rate (FDR) for dealing with multiple testing in this context.
Trajectories Mining
Shusaku Tsumoto, Shimane University
This presentation shows a method for grouping trajectories of two or three temporal variables. Our method employed the following two-stage approach. Firstly, it compared two trajectories based on their structural similarity, and determines the best correspondence of partial trajectories. Then, it calculated the value-based dissimilarity for the all pairs of matched segments, and outputs their total sum as the dissimilarity of two trajectories. Experimental results on several datasets demonstrate that our method could capture the structural similarity between trajectories even in the presence of noise and local differences, and could provide better proximity for discriminating objects.
The Rolling Ball: A Behavioral Customer Segmentation in Retail
Maarten Verschuere, dunnhumby
Tesco is the biggest retailer in the UK and the third largest grocery retailer in the world. The mission of Tesco is to win their customers' lifetime loyalty. Through their loyalty card -Clubcard- Tesco knows what is in each individual customers' basket for all of the 15 million households that visit the store every day. The vast amount of collected transactional data offers Tesco a challenging opportunity to understand their customers by looking at actual behavior rather than attitudinal market research. In cooperation with dunnhumby, Tesco developed a methodology called the Rolling Ball. This methodology allowed to extract seven distinct behavioral Lifestyles from the detailed transactional retail data and classify each individual customer to one of these lifestyles. This insight in how every customer behaves is an essential key to success in difficult times as the Tesco Lifestyles evolve together with changing shopping behavior. Thanks to keeping a close eye on the pulse of every customer
Tesco managed to report record profit numbers in the most recent financial year. The presentation deals with how dunnhumby developed the Rolling Ball methodology and how exactly Tesco has benefited from this behavioral segmentation.
Incorporating Fuzzy Cluster Memberships within Enterprise Miner
Donald Wedding, SAS
Clustering is the process of placing data records into homogenous groups. Members of each group are similar to one another and highly different from members outside the group. There are two types of clustering methodologies that are primarily used: hard clustering and fuzzy clustering. In hard clustering, membership is absolute and mutually exclusive. A record is completely a member of a cluster or it is not a member. In fuzzy clustering, a record is permitted to have partial membership so that it is allowed to be in more in one cluster. For example, a record might have 60% membership in CLUSTER A and 40% membership in CLUSTER B. Enterprise Miner 5.3 does not incorporate fuzzy clusters, but it is possible generate fuzzy memberships of hard clusters. This presentation will provide a brief description of fuzzy clusters and will present a tutorial on incorporating fuzzy memberships within Enterprise Miner.
Using Data Mining to Identify Missed Opportunities for Cervical Cancer Screenings
Terry Whitlock, Blue Cross Blue Shield - Tennessee
Objective:To use predictive analytics to assess drivers of compliance for Medicaid women who obtained a cervical cancer screening (CCS).
Background:
In an effort to respond to consumer demand not only for cost effective care but also quality care, HEDIS has become a tool used by more than 90 percent of health plans to measure performance on dimensions of care and service. One of 71 HEDIS quality indicators is the proportion of eligible women in a population being current with a cervical cancer screening. After many initiatives spanning years to move cervical cancer rates, which ultimately proved to bear no apparent success in moving cervical cancer screening rates, a more research oriented approach was applied to determine what "drivers of compliance" can be ascertained from many historical data sources in this population. The ideal finding from this research would not only seek to understand what drives a woman to cervical cancer screening but also segment the non-compliant population to a "most likely to become compliant" segmentation so that limited resources can be best utilized to increase screening rates.
Methods:
A retrospective case control study was designed where the subjects were measured on their current CCS compliance status based on HEDIS® 2008 technical specifications. Using SAS Enterprise Miner, predictive models were constructed to determine the likelihood of a subject being compliant for CCS. Model inputs included medical claims, compliance with other evidence-based guidelines, and demographic and geo-spatial variables. Varying model types were compared for performance.
Results:
Predictive analytics indicate that different age-groups are likely to be compliant for different reasons. Women age 24-41 are likely to be screened during an Ob/Gyn visit or during treatment for preventive, gynecological or obstetrical episodes of care. Women age 42-51 are likely to be screened during an Ob/Gyn visit, during treatment for preventive or gynecological episodes of care, or if compliant for breast cancer screening. Women age 52 and older are likely to be screened during an Ob/Gyn visit, during treatment for preventive episodes of care, or if compliant for breast cancer screening.
Conclusions:
A high proportion of CCS compliant women were positively associated with medical treatments not directly attributed to preventive screenings (i.e. obstetric and gynecological care). Additionally, a habit of preventive behaviors is formed in a compliant population (i.e. association with compliance in other evidence based guidelines).
Sequence Analysis Technique in Business
Katsutoshi Yada, Kansai University, Osaka, Japan
This presentation introduces and demonstrates the effectiveness of E-BONSAI system for examining consumer purchasing behavior. The system was developed from BONSAI which was originally designed for character string analysis required in the field of gene analysis. We adapted E-BONSAI to analyze consumers by examining chronological purchasing patterns expressed as character strings. This process revealed useful information from massive amount of customer purchase history data. This presentation describes the case where, based on information acquired through the above methodology, a new sales promotion strategy at Japanese supermarkets was implemented and proven to be effective.
Generate Rating Tiers Using Different GLM Modeling Approaches: A Topic of Predictive Modeling on Personal Lines Insurance Pricing
Jun Yan, Deloitte Consulting LLP
Since later 1990s, predictive modeling has been widely used as a strategic tool for Property and Casualty (P&C) insurance companies to compete in the market place. Originally introduced in personal auto insurance to improve pricing precision and risk segmentation, predictive modeling has been extended to homeowner's and small commercial lines as well. Recently, predictive modeling and the use of generalized linear models (GLM) have been applied widely in most areas of the P&C insurance operations.
In predictive models for pricing, the main focus is on predicting loss cost, determining premium to charge, evaluating rating adequacy, or determining rating class plan factors. One typical result developed from a pricing model is a rating plan, which displays the rating variables, factors and loss cost relativities across the rating variables.
In this presentation, we will describe some GLM based modeling methodologies for generating rating tiers on top of an existing rating plan. Those methodologies can be applied to modify a rating plan in numerous circumstances which include involving non-traditional rating variables in a rating plan or dealing with state regulation changes. Meanwhile those methodologies also can help to reduce business disruptions, such that the adjusted rating plan can be implemented for both new business and renewal business.
For the purpose of rating tier development, we will describe and compare three modeling approaches; each of them uses different target variable(s).
- An approach of separately modeling claim frequency and claim severity;
- An approach of modeling pure premium;
- An approach of modeling loss ratio.

