Abstracts

This page is updated weekly. Please check back frequently for the latest information.
Keynote Speakers
Frontiers in Data Mining: Emerging Trends, Challenges and Applications
Bart Baesens, Katholieke Universiteit Leuven (Belgium) & University of Southampton (United Kingdom)
Over the past few years, data mining has grown from a relatively unknown discipline into a widespread billion dollar business. Being first only adopted in the retail and banking sectors, we can nowadays observe a proliferation of the application domains, like for instance in e-business, terrorism prevention, RFID, software engineering, pharmaceutics, and bio-informatics. In this presentation, we first briefly present a selection of new exciting data mining techniques (e.g. Bayesian networks, multirelational mining), having a substantial potential for improving (strategic) business processes. We then discuss some key challenges when implementing data mining models successfully in the business, e.g. improving data quality, model interpretability, model backtesting and model stress testing. We conclude by covering some recent/visionary application domains and explain how data mining can contribute towards an increased efficiency in these fields.
Mining Industrial Data using Latent Variable Methods
John MacGregor, McMaster University, Canada
Latent Variable methods based on Principal Component Analysis (PCA) and Projection to Latent structures (PLS) have been the main approaches used for mining information from databases in the process industries. This talk will look at the theoretical justification for the use of these methods and present recent applications illustrating their use on a variety of problems from diverse industries. In particular, the talk will consider the use of plant data for process analysis, monitoring, and optimization, the extraction of information from digital images for on-line process and product quality control, and the use of diverse industrial databases for the rapid development of new products.
Managing Business Complexity with Agent-Based Modeling and Simulation
Michael J. North, Argonne National Laboratory
Agent-based modeling and simulation (ABMS) is a recent approach to modeling systems comprised of interacting autonomous agents. ABMS is already having far-reaching effects on the way that government and business uses computers to support decision-making. Computational advances have made possible a growing number of agent-based applications in a variety of fields at ever-increasing scales. Applications range from using ABMS to model supply chains and logistics systems, to predicting the spread of epidemics and the diffusion of public information, from the identifying factors in the fall of ancient civilizations to understanding contemporary urban conflict, to name a few. This tutorial, based on North and Macal's "Managing Business Complexity: Discovering Strategic Solutions with Agent-Based Modeling and Simulation" (Oxford 2007), describes the foundations of ABMS, identifies software toolkits and methods, and approaches for developing agent models, from spreadsheets to enterprise-scale computer systems, and discusses the relationship between ABMS and traditional modeling techniques, emphasizing the value-added that ABMS provides, along with special challenges pertaining to data and model validation.
An Introduction to Case-Based Reasoning with Special Emphasis to Image Mining Tasks in Biomedical Applications
Petra Perner, Institute of Computer Vision and applied Computer Sciences, IBaI, Germany
Case-Based Reasoning (CBR) solves problems using the already stored knowledge, and captures new knowledge, making it immediately available for solving the next problem. Therefore, case-based reasoning can be seen as a method for problem solving, and also as a method to capturing new experiences and making it immediately available for problem solving. It can be seen as a learning and knowledge-discovery approach.

CBR can collect samples in a defined way as well as learning more generalized knowledge in the form of higher-order constructs among the samples, prototypes and structures. As those it fills the gap between generalizing data mining methods, such as decision trees, statistical models, rule induction methods, and similarity-based data mining methods such as nearest neighbour classifiers.

The success of CBR systems has been shown for many applications among them are signal/image processing and interpretation tasks, help-desk applications, medical applications and E-commerce product-selling systems. In this talk we will explain the case-based reasoning and case mining process. We will show what kinds of methods are necessary to provide all the functions for such a computer model. We will develop the bridge between CBR and Statistics. Examples will be given based on image mining tasks in biomedical applications such as high-content analysis of cellular assays, novelty detection, and meta-learning for parameter selection in signal processing.
Reinventing Customer Relationships: Using Analytics to Capitalize on Insight, Intimacy, and Loyalty to Drive Growth
Daniel Thorpe, Wachovia
Everyone talks about creating an appealing "customer experience," but what do banks have to do to go beyond buzzwords and effectively create an integrated customer experience that leads to retention, revenue growth and improved profitability? What roles do analytics and targeting play in generating customer insight, customer value and customer intimacy? Several examples and a framework using analytics to understand the links between customers and values will be discussed.
Session Speakers
A Bag of Tricks for Your Balancing Act: How to Increase Predictive Accuracy on Imbalanced Datasets
Sven F. Crone, Lancaster University Management School, UK
Data Mining methods and procedures are routinely employed in business, but often neglect the specific properties of the dataset. For many corporate applications the actual class of interest, e.g. those responding to a direct mailing or defaulting on a loan, is often an underrepresented minority, which should be either targeted or avoided to ensure profitability. But how important is the data in the majority class of lesser interest? Is it required at all, or can we discard parts of it? And if so, is there some 'golden ratio' of negative to positive examples?

A variety of simple sampling strategies are now available to under- or over-sample the existing data. This presentation will demonstrate how different approaches of data sampling can enhance or impair predictive accuracy, using case studies of database marketing and direct mailing, customer credit scoring, and predicting internet shopping adoption to distinguish consumers between online-shoppers, browsers and offline shoppers.
Offset Techniques in Predictive Modeling for Insurance
Matthew Flynn, ISO and Jun Yan, Deloitte Consulting
In predictive modeling, "offset" is a technique frequently used in data architecture, modeling architecture and model setups for data mining. Intuitively, offset is a simple method used to run a model against the residual of a set of given factors. In this presentation, we will first display some generic applications of offset and corresponding SAS code using Proc GENMOD and Proc NLMIXED. The examples include modeling count instead of modeling ratio, and partially offsetting coefficients for certain variables. Next, we use personal auto pricing as example to show several variations of "offset" in real modeling practice where we will discuss how "offset" could be applied to exposure adjustment, sequential modeling and cross coverage tier construction. A GLM with Tweedie Compound Poisson distribution will be used in some of the examples where we will show how to code a Tweedie model in SAS Proc NLMIXED.
Support Vector Machines: The New Kid on the Block
Elsa Jordaan, The Dow Chemical Company
Support Vector Machines (SVM) is one of the latest nonlinear modelling techniques that come from the computational intelligence community. It is widely used for classification and text mining problems, but it can also be applied to regression problems. By design, SVM are able to model sparse data sets as well as rank deficient data sets ("fat" data). This last aspect makes it particularly interesting for analyzing marketing and medical data where the number of features is typically much higher the number of observations. Another capability of SVM, which is of particular interest in the financial sector, is its ability to detect outliers and anomalies.
Two Case Studies in Fraud Detection
Jin-Whan Jung, Jay King, and Sanjay Arangala, SAS
Fraud is known as a very difficult problem to solve due to its rarity and frequent changes in methods. Though rare, damage from fraud, when committed, can be significant both financially and in terms of corporate reputation and customer relations. Thus, fraud detection and prevention is an increasingly important objective across all industries. In a predictive modeling setting, approaches based on historical behavior tend to remain stable until the fraudsters change their strategy. Because new types of fraudulent activity are constantly emerging these models tend to decay rapidly and must also evolve. Catching these unusual and rapidly evolving behavior patterns have been especially challenging. Typical approaches with unknown historical information involve univariate and multivariate outlier detection techniques. Two examples using such techniques are introduced here with their business issues.
Data Mining at Chrysler
Thomas L. Kondrat, Chrysler LLC
Data mining technology has been used to solve a wide variety of business problems in many business domains at Chrysler. With the continuing expansion of brands and nameplates available to the automotive consumer, the automotive industry is extremely competitive and manufacturing and marketing processes have become increasingly complex. Consequently, we are relying more heavily on automated methods and analytics than ever before. Therefore, one critical key to corporate efficiency and effectiveness is the ability to both automate and optimize decision making (using analytics) to make the best use of the limited human resources available.
Building Symbolic Regression Models: An Industrial Experience
Arthur Kordon
Symbolic regression involves finding both the functional form and the numeric coefficients of a mathematical expression. It is one of the key application areas of Genetic Programming (GP) with great potential for effective empirical model development.

The presentation will focus on the industrial experience of applying symbolic regression for solving various real-world problems in the chemical industry. First, the competitive advantages of symbolic regression modeling, generated by GP, will be discussed. Second, an integrated methodology that explores the synergy between support vector machines, GP and statistics for effective symbolic regression modeling, will be presented. Third, the methodology will be illustrated with several industrial applications in The Dow Chemical Company.
Data Mining to Help Determine Which Orthodontic Patients are Appropriate to Treat and Which are Better to Refer to Specialists
Larry Lai & Eric Kuo, Align Technology, Inc.
The question of how to maximize treatment success for new customers using an innovative custom-manufactured dental product is a challenge for this medical device manufacturing company. By mining the wealth of digital orthodontics data about previously manufactured treatment devices, the company is better able to lay out effective product strategies to help guide the doctor in the decision-making process of, given his/her skill set, whether to treat a prospective patient using the device or refer the patient to a more-experienced dental specialist. The same approach can also be used to provide information to the clinical education, product sales, and product support teams to improve product training for customers. This strategy can improve treatment outcome quality, boost the doctor's clinical confidence in the dental product, and potentially improve the new customer retention rate.
Behavior-Based Predictive Models – A New Framework of Predictive Models
Wensui Liu & Jimmy Cela, ChoicePoint Precision Marketing
Modelers have traditionally used logistic regression, decision trees, or other data mining techniques to develop discriminate marketing models which classify populations of interests into segments. The aim is to calculate the probability-based score for each individual and predict his/her likelihood to respond to a marketing campaign offer, file an insurance claim, or default on a credit payment. This type of modeling strategy is focused on the choice of an individual without consideration to the consequent behaviors.

However, in the real world, what financially impacts companies the most is not in the choice to respond itself but the behaviors associated with such choice. Specifically, it is the frequency and the severity of such behaviors that result in either profits or losses to companies.

In this presentation, a new framework of predictive models will be introduced which simultaneously estimates both the probability of a specific choice (response, claim, or default) and the conditional probability of consequent behaviors (number of responses, claims, or defaults). Models to be discussed include Hurdle Model (Mullahy, 1986), Zero-Inflated Poisson Model (Lambert, 1992), and Latent Class Poisson Model (Wedel, 1993). Modeling strategy using SAS and various statistical tests for model selection will be illustrated to show the audience, from a marketing, insurance, and banking perspective, how to implement this new concept in data mining to predict customers' behaviors of interest.
Mining Industrial Data using Latent Variable Methods
John MacGregor, McMaster University, Canada
Latent Variable methods based on Principal Component Analysis (PCA) and Projection to Latent structures (PLS) have been the main approaches used for mining information from databases in the process industries. This talk will look at the theoretical justification for the use of these methods and present recent applications illustrating their use on a variety of problems from diverse industries. In particular, the talk will consider the use of plant data for process analysis, monitoring, and optimization, the extraction of information from digital images for on-line process and product quality control, and the use of diverse industrial databases for the rapid development of new products.
Marketing Impact Optimization Using PROC OPTMODEL
Randy Sherrod, CISCO
Econometric models are often used to help inform the allocation of corporate marketing budgets. By quantifying the impact of past marketing efforts on market performance (e.g., revenue), models make it possible to optimize the impact of marketing investment. Typically, scenario planning and optimization alternatives are calculated using the mean (not variance) of the estimated relationship between marketing efforts and market performance. This paper uses PROC OPTMODEL to illustrate how uncertainty can impact the optimization of marketing budgets by treating the econometric estimates of the relationship between marketing efforts and market performance as a random variable.
Mathematical Professional Science Masters (PSM) Degree Programs Are Excellent Sources for BI Staff Recruiting
Phil Tuchinsky, Tuchinsky BI, LLC and Senior Research Fellow, Central Michigan University Research Corporation
Would you like to add staff to your BI / analytics group that are A) seeking business / industrial careers in quantitative and data-driven problem solving; B) well-prepared academically as BI knowledge workers and C) proven project managers, business writers and presenters? Would these candidates be even more attractive if they come with D) experience on industrially sponsored analytics projects and E) business awareness coursework that prepares them to work with your executive project champions?

The graduates of more than a dozen professional science masters (PSM) degree programs at leading American universities have all these qualifications. They are self-starting mathematical knowledge workers and problem solvers, ready to grow into your BI corporate culture. Many will develop into leaders and managers. Starting salaries fresh from graduate school are near $60,000 a year (USD).

PSM programs are terminal degrees, usually involving two years of graduate training beyond a bachelor's degree. They are career-oriented, with established ties to business and industry. With Alfred P. Sloan Foundation support, more than 70 American universities created more than 120 of these programs since 1998. Their content ranges over all the mathematical, physical and life sciences; many are interdisciplinary.

The dozen+ mathematical PSM programs, variously named M.S. in Applied Mathematics, Financial Mathematics, Industrial Mathematics, Mathematical Entrepreneurship, etc., are exceptional training grounds for business intelligence knowledge workers. This talk presents the philosophy, origin and structure of these programs, their real-world roots and natural relationship to BI work ? and how to contact them all.
Alternative Paths Towards Improved Predictive Analytics for Customer Intelligence
Dirk Van den Poel, Ghent University, Belgium
This talk focuses on alternative ways to improve on your current predictive-analytics models for customer intelligence. This presentation will mainly focus on two areas of application: customer churn, cross/up-sell (also known as NPTB models, i.e. Next Product to Buy). Basically, there's two ways to enhance the predictive performance of your predictive models: 1. Include better variables, and 2. Employ better models. Firstly, this presentation demonstrates how sequential as well as textual information (using SAS Text Miner) will enhance the predictive capabilities over and above other variables. Secondly, we demonstrate how ensemble methods, comprised of numerous 'simple' models, increase the predictive power. All these approaches were rigorously tested through the peer-review process of international journals.
The IDeAL System: A Utility-based Methodology for Mining Massive Databases
Herna L. Viktor, University of Ottawa
Massive databases, which are omnipresent in domains such computational biology, law enforcement and environmental impact studies, amongst others, bring new challenges to the data mining community. There is an urgent need for novel algorithms and solutions to assist domain experts and data mining novices to understand these vast resources. Such users require direct, transparent access to their very large-scale relational databases. Furthermore, they require the data mining exercise and its results to be evaluated using measures that are of high economic utility, in order to make informed, thought-through decisions.

This talk describes the IDeAL data mining system, designed to mine very large-scale databases directly. A scalable approach that detects patterns between multiple seemingly unrelated characteristics is described, which enables us to capture previously unavailable semantically rich information. A utility-based evaluation method is followed, in order to give decision makers access to knowledge which may otherwise have remained hidden in their massive databases. The IDeAL system is illustrated by presenting the results when exploring an Anthropometric database, as seen from the Virtual Tailoring perspective.
Predicting Loss Given Default in Retail Portfolios using SAS Enterprise Miner
Hendrik Wagner, Independent Consultant (Risk Parameters)
Predicting percentage loss rates for non-defaulted loan exposures is a required component of risk estimation in the context of Basel 2 banking regulation. Over the course of the past few years many banks have gathered the necessary historical cash flow data that serve as the basis for LGD estimation in retail portfolios. Multivariate statistical estimation methods, however, have only recently started to be employed more widely. The presentation describes some tools in SAS Enterprise Miner that are particularly suited for addressing this problem, such as two-stage modelling, regression tree modelling and generalized additive neural networks.
Practical Applications of Decision Theory in Modeling Rare Events
Doug Wielenga, SAS
Modern data mining methods enable you to develop a large number of models in a short amount of time. Implicit in this development is a decision structure which can impact all phases of the process. Using the wrong decision structure can lead to inferior model development, incorrect model selection, and inadequate model deployment. By default, most modeling methods predict each observation for a class target into the level with the highest probability. This poses a problem when modeling a rare event where it may be impossible to find a set of predictors that identify any observations where the event of interest is most likely. It is common in these situations to oversample the rare event in the training data set or to change the threshold at which the rare event is selected. SAS Enterprise Miner® addresses both of these approaches through the use of the Target Profiler. This paper discusses several strategies for modeling rare events based on the nature and the size of the imbalance, and provides several examples of how to use the Target Profiler to obtain the correct probabilities and make the right decision. By using these strategies, you will be able to better identify the best variables, pick the best model, and improve deployment results.