Abstracts

This page is updated weekly so be sure to check back often for the latest information.
The riddle of suppression, enhancement and co-linearity in a linear regression model
Leonardo Auslender, SAS
It is standard practice in working with linear models, especially when the model is searched (for instance via stepwise) but also when the model is given, to provide meaningful interpretation to the resulting coefficients. It is not unusual to complain and even panic when the signs are not "correct", and one of the standard claims for this situation is co-linearity.

The aim of this presentation is to delve into the analytics of the relationships among variable coefficients. We show graphically that the correlations among those variables can sometimes induce better models even with high values of those correlations, and that co-linearity is a relative and not an absolute problem.
Data augmentation by predicting spending pleasure using commercially-available external data
Philippe Baecke, Ghent University
Since customer relationship management (CRM) plays an increasingly important role in a company's marketing strategy, the database of the company can be considered as a valuable asset to compete with others. Consequently, companies constantly try to augment their database through data collection themselves, but also trough the acquisition of commercially available external data. Until now, little research has been done on the usefulness of these commercially available external databases for CRM. This study presents a methodology for such external data vendors based on random forests predictive modeling techniques to create commercial variables that solve the shortcomings of a classic transactional database. Eventually, we predicted spending pleasure variables, a composed measure of purchase behavior and attitude, in 26 product categories for more than 3 million respondents. Enhancing a company's transactional database with these variables can significantly improve the predictive performance of existing CRM models. This has been demonstrated in a case study with a magazine publisher for which prospects needed to be identified for new customer acquisition.
Data mining in the financial services industry: change we need
Bart Baesens, Katholieke Universiteit Leuven (Belgium) & University of Southampton (United Kingdom)
In this talk, we will elaborate on some recent challenges that have emerged when applying data mining within the financial services industry for applications such as credit risk management, fraud detection, and anti-money laundering, We will first discuss issues related to data quality and master data management. Next, we discuss some important technical data mining challenges such as model interpretability, model compliance, and learning using networked data. This will be followed by an overview of model monitoring, model back testing, and model benchmarking. Finally, we also discuss how to incorporate macro-economic effects into data mining models by means of stress testing procedures. Throughout the talk, key recommendations and clear guidelines will be provided with respect to the challenges mentioned. Ideas presented in the talk can also have relevance in other data mining application fields, e.g. marketing, medicine, pharmaceutical, and governmental,
Don't Bore Your Expert: How to Interactively Learn Classifiers from Unlabeled Data
Michael Berthold, University of Konstanz
Building classifiers based on unlabeled data becomes increasingly important as data repositories grow and the human ability to label at least subsets of this data remains constant. Especially in the life sciences, such classification problems are becoming more important every day. We will focus on one example in this talk which stems from high-throughput microscope screening cameras. These devices are able to produce hundreds of thousands of images per day. Since they generate different types of pictures for each assay, a new classifier is needed every time. Asking the user to label some dedicated images is possible, but the number of such queries needs to be kept to an absolute minimum, preferably no more than one hundred queries for an assay consisting of several million images.

In this talk we discuss an adaptive active classification scheme which establishes ties between the two opposing concepts of unsupervised clustering of the underlying data and the supervised task of classification. The learning scheme allows for an initial clustering of large datasets, then queries a few selected examples and subsequently adjusts the classification boundaries based on a small number of additional labels. The learner hence asks the human expert for labels of the most informative examples throughout the learning process and therefore keeps the costs for supervision at a manageable level. We discuss results on real world applications and show how this approach also points out artefacts and other interesting outliers in the data.
How to Translate Data Mining Outputs into Measurable Uplifts in ROI
Mark Carmichael, Eclipse International
For the most part, direct marketers today
  • repeat old direct marketing techniques instead of testing new technologies. Typically marketers use the same direct marketing channels year after year instead of regularly evaluating and reallocating budget to improve return. More than 80% of marketers struggle to respond to marketing data in order to improve results.
  • hand off the planning process. Many marketers offload marketing planning to an agency and rarely question the agency's recommendations.
  • focus on reach, rather than on behaviour and intent. Today, most direct marketing planning tools identify which channels will deliver the greatest reach based on historical data. But this does not help a marketer influence customer behaviour or predict a propensity to take future action.
Technology Demands A New Approach
How much to spend on which media is an age-old question. So what's different now? In a word: technology.

Technology forces are changing direct marketing in three important ways - all of which demand that marketers re-engineer the way they spend their money. Technology
  • changes consumer behavior dramatically. Consumers are no longer a mass-market reachable through traditional media. Because of technology, consumers now go online for information, multitask across multiple channels, rely on peers and personal research for product information, and expect immediate, customized responses to their needs.
  • alters the purchase process. The shift in consumer information consumption has also changed the impact of traditional marketing tactics - like broadcast media and print - on purchase decisions. According to DoubleClick's Touchpoints II study, 64% of consumers feel the Internet has altered their purchase process from five years ago. In fact, today, Web sites have more influence over purchases than any other media, and consumers are more self-reliant and more willing to buy even high-ticket, high-consideration products online.7
  • plays an increasing role in marketing processes. Technology helps marketers get smarter about customers, deliver more relevant messages, and measure impact of their campaigns. Marketing automation, data warehouses, Web analytics, personalised URLs, email, SMS, CRM software, and contact management systems are all examples of technologies playing increasing roles in the marketing process. They enable marketers to shift from mass branding to an approach that better suits unpredictable customer behaviors: cross-channel integration of customer-facing channels.
Bringing it all Together: An Integrated Approach to Increased Results & ROI
Data analysis is only as good as the approach a marketer takes to executing on this. Outdated planning processes and technology realities require a new, more scientific way to plan that focuses on customer behavior and objectively considers all direct marketing resources. Marketers must facilitate, streamline, customize and track communications better and more effectively. By enriching the information flow (by making it more relevant, timely and meaningful), the marketer also improves the quality of the communication itself.

By embracing a mix of emerging digital technologies, sophisticated analysis techniques and continual testing of changing creative and campaign variables, marketers find the optimal approach to deliver content and information to a customer in a manner far more efficient and effective than traditional direct marketing approaches.

Develop an approach to utilize data analysis outputs to
  1. understand customers better
  2. communicate with them more effectively
  3. measure effectiveness
  4. determine (through testing) which permutation of variables yields highest ROI
Customer Dynamics & Data Fusion: Revealing the Evolution of Customer Relationships
Gary Class, Wells Fargo & Company
Customer Dynamics explores how customers' behavior changes over time and the impact of those changes on the customers' relationship with the Firm.

Critical to the development of predictive models of Customer Dynamics is the acquisition and alignment of disparate data regarding the customers' interactions with the firm. These include both Behavioral data (direct observations of customers' behavior) and Attitudinal data (indirect assessment of customers' motivations and beliefs via survey). Data Fusion is a methodology to connect and calibrate these disparate data to understand the "why" behind the "what".

The scope of customer interaction data can be expanded beyond the traditional sources of "Structured" transactional data (from accounting and operational systems) to include "Semi-structured" data (e.g. middleware messages & server logs) and "Unstructured" data (inbound emails, agent notes in CRM systems, etc.).

Capitalizing upon the information assets resulting from Data Fusion, the firm can observe how customer portfolios evolve over time and identify the drivers of important Customer Dynamics such as customer retention & revenue growth.

Ultimately, these insights can provide a rich Decision Support tool-set that can enable business strategies and tactics that are information-based and highly scale-able.
The High ROI of Data Mining for Innovative Organizations
John Elder, Elder Research, Inc.
Data mining can enhance your bottom line in three basic ways: by streamlining a process, eliminating bad cases, or highlighting the good. In rare situations, a fourth way -- creating something new -- is possible. But modern organizations are so effective at their core tasks that data mining usually results in an iterative, rather than transformative, improvement. Still, the impact can be dramatic.

Dr. Elder will share the story (problem, solution, and effect) of nine projects conducted over the last decade for some of America's most innovative agencies and corporations:

Streamline:
  • Cross-selling for HSBC
  • Image recognition for Anheuser-Busch
  • Biometric identification for Lumidigm (for Disney)
  • Optimal decisioning for Peregrine Systems (now part of Hewlett-Packard)
  • Quick decisions for the Social Security Administration
Eliminate Bad:
  • Tax fraud detection for the IRS
  • Warranty Fraud detection for Hewlett-Packard
Highlight Good:
  • Sector trading for WestWind Foundation
  • Drug efficacy discovery for Pharmacia & UpJohn (now part of Pfizer)
All of these projects were technical successes (sometimes astonishingly so). But some were business failures. From this cross-industry review, we'll discover what characteristics are key to a project succeeding in both realms.
Understanding Latent Semantics in Textual Data
Nick Evangelopoulos, University of North Texas & Terry Woodfield, SAS
Summarization and visualization of textual data typically involves Singular Value Decomposition (SVD), a matrix operation capable of reducing the dimensionality of the term-document structure by introducing a small number of latent semantic dimensions. Traditional textual analysis approaches utilize the SVD dimensions as attributes for clustering and other visualization techniques, or as inputs in predictive models, without making any effort to interpret them. This presentation will show how the SVD dimensions can be transformed into dimensions that can be easily labeled. The demystification of the SVD dimensions opens the way to a number of text mining possibilities. On the data exploration front, these include the summarization of collections of documents. On the predictive analytics front, they include the incorporation of meaningful inputs into interpretable models, such as Decision Trees. Several applications of this technique, including classification of news stories into meaningful topics, extraction of meaningful issues from customer feedback comments, and analysis of open-ended survey questions, will be discussed. The presentation will conclude with an illustration of topic extraction from a collection of textual data.
Financial Data Mining with Algorithmic Trading
Robert Golan, DBmind Technologies, Inc.
Algorithmic Trading has changed the world the way the Traders trade and Trade Support supports. There is a Brave New World happening with the "hands on" Trading evolving into "hands off" Algol Trading. Not all trades need to be made in ultra low latency timing. Future trading will rely on a broader set of data which will be mined for relevance. An important series of XBRL Financial Reporting events are happening throughout the world and especially in the USA. A critical mass of financial data will be ready for mining which will be a boon for transparent "low touch" fundamental style algorithmic trading. Also, "low touch" trading such as program trading & direct market access (DMA) will evolve into advanced Algol Trading strategies. Stock and economic indicators combined with XBRL will add value for Algol Trading. This is about a well thought out strategic high latency trading strategy with data mining discovering the governing rules while adding the expert rules with validation. Yes, the trader is still the key to making this all happen. Both fundamental and technical trading rules need to be combined with the expert rules, the data mined rules, and most importantly the regulatory environment rules. RegNMS in the USA and MiFID in Europe have indirectly helped the adoption of electronic trading and it is important to integrate the GRC related rules in an agile way. Agility is the key and thus the rules need to be placed into a rules engine and managed by the experts for proper compliance, risk management, and governance. Japan, China, and the Netherlands with regards to XBRL are ready to be data mined with Algol Trading now. A XBRL US survey is indicating at least 340 of the estimated 500 public companies that the SEC requires to begin filing in XBRL format in June 2009, have already converted their financial statements into XBRL. XBRL US, is the non-profit XML standard setter that developed and maintains the US GAAP taxonomy used by filers to comply with the SEC mandate. Almost $7 trillion in market capitalization will be represented by this XBRL financial data which is over 50% of the total market cap for all publicly traded companies reporting to the SEC. As this XBRL Financial Data ripens, a wonderful harvest awaits us data miners which will enhance the current Algol Trading strategies which use this data.
Impact of an Analytics Program on an Emergent Economy: The Mexican Case
Viterbo H. Berberena González & Guillermo Hijar, Universidad Anahuac


The evolution of the program in analytics offered by Anahuac University since January, 2005 is presented over a time frame beginning with the creation of executive courses and culminating with a Masters Degree in Analytical Intelligence. The various difficulties encountered in the process and their solution will be emphasized, in particular from the market needs standpoint and the receptiveness of the entrepreneurial community about the analytical approach as a competitive advantage.

The program in analytics, beyond its intrinsic value, has served as a methodological frame to reengineer and improve other graduate programs in the Graduate School of Engineering at Anahuac University.

Presentation Structure
  1. Background: Description of the analytics level at the beginning of year 2005
  2. The creation of a diploma degree course in Data Mining
    • The academic program design
    • Support from SAS Institute (materials)
    • Instructors' training
  3. Data mining course start - up (159 hours of instruction)
    • High demand vs. lack of instructors
    • Students' profile
    • First complaints and appraisals
    • Revision and improvement of the academic program (reduction to 120 hours of instruction)
    • A boom in the demand for a formal graduate course in Analytic Intelligence surges.
  4. Design and development of a Master's Degree in Analytic Intelligence
    • Research among Mexican Universities
    • Brainstorming with the Director of the Institute of Advanced Analytics and a visit to North Carolina State University
    • Brainstorming with the Director of SAS Global Academic Program and a visit to SAS Institute HQ's
Why Our Models Fail - Sometimes It's in the EDA
Dudley Gwaltney, SunTrust Bank
Everything about the model looks good - the Goodness of Fit; the lift, the model validates well. But when implemented, the model fails. Why? The answer might not be the final model, but the Exploratory Data Analysis (EDA).

As modelers, sometimes we are so anxious to try out the latest modeling techniques that we forget that the EDA is the most important step in the process. The presentation will review how a thorough EDA can lead to a successful modeling project and an incomplete EDA, a failed project.

The presentation will include:
  • Is the target rate significant enough to support the project?
  • Is the population size large enough?
  • The importance of understanding missing values?
  • What are the data sources - Is the data consistent over time, are updates provided, etc?
  • How will the model be implemented?

If the modeler does their due diligence during the EDA, the chances of success are much greater. If not, the chance of success is left to chance.
Impact on Customer Churn of Intensity of Sharing of User-generated Content at an Online Service
Zainab Jamal, HP Labs
Customers continue to increasingly interact online - sharing photos, videos, songs and other user-generated content. Research on social networks has ignored to a large extent the existence of social networks within the customer base of online services that have sharing functionalities. We believe the availability of such sharing functionalities increases the stickiness of these online services as it allows their customers to easily connect and share with their social network. In the absence of such functionalities, customers are more likely to move to competitive services that allow them to not only store and manage their digital content but also to share it with their social network.

In this study, we evaluate the value of sharing by establishing a link between the sharing activity of a customer as captured by RFM metrics adapted to sharing behavior and key performance metrics like customer churn and CLV.

We use a conditional hazard model for multiple events with heterogeneity to estimate the impact of sharing activity on the probability of customer churn. We also include other factors that may impact the churn rate like recency, frequency and amount of purchases, amount of digital content uploaded, number of other customers who viewed the content as well as number of calls to the customer service center. We show that intensity of sharing has a negative impact on customer churn. Thus, the more sharing the customers engage in the more likely they are to stay with the online service.
Optimal Design of Mailing Campaigns - Insights from a Series of Studies
Manfred Krafft, University of Münster
Textbooks and practitioner-oriented publications on direct marketing contain manifold - and often conflicting - recommendations for designing successful direct mail pieces. Academic research has tried to shed more light on the fuzzy issue of successful direct mail design and to provide answers to the following questions:
  1. What drives consumers to open direct mail pieces?
  2. What induces them to respond to a direct mail offering?
Numerous studies have considered only subsets of mailing design characteristics and are usually limited in the sense that only single campaigns have been investigated. Nevertheless, a few effects seem to be generalizable across studies and will be described. The shortcoming of current research is that the focal variable has been limited to response rate. Given that response rates are frequently single-digit percentages or even below 1%, one wonders what happens between mails being sent (100%) and customer response (1%). Therefore, based on a representative sample of 3,000 German households, we investigate the effects of
  1. various envelope characteristics on opening behavior, and
  2. design characteristics of the envelope content (i.e., the letter, brochure, and response device) on consumers' sustainable attention rate (SAR) of direct mail pieces.
SAR serves as a surrogate measure for actual response and covers the phenomenon that recipients keep mailings after opening and prior to actual response. Campaign mailing volume is included as a covariate. Analyzing 682 direct mail campaigns from non-profit organizations and financial service providers, we find that the design characteristics, along with mailing volumes, account for a substantial percentage of the variance in opening rate and SAR. Interestingly, the effects of mailing volume per campaign on opening rate and SAR are clearly non-linear. Mailing volume decisions also involve some trade-offs between getting consumers' attention and response. Finally, opening and retention behavior are uncorrelated, implying that opening a direct mail piece is only a necessary condition for responding to the offer, but not per se a driver of direct mail response. Consequently, both stages of the response process have to be optimized independently.
ApproxMAP : Intelligent Sequential Pattern Mining via Alignment
Hye-Chung Kum, University of North Carolina
The goal of sequential pattern mining is to detect patterns in a database comprised of sequences of itemsets. For example, retail stores often collect customer purchase records in sequence databases in which a sequential pattern indicates a customer's buying habit. In such a database, each purchase is represented as a set of items, itemsets, purchased together, and a customer sequence would be a sequence of such itemsets.

Sequential pattern mining is commonly defined as finding the complete set of frequent subsequences in such a database. However, the sheer volume of the results in the traditional support based frequent sequential pattern mining methods has led to increasing interest in new intelligent mining methods to find more meaningful and compact results. One such approach is the consensus sequential pattern mining method based on sequence alignment. Consensus sequential patterns can detect general trends in a group of similar sequences, and may be more useful in finding non-trivial and interesting long patterns. It can be used to detect general trends in the sequential database for natural customer groups, which is more useful than finding all frequent subsequences in the database. Formally, consensus sequential patterns are patterns shared by many sequences in the database but not necessarily exactly contained in any one of them.

In this talk, we will describe ApproxMAP (APPROXimate Multiple Alignment Pattern mining), an effective algorithm for consensus sequential pattern mining. It has been successfully applied to many areas such as multi-database mining, temporal streaming data mining, and policy analysis. Furthermore, we will present a detailed comparison study of the alignment based methods and support base methods. Many of you will be familiar with the most commonly available support based sequential pattern mining. The comparison study will illustrate how well ApproxMAP works in comparison.
Marketing Optimization for Increased Revenues
Choudur K. Lakshminarayan, HP Laboratories
We present a systematic and reliable way of assessing marketing effectiveness and optimally allocating investments among marketing instruments that HP employs to reach our customers based on a fixed budget. As marketing budgets are planned, the methods and tools we developed assist marketing managers to assess various allocation scenarios and determine the best investment strategy to maximize revenue. This program is unique in that it innovatively integrates econometrics, expert assessments of business conditions and mathematical optimization to build a unified framework for revenue/profit maximization based on a standardized platform. The methodology is flexible for adoption in other areas where optimization is involved.
Net Lift Prediction Models: How to Maximize Marketing Impact and What Data Miners Can Learn from Presidential Campaigns
Kim Larsen, Charles Schwab & Co
The true effectiveness of a marketing campaign is measured by quantifying the incremental impact. That is, additional revenue directly attributable to the campaign.

Measuring incremental impact is typically done by creating a random control group that will not receive the offer. If the clients contacted by the campaign (a.k.a. the "test group") have better post-campaign performance than the control group, the campaign has been effective and further rollouts can be considered.

The problem is that targeting strategies are often not designed to maximize the incremental impact. Typical targeting models yield impressive conversion rates for the test group, but too often the results are equally impressive for the control group. In such cases, the incremental impact is insignificant and marketing dollars could have been spent elsewhere.

The purpose of this talk is to demonstrate how to build Net Lift Models (also referred to as Uplift Models) that optimize the incremental impact of marketing campaigns by maximizing the difference in conversion rates between the test and control groups. We will consider three different ways to build such models in SAS and discuss the pros and cons of each method.
Text Mining to Discover Influential Communications in Social Movements
Randall LaViolette and Judy Spomer, Sandia National Laboratories
Social Movement Theory (SMT) is an area of study in Sociology and Political Science that provides an analytical framework for understanding the factors involved in organized social action. A social movement develops in response to an issue about which people are mobilized in an effort to solve the problem. Much of this research has focused on understanding the framing process, whereby a Social Movement Organization issues communications intended to influence the perceptions and direct the actions of the members of a community or general population. The web has become the principal medium for these organizations to distribute framing documents, i.e., those that describe an issue, identify victims, place blame, propose solutions, and ask readers to take action on an issue. Here models, that are able to discover small numbers of framing documents in a large corpus, are developed by combining Latent Semantic Analysis techniques with classification modeling algorithms. The models themselves provide insight into the character of framing documents. Global warming was selected as the social movement upon which to base this study. Global warming framing documents, collected from web sites, were combined with non-framing documents that also address global warming. This corpus served to train and test statistical models that not only detected framing documents, but further classified these by framing task (diagnostic, prognostic, motivational) with high accuracy. The accuracy was assessed both internally and against more straightforward approaches that were uninformed by SMT. These methods were implemented with SAS software and serve as a resource for the study of both SMT and active social movements.
Long-Term Value Modeling in the Automobile Industry
Bruce Lund, Marketing Associates LLC
Businesses often classify their customer base in terms of the customers' predicted long-term value (LTV). LTV may influence marketing strategies, particularly CRM and concern resolution. This paper describes an approach to LTV modeling in the automobile industry.

Defined: LTV provides an estimate of the time-adjusted profits to an automobile company from future new-vehicle purchases by a household.

Two statistical models will be discussed:
  1. A "Vehicle-Segment Choice Model" which predicts the likelihood that the next new-vehicle purchase from the Company will be within a particular vehicle segment.
  2. A "Time to Next Acquisition Model" which provides: (A) the probability that the household will ever make another new-vehicle purchase from the Company, and (B) assuming the household will make another new-vehicle purchase, the probability distribution for the household of the time in months to this next purchase.

Lastly, the LTV calculation will be described.

The implementation of the model for a major automobile company will be outlined and performance metrics will be discussed.
Experiences on Experiment Design in Direct Marketing
Riku Mäkeläinen, TeliaSonera
TeliaSonera is the leading Nordic and Baltic telecommunications operator. After several years of good experiences on predictive modeling, the Swedish Broadband marketing unit decided to further improve direct marketing by using design and analysis of experiments when selecting target customers. This presentation focuses on sharing experiences from development work until present. The presentation includes a concrete case. Tools used include SAS QC/ADX, SAS Enterprise Miner, SAS Campaign Studio, and SAS STAT.
Why Just Count Crime When You Can Prevent it: Changing Public Safety Outcomes with Operational Analytics
Colleen McCue, MC2 Solutions, LLC
Doing more with less. Why just count crime when you can anticipate, prevent and respond more effectively? Operationally relevant and actionable analysis allows command staff and police managers to leverage advanced analytics in support of meaningful, information based tactics, strategy and policy decisions in the applied public safety and security environment. As the law enforcement community increasingly is asked to do more with less, these methods represent an opportunity to prevent crime and respond more effectively, while optimizing increasingly scarce or limited resources. Case studies will include risk-based deployment and resource allocation, or "just-in-time" policing, illegal narcotics markets, hostile surveillance, and the behavioral analysis and modeling of violent crime to include homicide, aggravated assault, robbery, and sexual assault. Ultimately, the incorporation of meaningful, operationally relevant and actionable analysis into information based police tactics, strategy and policy promises to increase public safety and change outcomes.
Active Use of Data Mining in the Customer Life Cycle Management Process of a Telecom Operator
Robert Moberg and Mattias Andersson, TRE
The challenge of a Telecom operator, besides recruiting new customers, is to reduce the attrition rate and at the same time maintain or increase the average revenue per user, all under the greater mission to run a profitable business.

This presentation will give deeper insight on how the Swedish branch of the global Telco 3 works with strategic Customer LifeCycle Management (CLM) and the critical contributions from SAS Analytics and SAS Enterprise Miner in this process. The highly individualized communication reaches its targets with high precision via a wide variety of channels; e.g., Invoice, SMS, MMS, DM, TM, and Self-service. Besides increased lift values and higher yield on the campaigns, everything is now launched with less effort and more efficiently in an automated process.
Strategic Marketing Analytics During Turbulent Times
Will Neafsey, Ford Motor Company
The past 12-18 months will long be remembered as a chaotic and trying time for business and personal endurance. During this time, we have seen the collapse and disappearance of businesses and industries we all presumed would outlast our natural lives. Marketing executives and managers in all industries are looking backward and questioning every decision made leading up today. In this tumultuous economy, today's target customers may be the same people marketers ignored as little as a year ago.

The purpose of this talk is to discuss a variety of strategic marketing analytic techniques like segmentation and predictive modeling. These techniques will continue to be some of our best weapons throughout the economic turmoil and the recovery that lies ahead. General examples from the automotive industry will be used to illustrate the benefits and potential pitfalls of using advanced analytics in these uncertain times."
Exploiting randomness for aCRM choice modeling: Random MultiNomial Logit and Random Interaction MultiNomial Logit
Anita Prinzie, University of Manchester
Random Forests (RF) is a successful classifier exhibiting performance comparable to Adaboost, but is more robust. The exploitation of two sources of randomness, random inputs (bagging) and random features, make RF accurate classifiers in several domains. We hypothesize that methods other than classification or regression trees could also benefit from injecting randomness. We generalize the RF framework to other multiclass classification algorithms like the well-established MultiNomial Logit (MNL). We propose Random MNL (RMNL) as a new bagged classifier combining a forest of MNLs estimated with randomly selected features. The Random MNL only includes main effects. However, given the omnipresence of interaction effects in consumer behaviour, refraining from the assessment of the predictive value of interaction effects in RU models is unacceptable.

Therefore, we present a Random Interaction MultiNomial Logit (RIMNL) model reintegrating interaction effects in the input space of a MultiNomial Logit model. In a first step, a Random MultiNomial Logit (RMNL) is estimated, building a forest of multinomial logits with randomly selected variables, amongst which 60% randomly selected main effects and 40% two-way interaction effects created from two randomly selected main effects. From the raw importance variable scores, we select a subset of best main effects as well as a subset of best interactions. In a second step, a RMNL model estimates multinomial logit models randomly selecting features from the feature space integrating the best main effects as well as the best interaction effects from step 1. The RIMNL model is applied to the same data as the RMNL model. The results prove RIMNL's ability to detect powerful interactions.
The power of the Group Processing Facility in SAS Enterprise Miner
Sascha Schubert, SAS
The group processing facility in SAS Enterprise Miner provides users with the power to create loops over analytical models in very customized ways. With this users can automate the model building process to
  • build predictive models based on pre-defined segments, such as gender or age groups, or value-based grouping.
  • apply automated model performance optimization routines, such as bagging and boosting
  • apply Cross-Validation techniques to test model robustness.
This presentation will provide background on the different usage options of the group processing facility as well as examples using SAS Enterprise Miner.
Trajectories Mining
Shusaku Tsumoto, Shimane University
This presentation shows a method for grouping trajectories of two or three temporal variables. Our method employed the following two-stage approach. Firstly, it compared two trajectories based on their structural similarity, and determines the best correspondence of partial trajectories. Then, it calculated the value-based dissimilarity for the all pairs of matched segments, and outputs their total sum as the dissimilarity of two trajectories. Experimental results on several datasets demonstrate that our method could capture the structural similarity between trajectories even in the presence of noise and local differences, and could provide better proximity for discriminating objects.
Incorporating Fuzzy Cluster Memberships within Enterprise Miner
Donald Wedding, SAS
Clustering is the process of placing data records into homogenous groups. Members of each group are similar to one another and highly different from members outside the group. There are two types of clustering methodologies that are primarily used: hard clustering and fuzzy clustering. In hard clustering, membership is absolute and mutually exclusive. A record is completely a member of a cluster or it is not a member. In fuzzy clustering, a record is permitted to have partial membership so that it is allowed to be in more in one cluster. For example, a record might have 60% membership in CLUSTER A and 40% membership in CLUSTER B. Enterprise Miner 5.3 does not incorporate fuzzy clusters, but it is possible generate fuzzy memberships of hard clusters. This presentation will provide a brief description of fuzzy clusters and will present a tutorial on incorporating fuzzy memberships within Enterprise Miner.
Sequence Analysis Technique in Business
Katsutoshi Yada, Kansai University, Osaka, Japan
This presentation introduces and demonstrates the effectiveness of E-BONSAI system for examining consumer purchasing behavior. The system was developed from BONSAI which was originally designed for character string analysis required in the field of gene analysis. We adapted E-BONSAI to analyze consumers by examining chronological purchasing patterns expressed as character strings. This process revealed useful information from massive amount of customer purchase history data. This presentation describes the case where, based on information acquired through the above methodology, a new sales promotion strategy at Japanese supermarkets was implemented and proven to be effective.
Generate Rating Tiers Using Different GLM Modeling Approaches: A Topic of Predictive Modeling on Personal Lines Insurance Pricing
Jun Yan, Deloitte Consulting LLP
Since later 1990s, predictive modeling has been widely used as a strategic tool for Property and Casualty (P&C) insurance companies to compete in the market place. Originally introduced in personal auto insurance to improve pricing precision and risk segmentation, predictive modeling has been extended to homeowner's and small commercial lines as well. Recently, predictive modeling and the use of generalized linear models (GLM) have been applied widely in most areas of the P&C insurance operations.

In predictive models for pricing, the main focus is on predicting loss cost, determining premium to charge, evaluating rating adequacy, or determining rating class plan factors. One typical result developed from a pricing model is a rating plan, which displays the rating variables, factors and loss cost relativities across the rating variables.

In this presentation, we will describe some GLM based modeling methodologies for generating rating tiers on top of an existing rating plan. Those methodologies can be applied to modify a rating plan in numerous circumstances which include involving non-traditional rating variables in a rating plan or dealing with state regulation changes. Meanwhile those methodologies also can help to reduce business disruptions, such that the adjusted rating plan can be implemented for both new business and renewal business.

For the purpose of rating tier development, we will describe and compare three modeling approaches; each of them uses different target variable(s).
  1. An approach of separately modeling claim frequency and claim severity;
  2. An approach of modeling pure premium;
  3. An approach of modeling loss ratio.
Through the presentation, we will use a simulated personal auto insurance data and SAS codes to explain various modeling techniques, such as adjusting data structure, using GLM offset options, adjusting exposure and setting up Tweedie models.