SAS M2004
Speakers Speakers Abstracts Logistics Registration Info Agenda Training Sponsors www.sas.com
M2005

The following abstracts have been provided:

Mihael Ankerst, Boeing
Cooperative data mining: Tightly integrating data mining with visualization
The exponential growth of data and the inability of humans to process datasets of large sizes has led to the conception of automating the data mining process. This talk highlights the danger of neglecting the human factor in the mining process and proposes mining as a cooperative approach of the user and the computer. The areas data mining and information visualization offer various techniques which effectively complement one another supporting the discovery of patterns in the data. Whereas traditional (algorithmic) techniques are analyzing the data automatically, information visualization techniques can leverage the data mining process from an orthogonal direction by providing a platform for understanding the data and generating hypotheses about the data based on human capabilities such as domain knowledge, perception, and creativity.

This talk discusses the benefits and challenges of tightly integrating visualizations with mining algorithms and presents examples based on first prototypes.



Jason Bargen, Hallmark Cards, Inc.
Data Mining methods used on Hallmark's Gold Crown Card Consumers
Loyalty is not something you purchase. It is something you earn. Loyal consumers can generate good profits for companies, and a way for companies to become more profitable is to attract and retain these dedicated consumers. At Hallmark we are using different data mining techniques to better understand consumer's behavior. These data mining techniques benefit both Hallmark and its consumers. We are using this information to retain our loyal consumers and increase the number of Gold Crown Card members in our database. We also use this information to do a better job of providing consumers with relevant messages/offers that meet their needs. Our overall goal is to increase the number of loyal members by 40% in the next 4 years. I will talk about the different data mining methods used to help Hallmark understand and retain their current consumers and the tests underway to engage new consumers.



Joe Bartling, H&R Block
Design and Analysis of Market Tests at H&R Block
I present and discuss the processes and methodology utilized at H&R Block to design and analyze various types of retail market level tests. The various types of testing applications discussed are appropriate for testing media type, media weight, marketing programs, promotional programs, operational changes, and price testing. I discuss how we approach the selection of test and control markets and/or retail locations to maximize learning. I also discuss how we determine the relevant test metrics (e.g. lifetime value, units, revenue, etc.) and then calculate the appropriate test advantage. In addition, as all of our testing for the tax business must occur within a four month span every year, there are always last minute test ideas. I will discuss how we prepare in advance to accommodate these last minute tests.



Eugenia Bastos, SAS (with Russ Wolfinger, David Duling and Leonardo Auslender, SAS)
Data Mining Breast Cancer Clinical and Expression Data
One in eight women in US will develop breast cancer. Of these, one third will progress to fatal metastatic cancer (Peto et al., Lancet, 2000). Gene expression profiles have the potential to improve accuracy of metastasis classification when compared with the classical clinical variables.

In this study, we re-analyze clinical and microarray data of primary breast cancer tumor tissues from 98 patients (van't Veer et al., Nature, 2002), with the primary objective to predict metastasis status. The study sample includes sporadic and brca1 positive cases where 47.4% progressed to metastasis; only 18.6% had lymph node involvement; high grade accounted for 61.9% and 71.1% developed angioinvasion. Two groups of patients are classified according to their metastasis status: non-metastasis patients who were disease free for at least 5 years are called "good prognosis" group and patients who developed metastasis within 5 years are classified as "poor prognosis" group.

We begin with the 78 sporadic cases, using which the original authors developed a classifier based on 70 out of ~25,000 genes determined by a filtering and cross-validation approach using correlation coefficients. They report a 16.7% misclassification rate for leave-one-out cross validation (LOOCV). We present an approach based on analysis-of-variance filtering and cross-validated stepwise discriminant analysis which results in a classifier based on 13 genes and an LOOCV error rate of 3.8%.

Next, we extend consideration to all 98 cases and their associated clinical covariates. We consider additional mining approaches such as logistic regression, decision trees, and prediction based on profiles obtained by unsupervised k-means clustering. Due to the observational nature of the data, we advocate extensive cross-validation to help ensure generalizability. Substantial gains in predictive performance are evident.

Eugenia Bastos, PhD is analytical consultant for Life Sciences Organization at SAS Institute, San Francisco, California. E-Mail:Eugenia.bastos@sas.com.

Russ Wolfinger, PhD, is Director of Genomics, SAS Institute Inc., Cary, NC.

David Duling, PhD, is Director of Enterprise Miner Development, SAS Institute Inc., Cary, NC

Leonardo Auslender, is Statistician of Enterprise Miner Development, SAS Institute Inc. NC.



Tom Bradshaw, Bank of America
A Statistical (and yet Non-Traditional) Approach to the Design, Optimization, and Analysis of Matched Market Testing Using Linear Models and Time Series Behavioral Data
The analysis of mass media advertising and similar market level activities is often based on a pre/post comparison of two Matched Markets. Experience at the Bank of America has demonstrated that because the market selection process has historically not been statistically and behaviorally based the results have frequently been inconclusive or misleading. This presentation describes a statistically based approach that we have developed that uses behavior based time series data and SAS linear model procedures to select the best test and control markets from a large pool of candidates; evaluate and optimize their discriminative sensitivity prior to the test; and analyze the post test results. The methodology is demonstrated using data from an actual matched market test. Examples of SAS code are provided.



James Cappel, Central Michigan University
A Survey of Practices and Opinions about Business Intelligence
Many companies are still struggling with how best to position business intelligence within their organization and optimize their investments in data mining and BI. This study was conducted on behalf of the Central Michigan University Research Center and sponsored by The Dow Chemical Company, Ford Motor Company, and Eli Lilly and Company. This investigation involved a web-based survey of hundreds of business professionals from large organizations. The survey addressed various issues such as: the scope, structure, resources, drivers, and evaluation of BI practices within companies, as well as opinions about the perceived role, effectiveness, and importance of BI to companies. The results provide a baseline for organizations to assess their progress along the BI continuum and they raise thoughts about how companies may potentially improve their BI practices.



Robert Chu, SAS
On-Demand SAS Predictive Model Scoring
Application domains such as call centers, front office, and credit approval are requiring on-demand predictive model scoring. This presentation will review on-demand scoring issues and walk through all the needed steps live to deploy SAS predictive models. Steps include (1) create models in SAS Enterprise Miner, (2) register models in a SAS metadata repository, (3) score in real-time with registered models, (4) score in batch with registered models, and (5) summarize and report scoring results. Software tools used in the presentation include SAS Enterprise Miner, SAS Integration Technologies, messaging queues, and relational databases.



Jay Coleman & Allen Lynch
Method to Madness: Identifying the NCAA Tournament "Dance Card"
Every March, basketball fans across America prepare for an annual phenomenon called March Madness, where 64 NCAA college teams compete in the men's and women's tournaments to determine the national champions. Along with the tournaments come the nationwide rituals of office pools, brackets ... and controversy over which teams make the cut for the tournament, a.k.a., the Big Dance. The at-large selections by the NCAA Basketball Tournament Selection Committee inevitably leave some teams and fans elated and others feeling snubbed.

But two college professors seem to have discovered a method to March Madness. Professors Jay Coleman, an operations management professor at the University of North Florida in Jacksonville, and Allen Lynch, an economics professor at Mercer University in Macon, Georgia, have predicted the NCAA tournament teams with stunning accuracy. Over the past 10 years, their analytic-powered NCAA "Dance Card" has boasted an impressive 94 percent accuracy rate.

So, how do they do it? They have tapped into predictive modeling technology from SAS. Coleman and Lynch's intriguing Dance Card model is a colorful example of the predictive power of analytics. "The same techniques used in the Dance Card equation can be directed to a plethora of business applications. The possibilities are quite exciting," Lynch says.

The professors' selection process has been featured this past year in The Wall Street Journal, the Associated Press, eWeek, DM Review and a number of other newspaper and trade publications. Coleman and Lynch are not only passionate basketball fans but fans of unique data mining techniques, and colorful speakers. They are eager to share their story among their peers in the data mining and analytics communities.



Jim Cox, SAS
Text Miner Tips and Techniques
Text Mining is an exciting field with many varied and far-ranging applications. Unfortunately, there is no roadmap explaining how to apply it to specific problems. This is a workshop that presents many approaches, using SAS macros with the SAS Text Miner product, to dealing with common issues people have when trying to apply Text Mining to their application. Assistance is given on how to: 1) Automatically create synonym lists composed of misspellings in the data, 2) determine how many dimensions to use when compressing the textual representation, 3) take care of messy data, including extraneous punctuation, 4) be able to visualize clusters in an interactive fashion.



Rhonda Drake, Drake Direct
Perry D. Drake, Drake Direct

Data Mining a Non-Profit File for High Value Donors and Planned Givers
A Non-profit organization needed to identify best donors in order to focus planned giving activities against the donors most likely to develop into "Planned Givers" and other "High Value Donors."

The challenges regarding the objective were
  • highly seasonal donor campaigns
  • an acquisition strategy that favored the use of premiums
  • a retention donor strategy that leveraged RFM principles rather than treating all donors equally for a period of time
  • mixed use of premiums in the retention strategy
  • a declining acquisition and retention response due to name trading arrangements with other non profits.
The successful development of identifying "Planned Givers" and "High Value Donors" began with the creation of cohort groups with appeal and donation activity frozen in time relative to a donor's first gift.

Once the cohort groups were defined, data was then explored through factor analysis followed by segmentation and regression modeling to understand the importance of various appeal strategies and their correlation to "Planned Giver" activity. In particular, we examine how premiums with appeals, holiday appeals and a club approach to appeals impacts the likelihood of a donor to becoming a "Planned Giver."

Using this combined segmentation and model strategy the increased likelihood of identifying planned givers rose from 1.1 percent to over 7 percent. This represents a gain of over 600 in the ability to identify this highly desirable group.



David Duling, SAS
Computational Performance in Data Mining
Data Mining is often defined as the process of finding patterns in large databases, suggesting that predictive models are often built on data with large numbers of observations and/or variables. Statistical methods must be chosen and implemented carefully for scalability. Customers are increasing their needs for predictive models and statistical analysis and the amount of data acquired and stored is growing at unimaginable rates. To meet that need, hardware vendors are providing faster systems and both SMP and MPP capabilities, and main memory and disk storage capacities are growing at superlinear rates. These trends point to a future where data mining and computational performance will be codependent. This talk will review some methodologies and present some findings on distributing predictive model training and scoring processes.



William DuMouchel, AT&T Labs
Empirical Bayes Methods for Postmarketing Surveillance of Adverse Drug Reactions
Because of practical limits in characterizing the safety profiles of therapeutic products prior to marketing, manufacturers and regulatory agencies perform post-marketing surveillance based on the collection of adverse drug reaction (ADR) reports ("pharmacovigilance"). The resulting databases, while rich in real-world information, are notoriously difficult to analyze using traditional techniques. Each report may involve multiple medicines, symptoms, and demographic factors, and there is no easily linked information on drug exposure in the reporting population. Data mining techniques, such as association finding, are being used to screen for previously unknown ADRs. This presentation will discuss attacks on two problems encountered during application of empirical Bayes methods to such data:
  1. Interpreting polypharmacy effects due to frequent co-occurence of multiple drugs in the same report, and
  2. Coping with the very fine-grained adverse event coding system, called MedDRA Preferred Terms (PT), by clustering PTs that have similar ADR profiles across several thousand drugs


Edward Gaffin, Walt Disney World Resort
Data Mining and Business Intelligence: The Foundation for Building Effective Marketing Models for a Highly Segmented Product Set
The challenge: How to transform resort operational data and distribute it as meaningful business information for use by all levels of marketing and financial managers. This presentation demonstrates how the CRM division of Walt Disney World has assembled a SAS intranet solution to deliver valuable guest information to the users desktops across the financial and marketing disciplines. Combining the insight of all stakeholders with the expertise of the modeling and analytics team improves our ability to create a series of predictive models that mostly efficiently target our resort marketing efforts.

Jim Georges, SAS
Using non-numeric data in parametric prediction
Most strategies for representing non-numeric data in parametric predictive models involve some form of recoding or transformation of the non-numeric data into numeric data. This talk illustrates the deficiencies of some commonly used approaches such as weight-of-evidence transformations and makes recommendations on improving results. Specifically, an easy-to-implement technique inspired by Bayesian statistics is introduced which may be used to smooth the weight-of-evidence transformations and better represent non-numeric inputs in predictive models. The technique may be easily extended to also aid in the analysis of proportions or profile data.



Paolo Giudici, University of Pavia, Italy
Graphical models for web usage mining
In the talk we show how data mining models for clickstream data can be profitably built and used to predict the visit behavior at a website.

The aim of the analysis is to track the most important patterns of visits, where a pattern means a time-ordered sequence of pages, possibly repeated. The measure of importance adopted determined the results of the analysis. The most common measures refer either to the probability of visit of a certain sequence (support) or to the conditional probability of seeing a certain page, having seen others in the past (confidence), or to the lift.. These measures can be reinforced with an inferential statement.

Data is supposed to be extracted from a logfile that registers the access to a web site. It is structured in a transactional database format. It may be further simplified into a data matrix format, but doing so, the information on the temporal order of the seen pages is lost.

The data mining models that should be employed are examples of local models, or patterns. In the talk we first compare sequence rules, based on the apriori algorithm, with results from link analysis. We shall also compare direct and indirect sequence rules. Later, we consider more traditional statistical techniques, applied to the whole dataset (global), although based on local computations: graphical models, probabilistic expert systems and Markov chain models.

In terms of model assessment and comparison, it is rather difficult to assess local models; consequently, it is also hard to compare them with global based statistical models also because, often, they are based on rather different assumptions. However, certain comparisons are possible. For instance, it is possible to compare sequence rules with Markov chains. Our results show that the most probable patterns identified by the two procedures are rather similar. The choice between a local and a global method thus depends on the scope of the rules themselves.

On the other hand, the results of a local pattern are rather simple to interpret. However, there may be a problem in the very large number of rules found. Statistically based models, such as Markov chains, help considerably with the selection of the most relevant rules, as they can be evaluated in a more coherent way.

We finally describe how bayesian model selection can be used and valued to score models and determine model averaged estimates of quantities of interest(such as odds ratios).



J. Brian Gray, The University of Alabama
A Genetic Algorithm Approach to Tree Modeling
Tree models are valuable tools for prediction and data mining. Traditional tree-growing methodologies, such as CART, suffer from "greediness," i.e., locally optimal node splits do not always lead to the best tree model. Greedy solutions are also sensitive to perturbations in the data and can vary greatly across different training sets sampled from the same data. Ensemble techniques, including bagging and random forests, have better predictive performance than CART, but lack the interpretability of single tree models. TARGET (Tree Analysis with Randomly Generated and Evolved Trees) is an alternative method of constructing tree models based on a genetic algorithm. Empirical evidence shows that the TARGET approach produces smaller trees with better predictive performance than CART. TARGET solutions are also found to be more stable than CART solutions across different training samples from the same data.



Benton Gup, Michael Hardin and Michael Conerly, University of Alabama
Business Analytics Applied to Money Laundering Detection
The Patriot Act makes it imperative for financial institutions to proactively monitor potential money laundering activities by their customers. Due to the volatile nature of these activities, the detection schemes must continually be updated. In this presentation, we present a typology of recent money laundering schemes. We will illustrate the process of developing alert mechanisms using analytical techniques



Trevor Hastie, Stanford University
Least Angle Regression, Forward Stagewise and the Lasso
Least Angle Regression (LARS) is a new model selection algorithm. It is a useful and less greedy version of traditional forward selection methods. Three main properties of LARS are derived.

(1) A simple modification of the LARS algorithm implements the Lasso, an attractive alternative to OLS that constrains the sum of the absolute regression coefficients. The LARS modification calculates all possible Lasso estimates for a given problem in an order of magnitude less computer time than previous methods.

(2) A different LARS modification efficiently implements epsilon Forward Stagewise linear regression, another promising new model selection method; this connection explains the similar numerical results previously observed for the Lasso and Stagewise, and helps understand the properties of both methods, which are seen as constrained versions of the simpler LARS algorithm.

(3) A simple approximation for the degrees of freedom of a LARS estimate is available, from which we derive a Cp estimate of prediction error; this allows a principled choice among the range of possible LARS estimates.

LARS and its variants are computationally efficient. We provide R and Splus software which enables one to fit the entire coefficient path for LAR, Lasso or Forward Stagewise at the cost of a single least squares fit.

There are strong connections between the epsilon forward stagewise regression and the boosting technique popular in machine learning. These connections offer new explanations for the success of boosting.



Claudia Imhoff
Sarbanes-Oxley — A Legislative Mandate for Business Intelligence!
The Sarbanes-Oxley Act is the most far-reaching piece of legislation affecting Corporate America in years. Ultimately, its purpose was to restore investor confidence in publicly traded corporations in the US. While CEOs and CFOs are the main target for the Act, CIOs will be affected as well. It is technology that will give corporations the assurance that they are in compliance — specifically Business Intelligence (BI) technology. In this timely and in-depth examination of Sarbanes-Oxley and BI, Dr. Imhoff walks the attendee through the significant sections of the Act demonstrating where and how BI plays a role.

Here are the parts of her timely session that concern everyone:
  • Material changes
  • Internal controls
  • International concerns
  • Private companies
  • What's needed in your BI environment
Dr. Imhoff concludes with her own mandate to use the Sarbanes-Oxley Act to justify the driving need to standardize your corporation's IT architecture, nomenclature, technology and applications. The time could not be better for these critical initiatives. She ends her seminar with a warning about unexpected and unwelcome consequences that may result from compliance with the Act.



Roger Jones, Complexia
Data Mining For Auto and Truck Safety: Drowsy Driver Detection and Optimal Airbag Deployment
Modern automobiles have more computational capability than the Apollo spacecraft that went to the moon. Much of this capability is being used to improve the safety of the vehicles. We will discuss two such safety systems: a system that uses the ambient electric fields around the driver's head to locate the position of the head and detect when the driver is becoming drowsy, and a system that that measures high frequency acoustic waves in the windshield during a crash and advises the airbag control system on how to optimally deploy the airbags. Both applications make extensive use of data mining techniques for signal processing.



Bill Kahn, Capital One
Why Data Mining is Not Used and Why Better Data Mining Won't Help
Of course, data mining is often used, but never-the-less is not used in many places where one might think it should or could be used. This talk is about the should or could.

Often data mining is not used where one may think it should be for a simple reason?there is nothing to find. Data mining extracts the information content resident in the data, but in many data sets, oft times even quite large ones, there is (almost) no information to mine about the question of interest. Thus, one is led to the responsibility of statisticians to work on the creation of information-rich environments, i.e Design of Experiments (DOE). Some brief examples will be presented of where this problem has been observed and designs that have been successfully executed.

Also, data mining is often not used even where the data is actually information rich. Similarly, DOE is often not used even when one might think the implementation was straightforward. We will explore ways that these failures are due both to the lack of specific skills statisticians need so as to be able to influence the behavior of organizations and also to the culture of professional statistics itself. Slightly embarrassing personal stories will be shared. Finally, suggestions will be made as to what each of us can do to have higher impact on business behavior.



Dmitri V. Kuznetsov, Sigma Marketing
Multinomial Logit Models for Retention Analysis with 3 or more choices
We developed and successfully applied statistical predictive methodology for typical marketing retention problem, where we analyzed probabilities for each contract to remain active, cancel contract, or trade product. The analysis is based on activity data sets of leading national hi-tech office-equipment company. The developed methodology includes approach of Multinomial Logit Model with unordered structure of choices. More specifically, we applied Generalized Logit Model, where response is a function of characteristics of the chooser but not of the choices. Then, we discussed the next-step model development using Mixed Logit Model to take advantages of both Generalized Logit and Conditional Logit approaches. The keystone part of the methodology used SAS procedure LOGISTIC with option LINK=GLOGIT, and backward and stepwise selection of variables. The analysis with LINK=GLOGIT option in LOGISTIC became possible in SAS starting from the version 8.2, and for the present study this way is much more convenient than CATMOD procedure. Total computer analysis of the problem using SAS took a reasonably short time (hours) and did not demonstrate any convergence troubles.

The methodology provided:
  1. Simultaneous analysis of 3 customer-decision choices for different types/sizes of products employed by different-type customers. (The simultaneous analysis improves predictive accuracy due to extended statistical information and elimination of conditional grouping of products. Moreover, it dramatically decreases modeling time.)
  2. Most significant predictors for Active/Cancel/Trade choices.
  3. Estimated Active/Cancel/Trade probabilities for each contract that can be compared to one another.
  4. Extraction of contracts for Cancel/Trade Code Red from the company data sets.
  5. Information about actionable triggers for preventative strategies combined with implementation of business rules.
  6. Analysis of sales-decision cycles.
  7. Early Warning approach.


Larry Lai, Directv, Inc.
Variable Derivation and Selection for a Customer Churn Prediction Model
In a Customer Relation Management (CRM) environment, it is not uncommon to have a huge data warehouse containing all customer touches through different contact channels- mail, phone call, e-mail and website, inbound or outbound, you name it. Data content could be either origination at the time of registration such as credit risk, dealer channels, geographic, lifestyle, psychographic and demographic or longitudinal over life such as customer consumption, payment and contact. Prior to a CART modeling analysis, it can be more productive to conduct variable derivation and selection within a subcategory of attributes with similar context first, e.g. within billing or customer contact category as oppose to the entire database. This presentation describes the benefits of some techniques for conducting "partial" variable derivation and selection before engaging in a CART analysis and illustrates with a real case of customer churn prediction model.

Daymond Ling, CIBC (Canadian Imperial Bank of Commerce)
Successful Data Mining implementation in a Financial Institution
We all know how powerful Data Mining can be. Financial Institutions world wide have been mining data for decades. So what makes for a successful implementation that drives value for a business? In this presentation, I will share some insights on how to achieve success in a business setting.

Huan Liu, Arizona State University
Active Feature Selection with Large Amounts of Data - A Selective Sampling Approach
Feature selection, as a preprocessing step to data mining, has been very effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. Traditional feature selection methods resort to random sampling in dealing with data sets with a huge number of instances. In this paper, we introduce the concept of active feature selection, and investigate a selective sampling approach to active feature selection in a filter model setting. We present a formalism of selective sampling based on data variance in comparison with some class-based approaches, and apply it to a widely used feature selection algorithm Relief. Further, we show how it realizes active feature selection and reduces the required number of training instances to achieve time savings without performance deterioration. We design objective evaluation measures of performance, conduct extensive experiments using both synthetic and benchmark data sets, and observe consistent and si gnificant improvement. We suggest some further work based on our study and experiments.



David Madigan, Rutgers University
Text Categorization: A Review and Some New Results
Text categorization concerns the assignment of documents to predefined categories. Traditionally librarians and human indexers have carried out such categorization tasks, sometimes on a large scale. For example, the US National Library of Medicine engages over 100 human indexers to assign medical subject headings to 400,000 medical articles a year. Applications such e-mail filtering, pornography detection, medical coding, and news filtering are creating a growing demand for automated text categorization, especially for categorization algorithms that can can learn from examples. The statistical challenges revolve around issues of scale - the number of predictor variables can run to the tens of thousands - and model structure. In recent empirical evaluations, support vector machines and boosting algorithms have overtaken more traditional probabilistic classifiers like Naive Bayes. This talk will describe a computationally efficient Bayesian logistic regression approach that yields outstanding accuracy.



Ed Malthouse, Northwestern University
Understanding Database Marketing Trigger Events using Survival Analysis and Other Data Mining Methods
This talk shows how survival analysis methods can be used to understand the effect trigger events on certain outcomes of interest to the manager, and survival analysis approaches with alternative approaches. Database managers need to understand what effect, if any, certain customer-initiated actions — the trigger events — have on various outcomes. For example, after a hotel customer redeems loyalty points, is the customer more likely to return again sooner, or later? If the customer has "cashed out" his points and does not plan to return again, the hotel should plan certain contacts to earn the customer's loyalty again. If, on the other hand, participation in the program implies loyalty, different contacts are necessary. Marketing responses are discussed. A second example of a trigger event is customers calling their credit card company with a complaint. The Cox and discrete-time survival models with time-dependent covariates are reviewed. These methods are contrasted with simpler logistic regression approaches. We show how data mining methods can be used to identify customer variables that interact with the time-dependent covariates; for example, when certain customer redeem points, they are more likely to defect while when other customer redeem, they are less likely to defect. Effects on the baseline hazard function are also discussed. All methods are illustrated using large data sets from real companies including an on-line loyalty program, a hotel chain with a loyalty program, a software company, and a continuity program. Example SAS code will be made available.



Sreelatha Meleth, University of Alabama at Birmingham
Determining approaches to develop outcome predictive models for human malignancies
The ever increasing capabilities of molecular biology over the last decade has made the promise of a triumph in the war against a number of cancers, seem more winnable than it has been for several decades before that. These increasing capabilities however have an opposite effect of seemingly slowing down the pace of understanding of a disease process. This is particularly so, when different researchers look at different markers, using different cut-offs and different outcome measures. The vast capabilities of the field of Data-Mining offers a solution to the rapid expansion on the databases that associate molecular markers to disease prognosis. This study, we apply a number of data mining techniques to study the combined effect of molecular markers, demographic variables, and traditional prognostic indicators such as stage, on the survival of patients with colo-rectal cancer. We demonstrate that it is possible to use these techniques to build a predictive model. Eventually the purpose of a predictive model, is to enable a clinician to enter a specific set of biomarker indices, demographic background, and disease stage indices into a algorithm, in order to derive a prognosis specific to his/her particular patient.

Authors: Sreelatha Meleth Ph.D, Mike Hardin Ph.D, Upender Manne Ph.D



Roosevelt Mosley and Shawna Ackerman, Pinnacle Actuarial Resources
Use of Credit in Personal Insurance
For financial institutions credit scores and scoring mechanisms continue to provide effective means to identify markets and assess risk. In the insurance industry the analogous mechanism, insurance scoring is often challenged by regulatory constraints. In this session two property and casualty actuaries will discuss the methods, results and challenges of using credit in three distinct regulatory environments: an unconstrained market, a limited use market and a prohibited use market.



Brendan Murphy, Trinity College, Dublin
Exploring Structures in College Applications Data
Applications for third-level courses in Ireland are processed by the Central Applications Office (CAO).

The applications involve each applicant listing up to ten courses in order of preference. Places are subsequently offered to applicants on the basis of their performance in their final second-level examinations.

The college applications process has come under much scrutiny by the public and media in Ireland. Many criticisms have been made, for example, the system apparently creates artificial demand for some courses where students choose high profile (or popular) courses rather than choosing courses on a vocational basis.

We explore the CAO applications from the year 2000 to establish if there are structures in the applicants' course choices.

The primary tools for these investigations are cluster analysis, mixture models and multidimensional scaling.

We establish the existence of clusters of courses where applicants tend to choose courses within these clusters. These clusters have both a vocational and geographical basis. An important difference between male and female applicants is revealed, in particular with respect to courses involving a language component.



Olivia Parr-Rud, OLIVAGroup and Sigma Marketing
Key Steps for Effective Predictive Modeling
Automated modeling software systems that streamline the predictive modeling process are enabling data miners to develop sophisticated models for marketing, risk and customer relationship management. However, the model processing which is the main focus of the software systems, is a minor step in the whole process. The success of the model is highly dependent on the diligence in the steps leading to and following the model processing.

Determining the objective is the first and most critical step. The modeler must consider the company objective as well as the data miner's ability to implement the final model. The next step is getting relevant, accurate data for the project. After the model is built, thorough validation is key to assure the model's performance. And finally, implementation must me flawless to insure the model's success. In this session, I will discuss these key steps and provide real world examples from a variety of industries.



Will Potts, Data Miners
Modeling Recurrent Customer Outcomes
Valuable outcomes such as upgrading, downgrading, missing a payment, making an insurance claim, or ordering from a catalog recur throughout customer lifetimes. Churn — usually considered to be a terminal event — can recur after customers are won-back or reactivated. Models that predict the intensity of a recurrent outcome can be used to guide customer-level interventions. Predictive intensity models are extensions of survival analysis methods for renewal or non-homogenous Poisson processes. These models flexibly estimate the effect of time-dependent covariates such as past customer behavior and account for individual frailty.



David Press, Greenbrier & Russell
Current State of Analytics--Journey Into Mainstream Corporate Culture
Areas of focus will be executive needs, emergence of statistician service bureaus, differentiating analytics from reporting, and emerging trends such as social networking, complex adaptive systems, and analytic level of confidence requirements.



Bruce Ratner, DM STAT-1 Consulting
A Genetic Jackknife Method: 3-in-1 Tool for Variable Selection, Data Mining and Model Building
The trinity of traditional analytical techniques - variable selection, data mining, and model building - is presented in detail along with their strengths and weaknesses. Then, I introduce a new "jackknife" method that is a 3-in-1 tool for automatically and simultaneously performing the trinity of techniques: selecting important original variables, finding patterns within the data by constructing new important variables from the original variables, and formulating a mathematical equation based on the best set of original and constructed variables. The jackknife method (GenIQ) is based on the assumption-free, nonparametric genetic paradigm inspired by Darwin's Principle of Survival of the Fittest and the biological operations of reproduction, sexual recombination and mutation. The GenIQ method offers a clear advantage over current statistical methods, whose performance is dependent upon theoretical assumptions, predefined model formulations, and data-type restrictions. A case study is presented to illustrate the potential of the new method for building database marketing models with the GenIQ implementation (software) tutorial of the new method. (Note: The tutorial is NOT for selling software: the GUI provides a clarifying explanation of the theoretical aspects the GenIQ method.)

The intended audience for the session consists of model builders of all levels of expertise, and marketers, who use models in the DM Space (direct/database marketing {DDBM/eDDBM}, customer relationship management {CRM/eCRM}, and (knowledge discovery/data mining {KDD}). This topic is important and interesting to the DM community because the methodology is inherently new and original, as it is based on the latest machine learning paradigm of decile optimization. The benefit to participants is an alternative to the standard logistic and ordinary linear regression models.



Brett Russ, Blue Cross and Blue Shield of North Carolina
Using Data on Existing Customers to Attract and Retain More Profitable Customers
Blue Cross and Blue Shield of North Carolina is a leader in delivering innovative health care products, services and information to approximately 2.9 million members, including 500,000 served on behalf of other Blue Plans. Within the Product and Market Intelligence Department, we are constantly looking at ways to support the company's mission statement: at our core, Blue Cross and Blue Shield of North Carolina is a health care company. Our job is to deliver quality, innovative products, services and information designed to help our customers improve their health. We are always striving to come up with innovative ways to enhance profitability and improve market share while best serving our customers. In this presentation we will discuss how data mining is used in specific case studies including "What are Customers Buying," "Profitability Analysis," and "Penetration Analysis."



Melinda Satterwhite, Nextel
The use of survival analysis in telecommunications
This presentation will center on the discussion of the use of survival analysis in telecommunications and will include 1) the availability and issues of data; 2) flexible hazard versus empirical hazard; and 3) scoring of the hazards to forecast churn scores.



Vineet Singh, HP
Data Mining to Increase Accuracy for Telecom Fraud Detection
Within the telecommunications industry, fraud worldwide costs US$35-40 billion annually, and continues to increase each year. HP's leading Fraud Management Solution (FMS) provides comprehensive fraud detection, prevention, and response for wireless and fixed line operators. It provides the framework to use rule-based technique and data mining to increase accuracy for fraud detection. In this talk, we will discuss our experience, challenges, and successes in this application of data mining.



Robert A. Stine, Wharton School, University of Pennsylvania
Awktion Modeling of Wide Data Sets
The variety of choices presented by wide data sets having many features challenge data miners. The first obstacle is speed. The abundance of features slows even fast methods like forward stepwise selection to a crawl. Each step of finding and adding the best predictor can take hours. The second obstacle raised by the abundance of features is overfitting. Expansive searching increases the chances for adding spurious predictors, features that fit well in-sample but generate poor predictions out-of-sample. To overcome these challenges, data miners often resort to automatic procedures that close the problem to substantive knowledge. This automation presents the third obstacle: the inability to exploit domain experts.

Awktion modeling addresses all three challenges. Awktion (auctions with knowledge that inhibits overfitting noise) modeling uses an auction to blend multiple streams of features into a model. These streams of features come from substantive and universal recommenders, algorithms that generate features from the raw data. Each recommender offers features for inclusion in a predictive model. Recommenders that identify useful predictors gain the wealth needed for placing further features into an accumulating model. A recommender can be fully automatic or generate features using the knowledge of domain experts.

An example using financial data illustrates the ideas



Steve Tanner, University of Alabama at Huntsville
The Role of Data Mining in Data Usability
Users of data mining applications often wish to access data in a wide range of formats and physical locations. Simply accessing this data can be a daunting task and can require significant time and effort by the user. This can involve both real-time and archived data from several sources, and in formats varying from character format, packed binary, "standard" scientific formats to self-describing formats. This heterogeneity results in data-application interoperability problems for scientific and mining tools.

This presentation will show several approaches to dealing with these interoperability issues. This includes tools and techniques that researchers at the Information Technology and Systems Center located at the University of Alabama in Huntsville have developed. Such approaches include providing users with multiple mining environments from large server based systems to distributed web services within a grid computing environment to fast real-time processes running directly on board sensor systems. Some time will be spent discussing: the Algorithm Development and Mining System (ADaM) a mining toolkit, and the Earth Science Markup Language (ESML), an elegant interchange technology that enables data interoperability with applications without enforcing a standard format



Marietta Tretter, Texas A&M University
CRM, or Not, in Archaeology — Mining Rock Art
Native American rock art presents an interesting challenge to archaeologists, chemists, and statisticians. Much of the recording and analysis of rock art has been done by avocational archaeologists. Although rock art is often associated with archaeological sites, until recently, it has not been possible to date it, thus making it of less interest to professional archaeologists. Through modern chemical analysis the painted rock art can now be dated. The recorded rock art involves drawings, photographs, verbal descriptions of the art, and site maps. Many volunteer groups have spent many hours creating volumes of documentation in varied formats. All of this 'data' is mostly stored away in paper archives which few researchers have access to. This is an opportunity awaiting a data mining solution. A few projects are working on putting this documentation into digital libraries so that it can be searched and analyzed. This talk will present some of the current data mining analysis that can be done on this newly digital data. The analysis includes Text Mining and general Data Mining analysis.



John Wallace, Business Researchers
Using SAS Text Miner to Analyze Call Center Data
Nearly every organization interacts with its customers or members through the call center. Call center data serves as the platform to discuss SAS Text Miner and the process involved in using it and other analytical tools for text mining. The pre-processing of data, exploratory analysis and development of synonym and stop lists is covered. The final model is a hierarchy of Expectation-Maximization Clustering models that total over 100 clusters.



Andreas S. Weigend, formerly of Amazon.com
Online Customer Behavior
Dr. Andreas Weigend, who served as Amazon.com's Chief Scientist until early 2004, shares some insights into online customer behavior. The talk starts with a discussion of objectives and of sources of data in e-commerce, gives an analysis of clickstream and purchase data, and presents probabilistic models for customer intentions and modalities from click streams in real time. The talk discusses the importance of online experiments, as well as the need for a framework for modeling and predicting long-term customer behavior. It ends with a discussion of data mining in online advertising and online dating, and of leveraging social networks in e-business.



Cary White, University of North Carolina
Data Warehousing at the University of North Carolina — Chapel Hill — A Case Study with a Focus on Lessons Learned
Anyone attempting to develop a data warehouse in a higher education environment encounters unique challenges not always found in a corporate setting. Diverse constituencies, limited funds, highly political culture with many power centers, aging source systems, lack of knowledge of business processes, and the need for an 'instant' enterprise warehouse are just a few of these challenges. This presentation will uncover some of these challenges as well as presenting some of the choices, lessons learned, good luck and good judgment that have occurred during the life of this project.

UNC-CH is in the third year of a major data warehouse initiative. A small permanent team has been responsible for the technical portion of the project from requirements-gathering through ETL and finally to deployment.

Within the framework of a case study of the data warehouse lifecycle in this university environment

You will learn:
  • Unique challenges of data warehousing in a university setting
  • Turning points in the project and choices made along the way
  • Lessons learned during the life of this project most of which are relevant to data warehousing in any setting
  • Some best practices gleaned from the literature and from our experiences in the first years of this project
Audience:
  • Project managers
  • Data warehouse architects
  • Business sponsors and drivers
  • Anyone contemplating the development of an enterprise data warehouse with a small budget and limited staffing

Search | Contact Us | Terms of Use & Legal Information | Privacy Statement
Copyright © 2004 SAS Institute Inc. All Rights Reserved.

What participants say about the M-series:

"The educational content, exchange of ideas, and intellectual environment I found at the conference exceeded my expectations and confirmed SAS' place as the premier data mining conference in the world."

Thad Perry, Ph.D.
Senior Director of Infomatics

"SAS is doing a tremendous service for the data mining community. The conference provides an excellent forum for exchanging ideas and best practices in business and a stage for sharing the latest and best academic research in the field."

Jaideep Srivastava
Professor of Computer Science and Engineering
University of Minnesota

"This was a superb environment - one of the smartest conference venues I have experienced (and I have experienced a lot). The talks went into greater depth than the talks at many such meetings. Many of the talks were particularly valuable in shedding light on different application areas of data mining."

David Hand
Professor and Head of Statistics
Imperial College, London

"This conference is definitely a must. Not only for the information, but for the opportunity it provides to exchange ideas and learn from your colleagues."

Daryl Berry
T-Mobile US

"The information I got from the presentations was great, and it was nice to talk to and exchange experiences with professionals who are pretty much doing the same thing."

Victor Alonso
Zurich Insurance Co

"What really impressed me was the sense of community that normally isn't present at conferences of this size."

Brij Masand
Data Miners

"The conference has opened a whole new world for me."

Rachel Alt-Simmons
Hartford Life Insurance