 |

The following abstracts have been provided:
Mihael Ankerst, Boeing
Cooperative data mining: Tightly integrating data mining with visualization
The exponential growth of data and the inability of humans to process
datasets of large sizes has led to the conception of automating the data
mining process. This talk highlights the danger of neglecting the human
factor in the mining process and proposes mining as a cooperative approach
of the user and the computer. The areas data mining and information
visualization offer various techniques which effectively complement one
another supporting the discovery of patterns in the data. Whereas
traditional (algorithmic) techniques are analyzing the data automatically,
information visualization techniques can leverage the data mining process
from an orthogonal direction by providing a platform for understanding the
data and generating hypotheses about the data based on human capabilities
such as domain knowledge, perception, and creativity.
This talk discusses the benefits and challenges of tightly integrating
visualizations with mining algorithms and presents examples based on first
prototypes.
Jason Bargen, Hallmark Cards, Inc.
Data Mining methods used on Hallmark's Gold Crown Card Consumers
Loyalty is not something you purchase. It is something you earn. Loyal
consumers can generate good profits for companies, and a way for companies
to become more profitable is to attract and retain these dedicated
consumers. At Hallmark we are using different data mining techniques to
better understand consumer's behavior. These data mining techniques
benefit both Hallmark and its consumers. We are using this information to
retain our loyal consumers and increase the number of Gold Crown Card
members in our database. We also use this information to do a better job of
providing consumers with relevant messages/offers that meet their needs.
Our overall goal is to increase the number of loyal members by 40% in the
next 4 years. I will talk about the different data mining methods used to
help Hallmark understand and retain their current consumers and the tests
underway to engage new consumers.
Joe Bartling, H&R Block
Design and Analysis of Market Tests at H&R Block
I present and discuss the processes and methodology utilized at H&R Block to design
and analyze various types of retail market level tests. The various types of
testing applications discussed are appropriate for testing media type, media
weight, marketing programs, promotional programs, operational changes, and price
testing. I discuss how we approach the selection of test and control markets and/or
retail locations to maximize learning. I also discuss how we determine the relevant
test metrics (e.g. lifetime value, units, revenue, etc.) and then calculate the
appropriate test advantage. In addition, as all of our testing for the tax business
must occur within a four month span every year, there are always last minute test
ideas. I will discuss how we prepare in advance to accommodate these last minute
tests.
Eugenia Bastos, SAS (with Russ Wolfinger, David Duling and Leonardo Auslender, SAS)
Data Mining Breast Cancer Clinical and Expression Data
One in eight women in US will develop breast cancer. Of these, one third
will progress to fatal metastatic cancer (Peto et al., Lancet, 2000). Gene
expression profiles have the potential to improve accuracy of metastasis
classification when compared with the classical clinical variables.
In this study, we re-analyze clinical and microarray data of primary breast
cancer tumor tissues from 98 patients (van't Veer et al., Nature, 2002),
with the primary objective to predict metastasis status. The study sample
includes sporadic and brca1 positive cases where 47.4% progressed to
metastasis; only 18.6% had lymph node involvement; high grade accounted for
61.9% and 71.1% developed angioinvasion. Two groups of patients are
classified according to their metastasis status: non-metastasis patients
who were disease free for at least 5 years are called "good prognosis"
group and patients who developed metastasis within 5 years are classified
as "poor prognosis" group.
We begin with the 78 sporadic cases, using which the original authors
developed a classifier based on 70 out of ~25,000 genes determined by a
filtering and cross-validation approach using correlation coefficients.
They report a 16.7% misclassification rate for leave-one-out cross
validation (LOOCV). We present an approach based on analysis-of-variance
filtering and cross-validated stepwise discriminant analysis which results
in a classifier based on 13 genes and an LOOCV error rate of 3.8%.
Next, we extend consideration to all 98 cases and their associated clinical
covariates. We consider additional mining approaches such as logistic
regression, decision trees, and prediction based on profiles obtained by
unsupervised k-means clustering. Due to the observational nature of the
data, we advocate extensive cross-validation to help ensure
generalizability. Substantial gains in predictive performance are evident.
Eugenia Bastos, PhD is analytical consultant for Life Sciences Organization
at SAS Institute, San Francisco, California. E-Mail:Eugenia.bastos@sas.com.
Russ Wolfinger, PhD, is Director of Genomics, SAS Institute Inc., Cary, NC.
David Duling, PhD, is Director of Enterprise Miner Development, SAS
Institute Inc., Cary, NC
Leonardo Auslender, is Statistician of Enterprise Miner Development, SAS
Institute Inc. NC.
Tom Bradshaw, Bank of America
A Statistical (and yet Non-Traditional) Approach to the Design, Optimization, and Analysis of Matched Market Testing Using Linear Models and Time Series Behavioral Data
The analysis of mass media advertising and similar market level activities
is often based on a pre/post comparison of two Matched Markets. Experience
at the Bank of America has demonstrated that because the market selection
process has historically not been statistically and behaviorally based the
results have frequently been inconclusive or misleading. This presentation
describes a statistically based approach that we have developed that uses
behavior based time series data and SAS linear model procedures to select
the best test and control markets from a large pool of candidates; evaluate
and optimize their discriminative sensitivity prior to the test; and
analyze the post test results. The methodology is demonstrated using data
from an actual matched market test. Examples of SAS code are provided.
James Cappel, Central Michigan University
A Survey of Practices and Opinions about Business Intelligence
Many companies are still struggling with how best to position business
intelligence within their organization and optimize their investments in
data mining and BI. This study was conducted on behalf of the Central
Michigan University Research Center and sponsored by The Dow Chemical
Company, Ford Motor Company, and Eli Lilly and Company. This investigation
involved a web-based survey of hundreds of business professionals from
large organizations. The survey addressed various issues such as: the
scope, structure, resources, drivers, and evaluation of BI practices within
companies, as well as opinions about the perceived role, effectiveness, and
importance of BI to companies. The results provide a baseline for
organizations to assess their progress along the BI continuum and they
raise thoughts about how companies may potentially improve their BI
practices.
Robert Chu, SAS
On-Demand SAS Predictive Model Scoring
Application domains such as call centers, front office, and credit approval
are requiring on-demand predictive model scoring. This presentation will
review on-demand scoring issues and walk through all the needed steps live
to deploy SAS predictive models. Steps include (1) create models in SAS
Enterprise Miner, (2) register models in a SAS metadata repository, (3)
score in real-time with registered models, (4) score in batch with
registered models, and (5) summarize and report scoring results. Software
tools used in the presentation include SAS Enterprise Miner, SAS
Integration Technologies, messaging queues, and relational databases.
Jay Coleman & Allen Lynch
Method to Madness: Identifying the NCAA Tournament "Dance Card"
Every March, basketball fans across America prepare for an annual
phenomenon called March Madness, where 64 NCAA college teams compete in the
men's and women's tournaments to determine the national champions. Along
with the tournaments come the nationwide rituals of office pools,
brackets ... and controversy over which teams make the cut for the
tournament, a.k.a., the Big Dance. The at-large selections by the NCAA
Basketball Tournament Selection Committee inevitably leave some teams and
fans elated and others feeling snubbed.
But two college professors seem to have discovered a method to March
Madness. Professors Jay Coleman, an operations management professor at the
University of North Florida in Jacksonville, and Allen Lynch, an economics
professor at Mercer University in Macon, Georgia, have predicted the NCAA
tournament teams with stunning accuracy. Over the past 10 years, their
analytic-powered NCAA "Dance Card" has boasted an impressive 94 percent
accuracy rate.
So, how do they do it? They have tapped into predictive modeling
technology from SAS. Coleman and Lynch's intriguing Dance Card model is a
colorful example of the predictive power of analytics. "The same
techniques used in the Dance Card equation can be directed to a plethora of
business applications. The possibilities are quite exciting," Lynch says.
The professors' selection process has been featured this past year in The
Wall Street Journal, the Associated Press, eWeek, DM Review and a number of
other newspaper and trade publications. Coleman and Lynch are not only
passionate basketball fans but fans of unique data mining techniques, and
colorful speakers. They are eager to share their story among their peers
in the data mining and analytics communities.
Jim Cox, SAS
Text Miner Tips and Techniques
Text Mining is an exciting field with many varied and far-ranging
applications. Unfortunately, there is no roadmap explaining how to apply
it to specific problems. This is a workshop that presents many approaches,
using SAS macros with the SAS Text Miner product, to dealing with common
issues people have when trying to apply Text Mining to their application.
Assistance is given on how to: 1) Automatically create synonym lists
composed of misspellings in the data, 2) determine how many dimensions to
use when compressing the textual representation, 3) take care of messy
data, including extraneous punctuation, 4) be able to visualize clusters in
an interactive fashion.
Rhonda Drake, Drake Direct
Perry D. Drake, Drake Direct
Data Mining a Non-Profit File for High Value Donors and Planned Givers
A Non-profit organization needed to identify best donors in order to focus
planned giving activities against the donors most likely to develop into
"Planned Givers" and other "High Value Donors."
The challenges regarding the objective were
- highly seasonal donor campaigns
- an acquisition strategy that favored the use of premiums
- a retention donor strategy that leveraged RFM principles rather than treating all donors equally for a period of time
- mixed use of premiums in the retention strategy
- a declining acquisition and retention response due to name trading arrangements with other non profits.
The successful development of identifying "Planned Givers" and "High Value
Donors" began with the creation of cohort groups with appeal and donation
activity frozen in time relative to a donor's first gift.
Once the cohort groups were defined, data was then explored through factor
analysis followed by segmentation and regression modeling to understand the
importance of various appeal strategies and their correlation to "Planned
Giver" activity. In particular, we examine how premiums with appeals,
holiday appeals and a club approach to appeals impacts the likelihood of a
donor to becoming a "Planned Giver."
Using this combined segmentation and model strategy the increased
likelihood of identifying planned givers rose from 1.1 percent to over 7
percent. This represents a gain of over 600 in the ability to identify
this highly desirable group.
David Duling, SAS
Computational Performance in Data Mining
Data Mining is often defined as the process of finding patterns in large
databases, suggesting that predictive models are often built on data with
large numbers of observations and/or variables. Statistical methods must
be chosen and implemented carefully for scalability. Customers are
increasing their needs for predictive models and statistical analysis and
the amount of data acquired and stored is growing at unimaginable rates.
To meet that need, hardware vendors are providing faster systems and both
SMP and MPP capabilities, and main memory and disk storage capacities are
growing at superlinear rates. These trends point to a future where data
mining and computational performance will be codependent. This talk will
review some methodologies and present some findings on distributing
predictive model training and scoring processes.
William DuMouchel, AT&T Labs
Empirical Bayes Methods for Postmarketing Surveillance of Adverse Drug Reactions
Because of practical limits in characterizing the safety profiles of
therapeutic products prior to marketing, manufacturers and regulatory
agencies perform post-marketing surveillance based on the collection of
adverse drug reaction (ADR) reports ("pharmacovigilance"). The resulting
databases, while rich in real-world information, are notoriously difficult
to analyze using traditional techniques. Each report may involve multiple
medicines, symptoms, and demographic factors, and there is no easily linked
information on drug exposure in the reporting population. Data mining
techniques, such as association finding, are being used to screen for
previously unknown ADRs. This presentation will discuss attacks on two
problems encountered during application of empirical Bayes methods to such
data:
- Interpreting polypharmacy effects due to frequent co-occurence of
multiple drugs in the same report, and
- Coping with the very fine-grained adverse event coding system,
called MedDRA Preferred Terms (PT), by clustering PTs that have
similar ADR profiles across several thousand drugs
Edward Gaffin, Walt Disney World Resort
Data Mining and Business Intelligence: The Foundation for Building Effective Marketing Models for a Highly Segmented Product Set
The challenge: How to transform resort operational data and distribute it
as meaningful business information for use by all levels of marketing and
financial managers. This presentation demonstrates how the CRM division of
Walt Disney World has assembled a SAS intranet solution to deliver valuable
guest information to the users desktops across the financial and marketing
disciplines. Combining the insight of all stakeholders with the expertise
of the modeling and analytics team improves our ability to create a series
of predictive models that mostly efficiently target our resort marketing
efforts.
Jim Georges, SAS
Using non-numeric data in parametric prediction
Most strategies for representing non-numeric data in parametric predictive
models involve some form of recoding or transformation of the non-numeric
data into numeric data. This talk illustrates the deficiencies of some
commonly used approaches such as weight-of-evidence transformations and
makes recommendations on improving results. Specifically, an
easy-to-implement technique inspired by Bayesian statistics is introduced
which may be used to smooth the weight-of-evidence transformations and
better represent non-numeric inputs in predictive models. The technique may
be easily extended to also aid in the analysis of proportions or profile
data.
Paolo Giudici, University of Pavia, Italy
Graphical models for web usage mining
In the talk we show how data mining models for clickstream data can be profitably built and used to predict the visit behavior at a website.
The aim of the analysis is to track the most important patterns of visits, where a pattern means a time-ordered sequence of pages, possibly repeated. The measure of importance adopted determined the results of the analysis. The most common measures refer either to the probability of visit of a certain sequence (support) or to the conditional probability of seeing a certain page, having seen others in the past (confidence), or to the lift.. These measures can be reinforced with an inferential statement.
Data is supposed to be extracted from a logfile that registers the access to a web site. It is structured in a transactional database format. It may be further simplified into a data matrix format, but doing so, the information on the temporal order of the seen pages is lost.
The data mining models that should be employed are examples of local models, or patterns. In the talk we first compare sequence rules, based on the apriori algorithm, with results from link analysis. We shall also compare direct and indirect sequence rules. Later, we consider more traditional statistical techniques, applied to the whole dataset (global), although based on local computations: graphical models, probabilistic expert systems and Markov chain models.
In terms of model assessment and comparison, it is rather difficult to assess local models; consequently, it is also hard to compare them with global based statistical models also because, often, they are based on rather different assumptions. However, certain comparisons are possible. For instance, it is possible to compare sequence rules with Markov chains. Our results show that the most probable patterns identified by the two procedures are rather similar. The choice between a local and a global method thus depends on the scope of the rules themselves.
On the other hand, the results of a local pattern are rather simple to interpret. However, there may be a problem in the very large number of rules found. Statistically based models, such as Markov chains, help considerably with the selection of the most relevant rules, as they can be evaluated in a more coherent way.
We finally describe how bayesian model selection can be used and valued to score models and determine model averaged estimates of quantities of interest(such as odds ratios).
J. Brian Gray, The University of Alabama
A Genetic Algorithm Approach to Tree Modeling
Tree models are valuable tools for prediction and data mining. Traditional
tree-growing methodologies, such as CART, suffer from "greediness," i.e.,
locally optimal node splits do not always lead to the best tree model.
Greedy solutions are also sensitive to perturbations in the data and can
vary greatly across different training sets sampled from the same data.
Ensemble techniques, including bagging and random forests, have better
predictive performance than CART, but lack the interpretability of single
tree models. TARGET (Tree Analysis with Randomly Generated and Evolved
Trees) is an alternative method of constructing tree models based on a
genetic algorithm. Empirical evidence shows that the TARGET approach
produces smaller trees with better predictive performance than CART. TARGET
solutions are also found to be more stable than CART solutions across
different training samples from the same data.
Benton Gup, Michael Hardin and Michael Conerly, University of Alabama
Business Analytics Applied to Money Laundering Detection
The Patriot Act makes it imperative for financial institutions to
proactively monitor potential money laundering activities by their
customers. Due to the volatile nature of these activities, the detection
schemes must continually be updated. In this presentation, we present a
typology of recent money laundering schemes. We will illustrate the
process of developing alert mechanisms using analytical techniques
Trevor Hastie, Stanford University
Least Angle Regression, Forward Stagewise and the Lasso
Least Angle Regression (LARS) is a new model selection algorithm. It
is a useful and less greedy version of traditional forward selection
methods. Three main properties of LARS are derived.
(1) A simple modification of the LARS algorithm implements the Lasso,
an attractive alternative to OLS that constrains the sum of the
absolute regression coefficients. The LARS modification calculates all
possible Lasso estimates for a given problem in an order of magnitude
less computer time than previous methods.
(2) A different LARS modification efficiently implements epsilon
Forward Stagewise linear regression, another promising new model
selection method; this connection explains the similar numerical
results previously observed for the Lasso and Stagewise, and helps
understand the properties of both methods, which are seen as
constrained versions of the simpler LARS algorithm.
(3) A simple approximation for the degrees of freedom of a LARS
estimate is available, from which we derive a Cp estimate of
prediction error; this allows a principled choice among the range of
possible LARS estimates.
LARS and its variants are computationally efficient. We provide R and
Splus software which enables one to fit the entire coefficient path for LAR,
Lasso or Forward Stagewise at the cost of a single least squares fit.
There are strong connections between the epsilon forward stagewise
regression and the boosting technique popular in machine learning.
These connections offer new explanations for the success of boosting.
Claudia Imhoff
Sarbanes-Oxley A Legislative Mandate for Business Intelligence!
The Sarbanes-Oxley Act is the most far-reaching piece of legislation
affecting Corporate America in years. Ultimately, its purpose was to
restore investor confidence in publicly traded corporations in the US.
While CEOs and CFOs are the main target for the Act, CIOs will be affected
as well. It is technology that will give corporations the assurance that
they are in compliance specifically Business Intelligence (BI)
technology. In this timely and in-depth examination of Sarbanes-Oxley and
BI, Dr. Imhoff walks the attendee through the significant sections of the
Act demonstrating where and how BI plays a role.
Here are the parts of her timely session that concern everyone:
- Material changes
- Internal controls
- International concerns
- Private companies
- What's needed in your BI environment
Dr. Imhoff concludes with her own mandate to use the Sarbanes-Oxley Act to
justify the driving need to standardize your corporation's IT architecture,
nomenclature, technology and applications. The time could not be better for
these critical initiatives. She ends her seminar with a warning about
unexpected and unwelcome consequences that may result from compliance with
the Act.
Roger Jones, Complexia
Data Mining For Auto and Truck Safety: Drowsy Driver Detection and Optimal Airbag Deployment
Modern automobiles have more computational capability than the Apollo
spacecraft that went to the moon. Much of this capability is being used to
improve the safety of the vehicles. We will discuss two such safety
systems: a system that uses the ambient electric fields around the driver's
head to locate the position of the head and detect when the driver is
becoming drowsy, and a system that that measures high frequency acoustic
waves in the windshield during a crash and advises the airbag control
system on how to optimally deploy the airbags. Both applications make
extensive use of data mining techniques for signal processing.
Bill Kahn, Capital One
Why Data Mining is Not Used and Why Better Data Mining Won't Help
Of course, data mining is often used, but never-the-less is not used in many places where one might think it should or could be used. This talk is about the should or could.
Often data mining is not used where one may think it should be for a simple reason?there is nothing to find. Data mining extracts the information content resident in the data, but in many data sets, oft times even quite large ones, there is (almost) no information to mine about the question of interest. Thus, one is led to the responsibility of statisticians to work on the creation of information-rich environments, i.e Design of Experiments (DOE). Some brief examples will be presented of where this problem has been observed and designs that have been successfully executed.
Also, data mining is often not used even where the data is actually information rich. Similarly, DOE is often not used even when one might think the implementation was straightforward. We will explore ways that these failures are due both to the lack of specific skills statisticians need so as to be able to influence the behavior of organizations and also to the culture of professional statistics itself. Slightly embarrassing personal stories will be shared. Finally, suggestions will be made as to what each of us can do to have higher impact on business behavior.
Dmitri V. Kuznetsov, Sigma Marketing
Multinomial Logit Models for Retention Analysis with 3 or more choices
We developed and successfully applied statistical predictive methodology
for typical marketing retention problem, where we analyzed probabilities
for each contract to remain active, cancel contract, or trade product. The
analysis is based on activity data sets of leading national hi-tech
office-equipment company. The developed methodology includes approach of
Multinomial Logit Model with unordered structure of choices. More
specifically, we applied Generalized Logit Model, where response is a
function of characteristics of the chooser but not of the choices. Then, we
discussed the next-step model development using Mixed Logit Model to take
advantages of both Generalized Logit and Conditional Logit approaches. The
keystone part of the methodology used SAS procedure LOGISTIC with option
LINK=GLOGIT, and backward and stepwise selection of variables. The analysis
with LINK=GLOGIT option in LOGISTIC became possible in SAS starting from
the version 8.2, and for the present study this way is much more convenient
than CATMOD procedure. Total computer analysis of the problem using SAS
took a reasonably short time (hours) and did not demonstrate any
convergence troubles.
The methodology provided:
- Simultaneous analysis of 3 customer-decision choices for different types/sizes of products employed by different-type customers. (The simultaneous analysis improves predictive accuracy due to extended statistical information and elimination of conditional grouping of products. Moreover, it dramatically decreases modeling time.)
- Most significant predictors for Active/Cancel/Trade choices.
- Estimated Active/Cancel/Trade probabilities for each contract that can be compared to one another.
- Extraction of contracts for Cancel/Trade Code Red from the company data sets.
- Information about actionable triggers for preventative strategies combined with implementation of business rules.
- Analysis of sales-decision cycles.
- Early Warning approach.
Larry Lai, Directv, Inc.
Variable Derivation and Selection for a Customer Churn Prediction Model
In a Customer Relation Management (CRM) environment, it is not uncommon to
have a huge data warehouse containing all customer touches through
different contact channels- mail, phone call, e-mail and website, inbound
or outbound, you name it. Data content could be either origination at the
time of registration such as credit risk, dealer channels, geographic,
lifestyle, psychographic and demographic or longitudinal over life such as
customer consumption, payment and contact. Prior to a CART modeling
analysis, it can be more productive to conduct variable derivation and
selection within a subcategory of attributes with similar context first,
e.g. within billing or customer contact category as oppose to the entire
database. This presentation describes the benefits of some techniques for
conducting "partial" variable derivation and selection before engaging in a
CART analysis and illustrates with a real case of customer churn prediction
model.
Daymond Ling, CIBC (Canadian Imperial Bank of Commerce)
Successful Data Mining implementation in a Financial Institution
We all know how powerful Data Mining can be. Financial Institutions world
wide have been mining data for decades. So what makes for a successful
implementation that drives value for a business? In this presentation, I
will share some insights on how to achieve success in a business setting.
Huan Liu, Arizona State University
Active Feature Selection with Large Amounts of Data - A
Selective Sampling Approach
Feature selection, as a preprocessing step to data mining, has been very
effective in reducing dimensionality, removing irrelevant data, increasing
learning accuracy, and improving result comprehensibility. Traditional
feature selection methods resort to random sampling in dealing with data
sets with a huge number of instances. In this paper, we introduce the
concept of active feature selection, and investigate a selective sampling
approach to active feature selection in a filter model setting. We present
a formalism of selective sampling based on data variance in comparison with
some class-based approaches, and apply it to a widely used feature
selection algorithm Relief. Further, we show how it realizes active feature
selection and reduces the required number of training instances to achieve
time savings without performance deterioration. We design objective
evaluation measures of performance, conduct extensive experiments using
both synthetic and benchmark data sets, and observe consistent and si
gnificant improvement. We suggest some further work based on our study and
experiments.
David Madigan, Rutgers University
Text Categorization: A Review and Some New Results
Text categorization concerns the assignment of documents to predefined
categories. Traditionally librarians and human indexers have carried out
such categorization tasks, sometimes on a large scale. For example, the
US National Library of Medicine engages over 100 human indexers to
assign medical subject headings to 400,000 medical articles a year.
Applications such e-mail filtering, pornography detection, medical
coding, and news filtering are creating a growing demand for automated
text categorization, especially for categorization algorithms that can
can learn from examples. The statistical challenges revolve around
issues of scale - the number of predictor variables can run to the tens
of thousands - and model structure. In recent empirical evaluations,
support vector machines and boosting algorithms have overtaken more
traditional probabilistic classifiers like Naive Bayes. This talk will
describe a computationally efficient Bayesian logistic regression
approach that yields outstanding accuracy.
Ed Malthouse, Northwestern University
Understanding Database Marketing Trigger Events using Survival Analysis and Other Data Mining Methods
This talk shows how survival analysis methods can be used to understand the
effect trigger events on certain outcomes of interest to the manager, and
survival analysis approaches with alternative approaches. Database
managers need to understand what effect, if any, certain customer-initiated
actions the trigger events have on various outcomes. For example,
after a hotel customer redeems loyalty points, is the customer more likely
to return again sooner, or later? If the customer has "cashed out" his
points and does not plan to return again, the hotel should plan certain
contacts to earn the customer's loyalty again. If, on the other hand,
participation in the program implies loyalty, different contacts are
necessary. Marketing responses are discussed. A second example of a
trigger event is customers calling their credit card company with a
complaint. The Cox and discrete-time survival models with time-dependent
covariates are reviewed. These methods are contrasted with simpler
logistic regression approaches. We show how data mining methods can be
used to identify customer variables that interact with the time-dependent
covariates; for example, when certain customer redeem points, they are more
likely to defect while when other customer redeem, they are less likely to
defect. Effects on the baseline hazard function are also discussed. All
methods are illustrated using large data sets from real companies including
an on-line loyalty program, a hotel chain with a loyalty program, a
software company, and a continuity program. Example SAS code will be made
available.
Sreelatha Meleth, University of Alabama at Birmingham
Determining approaches to develop outcome predictive models for human malignancies
The ever increasing capabilities of molecular biology over the last decade
has made the promise of a triumph in the war against a number of cancers,
seem more winnable than it has been for several decades before that. These
increasing capabilities however have an opposite effect of seemingly
slowing down the pace of understanding of a disease process. This is
particularly so, when different researchers look at different markers,
using different cut-offs and different outcome measures. The vast
capabilities of the field of Data-Mining offers a solution to the rapid
expansion on the databases that associate molecular markers to disease
prognosis. This study, we apply a number of data mining techniques to
study the combined effect of molecular markers, demographic variables, and
traditional prognostic indicators such as stage, on the survival of
patients with colo-rectal cancer. We demonstrate that it is possible to use
these techniques to build a predictive model. Eventually the purpose of a
predictive model, is to enable a clinician to enter a specific set of
biomarker indices, demographic background, and disease stage indices into a
algorithm, in order to derive a prognosis specific to his/her particular
patient.
Authors: Sreelatha Meleth Ph.D, Mike Hardin Ph.D, Upender Manne Ph.D
Roosevelt Mosley and Shawna Ackerman, Pinnacle Actuarial Resources
Use of Credit in Personal Insurance
For financial institutions credit scores and scoring mechanisms continue to
provide effective means to identify markets and assess risk. In the
insurance industry the analogous mechanism, insurance scoring is often
challenged by regulatory constraints. In this session two property and
casualty actuaries will discuss the methods, results and challenges of
using credit in three distinct regulatory environments: an unconstrained
market, a limited use market and a prohibited use market.
Brendan Murphy, Trinity College, Dublin
Exploring Structures in College Applications Data
Applications for third-level courses in Ireland are processed by the
Central Applications Office (CAO).
The applications involve each applicant listing up to ten courses in order
of preference. Places are subsequently offered to applicants on the basis
of their performance in their final second-level examinations.
The college applications process has come under much scrutiny by the public
and media in Ireland. Many criticisms have been made, for example, the
system apparently creates artificial demand for some courses where students
choose high profile (or popular) courses rather than choosing courses on a
vocational basis.
We explore the CAO applications from the year 2000 to establish if there
are structures in the applicants' course choices.
The primary tools for these investigations are cluster analysis, mixture
models and multidimensional scaling.
We establish the existence of clusters of courses where applicants tend to
choose courses within these clusters. These clusters have both a vocational
and geographical basis. An important difference between male and female
applicants is revealed, in particular with respect to courses involving a
language component.
Olivia Parr-Rud, OLIVAGroup and Sigma Marketing
Key Steps for Effective Predictive Modeling
Automated modeling software systems that streamline the predictive modeling
process are enabling data miners to develop sophisticated models for
marketing, risk and customer relationship management. However, the model
processing which is the main focus of the software systems, is a minor step
in the whole process. The success of the model is highly dependent on the
diligence in the steps leading to and following the model processing.
Determining the objective is the first and most critical step. The modeler
must consider the company objective as well as the data miner's ability to
implement the final model. The next step is getting relevant, accurate data
for the project. After the model is built, thorough validation is key to
assure the model's performance. And finally, implementation must me
flawless to insure the model's success. In this session, I will discuss
these key steps and provide real world examples from a variety of
industries.
Will Potts, Data Miners
Modeling Recurrent Customer Outcomes
Valuable outcomes such as upgrading, downgrading, missing a payment, making
an insurance claim, or ordering from a catalog recur throughout customer
lifetimes. Churn usually considered to be a terminal event can
recur after customers are won-back or reactivated. Models that predict the
intensity of a recurrent outcome can be used to guide customer-level
interventions. Predictive intensity models are extensions of survival
analysis methods for renewal or non-homogenous Poisson processes. These
models flexibly estimate the effect of time-dependent covariates such as past
customer behavior and account for individual frailty.
David Press, Greenbrier & Russell
Current State of Analytics--Journey Into Mainstream Corporate Culture
Areas of focus will be executive needs, emergence of statistician service
bureaus, differentiating analytics from reporting, and emerging trends such
as social networking, complex adaptive systems, and analytic level of
confidence requirements.
Bruce Ratner, DM STAT-1 Consulting
A Genetic Jackknife Method: 3-in-1 Tool for Variable Selection, Data Mining and Model Building
The trinity of traditional analytical techniques - variable selection, data
mining, and model building - is presented in detail along with their
strengths and weaknesses. Then, I introduce a new "jackknife" method that
is a 3-in-1 tool for automatically and simultaneously performing the
trinity of techniques: selecting important original variables, finding
patterns within the data by constructing new important variables from the
original variables, and formulating a mathematical equation based on the
best set of original and constructed variables. The jackknife method
(GenIQ) is based on the assumption-free, nonparametric genetic paradigm
inspired by Darwin's Principle of Survival of the Fittest and the
biological operations of reproduction, sexual recombination and mutation.
The GenIQ method offers a clear advantage over current statistical methods,
whose performance is dependent upon theoretical assumptions, predefined
model formulations, and data-type restrictions. A case study is presented
to illustrate the potential of the new method for building database
marketing models with the GenIQ implementation (software) tutorial of the
new method. (Note: The tutorial is NOT for selling software: the GUI
provides a clarifying explanation of the theoretical aspects the GenIQ
method.)
The intended audience for the session consists of model builders of all levels of expertise, and marketers, who use models in the DM Space (direct/database marketing {DDBM/eDDBM}, customer relationship management {CRM/eCRM}, and (knowledge discovery/data mining {KDD}). This topic is important and interesting to the DM community because the methodology is inherently new and original, as it is based on the latest machine learning paradigm of decile optimization. The benefit to participants is an alternative to the standard logistic and ordinary linear regression models.
Brett Russ, Blue Cross and Blue Shield of North Carolina
Using Data on Existing Customers to Attract and Retain More Profitable Customers
Blue Cross and Blue Shield of North Carolina is a leader in delivering
innovative health care products, services and information to approximately
2.9 million members, including 500,000 served on behalf of other Blue
Plans. Within the Product and Market Intelligence Department, we are
constantly looking at ways to support the company's mission statement: at
our core, Blue Cross and Blue Shield of North Carolina is a health care
company. Our job is to deliver quality, innovative products, services and
information designed to help our customers improve their health. We are
always striving to come up with innovative ways to enhance profitability
and improve market share while best serving our customers. In this
presentation we will discuss how data mining is used in specific case
studies including "What are Customers Buying," "Profitability Analysis,"
and "Penetration Analysis."
Melinda Satterwhite, Nextel
The use of survival analysis in telecommunications
This presentation will center on the discussion of the use of survival
analysis in telecommunications and will include 1) the availability and
issues of data; 2) flexible hazard versus empirical hazard; and 3) scoring
of the hazards to forecast churn scores.
Vineet Singh, HP
Data Mining to Increase Accuracy for Telecom Fraud Detection
Within the telecommunications industry, fraud worldwide costs US$35-40
billion annually, and continues to increase each year. HP's leading Fraud
Management Solution (FMS) provides comprehensive fraud detection,
prevention, and response for wireless and fixed line operators. It provides
the framework to use rule-based technique and data mining to increase
accuracy for fraud detection. In this talk, we will discuss our experience,
challenges, and successes in this application of data mining.
Robert A. Stine, Wharton School, University of Pennsylvania
Awktion Modeling of Wide Data Sets
The variety of choices presented by wide data sets having many features challenge data miners. The first obstacle is speed. The abundance of features slows even fast methods like forward stepwise selection to a crawl. Each step of finding and adding the best predictor can take hours. The second obstacle raised by the abundance of features is overfitting. Expansive searching increases the chances for adding spurious predictors, features that fit well in-sample but generate poor predictions out-of-sample. To overcome these challenges, data miners often resort to automatic procedures that close the problem to substantive knowledge. This automation presents the third obstacle: the inability to exploit domain experts.
Awktion modeling addresses all three challenges. Awktion (auctions with knowledge that inhibits overfitting noise) modeling uses an auction to blend multiple streams of features into a model. These streams of features come from substantive and universal recommenders, algorithms that generate features from the raw data. Each recommender offers features for inclusion in a predictive model. Recommenders that identify useful predictors gain the wealth needed for placing further features into an accumulating model. A recommender can be fully automatic or generate features using the knowledge of domain experts.
An example using financial data illustrates the ideas
Steve Tanner, University of Alabama at Huntsville
The Role of Data Mining in Data Usability
Users of data mining applications often wish to access data in a wide range
of formats and physical locations. Simply accessing this data can be a
daunting task and can require significant time and effort by the user.
This can involve both real-time and archived data from several sources, and
in formats varying from character format, packed binary, "standard"
scientific formats to self-describing formats. This heterogeneity results
in data-application interoperability problems for scientific and mining
tools.
This presentation will show several approaches to dealing with these
interoperability issues. This includes tools and techniques that
researchers at the Information Technology and Systems Center located at the
University of Alabama in Huntsville have developed. Such approaches
include providing users with multiple mining environments from large server
based systems to distributed web services within a grid computing
environment to fast real-time processes running directly on board sensor
systems. Some time will be spent discussing: the Algorithm Development and
Mining System (ADaM) a mining toolkit, and the Earth Science Markup
Language (ESML), an elegant interchange technology that enables data
interoperability with applications without enforcing a standard format
Marietta Tretter, Texas A&M University
CRM, or Not, in Archaeology Mining Rock Art
Native American rock art presents an interesting challenge to
archaeologists, chemists, and statisticians. Much of the recording and
analysis of rock art has been done by avocational archaeologists. Although
rock art is often associated with archaeological sites, until recently, it
has not been possible to date it, thus making it of less interest to
professional archaeologists. Through modern chemical analysis the painted
rock art can now be dated. The recorded rock art involves drawings,
photographs, verbal descriptions of the art, and site maps. Many volunteer
groups have spent many hours creating volumes of documentation in varied
formats. All of this 'data' is mostly stored away in paper archives which
few researchers have access to. This is an opportunity awaiting a data
mining solution. A few projects are working on putting this documentation
into digital libraries so that it can be searched and analyzed. This talk
will present some of the current data mining analysis that can be done on
this newly digital data. The analysis includes Text Mining and general Data
Mining analysis.
John Wallace, Business Researchers
Using SAS Text Miner to Analyze Call Center Data
Nearly every organization interacts with its customers or members through
the call center. Call center data serves as the platform to discuss SAS
Text Miner and the process involved in using it and other analytical tools
for text mining. The pre-processing of data, exploratory analysis and
development of synonym and stop lists is covered. The final model is a
hierarchy of Expectation-Maximization Clustering models that total over 100
clusters.
Andreas S. Weigend, formerly of Amazon.com
Online Customer Behavior
Dr. Andreas Weigend, who served as Amazon.com's Chief Scientist until early
2004, shares some insights into online customer behavior. The talk starts
with a discussion of objectives and of sources of data in e-commerce, gives
an analysis of clickstream and purchase data, and presents probabilistic
models for customer intentions and modalities from click streams in real
time. The talk discusses the importance of online experiments, as well as
the need for a framework for modeling and predicting long-term customer
behavior. It ends with a discussion of data mining in online advertising
and online dating, and of leveraging social networks in e-business.
Cary White, University of North Carolina
Data Warehousing at the University of North Carolina Chapel Hill A Case Study with a Focus on Lessons Learned
Anyone attempting to develop a data warehouse in a higher education
environment encounters unique challenges not always found in a corporate
setting. Diverse constituencies, limited funds, highly political culture
with many power centers, aging source systems, lack of knowledge of
business processes, and the need for an 'instant' enterprise warehouse are
just a few of these challenges. This presentation will uncover some of
these challenges as well as presenting some of the choices, lessons
learned, good luck and good judgment that have occurred during the life of
this project.
UNC-CH is in the third year of a major data warehouse initiative. A small
permanent team has been responsible for the technical portion of the
project from requirements-gathering through ETL and finally to deployment.
Within the framework of a case study of the data warehouse lifecycle in
this university environment
You will learn:
- Unique challenges of data warehousing in a university setting
- Turning points in the project and choices made along the way
- Lessons learned during the life of this project most of which are relevant to data warehousing in any setting
- Some best practices gleaned from the literature and from our experiences in the first years of this project
Audience:
- Project managers
- Data warehouse architects
- Business sponsors and drivers
- Anyone contemplating the development of an enterprise data warehouse with a small budget and limited staffing
|
What participants say about the M-series:
"The educational content, exchange of ideas, and intellectual environment I found at the conference exceeded my expectations and confirmed SAS' place as the premier data mining conference in the world."
Thad Perry, Ph.D.
Senior Director of Infomatics
"SAS is doing a tremendous service for the data mining community. The conference provides an excellent forum for exchanging ideas and best practices in business and a stage for sharing the latest and best academic research in the field."
Jaideep Srivastava
Professor of Computer Science and Engineering
University of Minnesota
"This was a superb environment - one of the smartest conference venues I have experienced (and I have experienced a lot). The talks went into greater depth than the talks at many such meetings. Many of the talks were particularly valuable in shedding light on different application areas of data mining."
David Hand
Professor and Head of Statistics
Imperial College, London
"This conference is definitely a must. Not only for the information, but for the opportunity it provides to exchange ideas and learn from your colleagues."
Daryl Berry
T-Mobile US
"The information I got from the presentations was great, and it was nice to talk to and exchange experiences with professionals who are pretty much doing the same thing."
Victor Alonso
Zurich Insurance Co
"What really impressed me was the sense of community that normally isn't present at conferences of this size."
Brij Masand
Data Miners
"The conference has opened a whole new world for me."
Rachel Alt-Simmons
Hartford Life Insurance
|