Reflections on Big Data

A look at the human-related challenges of analytics and data science

By: Richard Boire, Partner at the Boire Filler Group

In my book, “Data Mining for Managers: How to use Data (Big and Small) to Solve Business Challenges”, I explain what is new versus what already exists within Big Data Analytics. The reality is Big Data has always been with us. It just wasn’t discussed as it is now. In the digital world, and more importantly with technology that provides easy access to information at our fingertips, Big Data is now on everyone’s mind-it has achieved mainstream status.

The fact remains that, many of the same challenges within Big Data have existed both in the old and new world.Volume has always been an issue, yet technologies such as Hadoop have facilitated the processing and consumption of ever-increasing volumes of data. But for data scientists or data miners (the older definition), this is not the issue. Data Mining or Data Science for the most part is about transforming data. Data miners earn their salaries as a result of their ability to transform raw source data into meaningful data mining variables. This suggests that in any exercise, the creation of a meaningful analytical file is still the most important prerequisite in any analytics exercise.

This critical skillset is even more paramount as semi-structured and unstructured data represent the newer components that need to be mined. Extraction skills utilizing the right programming skills, software, apps, etc., are essential in identifying the key pieces of information. In my book I reinforce the need for skills that identify the business problem or challenge and equally important the ability to create a data environment which provides the information foundation in the development of a given business solution.

By “working the data”, the analyst undergoes an extensive variable derivation process where transformation of this data into meaningful variables or fields is the real key to successful data mining.

Richard Boire
Partner at the Boire Filler Group

Before the digital explosion of the internet and social media, a typical data mining exercise would involve the data miner asking for as much data as possible. The rationale in this approach was to allow the data miner to filter out all the noise in the data which represented structured data. In our Big Data world, massive volumes of semi-structured and unstructured data no longer lend themselves to this approach. The “initial” ask of the data needs to be filtered. This is where “domain” or business understanding is even more significant as the filtering objective is to extract the necessary information which can potentially build the business solution. The data miner needs to certainly have the requisite technical skills in extracting the right data. But it is even more incumbent in our new big data world that he or she have the requisite skillsets in being able to identify business problems or broader business issues that deal with the overall business strategy.

Historically, raw data consisted of transaction records, customer files, campaign data, and perhaps geo-demographic data. All this data was structured but the information was meaningless in its current raw state. Here is where the data miner or data scientist spends most of their time. By “working the data”, the analyst undergoes an extensive variable derivation process where transformation of this data into meaningful variables or fields is the real key to successful data mining. Many transformations occur involving the following tasks:

  • Summarization of data (means, mins, maxes, standard deviations, medians)
  • Creating ordinal group or categorical type variables as well as yes/no binary type variables
  • Creating time-sensitive variables and change type variables overtime

These above tasks also require the ability to merge and join multiple files into one overall file (the analytical file). As you may already have surmised, a deep understanding of data complemented with deep technical programming knowledge provides unlimited capabilities in “working the data” to create the analytical file. This scenario for data miners has existed for decades and still exists within Big Data today. But let’s talk about what this means in today’s exploding digital environment.

In much of our digital world, the data arrives either in semi structured or unstructured format. The newer challenge for data miners is to first convert this raw data into raw structured data before undertaking the variable transformations as discussed. Knowledge of ETL conversion technologies such as the creation of JSON objects enables the analyst to create this raw structured data. But keep in mind, the data miner is far from finished. The conversion of this data to raw structured data means nothing unless it is transformed into meaningful data mining variables. But all these aforementioned data mining capabilities can be meaningless unless it is done in a more focused manner. With data permeating everywhere, the stock answer of extracting all information becomes less relevant within an explosive infrastructure of information.

The extraction process is now more critical within the data mining process. But this implies that we truly understand what the business problem we are trying to solve is. For example, if I am trying to understand how engagement with Coca Cola in social media has changed both prior to and after a marketing promotion, then I might do the following:

  • Extract all tweets with keywords related to Coca Cola that occurred two months prior to the promotion date and two months after the promotion date.
  • Convert above data to JSON objects extracting the date field using Java type programming or some API.
  • Create analytical file of a structured table with only one field(date field)
  • Create graphical trend report using visual analytics that depicts tweet counts which are prior to and after the promotion.

Now suppose we want to overlay whether or not the tweet refers to Coca Cola in a positive or negative manner. Using sentiment analysis tools we would do the following:

  • Using the Structured Table, create output file of tweets which is then input into sentiment analysis tool.

  • Based on information from sentiment analysis tool, codify each tweet as being positive or negative.

  • Create graphical trend report using a tool such as Tableau that now graphs positive vs. negative sentiments overtime which are prior to and after the promotion.

As you can see in this above case (general reporting of tweet behavior over a period of time), the extraction process needs to be much more focused towards the business problem. The old adage of “give me everything” within social media simply consumes too much resources in trying to make sense of the data. At the same time, this approach enhances the focus of the data miner in trying to better understand the business problem. Although we have always indicated that identifying and understanding the business problem is one of the four key steps in the data mining process, this stage is simply reinforced through social media data.

Besides just the ability to identify simple engagement and sentiment, there is often the need to probe more deeply into the content. Are there certain themes or topics that are emerging from the social media conversations. The use of text mining and text analytics represent tools that allow this type of more exhaustive probing. But again, what is the business problem we are trying to solve? If social media translated into more engagement as a result of increased tweets or retweets in the post marketing period relative to the pre marketing period, we might decide to better understand this scenario thru more analysis. Through text mining, we might discover that certain themes or topics are more relevant in driving this engagement to higher levels as a result of the marketing campaign.

In these above scenarios, the business problem dictated how we were going to use social media. Now suppose we want to build a customer retention model that can potentially use information in social media which can mine tweets as either being complaints or non-complaints. The first issue concerns itself with the ability to match customer records from the given company’s database against the same individuals that are engaging in social media. This is typically very small and would not necessarily be useful in the larger context of enhancing the overall retention model. The second issue is one of reliability as some naturally question whether the comments of people in social media are truly representative of the so-called silent majority. Furthermore, the privacy issues in using this type of information may not have been addressed. If our intention is to build better retention models, we might seriously question the usefulness of appending social media to customer records given these issues.

Big Data and ultimately social media data will continue to grow. As practitioners, we can no longer respond in the old fashion of “extracting everything” as unlimited data volumes are more the norm with Hadoop type technologies. We truly need to understand the business problem so that we can effectively extract the right social media information when building the solution. One of the underlying themes in my book is the need for these hybrid skillsets of domain business understanding complemented with the requisite data mining technical skillsets.

In our exploding world of data, this demand is just going to increase. We continue to see great developments in software and technology that provide the necessary tools for the data miner. Yet, the real challenge for analytics and data science is human-related in the sense that it is not about having more data miners and analysts but having the right analysts who are trained and educated on the principles of data mining. As practitioners move forward in our Big Data world, this ability to understand and learn the domain knowledge of a given business and to grasp its major issues will become even more paramount as a data mining skillset.

Richard Boire B.Sc. (McGill), MBA (Concordia) is the author of Data Mining for Managers: How to Use Data (Big and Small) to Solve Business Challenges and the founding partner at the Boire Filler Group, a nationally recognized expert in the database and data analytical industry.