The Knowledge Exchange / Business Analytics / Big data myths according to Phil Simon

Big data myths according to Phil Simon

Phil Simon, Technology Consultant

Phil Simon, author of Too Big to Ignore: The Business Case for Big Data, discusses two myths of collecting and analyzing data.

Myth: You can get all of the data
We are living in unprecedented times. Never before has so much data been available to us. Forget megabytes and petabytes, exabytes of data now exists. I read recently that the average person in an industrialized society today consumes more information in one day than his counterpart did in the fifteenth century in his lifetime.

Despite this unfathomable amount of data, no person or organization can store and retrieve all data. And yes, that includes Google. Its software indexes the Surface Web, not the Deep Web. Some estimates put the latter at 25 times the size of the former. As a result, when you search, you are accessing anywhere from four to six percent of all information on the Internet.

Taking it down a level or thirty, individual authors like me cannot access some very valuable information, such as which specific customers are buying my books. Sites like Amazon and stores like Barnes and Noble keep that information. Nothing would make me happier than knowing my customers, but even in a Big Data world that information eludes me.
You will never get all of the data. Period. Deal with it.

Myth: You need all of the data
No doubt that more data helps, but don’t for a minute think that you need all data to make an informed business decision. Organizations that are effectively leveraging the power of Big Data realize that they will never capture all relevant information.

New sources of data spring up seemingly every day, and it’s not as if they’re all valuable. For instance, e-mail messages often contain extremely valuable insights into the state of an enterprise. Smart companies are mining individual messages to gauge employee sentiment and potentially determine who might be exiting.

This is a far cry from saying that all e-mails are equally valuable. It’s hard to make the argument that using text analytics on spam makes much sense.

You don’t need all of the data. Yes, more is better than less, but don’t waste time trying to achieve the impossible.

Tags: ,
  • Facebook
  • del.icio.us
  • Twitter
  • Digg
  • LinkedIn
  • email

2 Comments

  1. Tom Deutsch
    Posted March 26, 2013 at 11:05 am | Permalink

    Hi Phil – let me provide a bit of a counterpoint. You write “This is a far cry from saying that all e-mails are equally valuable. It’s hard to make the argument that using text analytics on spam makes much sense.

    You don’t need all of the data. Yes, more is better than less, but don’t waste time trying to achieve the impossible.”

    Well you need to process all the email to get rid of what is truly non-informing, but doing that accurately and touching all of it isn’t optional. So while true that you don’t want to do analytics on the SPAM, you have to process all of it to get to the useful bits.

    Also worth noting that sampling doesn’t work well here as an approach, and that text models that are black-box based, non-declarative are often also problematic.

  2. Posted March 26, 2013 at 12:12 pm | Permalink

    Fair enough, Tom. My only point is that we shouldn’t analyze none of the data because we can’t get to all of it.

One Trackback

  1. [...] his discussion on the myths of collecting and analyzing data. For further reading on the subject, read part one of this two-part [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>