Help wanted: Data scientist
By Thomas H. Davenport
"Big data" has excited many executives with its potential to transform businesses and organizations. The concept refers to data that is either too voluminous, too unstructured, or from too many diverse sources to be managed and analyzed through traditional means. The related concept of "high-performance analytics" (HPA) involves using new technologies to dramatically accelerate the speed of large-scale analytical projects.
But prospering from big data and HPA takes more than simply employing new technologies. Firms will need to build new capabilities to deal with this new resource, even if they are already experienced users of analytics.
One of the most important capabilities is "data scientists" to do the day-to-day work of big data management and analysis. I interviewed more than 30 of them in a recent study, and here's what you need to know:
What is a data scientist?
Data scientists are hybrids of technologists and quantitative analysts. They work in a variety of organizations, from big data startups to large, established companies like GE, Intuit and EMC.
GE, for example, expects to hire more than 400 computer and data scientists at its new Global Software and Analytics Center in San Francisco to focus on big data for industrial products, such as locomotives, turbines, jet engines and large energy generation facilities.
Data scientists have somewhat different roles from traditional quantitative analysts. Whereas traditional analysts typically use analytics on internally generated data to support internal decisions, the focus of many data scientists is on customer-facing products and processes, where they help to generate products, features, and value-adding services. For example:
- At the business social network LinkedIn, data scientists developed the “People You May Know” and “Jobs You May Be Interested In” features of the site, among others.
- GE is already using data scientists to optimize the service contracts and maintenance intervals for industrial products.
- Google, of course, uses its roughly 600 data scientists to refine its core search and ad-serving algorithms.
- Zynga uses data scientists to target games and game-related products to customers.
- In health care big data firms, data scientists try to discover the most effective treatments for different diseases.
Given this product-centric focus, data scientists are most likely to be in product development or marketing organizations. Some work in the reporting structure of the chief technology officer (CTO). Those who report to CTOs are likely to work on tools that make data science easier and more productive.
Data scientists who focus on HPA applications don't necessarily need to understand how to process unstructured data, but they do need to understand how analytical work can be divided across multiple parallel servers. They should be able to explore a variety of ways to use the extra time from HPA to refine their models. In addition, they need to try to accelerate decision speeds to match the much faster cycle times of data analysis.
Data scientists require technical, business, analytical and relationship skills. From a technical standpoint, many have advanced computer science degrees, or advanced degrees (often PhDs) in fields such as physics, biology or social sciences that require a lot of computer work.
They're not just programmers; many refer to data scientists' computational skills as "hacking" – bending technology to do their bidding in unusual ways. The specific technologies on which data scientists focus include:
- Hadoop, MapReduce, and the related ecosystem of distributed file system tools.
- Programming languages such as Python, Java, Pig and Hive.
- Machine learning.
- Nontraditional database tools such as Vertica and MongoDB.
- Natural language processing.
- Statistical tools.
In addition to these technical skills, data scientists also need the attributes previously necessary for analytical professionals, including mathematical and statistical skills, business acumen, and the ability to communicate effectively with customers, product managers and decision makers.
Of course, the combination of these skills is difficult to find in one person, so some companies have created data science teams that together embody this collection of skills.
Finding data scientists
In a recent Economist Intelligence Unit survey of 600 global executives, 54 percent of North American respondents said finding the right people with the right skills is the No. 1 obstacle to launching a successful big data project. Where can an organization find data scientists?
There are few, if any, academic programs in the area, although several are being designed now. Most organizations, however, must recruit and hire individuals from other backgrounds who have skills related to data science. For example, George Roumeliotis, the head of a data science team at Intuit in Silicon Valley (and himself a PhD in astrophysics), seeks people who can develop prototypes in a mainstream programming language such as Java, and have a solid foundation in math, statistics, probability and computer science. He also looks for a feel for business issues and empathy for customers.
There are a variety of other approaches in use to develop and hire data scientists. EMC has determined that the availability of data scientists will be an important gating factor in its own big data efforts and those of its customers. So it has created a "Data Science and Data Analytics" training program for its employees and customers. Some large consulting firms are beginning to offer data scientists to their clients. And a Silicon Valley program, the Insight Data Science Fellows Program, takes scientists for six weeks and teaches them the skills to be a data scientist.
In addition to the challenges of finding and retaining data scientists, there are some other potential difficulties with the role and the profession that may deter some organizations from employing them. Of course, data scientists are expensive, with many of those in startup organizations pulling down large options packages.
Another problem is that while data scientists combine technology-intensive “data wrangling” and analytics, there is often more of the former than the latter – “big data often equals small math.” The amount of effort necessary to deal with large volumes of unstructured data sometimes means there are fewer resources and less time left over for detailed statistical analysis.
Because of the difficulties of extracting and structuring data, current data scientists also often face issues of relatively low productivity. The next generation of data scientists will undoubtedly be more productive and will use tools that make common tasks much easier.
Just as traditional quantitative analysis on "small data" didn't happen without professional and semi-professional analysts, big data can't be analyzed without data scientists.
Such a person can not only convert unstructured data to structured data and perform quantitative analysis on it, but also help an organization think about what data sources to investigate, what customers really need in data and analysis requirements, and how best to incorporate big data-based products and services into an effective business model.
The many executives who are excited about the potential of big data and high-performance analytics for their organizations need to realize that putting big data to work requires a special breed of analyst. Even if an organization isn't quite ready to aggressively pursue big data opportunities yet, it's worth thinking now about how and when it will acquire the most scarce and valuable resource in big data – the data scientists.
Bio: Thomas H. Davenport is a Visiting Professor at Harvard Business School, co-founder and Director of Research at the International Institute for Analytics, and a Senior Advisor to Deloitte Research.
A degree in BIG DATA?
Some existing master's degree programs in analytics, such as the one SAS helped start at North Carolina State University, are including some big data training in their curricula (such as Hadoop programming and dealing with unstructured data). Most organizations, however, must recruit and hire individuals from other backgrounds who have skills related to data science.