Meet the data scientist: Victor Fang
By Stephanie Robertson, SAS Insights Editor
Like many in Silicon Valley, Victor Fang has been a data scientist since before the title was coined. As part of our Data Scientist Series, we interviewed Fang, who is revolutionizing video content analysis.
What’s your background and education?
Fang: I received my PhD in computer science from the University of Cincinnati, and got my BA in electrical engineering from the University of Science and Technology of China.
What skills help you most as a data scientist?
In my opinion, data scientist is a rebranding of a category of professions, and was made popular back in 2008 by DJ Patil, et al. [Patil was recently named as the US government's first chief data scientist.] A data scientist today might have held a previous title of research scientist, software engineer, quantitative analyst, statistician, etc. But being a data scientist requires more credentials and training than the titles mentioned above. For example, at Pivotal, we measure the skill set of a qualified data scientist in a high-dimensional Euclidean space and require:
- Mathematical/statistical/machine learning skills.
- Software engineering: R, Python, Java, Matlab, SAS, etc.
- Domain knowledge.
- Communication skills.
- Technical knowledge.
Passion also plays an important role in data scientist’s life: Are you excited about getting valuable insights out of messy big data via creative ways of applying machine learning models, scaling up your algorithm to petabyte scale?
When did you figure out you wanted to be a data scientist? What motivated you to become one?
Like most data scientists at Silicon Valley today, I have been somewhat of a data scientist even before this title was coined. I’ve always had an addiction to data. I even joked with my PhD advisor once: “I feel comfortable when I can play with data.”
I’ve been working on research and development into machine learning for almost 10 years. My journey toward being a data scientist began when I joined the National Lab of Pattern Recognition at the Chinese Academy of Sciences as a research engineer. My role was to develop real-time intelligent video analytics algorithms and software that ended up deployed in major cities in China. The challenges in computer vision reside in building robust machine learning models that generalize to unseen data. It was at that time I realized machine learning theory is the key, so I decided to pursue my PhD degree.
After I earned my degree, I continued the journey of being a scientist in the industry, solving challenging problems such as a computer-aided diagnosis that was FDA approved. Then I decided to join EMC/Pivotal so I can help enterprises use data for deeper business insights and greater values.
What department do you work in and who do you report to?
I am a senior data scientist at Pivotal Data Labs (PDL), Pivotal Inc. in Palo Alto, California. I report to the head of IT/Security Analytics.
How long have you had your job and were you hired specifically to be a data scientist?
I joined EMC as a senior data scientist in 2012, then we spun-out into a pre-IPO company Pivotal Inc.
Do you work on a team? If so, what’s the makeup of your team?
We work as a small team in a highly agile fashion during the customer engagement (Pivotal Data Science Lab). Besides the data scientists like me, our team is comprised of other talents:
- A data/solution architect who makes sure the Pivotal platform is up and running in the customers’ enterprise environment.
- Data engineers who help load the data into our Pivotal platform.
- A project manager who coordinates the project progress and deliverables with the customer’s stakeholders.
- SMEs (subject-matter experts) who understand the domain knowledge, jargons and pain points.
What’s your job like? Is there a typical day or is each day different? Can you give us a basic idea of what you do and the kind of projects you work on?
At Pivotal, the data scientist role is customer facing. We closely communicate with our Fortune 500 clients to help them become data-driven and predictive enterprises. We have six data science verticals within PDL, and I am in the IT Operation/Security Analytics vertical. I've been working on challenging use cases in Fortune 500 enterprise IT, such as advanced persistent threat, insider threat, reliability engineering, etc. My colleagues and I have filed about 10 US patents around these strategic areas that have enriched the IP portfolio.
Besides customer engagements, I’ve also been leading our Video Analytics Data Lake initiative (VADL) that I presented at Strata NYC conference in Oct. 2014. VADL is a disruptive video analytics platform that will help data scientists achieve real-time streaming analytics, as well as big data analytics, in the video content analysis domain.
What’s your biggest challenge?
In each Data Science Lab engagement, we’re solving the most challenging data science problems in large enterprises. To me, the biggest challenge is identifying the abundantly siloed unstructured/structured data sources that can contribute to the downstream data science modeling, and their respective normalization and cleansing. For example, in large enterprise IT divisions, at least 20 disjointed data sources are collected on a daily basis such as proxy logs, authentication logs, VPN logs, etc. How to properly correlate them to form a 360-degree view of the entities for machine learning and how to build meaningful features are usually the challenges.
What’s your biggest accomplishment thus far?
Among multiple accomplishments, such as deploying my models in Fortune 500 clients’ environment and having them operate on a daily basis, I think the biggest one is the Video Analytics Data Lake. It’s a vertical solution platform that I envisioned back in 2012 that will be revolutionizing the legacy video content analysis market, which enables multi-latency analytics, scalability, PaaS cloud-readiness, etc. (For details refer to my talk at Strata 2014.)
The efficiency of a data scientist relies on the platform and tooling. VADL aims to make the video analytics data scientist’s life easier by focusing on the analytics algorithm building and letting the platform deal with the rest. The success of Hadoop MapReduce is such an example. (For details refer to my 2013 blog.)
Teaming up with the talented engineers from EMC Lab of China, we were able to construct the VADL prototype, with state-of-the-art big data products such as SpringXD, in-memory database, Hadoop, Spark, etc. It’s a rewarding journey when I imagine how data scientists in the future will benefit from this platform.
What do you enjoy doing in your spare time?
I love traveling, classical music and working out. I also enjoy all creative designs: wearable devices, robotics, and visual arts!
What’s your favorite new technology or app?
Technology in big data and data science is always evolving like a fashion show. Five years ago I was into Hadoop, Matlab, SAS, etc. on the analytics tools side. After I joined EMC/Pivotal, I embraced more open source tools, such as R, Python, MADLib, Gephi, Spark, Cloud Foundry, etc. that have been growing and constantly improving.
One of my 2014 New Year’s resolutions was “fast data” – emerging technologies like Apache Storm, Spark Streaming, Kafka, SpringXD, etc. And I can check it off my list because I’m co-organizing, with my friends from SAS, Amazon, Yahoo, Twitter, etc., the first SIAM Data Mining Workshop on Big Data & Streaming Analytics, May 2-5 in Vancouver.