Posted on June 21, 2013
Excitement around Big Data has significantly increased the demand for data scientists. This hot position has been hailed as the "Sexiest Job of the 21st Century" by the Harvard Business Review.
When we started in the advanced analytics space we didn't use the term 'data scientist.' We were more focused on solving problems than defining a job description. Our initial teams have evolved and adapted to challenges through hands-on experience to find the appropriate methodology for success. However, as the demand for resources grows, ensuring that we recruit individuals that will quickly add value to our teams is a constant challenge.
Quality data scientists may be elusive, but they do exist. From our experience in the trenches, we've found seven core characteristics of a successful data scientist. This is a developing view, but offers some insight into the position and the types of skills we are seeking for our team.
It's rare that any one individual is an expert in all of these areas. However, a successful data scientist understands enough to intelligently collaborate with experts that can complement knowledge and skill gaps. Below, we describe these seven core competencies and the seven key collaborators of a successful data scientist.
A good data scientist:
Many 'Big Data' projects fail because they do not set clearly defined objectives. With all the hype surrounding data analytics, there's often the false assumption that simply having a lot of data will magically produce valuable results. Frequently there is also the unrealistic expectation that a vendor's black box will easily spit out valuable answers. Such flashy products can make great tools, but without a good project blueprint these tools are just tools.
Just like a science experiment, a Big Data project requires clearly defining the problem to be addressed, and developing a targeted plan for quickly translating raw data into a solution. A data scientist also understands that the results from a complex experiment are frequently influenced by what data is studied and how it's measured – being careful to not accidentally pre-determine an outcome simply because of the way an experiment was designed. Without effective project management, the stakeholders (who are also likely holding the checkbook) will quickly become disillusioned at a lack of tangible results.
‘Big Data’ is largely a misnomer. Many companies have been managing huge volumes of data for years and storage technologies largely keep up with storage demand. The real challenge of Big Data is that most of it is a complete mess. Traditional quantitative analysts (aka quants) that really came into force during the 90s and early 2000s are trained mostly to work with highly structured and very clean data (aka ‘dream data’). Dumping messy data on the desks of traditional analysts has proven problematic. Most of the skill and effort in Big Data comes from parsing, cleaning, de-normalizing, re-normalizing, linking, indexing, interpreting and otherwise preparing all this messy data for analysis. A data scientist thrives in tackling this work and pulling together jumbled disorganized data in order to solve a puzzle.
Most of the data used in a project is already stored somewhere else—be that an e-mail server, transaction database or event logs. A data scientist will need to partner with the owners of those systems and leverage an experienced DBA and/or infrastructure expert that can coordinate access to all this information and integrate it into the project’s compute environment—either through direct feeds or a separate consolidated data store.
Changing what can be accomplished with data analytics often involves changing who can access the raw information. A data scientist that expects immediate and unfettered access to any data source will quickly find himself or herself to be a dataless data scientist.
Results from advanced analytics often create a sudden lack of plausible deniability over what’s going on within a process or workgroup. Such insight frequently creates an immediate impetus for senior managers to implement corrective action and put performance back on track.
All of the above can cause a significant political storm if those managing a Big Data project are not experienced in handling the politics of information. On all these points a senior executive sponsor is key both gaining timely access to required resources and to maintaing quality relations with a broad net of stakeholders. A successful Big Data project appropriately coordinates between all stakeholders and ensures that everyone has the opportunity to derive value from the project.
A data scientist must understand where the data comes from, what it means, and even more importantly what it doesn't mean. They recognize not only the power of data analytics in understanding processes and organizations, but also its limitations. Effectively understanding these connections requires a data scientist to develop an appreciation of the business they are working in. An effective data scientist will likely not understand every nuance of the business unit they are working with, but will partner with an appropriate expert to review how things work, why they work that way and how this impacts the creation, collection and interpretation of data.
Being able to connect the dots between data sources and construct quality algorithms is key. Proficiency with typical coding resources (e.g., SQL, Perl, Python, R, PHP) is required to construct appropriate algorithms. However, full-time developers can and should always be involved when transitioning from initial design into a production-level system (e.g., building a real-time reporting package for managers).
Successful Big Data teams often incorporate expertise in this area from disciplines that are not traditionally associated with business. Those with a background in analytical sciences such as genomics can often bring a new perspective on handling the computational challenges now being faced by performance improvement initiatives.
A Big Data project is rarely successful because of statistics, but it can quickly fall apart without some basic sense checking of the results. Most business teams deal in absolutes and the 1 + 1 = 2 world of Excel spreadsheets, but Big Data has a lot of ‘maybes’. How two ‘maybes’ relate to one another and whether or not a bunch of ‘maybes’ brings one closer to a near certain ‘yes’ always requires careful statistical sense-checking. Any discussion of results from Big Data should have at least one person capable of confirming the statistical validity of results. A mathematician or statistician that understands the nuances of the source data can be a highly productive partner to the data scientist but the data scientist must understand enough math to converse with these experts.
A data scientist must always understand why the analysis was sponsored in the first place and be able to present how their work is furthering the achievement of that goal. ‘Cool’ results are great for tweeting out a new info-graphic, but the senior management sponsors of a project are far more interested in results that are succinct and actionable. If the results do not fundamentally change how and why actions are made within an organization, then they will be of little value to a commercial enterprise.
Traditionally trained strategy consultants are experienced in communicating nuanced messages to senior management and can be a valuable partner to a data scientist for bridging the gap between results and action.