31 Flavors of Data Science
31 Flavors of Data Science
In today’s environment, nearly every organization is using data analytics to adopt a more data-driven approach to understand their customers, whether to inform their marketing strategy or otherwise. When I speak with our Big Data team about how they are generating robust, actionable analytics from immense data sets, they are quick to point out that these efforts bring clarity surrounding how people interact with the web. This data is the basis for usable features that inform critical business strategies.
Based on the increase in the number of calls that data scientists are getting from recruiters, you would think there was a deficit in the number of people who can truly decipher all of the data that we keep producing. And you would beright… the “sexiest job title of the 21st century” is a high-demand position because of the massive explosion of data generated and retained by organizations and individuals.
The demand for data scientists is increasing so quickly that McKinsey predicts there will be a 50% gap in the supply of data scientists versus demand by 2018.
With this demand, you have an emerging confusion about what it means to be a data scientist. Similar to a SQL DBA thinking that they are a SQL Developer, a series of statisticians, analysts, and engineers work with Big Data, but may not be “data scientists.”
Confused? Join the club!
Add the plague of professionals who started changing their titles to “Data Scientist” without any of the necessary qualifications, and the result is confusion in the market, obfuscation in resumes, and exaggeration of skills.
I’ve been recruiting for data scientists for nearly 6 years and the “What is a data scientist?” pandemic continues to grow worse. With the media inflating data scientists’ already-high salaries, data scientists have captured the imagination of job seekers believing they can write Hadoop on their resume and get a 50% raise. Fortunately, there’s a lot more than words on a resume.
What is a Data Scientist?
If you are a parent or enjoyed growing up with a sibling, you are quite familiar with what it takes to be a data scientist.
A data scientist is the adult version of my little sisters, who could not stop asking, “Why?” or, “Why not?”
Just like your younger sibling or child, they’re the person who goes into an ice cream shop and gets six very different scoops on their cone.
A sample won’t do.
No, they really need to know what each one tastes like.
They also need to understand why they taste different when eaten together, separately, or in a different order.
Maybe they need more of a different flavor.
Just like the combinations available at your local ice cream shop, data scientists encompass many flavors of work. Data scientists have mastered a piece of the responsibilities of statisticians, analysts, and engineers, along with their own specific requirements. A data scientist’s responsibilities will vary based on the company and the person, and may look more like one of those other titles, rather than a mixture of all three.
A data scientist is someone who does the following tasks:
- Data Cleaning
- Data Analysis
The order of these tasks roughly reflects the life cycle of a data science project. If you’re interviewing for these types of hires, keep each of these areas in mind.
There is a tremendous amount of data in our world, but pieces of it are not ready to be translated. The “data cleaning” aspect of a data scientist’s responsibilities includes ensuring data is nicely formatted and conforms to a set of rules. Data cleansing is about finding hiccups, fixing them, and making sure they’ll be fixed automatically in the future. As an added bonus, all the downstream work can only be as good as the data you’ve assembled.
A data scientist typically works with data sets that are too large to open in a typical spreadsheet program. The data sets may even be too large to work with on a single computer.
Data analysis is a realm of visualization. This is where plots of data are created in an attempt to understand the information more accurately. Through this process, a data scientist is attempting to craft a story to explain the data in a way that will be actionable and easy to communicate. Sometimes this can be something simple, like signals when new users convert into long-term users, or something more complex, like figuring out when someone is slowly scamming you for lots of money.
Just like your favorite ice cream, a significant other occupies a unique place within a person’s social network. For example, data scientists at Facebook know who your romantic partner is, whether you formalize the romantic connection on your profile or not. The connection is not characterized by “embeddedness,” the standard way of measuring a connection’s proximity, but by what the researchers call “dispersion.” Instead of utilizing a mediocre or simpler measure, they test how many different networks a person’s friend shares. In other words, your significant other won’t just share many friends with you, but friends from all walks of life: your colleagues, your high school buds, your college friends, your family, and so on. Data analysts are masters of uncovering details others may easily overlook.
Modeling / Statistics
A data scientist’s background influences whether they believe they are working on modeling or statistics. Those who studied statistics consider themselves to be statisticians, while everyone else is probably going to identify as a modeler (or if they’re feeling fancy, a “machine learning expert”).
This is where deep theoretical knowledge creeps into data science. Once you secure and understand a clean data set, predictions are generally either based on that data or on similar data you’ll encounter in the future. A data scientist spends a lot of time evaluating and tweaking models, as well as going back to the data to bring out new features that can help make better models.
Having clean data and a good model is only the tip of the ice cream cone. It doesn’t help anyone if a company is unable to consistently deliver data predictions to their advertising customers. This means building some sort of data product that can be used by people who aren’t data scientists.
This can take many forms: a visual representation, a metric on a dashboard, or an application.
Whether a data scientist is building an application or a proof of concept often depends on the amount of data, how fast operations need to be, and the anticipated audience.
Remember the ice cream cone I mentioned earlier? At the bottom of the cone, you have a melted collection of your little sister’s favorite flavors. Actually, there’s probably a melted mess all over the floor and you’re trying to account for who is going to clean it up.
Or, does it get cleaned up?
The long-term life cycle of a data science project looks a lot like a melted mess.
It’s possible to go back and redo the analysis because you had a great insight.
It’s also possible that a new source of data will be provided and need to be incorporated.
Perhaps the prototype encounters more use than previously expected.
This is the best thing about data science: you do a lot of things and you do them together, and it’s a nice challenge – just like trying to eat too much ice cream.
Always be in the Know, Subscribe to the Relus Cloud Blog!