Archives April 2022

Data Scientist at a Startup or a Small Company

As opposed to the companion post about data science in a corporate, let’s talk now about the data science life at a startup or small company, that is, a company with less than 100 employees.

A startup has, by definition, an untested business model: the company runs on seed money and may or may not be profitable. In this setting, data science is a core differentiator that can, presumably, disrupt the market and generate infinite profits for investors. So what does a data scientist do there? As opposed to large corporates, a data scientist will work as part (or closely with) a product team. Since the company itself is still small enough, there are not many business processes to improve (typically non-existent), and since data science is a strategic choice, then it is natural for data scientists to be part of a product team. They can, sometimes, be also business analysts.

What to expect from working on a startup?

  • Flexibility (home office, flex hours).
  • Relaxed environment and work atmosphere.
  • More modern tech stack and (current) best practices.

While this sounds like a great deal, I would argue that this is not the ideal place to start a career. Startups are fantastic places for experienced professionals, but too chaotic for juniors. I recommend spending at the very least 2 years in corporate first.

Other downsides of startups include:

  • Unreal business model, non-existing profit: Many industries are no more than thin air.
  • None of the co-founders is tech-savvy: Sadly, many healthcare startups.
  • No data: Usually a corollary of the previous one. Due to some unrealistic expectations of non-technical co-founders, artificial intelligence is expected to learn “on its own”, somehow, magically.
  • While this is maybe a personal thing, an unhealthy focus on “family” culture. I like my work relationships to stay where they belong, at work. Sure, sometimes your colleagues can get upgraded into friends, but that should not be the default mode.

Working as a data scientist in a corporate

Thanks to my work in universities and corporate training, I often get to chat with young people interested in working as data scientists. I decided to put together a couple of articles concerning my experience in both corporates and startups. Hopefully this could be useful to someone.

First of all, what do you mean by “Corporate”?

There is no clear-cut, universally accepted definition. But let’s say that by “corporate” I mean a company with roughly 100+ employees. This could be a private or public company, or a government entity. If the company is big enough, there can be different business units, each of them with their own data science team, or one centralized “Center of Excellence” for all things data science in the organization. Examples include a retail bank, a telecommunications company or the consulting / professional services firms.

What does a data scientist do there?

The role of the data scientist is, in its most common form, business support. Data Science is used as a means to optimize day-to-day operations. While there might be some space for innovation or research and development, prototype development is more or less non-existent. This tends to be outsourced to larger consulting companies, sometimes because they have more senior experts and exposure to wider projects/industries, but the most likely reason is scapegoating. If one is not sure about what to do or where to focus, going to a consulting company is a good way to get wide exposure without committing to anything. Consulting companies are the running sushi at the beginning on one’s career: try a little bit of everything.

Other roles in a large company include being a subject-matter expert that helps end customers, who are typically non-technical. This is more or less an internal consulting role or sometimes, in the case of technology vendors, it can be something more like a sales engineer.

What to expect from working on a large company?

The name of the game is processes. A large company does not care about money, or making people wait. It is all about following structured processes to avoid errors. This can be annoying at one’s early career stages, when it feels like there is so much to do that there is no time to waste.

Another feature to expect is a structured chain of command. This comes in package with the processes, there is a long line of bosses and approvals for everything.

Technology stack is usually already defined in a large company. This is especially true if there is already some data analytics or data science capability. Having to compete nowadays with startups, many companies are discarding expensive proprietary tools in favor or open source tools, favored by startups.

Finally, while there might be lots of data to analyze, this will be very often in silos, scattered across the organization. This is often due to politics, and it can come to genuinely stupid cases. At one project I had to scrape data from an internal website maintained by another department, just because the teams would not cooperate with each other.

What types of models/tasks will I do?

Since a data scientist will mainly support core business operations, models will revolve around the following:

  • Churn, CRM / customer analytics, pricing.
    • Preference for interpretable models, as they need to be used by business decision-makers.
  • Ad-hoc analysis.
    • Employee turnover, productivity.
  • Sometimes, “AI” prototypes.
    • Most likely, not deep learning, but you can get lucky.

Sounds like the job could be boring…

It can be. But can be avoided. The secret is to focus on an industry or problem that interests you. This will keep you engaged, regardless of the politics, processes and all other nonsense typical of large organizations. A good team, and particularly a good mentor, can help.

How does a good mentor/team look like?

First and foremost, it should be someone smart & kind. There is no place in the modern business world for arrogant assholes. There has never been a solid justification for them to exist in the first place, and there is no need to keep them. In particular, your boss should be a nice person to be around. This is hard to describe objectively, as it is a direct function of your own personality and experiences. I would also insist that your first mentors/bosses would be tech savvy. Beware of non-tech direct managers! They can be super dangerous since they have very often acute cases of Dunning-Kruger.

It is also nice for a team to have a diverse skillset and seniority. Teams with smart boys that look more like a college dorm than a company get boring pretty quickly.

How can I find if I am joining a good team?

As a data scientist, you are basically researcher, so research! Go to their social media, internet search, Glassdoor reviews. Ask around in communities like reddit. If you look them up on LinkedIn, look at the average tenure (how long people stay in the company), look at the individual team members and see where they are coming from (background, previous experience, etc). Last but not least, don’t be afraid to ask someone out for a coffee or lunch to ask them how it is to work at that company.

Are there any advantages of working on a corporate?

  • You will learn on one consistent stack, with consistent procedures, and, if you do your research well, with a good mentor and team.
  • Stable job, good to collect brand names in your resume early on that can be leveraged later.
  • Compensation ok, but at this stage being a good apprentice is better.

Any disadvantages of working on a corporate?

  • Everything might move slowly.
    • Expect to wait 2-3 weeks even for small requests.
  • Your job is 100% focused on improving business needs, whether you like it or not.

Calculating Basic Statistics

Whenever we are exploring a new dataset, the very first thing to do is calculate some basic statistics: number of observations, mean or average, minimum, maximum, median and standard deviation.

This helps us get an overview of our data quickly.

We will illustrate this in a dataset consisting of the height of a sample of 18-year-old males (in cm). In this case, the measured height of each student is our value or observation.

Our first statistic, the number of observations or sample size is easy to get but important: we usually require a minimum of 30 before deciding the statistical test that should be applied later.

The mean or average is the sum of all observations divided by the number of observations. In several contexts this represents a good estimate of how our data looks like.

The minimum and maximum are useful for determining the range of the data, that is, the set of possible values that we will find in our dataset. They can be calculated with the min() and max() functions in a similar way.

The median is the value that is right in the middle. That is, if we order from smallest to largest, the median is the value such that 50% of the values are above it and 50% of the values are below.

Sometimes, we will find that the median and the mean are very similar. This can be an indication that there is some symmetry in our data: for every large observation there is also a small observation, in similar proportions. Whenever the median and the mean are different, this means that there is a certain skew in our data, suggesting perhaps the presence of outliers or unusual observations.

The standard deviation measures the spread of the data. That is, on average, how far are the values in our dataset from the mean. The larger the standard deviation, the bigger the spread. A small value of the standard deviation suggests that all the observations are similar to each other and to the average.

Beyond model metrics: confusion matrix and parity plots

There are many metrics for assessing the “quality” of a machine learning model, depending on whether one is dealing with a regression or a classification task. There is RMSE, MAPE, R2 for regression, for instance, and AUC, ROC scores for classification.

However, I find it very hard and unbelievable that one can only rely on such vague proxies. I have been also faced the situation that the models look great in development, but not so great in practice. This is no surprise, of course, as it is expected that simple descriptive statistics miss features regarding the “shape” of the data. What we observe in practice is then closely similar to Anscombe’s quartet: we might interpret a single-metric as an indication that the model has a completely different performance than it actually has.

A way around this is to look at the complete story: for classification problems, one needs to look at the confusion matrix. This is a one-liner in scikit learn.

The parity plot is its continuous analog. It can be calculated by:

import matplotlib.pyplot as plt y_true_min, y_true_max = y_true.min(), y_true.max() y_pred_min, y_pred_max = y_pred.min(), y_pred.max() lb = min(y_true_min, y_pred_min) ub = max(y_true_max, y_pred_max) fig, ax = <a href="http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.subplots">plt.subplots</a>() ax.scatter(y_true, y_pred) ax.plot([lb, ub], [lb, ub], 'r-', lw=4) ax.set_xlabel('True values') ax.set_ylabel('Predicted') <a href="http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.show">plt.show</a>()
Code language: JavaScript (javascript)

Note that the weird calculation of the lower and upper bounds (lb and ub) comes from the fact that sometimes the predictions might be well off from the model, especially during model development, so it is worth zooming in.