Data Scientist at a Startup or a Small Company

As opposed to the companion post about data science in a corporate, let’s talk now about the data science life at a startup or small company, that is, a company with less than 100 employees.

A startup has, by definition, an untested business model: the company runs on seed money and may or may not be profitable. In this setting, data science is a core differentiator that can, presumably, disrupt the market and generate infinite profits for investors. So what does a data scientist do there? As opposed to large corporates, a data scientist will work as part (or closely with) a product team. Since the company itself is still small enough, there are not many business processes to improve (typically non-existent), and since data science is a strategic choice, then it is natural for data scientists to be part of a product team. They can, sometimes, be also business analysts.

What to expect from working on a startup?

  • Flexibility (home office, flex hours).
  • Relaxed environment and work atmosphere.
  • More modern tech stack and (current) best practices.

While this sounds like a great deal, I would argue that this is not the ideal place to start a career. Startups are fantastic places for experienced professionals, but too chaotic for juniors. I recommend spending at the very least 2 years in corporate first.

Other downsides of startups include:

  • Unreal business model, non-existing profit: Many industries are no more than thin air.
  • None of the co-founders is tech-savvy: Sadly, many healthcare startups.
  • No data: Usually a corollary of the previous one. Due to some unrealistic expectations of non-technical co-founders, artificial intelligence is expected to learn “on its own”, somehow, magically.
  • While this is maybe a personal thing, an unhealthy focus on “family” culture. I like my work relationships to stay where they belong, at work. Sure, sometimes your colleagues can get upgraded into friends, but that should not be the default mode.

Working as a data scientist in a corporate

Thanks to my work in universities and corporate training, I often get to chat with young people interested in working as data scientists. I decided to put together a couple of articles concerning my experience in both corporates and startups. Hopefully this could be useful to someone.

First of all, what do you mean by “Corporate”?

There is no clear-cut, universally accepted definition. But let’s say that by “corporate” I mean a company with roughly 100+ employees. This could be a private or public company, or a government entity. If the company is big enough, there can be different business units, each of them with their own data science team, or one centralized “Center of Excellence” for all things data science in the organization. Examples include a retail bank, a telecommunications company or the consulting / professional services firms.

What does a data scientist do there?

The role of the data scientist is, in its most common form, business support. Data Science is used as a means to optimize day-to-day operations. While there might be some space for innovation or research and development, prototype development is more or less non-existent. This tends to be outsourced to larger consulting companies, sometimes because they have more senior experts and exposure to wider projects/industries, but the most likely reason is scapegoating. If one is not sure about what to do or where to focus, going to a consulting company is a good way to get wide exposure without committing to anything. Consulting companies are the running sushi at the beginning on one’s career: try a little bit of everything.

Other roles in a large company include being a subject-matter expert that helps end customers, who are typically non-technical. This is more or less an internal consulting role or sometimes, in the case of technology vendors, it can be something more like a sales engineer.

What to expect from working on a large company?

The name of the game is processes. A large company does not care about money, or making people wait. It is all about following structured processes to avoid errors. This can be annoying at one’s early career stages, when it feels like there is so much to do that there is no time to waste.

Another feature to expect is a structured chain of command. This comes in package with the processes, there is a long line of bosses and approvals for everything.

Technology stack is usually already defined in a large company. This is especially true if there is already some data analytics or data science capability. Having to compete nowadays with startups, many companies are discarding expensive proprietary tools in favor or open source tools, favored by startups.

Finally, while there might be lots of data to analyze, this will be very often in silos, scattered across the organization. This is often due to politics, and it can come to genuinely stupid cases. At one project I had to scrape data from an internal website maintained by another department, just because the teams would not cooperate with each other.

What types of models/tasks will I do?

Since a data scientist will mainly support core business operations, models will revolve around the following:

  • Churn, CRM / customer analytics, pricing.
    • Preference for interpretable models, as they need to be used by business decision-makers.
  • Ad-hoc analysis.
    • Employee turnover, productivity.
  • Sometimes, “AI” prototypes.
    • Most likely, not deep learning, but you can get lucky.

Sounds like the job could be boring…

It can be. But can be avoided. The secret is to focus on an industry or problem that interests you. This will keep you engaged, regardless of the politics, processes and all other nonsense typical of large organizations. A good team, and particularly a good mentor, can help.

How does a good mentor/team look like?

First and foremost, it should be someone smart & kind. There is no place in the modern business world for arrogant assholes. There has never been a solid justification for them to exist in the first place, and there is no need to keep them. In particular, your boss should be a nice person to be around. This is hard to describe objectively, as it is a direct function of your own personality and experiences. I would also insist that your first mentors/bosses would be tech savvy. Beware of non-tech direct managers! They can be super dangerous since they have very often acute cases of Dunning-Kruger.

It is also nice for a team to have a diverse skillset and seniority. Teams with smart boys that look more like a college dorm than a company get boring pretty quickly.

How can I find if I am joining a good team?

As a data scientist, you are basically researcher, so research! Go to their social media, internet search, Glassdoor reviews. Ask around in communities like reddit. If you look them up on LinkedIn, look at the average tenure (how long people stay in the company), look at the individual team members and see where they are coming from (background, previous experience, etc). Last but not least, don’t be afraid to ask someone out for a coffee or lunch to ask them how it is to work at that company.

Are there any advantages of working on a corporate?

  • You will learn on one consistent stack, with consistent procedures, and, if you do your research well, with a good mentor and team.
  • Stable job, good to collect brand names in your resume early on that can be leveraged later.
  • Compensation ok, but at this stage being a good apprentice is better.

Any disadvantages of working on a corporate?

  • Everything might move slowly.
    • Expect to wait 2-3 weeks even for small requests.
  • Your job is 100% focused on improving business needs, whether you like it or not.

Calculating Basic Statistics

Whenever we are exploring a new dataset, the very first thing to do is calculate some basic statistics: number of observations, mean or average, minimum, maximum, median and standard deviation.

This helps us get an overview of our data quickly.

We will illustrate this in a dataset consisting of the height of a sample of 18-year-old males (in cm). In this case, the measured height of each student is our value or observation.

Our first statistic, the number of observations or sample size is easy to get but important: we usually require a minimum of 30 before deciding the statistical test that should be applied later.

The mean or average is the sum of all observations divided by the number of observations. In several contexts this represents a good estimate of how our data looks like.

The minimum and maximum are useful for determining the range of the data, that is, the set of possible values that we will find in our dataset. They can be calculated with the min() and max() functions in a similar way.

The median is the value that is right in the middle. That is, if we order from smallest to largest, the median is the value such that 50% of the values are above it and 50% of the values are below.

Sometimes, we will find that the median and the mean are very similar. This can be an indication that there is some symmetry in our data: for every large observation there is also a small observation, in similar proportions. Whenever the median and the mean are different, this means that there is a certain skew in our data, suggesting perhaps the presence of outliers or unusual observations.

The standard deviation measures the spread of the data. That is, on average, how far are the values in our dataset from the mean. The larger the standard deviation, the bigger the spread. A small value of the standard deviation suggests that all the observations are similar to each other and to the average.

Beyond model metrics: confusion matrix and parity plots

There are many metrics for assessing the “quality” of a machine learning model, depending on whether one is dealing with a regression or a classification task. There is RMSE, MAPE, R2 for regression, for instance, and AUC, ROC scores for classification.

However, I find it very hard and unbelievable that one can only rely on such vague proxies. I have been also faced the situation that the models look great in development, but not so great in practice. This is no surprise, of course, as it is expected that simple descriptive statistics miss features regarding the “shape” of the data. What we observe in practice is then closely similar to Anscombe’s quartet: we might interpret a single-metric as an indication that the model has a completely different performance than it actually has.

A way around this is to look at the complete story: for classification problems, one needs to look at the confusion matrix. This is a one-liner in scikit learn.

The parity plot is its continuous analog. It can be calculated by:

import matplotlib.pyplot as plt y_true_min, y_true_max = y_true.min(), y_true.max() y_pred_min, y_pred_max = y_pred.min(), y_pred.max() lb = min(y_true_min, y_pred_min) ub = max(y_true_max, y_pred_max) fig, ax = <a href="http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.subplots">plt.subplots</a>() ax.scatter(y_true, y_pred) ax.plot([lb, ub], [lb, ub], 'r-', lw=4) ax.set_xlabel('True values') ax.set_ylabel('Predicted') <a href="http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.show">plt.show</a>()
Code language: JavaScript (javascript)

Note that the weird calculation of the lower and upper bounds (lb and ub) comes from the fact that sometimes the predictions might be well off from the model, especially during model development, so it is worth zooming in.

Data Science Job market in Czechia

I have done a big part of my career as a data scientist in Prague. During this time, I worked for different companies as a full-time employee and as a contractor, both as an individual contributor and as middle management. I also helped hire, and sometimes also fire, candidates. This has given me a good overview of the local data science job market that might be useful to you.

There are roughly three tiers: international companies, companies focused on the local market and startups. Here are some notes about my experience with each of them.

International companies

There are many companies that are based in Czechia but their primary market is other EU countries or the US. These companies tend to have higher compensation, from 40-60 thousand Czech crowns for entry level, and 80-100 with 2-3 years of experience. The working language is usually English and friendly with foreigners. As the main business unit or the final client is abroad, sometimes this higher pay implies meetings in different timezones and traveling. It also means that the output of the work may not be visible, and one can feel detached of it, especially in consulting companies. Since these are usually big brands, they give you a resume boost.

Local market

These are companies that have main customers in Czechia. Some of them have strong data science teams, like telecom operators (O2, T-Mobile, Vodafone), but there are many others. In these companies the pay is usually below that of international companies. They may not necessarily be English-friendly, which could be a problem for foreigners. One advantage is that the work feels (and is) closer: you can see the outputs of your models in your everyday life. They are also more involved with the local community in general, sponsoring hackathons, events, etc.

Startups

Startups overlap with the previous two, but are an important part of the local data science job market, and are a bit different. They tend to have friendlier schedules, be more open about remote work and more relaxed in general. On one hand this makes the transition from university a bit easier for many, as many startups look more like a cool student dormitory than a traditional working place. Some of them tackle riskier projects (i.e. more technically fun), and tend to have more modern engineering practices. Among these startups there are both product companies and companies focused on professional services. This second category is a smaller version of larger consultancies: they have international projects and one gets to rotate from project to project. They tend to be more open to part-time arrangements as well. There is more risk in general, if the business model is not resilient.

Is Deep Learning a valuable skill?

Many online courses and universities promote deep learning courses and bootcamps. This enthusiasm goes also to business decision makers and investors. And then, there comes the push to incorporate deep learning / artificial intelligence / machine learning models at all costs. For someone transitioning to data science, is deep learning a valuable skill to invest time and money to learn?

The truth is, unless you are working on images, audio or text, deep learning is not a very valuable skill. Deep learning excels at extracting patterns from high-dimensional data that is generated in a consistent way. Only in those cases makes sense to use deep learning. The typical data scientist that uses a mixture of SQL and classical data mining algorithms (logistic regression, decision trees, random forests) on tabular data is unlikely to benefit from it.

Even in cases where deep learning would normally work great, one should also be aware of the amount of data. If one does not have enough data, deep learning algorithms will not work. A rough estimate is 10 data points per parameter. This is a completely heuristic figure, which I cannot back up by theory. Modern neural networks have millions of parameters.

During my corporate training courses we often benchmark deep learning models against others. In most cases, deep learning models are way below the mark. One exception to this are outlier detection models. In this setting, we often get better results with autoencoders. But otherwise it is hard to make the case for deep learning.

Instead of investing time and money in learning advanced algorithms, newcomers to the field should brush up on their data gathering and analysis skills. That is definitely a differentiatior.

Should you go to a startup in your first job as data scientist?

Startups are all the rage these days, and many people are attracted by some of the benefits: lots of toys in the office, possibility of remote work, flexibility and an informal office culture. This is all fine and well but it comes to a cost. Does that mean you should go to a startup in your first job as data scientist?

Startups, by definition, do not have a tested business model. Maybe there is a need in the market, but maybe customers are not ready to pay. There will be likely a lot of iteration in the product, and that very often translates into changes in the internal processes.

This can be terrible news for someone who is starting in their career as a data scientist:

Tasks might not be clearly defined

There will be constant changes in the data models

Objectives and likely very little consistency in the day to day work.

Why is this bad for you as a beginner? Constant changes distract you from getting proficient with a single set of tools.

Ok, so should you avoid startups?

Don’t get me wrong, I think that data scientists should be proficient in many tools. After all, the work is largely the same, whether you use R, Python or a no/low-code solution like Alteryx. But the constant changes in the beginning distract you from grasping the fundamentals of real-life data science work.

Startups are fantastic places to go as your second job, once you earned enough experience in a corporate environment or a larger company. You will really appreciate the flexibility in the work, and since you will be, at that point, much more familiar with your tools, you will thrive in the wonderful chaos a startup can be. But I would highly recommend beginning your career in a more structured environment. That does not mean you cannot succeed in a startup in your first job as data scientist, but you should be aware of the risk.