The starting path to becoming a Data Scientist — Survey data analysis

This article is a deliverable for the final project of the Coding Dojo Data Science Bootcamp provided by the Saudi Digital Academy (SDA).

This story is going to specifically focus on analyzing the answers of those who have 1–2 or fewer years of writing code and/or programming. The purpose of this article is to get an insight into the data scientist world and answer a few questions that can help those who are wanting to take the data science road and have fun playing with data.

The data used in this analysis is provided by Kaggle from the 2020 Kaggle ML & DS Survey that they conducted for 3.5 weeks in October on Kaggle members.

The data has 20037 rows and 355 columns. The data contains tons of null values and questions with multiple answers.

In this article, we are going to explore the data and try to see different sights of the data by using Exploratory data analysis(EDA).

First, let’s take a look at the Demographic information of the survey.

Demographic information

Age:

Age group

We can see that more than 50% of the responders are under 30 years old.

Gender:

other is who didn’t specify his/her gender

We can see that more than 70% of the respondents(data scientists) were men.

Gender Vs. Age:

We can see that the majority of men are in their 25–29 years, and women in their 22–24 years.

Country:

Top 20 countries out of 171 countries

The majority of the data scientists are from India, the USA, and Brazil.

Education:

It shows that more than 70% have a Bachelor or/and Master degree.

Insight into data scientists with 1–2 and fewer coding years

Now, after looking into the Demographic information of the survey, it is time to focus on our target group of this survey, which is the data scientists with 1–2 coding years or less.

By focusing on this group, we will try to answer some questions that may help people who are interesting in the data world and want to start their learning journey to know where to look and how to start!

First question: From where do I start my learning path in data science?

We can divide this question into two parts:

  • What learning platform should I use?

We can see that the most popular learning platform among data scientists with 1–2 coding years or less is Coursera.

  • What language should I learn first?

The most recommended language to start learning as a data scientist is Python and it also the most used language in the 1–2 coding years data scientists, as shown in the image below.

Second question: What tools/technologies should learn first?

  • IDE

As shown in the graph, the most popular IDE to start with is Jupyter.

  • NoteBook

It shows that Colab Notebook has the highest rank among other notebooks among data scientists.

  • Big data product

Mysql appears to be the most used data product among other big data products.

  • Visualization library

Matplot library and seaborn library are the most used libraries among other visualization libraries.

IMPROVEMENTS AND FUTURE WORK

This story has not finished yet, it still under processing, and there are more questions to ask and find their answers by exploring the dataset from different angles!

Data scientist