Dec 27 2024 · Piotr Płoński

How to become a Data Scientist?

How to become a Data ScientistIn recent years, many people have been drawn to the field of data science, often believing it to be a fast track to wealth. It's true that data scientists working for large companies can earn impressive salaries. However, in smaller companies, the earnings might be closer to those of a software engineer. And that's perfectly fine. If you have a natural curiosity, enjoy research, and take pleasure in discovering new insights, data science can still be incredibly fulfilling, regardless of the paycheck.

In this post, I will share my journey to becoming a data scientist, discuss the skills you need to succeed, and offer advice for those considering this career. I'll also reflect on why communication and curiosity matter more than programming in this field.

My Journey to Becoming a Data Scientist

When I started my studies, I planned to become an electrical engineer. However, in my second year, I realized that most job opportunities were for programmers, not electrical engineers. This prompted me to start learning programming, beginning with C and C++. I wasn't very good at programming at first, but I discovered a passion for it, dedicating 8 to 10 hours a day to practice. I was learning by doing university projects and solving programming competitions tasks.

After my third year, I secured an internship at the Interdisciplinary Center for Mathematical Modeling (ICM) as a junior researcher. There, I worked on modeling the influenza virus and developed a strong interest in research. I read extensively research papers and continued honing my programming skills.

During this time, I discovered neural networks. I took all the available courses on the topic and chose a supervisor specializing in neural networks for my Master of Science thesis. My thesis focused on applying neural networks to high-energy physics. I also collaborated with a friend Robert, who helped me understand both neural networks and high-energy physics. Together we co-authored a research paper about applying neural networks for parametrization of physics properties.

This experience led me to discover data mining, further deepening my interest in the field. After completing my Master's degree, I decided to pursue a PhD, where I applied machine learning in various scientific disciplines, such as building phylogenetic trees for influenza RNA, understanding children's dyslexia, and advancing high-energy physics. During studies, I also traveled to leading physics laboratories, such as INFN in Gran Sasso (picture below), CERN, JPARC and Fermilab, working closely with Robert and Dorota. Thank you both! :) Below is a picture of the INFN physics laboratory, beautifully located in Gran Sasso, Italy. I had an amazing time there.

LNGS lab

During my PhD, I also worked part-time as a software engineer at Netezza, which was later acquired by IBM. At IBM, I developed data mining algorithms implemented directly into data warehouses. This was an invaluable experience, blending industry application with research. I learned a lot about internals of data mining algorithms and their efficient implementations.

After leaving IBM, I became a part-time data scientist at a large call center company, working on diverse machine learning projects for large customers (Samsung, Direct-TV, T-Mobile, HP). At the same time, I spent one and a half years as an assistant professor on my university. After that time, I founded my own company, MLJAR, where I focused on developing automated machine learning solutions. The desire to launch my own company was so strong.

Skills You Need to Become a Data Scientist

I have about 15 years of experience working with data. Here, I've summarized the skills that, in my opinion, are the most important for becoming a data scientist. Of course, there are many more, but I've chosen to focus on the most essential ones. These skills can be divided into two categories: hard skills and soft skills.

Hard Skills

1. Programming

Data scientists use data to provide value. The amount of data is so large that data scientists must rely on computers to perform computations. This is typically achieved using programming languages like Python, R, or SQL, which are among the most popular. However, there are examples of companies that rely heavily on data while using less common languages, such as OCaml at Jane Street Capital (large proprietary trading firm).

Many people believe that programming is an essential skill. In my opinion, programming is just a tool. Modern no-code and low-code platforms have made this skill less critical for entering the field.

What matters most is the result - how effectively you can use available tools to solve problems.

If you are comfortable working in a no-code environment and can use it effectively, then you don't need to be a programming expert. I believe that with the advancements in Large Language Models, this skill will become much easier to acquire for those just starting out.

2. Algorithms and Mathematics

Understanding machine learning algorithms and mathematical principles is essential for correctly applying them to data problems. There is a distinction between understanding an algorithm conceptually, knowing how to use it, and being able to implement it from scratch. For those pursuing research, it is often necessary to modify or implement algorithms, which requires a deeper understanding of how they work. For practical applications, the key is knowing which algorithm is best suited to a particular problem. Having knowledge of a wide range of machine learning algorithms makes it easier to propose effective solutions.

However, in my experience, many data science problems don't require machine learning at all. Simple statistics - such as averages - combined with a visualization dashboard and, occasionally, email alerts based on hardcoded thresholds, are often more than sufficient :)

3. Domain Knowledge

In my opinion, this is the most important skill among the hard skill set. To apply data science effectively, you need to understand the field you are working in. Whether you are a researcher or working in a business environment, domain knowledge helps you frame problems and create relevant solutions.

In the business world especially, you need to understand where the money comes from. Your job as a data scientist is to use data to provide insights and make data-driven decisions that help the business either generate more revenue or save money.

Soft Skills

1. Communication

Communication is perhaps the most important soft skill for a data scientist. You need to present your results to your boss in terms of business outcomes rather than technical details. For example, it doesn't matter whether you used a neural network or a random forest - the focus should be on how your findings impact the business (show them the money).

In addition, building strong relationships with colleagues across teams is essential. You'll often need data or insights from others, and having good interpersonal connections makes collaboration much smoother. Participating in informal activities, such as joining colleagues for lunch or casual meetings, can help build relationships.

2. Curiosity

Curiosity drives discovery. Understanding how your company operates and generates revenue enables you to use data more effectively to improve the business. Being curious about the "why" behind your tasks will make you a better problem-solver.

There are countless stories of employees who, driven by curiosity, created a simple script in just a few hours. That quick proof-of-concept work often led to significant changes within the company later. Don't be afraid to experiment. It is never time lost. You learn and gain experience.

3. Adaptability

Change is a constant in the life of a data scientist :) Data formats evolve, and processes shift. Staying calm and flexible in the face of these changes is essential.

Imagine spending two weeks building a new ML model, only to have your manager inform you that your team has just gained access to a new data source and you need to redo the entire process. Stay calm. You can use your previous model as a solid benchmark. With the new data source, you can construct a challenger model and compare the two.

At the end of the day, what matters is the value your models bring to the business - not which version ends up in production.

How to Start Your Journey as a Data Scientist

Learn about machine learning algorithms

To begin your journey into data science, I recommend starting by learning the basics of machine learning algorithms. At this early stage, don't worry about diving too deep into the technicalities or advanced mathematics. Instead, focus on gaining a general understanding of the different types of machine learning algorithms and their typical applications. Familiarize yourself with the various tasks in machine learning. Here are some key tasks to explore:

  • Classification: Learn how machine learning can be used to categorize data into predefined classes or groups. Common examples include spam detection in emails, churn prediction, credit scoring, medical diagnosis, emotion detection.
  • Regression: Understand how regression algorithms predict continuous values, like forecasting house prices, predicting stock market trends, sales forecasting, weather forecasting, demand prediction.
  • Clustering: Explore unsupervised learning, where data points are grouped based on similarities. This technique is useful in document clustering, customer segmentation or market basket segmentation.
  • Anomaly Detection: Learn how to identify outliers or unusual patterns in the data, which can be critical for predictive maintenance, fraud detection, or network security.
  • Time Series Analysis: Delve into the techniques used to analyze and forecast data that is sequential in nature, such as weather forecasting or sales predictions.
  • Dimensionality Reduction: Understand how to reduce the number of features in your dataset without losing critical information, which can be helpful in reducing computational complexity.
  • Natural Language Processing (NLP): Learn how algorithms process and analyze human language, from chatbots to sentiment analysis.
  • Speech Recognition: Explore how machines convert spoken language into text, a key area in virtual assistants like Siri or Alexa.
  • Computer Vision: Discover how algorithms interpret and process visual information, from facial recognition to object detection in images.

As you start learning, focus on understanding when each approach is appropriate and the type of problem it addresses. You don't need to master all of them right away, but it's important to get a high-level view of the landscape of machine learning and know which method is suitable for different real-world problems.

It is crucial to understand when to train your own machine learning model and when it is more practical to use pre-trained models. Training your own model from scratch can be beneficial when you have a unique dataset. However, in many cases, using pre-trained models can be an excellent choice, particularly for tasks like image classification, natural language processing, or speech recognition, where large, high-quality models are available and have been trained on massive datasets.

If you would like to gain deeper insight into machine learning and math behind, I highly recommend the following books:

  • The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman (2009). It is math heavy and provides a detailed overview of various machine learning methods. You can access book PDF for free The Elements of Statistical Learning (PDF).
  • Pattern Recognition and Machine Learning by Christopher Bishop (2006). It has a lot of theory and the practical aspects of machine learning techniques. Access PDF for free Pattern Recognition and Machine Learning (PDF)

These books will give you a strong understanding of the key concepts and techniques that form the foundation of machine learning and data science. I would recommend them for people that would like to do research in the data science.

Learn programming and no-code tools

A data scientist uses computers to extract value from data. At a high level, no one really cares what tool you use to create value, as long as the results are impactful. However, you do need a tool that you're comfortable with to instruct computers effectively.

Choosing a Programming Language

The most obvious starting point is learning a programming language. Among them, Python is the most popular. Its popularity stems from the vast ecosystem of libraries available for tasks such as data visualization, data wrangling, and model building. Additionally, there are so many resources for learning Python, making it easy to start.

Another option is R, which is widely used in industries like pharmaceuticals and bioinformatics. I learned R before Python because it was necessary for my previous work. However, a friend introduced me to Python, and after switching, I never looked back. Python's versatility is unmatched - you can use it to build web apps, desktop applications, and even games. While I prefer Python, you might find R to be the better tool for your specific needs.

The Importance of SQL

In addition to programming, knowing how to query databases is essential. Every data scientist should have at least a basic understanding of SQL, as it's the standard language for interacting with databases. SQL allows you to retrieve and manipulate data efficiently, making it a critical skill in any data science toolkit.

Exploring No-Code and AI Tools

Programming is just one way to instruct a computer. If you're comfortable using no-code tools for data science, that's perfectly fine too. These tools have matured significantly and can be great for certain tasks, especially for those who are not programmers by trade.

We're also in the era of AI assistants and tools like code generation pilots, which allow you to interact with your data through chat. These solutions are fantastic for data scientists at any skill level. For instance, AI assistants can generate code for data analysis, which you can then inspect and refine. Well-written code should read like prose, regardless of your coding experience.

Learn by doing

In my opinion, the best way to learn programming or no-code tools is through hands-on experience. Look for projects that genuinely interest you. If you're a student, always choose ambitious projects in your classes. Challenging yourself this way will maximize your learning.

If you're not currently studying, I highly recommend exploring Kaggle. It's a platform for machine learning competitions. Kaggle offers a range of competitions, from those with monetary prizes to those focused purely on gaining knowledge and experience. The Kaggle forums are also home to a vibrant and supportive community of data scientists. Many participants share their code, insights, and discoveries, making it a fantastic environment to learn and grow. Whether you're a beginner or an experienced practitioner, Kaggle is a great place to sharpen your skills.

Improve communication skill

Everyone likes to feel understood. To make this happen, you need to communicate clearly. For a data scientist, it's very important to explain complex ideas in simple words. Some people are naturally good at this and can tell great stories about their work. But if you're not, don't worry! Here are three tips I've used to improve my communication skills.

Write Blog Posts About Your Projects

Start a blog to share what you're working on. Write about your project goals, where your data comes from, how you did the analysis, and what you found out. If you do this often, it will help you think more clearly about your projects and explain them better.

Explain Concepts to Your Family

Try to explain what you're learning or building to your family. Imagine you're talking to a five-year-old, so you have to make it very simple. This is a great way to practice breaking down big ideas into smaller, easy-to-understand pieces.

Talk to Strangers

Practice small talk with strangers, like people at the store or a gas station. These short conversations can help you feel more comfortable talking to others. Don't be nervous - most people enjoy a friendly chat!

Conclusion: Embrace the Journey 🚀

So, there you have it - your roadmap to becoming a data scientist, sprinkled with stories, tips, and a few personal anecdotes. The path to this career isn't linear, and that's the beauty of it! Whether you're starting out as an electrical engineer like I did, diving headfirst into programming, or tinkering with no-code tools, there's room for everyone in this field.

Here's the good news: you don't need to know everything to start. The journey itself will teach you the skills you need, one step at a time. Sure, there will be days when your code won't run, your data will be messy, and you'll wonder if Excel might have been a better career choice. But trust me, those moments are just part of the adventure.

Let's recap some of the key takeaways:

  • Focus on learning by doing: Build projects, join Kaggle competitions, or just solve problems that genuinely interest you. Learning isn't about perfection; it's about progress.
  • Master the tools you need: Whether it's Python, R, SQL, or no-code platforms, choose what works for you and start experimenting. Remember, it's the results that matter, not the method.
  • Keep it simple: Many business problems don't require deep neural networks or complex algorithms. Sometimes, an average and a good chart can save the day.
  • Communicate effectively: Practice explaining your work to a five-year-old, write blogs, or even strike up small talk with strangers. The better you are at communicating, the more impact your work will have.
  • Stay curious and adaptable: Ask questions, experiment, and embrace change. These traits will keep you growing no matter where your data science journey takes you.

Finally, remember to enjoy the ride. Data science isn't just about numbers, algorithms, or big paychecks. It's about curiosity, discovery, and the excitement of turning raw data into meaningful insights. It's okay if your first model doesn't predict the next big stock trend or your initial SQL query takes down the company database (just kidding - let's hope that never happens). Mistakes are just stepping stones to mastery.

So go ahead, take that first step, and start building your future as a data scientist. The data is waiting - now it's your turn to make sense of it. Good luck, and remember to have fun along the way! 😊

Become a Data Science wizard, today!

Forget about Python problems, just do your work.

MLJAR Studio