pet projects in data science
Data Science Pet Projects. FAQ
What are data science pet projects, and why are they fascinating?
This article was translated in collaboration with Klokov Alexey. The idea to write an article was born after numerous questions about personal projects in the Open Data Science (ODS) community. You will find ideas for new projects and useful code in this article. So, let’s analyze the frequent questions and give a definition of the data science pet project:
- Why do you need pet projects?
- What stages can the pet project’s development consist of?
- How to choose a topic and collect data?
- How to find computing resources?
- How to put AI algorithms into production?
- How to do a project on GitHub?
- How and why to look for collaborators?
- When does the ODS pet project hackathon take place?
- Where can I see examples of pet projects and stories of ODS participants?
A data science pet project is an out-of-hours activity, the purpose of which is to solve some problem using data mining, improving your professional skills.
Why do you need pet projects?
- Go through all the stages of DS project development by yourself, from data collection to production (like full stack ml engineer).
- Work with real data, which can be very different from toy datasets (a classic example of ideal data is “Titanic kaggle competition”).
- Find your first job in the absence of experience. You know this vicious circle when you need work experience for your first job.
- Roll into a new DS area. If you are tired of tabular data — you can jump into computer vision.
- Acquire a broad ML outlook by working with different data domains. Data processing approaches often wander from one area to another. For example, transformers came in CV from NLP.
- Prepare pipelines for future hackathons and work projects.
- Simplify your technical interview or get additional points during the qualifying rounds to receive a job offer.
- Learn to dive into a new project in a short time.
- Learn to tell other people about your initiatives competently. In your project, you are PO, CTO, CEO (+ a bit of HR).
- Join open-source activities or try your luck in startup activities. Maybe your project will be useful for humanity or be monetizable.
Stages of project development
- Topic search and data collection.
- Data annotation. You can annotate yourself or turn to third-party help. As I know, some CEOs of markup companies have offered pro bono help if you do not pursue commercial goals.
- Research — involves data analysis, testing several hypotheses, and building ML models. In this section, we could talk about useful tools such as DVC, Hydra, MLflow, WandB — but we will leave this for the optional study of readers.
- Implementation of the built models in the production.
- Repository design.
- PR, collecting feedback, attracting collaborators, technical improvement of the project, and deepening /expanding the topic.
Search for the project topic and data
In ML pet projects, the topic is inextricably linked to data. These two objects (topic and data) rarely lie together “on a silver platter”. In my opinion, the following procedure must be repeated several times for a successful independent search:
First, you need to fix one of two objects: a topic or data. After, based on the fixed object, find the second object.
Data collection can take different forms, for example:
- unloading from some platform;
- parsing of internet pages;
- combining existing datasets found on kaggle, papers-with-code, GitHub, habr1, habr2, etc.;
- artificial generation;
- computer vision projects might benefit from this article.
Data-driven topic searching can present activities such as:
- analyzing existing annotation in the data;
- testing competing products;
- communication with potential users of a future product;
- “cold calls”.
Examples 1 and 2 demonstrate fixing the topic and then searching for data. Examples 3 and 4 indicate fixing the data and then searching for a topic.
Example 1: You are passionate about an activity, for example, running (insert your own), so we fix a slightly more general topic — physical exercises (insert your own). What data is accumulated as someone engages in physical activity? If the data is found /collected, you can do predictive analytics or video analysis of what’s happening around you.
- there are sensors that detect steps, pulse, breathing depth, heart rate, body temperature. Then the data will be sensor signals, tabular data, and time series. Well, we hang ourselves with sensors to record data or download an application that performs similar functions;
- someone is filming themselves or their surroundings space during workouts. In this case, the data will be video clips. Okay, we hang up the camera to collect data — we record a lot of videos of your process;
- and, in fact, why do we need to collect data ourselves? Maybe someone has already done it? The search shows that there is a dataset collected by doctors and patients who were hung with sensors: A database of physical therapy exercises with performance variability collected by wearable sensors.
Example 2: And let’s try to find an unsolved problem? — some time ago, I saw a discussion that there is an open solution for Russian speech recognition (ASR/STT), but there is no good OCR engine (substitute your own). This needs to be corrected; we fix the topic — open-source pretrained ruOCR with the possibility of easy retraining on a custom dataset (substitute your own). Existing solutions, such as EasyOCR from OpenCV, tesseract, ABBYY, PolyAnalyst, do not work for all data domains, are wrapped in an inconvenient service, or are not open and free at all. Finding suitable data for the Russian language gives an answer to why this problem has not been solved yet — there is little data, but that should not stop us:
- we can pre-train the model using datasets in other languages or synthetics and then tune on small datasets for Russian;
- to generate synthetics, you can use someone else’s pet projects to generate Russian words in a picture, for example, repo;
- to generate synthetics, you can use OCR datasets of other languages, replacing foreign words with Russian with some conditional GAN (this sub-item can be a separate pet project!).
Example 3: Some random way, you got a dataset — you participated in a kaggle competition with good data,` or you saw the news that some group of enthusiasts/company had opened its high-quality dataset. Russian Russian language, for example, Toloka’s public datasets, Taiga’s corpus of Russian text, and the corpus of Russian speech. OK, we fix the dataset — Taiga (substitute your own). Now we are thinking about the topic of the project.
- Text classification?
- Unsupervised style transfer? –How would one author paraphrase a quote from another author using their own figures of speech?
- Generation of poems on a particular theme? There are many poems in the taiga corpus.
Example 4: The previous example can be slightly changed: don’t wait for a high-quality dataset on some topic to come to you, but search for a dataset on the topic of interest right away. If you want, for example, to get good in computer vision (substitute your own), you can find and record data for the task of recognizing the pose of a person/object (substitute your own). After that, come up with a topic:
- video analysis of your physical training;
- interaction with a virtual keyboard via a camera or lidar, inspired by this project;
- computer control with palm movements (2D tutorial; 3D video);
- sign language → speech\text; speech\text → sign language.
How to find computing resources?
If you are a “strong and independent data scientist” with your GPU card and 32GB RAM, then this section is not for you. But those who do not have a good computer should not despair:
• there are free (and paid) cloud computing resources such as kaggle kernel, google colab;
• you can find a “strong and independent data scientist” and team up with him;
• you can do without GPU to process small black-and-white pictures (MNIST-like datasets);
• no GPU is required to process RGB images with some classic CV algorithms;
• for tabular data processing, ML algorithms (without DL), which are perfectly trainable without GPU, are often sufficient;
• you should not try to cram a large tabular dataset entirely into a small RAM. In this case, splitting the dataset into several parts and creating a model for each part is recommended. After that, average model predictions using ensemble. There are other approaches/tools, for example, Dask;
• there are companies offering computing resources for non-commercial projects free of charge.
If the models are built, then you want to start using them through a user-friendly interface and share the created product with others — demonstrating the work of your project is very important. It would be awesome if you can show the demo at any time, for example, to a friend in the elevator or to the technical team during technical interviews. Ideally, it is recommended to bring the rollout in “prod” to an application running 24/7. Here are some examples of tools:
- Telegram bot
- Flask on a personal or virtual leased server (VPS)
The advantage of Streamlit/Gradio is a multifunctional UI and free hosting from the authors of the library or Huggingface. And the easiest and fastest way, in my opinion, is telegram bot. The telebot library provides a very convenient API for programming bot actions. Next, I will show the python code that, when receiving a text message from a user, sends the temperature of the video cards or processor cores of the computer on which it is running to the chat. This template is easy to alter to suit your needs, use, and pass on to others:
import psutil #pip install psutil
token = '___your___tg_bot___token___'
bot = telebot.TeleBot(token)
'Узнай температуру видеокарт или ядер процессора, напиши: gpu или cpu')
if message.text.lower() == 'gpu':
comand = !nvidia-settings -q GPUCoreTemp
output = [str(x).strip() for x in temp if 'gpu:' in x]
elif message.text.lower() == 'cpu':
output = str(psutil.sensors_temperatures()['coretemp'][1:])
bot.send_message(message.chat.id, 'unknown command')
How to design a presentable project
I recommend putting the project materials that may be useful to you and other people in the future in open access (for example, on GitHub) and document all the work and results well in order to:
- save the code outside of your computer
- successfully demonstrate your project to other people
- remember where you left off (when the project is temporarily frozen)
Design the repository so that “it would not be a shame to show it off”. All descriptions and instructions must be unambiguous, complete, and reproducible. I propose a list of items (some are optional), which I tried to stick to for the design of my projects: example 1, example 2.
- task description;
- description of the product that solves the problem;
- description of the environment (requirements/Docker/etc) with installation instructions ;
- scripts for getting data and a link to the data with markup;
- pipelines for experiments with playback instructions (working with data, training, validation, visualization of graphs/dashboards);
- product scripts with full launch instructions;
- a link to the weights of the models that are used in the production;
- demos: diagrams, pictures, gifs;
- anything you/others find useful.
How and why to look for collaborators
Undoubtedly, it is possible to work on a project alone. But the task you may have your sights set on can be enormous, beyond the capacity of one pair of hands. Each subtask can be a separate project, may challenge your motivation, and take several weeks of measured work — in this case, you can not do without enthusiasts just like you. Working on a project in a team of like-minded people is more productive, fun, and dynamic. Joint discussion of problems and a versatile approach to finding solutions helps to overcome any difficulties. In addition, the team becomes responsible for deadlines, as a teammate’s work may depend on your results. While working in a team, I try to adhere to the rule:
Figure it out yourself — explain it to your comrades
To attract collaborators, you need to be able to talk briefly about your project. AIn this, a well-prepared speech, presentation, and good design of the repository (see the previous section) help a lot. The approach to finding teammates can work in the other direction: if you can’t decide on your project, then why not join another interesting one? I’ve been able to find like-minded people:
- in slack ODS: #ods_pet_projects (for pets) and #_call_4_collaboration (for commercial);
- in ods telegram chat dedicated to pet projects;
- in telegram chats by interests, for example, NLP, RL_pet and RL;
- at the ODS pet project hackathon (see the next section).
When does the ODS pet project hackathon take place?
Every year, at the end of winter, there is an ods pet project hackathon dedicated to personal projects. Enthusiasts join teams, work actively on their projects for 2 weeks, and then tell each other about the results, share their best practices, find like-minded people, and have a great time. And as a bonus, the organizers award the top ODS merch to those who have made good progress in the project (the rules may change, please check). To avoid missing the next hackathon, join the telegram chat, actively remind about the hackathon, offer your ideas and help organize it.
And now, some personal stories about pet projects by people from ODS.ai
A message from a chat DL_in_NLP
The birth of HuggingFace
In fact, I watched with my own eyes how the HF appeared (no irony). It was at EMNLP 2018 in Brussels. In the conference chat room, either Wolf or someone else, I do not remember already, wrote something like, “guys, we want a beer, and then let’s code something interesting; who is with us? It was inspired by Google’s presentation of BERT at the same conference — they wanted to try to rewrite it in PT. The guys worked half the night and another half a day, and pytorch-pretrained-bert appeared. And what happened later, everyone knows.
ODS.ai nickname: yorko
Yuri first worked on measuring the tone of news about cryptocurrencies; it was the first ML project in the startup. Then it became a team pet-project as part of an ods course on MLOps. In the article, yorko talks about both the technical and organizational aspects of working on a team pet project.
ODS.ai nickname: laggg
In the summer of 2020, I talked about creating an ML agent for the game surviv.io at the intersection of CV and RL algorithms in the ods petprojects channel. At that time, I promised to share my work with the community as soon as something good came out. It seems that this day has come in the fall of 2021. I want to talk about the first release of our ML agent and let anyone run our bot and observe its behavior. What’s been released: an ML bot that knows how to move around based on an incoming picture frame and an inventory state vector. Here’s a repository with detailed instructions and a description of the project.
In the winter of 2022, I participated in ODS pet project hackathon and found some great fellows; we researched and made a neural environment in which we trained RL agent to approach objects in the same game. Here’s the project repository with beautiful gifs.
ODS.ai nickname: copperredpony
My little pet project helped me (as I think) to get my first job. After failing to pass the interviews for ML positions, I decided to focus on learning Computer Vision through a pet project. Select an interesting topic and know only what you need to do. As a result, the practice was better than just articles and courses. And with the pet project at the interviews, I was asked more about the pet project (I had no other experience) and less about the technical things. I happily told them how I screwed up and corrected my mistakes, and the interviewers listened happily. I had a pet project about classifying birds by photo, wrapped in a telegram bot. It turned out pretty fucked up, and I abandoned it as soon as I found a job. But I should go back and redo it.
ODS.ai nickname: ira_krylova
For two years, I made GIS projects as a hobby, then made them into a blog on the medium and started showing them around. People looked at them, but I noticed that they remembered not the technical part but the thematic part, and these topics turned out to be close to someone. So it happened a few times. When I applied for my current job, I also showed this blog to everyone, and I think it helped a lot — colleagues were discussing my projects with each other and me. It turned out that several of my projects overlap with what the company does. Description of the project in this article.
ODS.ai nickname: erqups
I’ve never been in the position of a junior looking for work, but in general, pet projects are a good way to systematize specific knowledge in your head and, at the same time, share experience, and maybe they will help someone. Any result will be helpful — if you get praise, it means everything is OK; if you get criticism, it’s even better; you change your approach, refine it, and the real project will be much easier. This is a win-win situation from all sides. For example, when I experimented with approaches to grid inference on the CPU, I wrote this article and got my first tips on Habr.
ODS.ai nickname: as_lyutov
In October 2021, I graduated from an online school with a degree in data science. Even then, I clearly understood that I needed a good portfolio of projects to qualify for the junior position. The ODS community had its own area of pet projects for beginners, but the ideas did not resonate.
In the fall, I participated in the hackathon of Raiffeisen bank on predicting real estate prices. I participated alone, got into the top 50, and got a good experience and a project in the portfolio. Then, I looked through the bank’s vacancy list and saw an attractive position as a quantitative research analyst, where I was to create a recommendation service based on the clients’ deals. That is how the idea of the pet project was born.
I had been actively investing since I was 25 years old, and the subject was close to my heart. To implement it, it was necessary to collect two datasets: one of the users’ deals and the other of the fundamental indicators of securities and chips based on them. The first dataset is synthetic. Finding private data on users’ deals is difficult; I used Tinkoff Journal’s article about the portrait of retail investors in Russia and made a sample of random users who made deals on S&P-500 and top-10 from the SPB index once a week. The number of trades was also random but was limited to an integer between 1 and 7 per week. The second dataset of S&P-500 stocks was much easier to collect. I have been using the finviz service for a long time for stock valuation, and it was easy to parse it with the spider. I saved the data from both datasets into Mongo DB. After that, I defined the main metrics, built the main recommendation models (ALS, Item-Item recommender, Cosine Recommender, TF-IDF), and estimated the metrics by baselines.
The whole job took about 1.5–2 months. I was done by about mid-February. And after a week, our life changed completely. So I was affected by two hiring freezes from the companies where I submitted my pet project. Finally, in mid-March, I was selected for a data analyst position. The pet project has played an essential role in the selection stage: my future boss was pleasantly surprised that I knew Mongo, and my experience in hackathons helped me to solve test tasks.
The plans are to train MLOps skills on the pet project, enrich it with data from other sources, and redesign the synthetic dataset with users’ transactions. Right now, the project is on hold due to the high workload. Such projects help novices systematize their skills and try them out on real-world tasks. In addition, you don’t need to make up a topic to talk about at the interview; just discuss your original pet project.
ODS.ai nickname: artgor
5 years ago, I did a pet project, which helped many times at the interviews. Description of the project in the article.
ODS.ai nickname: Sergei
For two years, I’ve been working on a pet project about GAN/Deepfake; in the process, I got very good at DL, and the project description is in the article.
ODS.ai nickname: poxyu_was_here
I want to tell you a little about our startup PTF-Lab (previously known as Punch To Face; currently known as Path To Future lab). Our primary focus is virtual advertising and AR in sports and cybersports broadcasts. PTF had a lot of pivots, stalemates, ups, downs, and so on. We started the project back in 2014 together with my childhood friend. Our story, very briefly:
- 2012–2013: how would it be funnier to film and showcase sporting events? © my childhood friend and partner in PTF.
- 2014: unsuccessful search for like-minded people; beginning to learn programming, designing; working on the first prototype.
- 2015: first prototype; unsuccessful search for like-minded people; the head of robotics laboratory at Kurchatov Institute believed in us and said, “why not?”
- 2016–2017: attempts to park with Chechens and Dagestanis; hello machine learning; first trip to a business angel (second prototype; third prototype).
- 2018: first time in a venture capital fund (without success); joining ODS (historical moment); deep learning went into business.
- 2019: there are a lot of people who would like to contribute and join our team; a lot of CV&DL experiments; performance at datafest in Odessa; Russian MMA organizations share data with us and let us in director’s booths during events.
- 2020: our Differentiable Mesh Renderer (a.k.a. Wunderwaffe) on PyTorch; a new prototype demo; UFC write to us (the largest mixed martial arts organization), and we pitch to them.
- 2021: first small commercial projects; complete burnout (for me for sure) and quiet death of the team (but not the funders — dementia and courage); and then we find a commercial partner and register the company in Cyprus.
- 2022: assembling a team, organizing processes, buying equipment; a new demo in real-time.
- P.S. — never give up and only go ahead.