Skip to main content

· 11 min read
Jack Leitch

EtLT of my own Strava data using the Strava API, MySQL, Python, S3, Redshift, and Airflow

system_diagram

I build an EtLT pipeline to ingest my Strava data from the Strava API and load it into a Redshift data warehouse. This pipeline is then run once a week using Airflow to extract any new activity data. The end goal is then to use this data warehouse to build an automatically updating dashboard in Tableau and also to trigger automatic re-training of my Strava Kudos Prediction model.

· 12 min read
Jack Leitch

... and deploying on Google Chrome with FastAPI and Docker

I recently finished the fantastic new Natural Language Processing with Transformers book written by a few guys on the Hugging Face team and was inspired to put some of my newfound knowledge to use with a little NLP-based project. While searching for some ideas I came across an excellent blog post by Tezan Sahu in which he built a Microsoft Edge extension to paraphrase text highlighted on your screen. I wanted to take this a step further by:

  1. optimizing model inference with ONNX runtime and quantization,

  2. include features such as summarization, name entity recognition (NER), and keyword extraction.

· 25 min read
Jack Leitch

Predicting customer churn from a telecom provider

I’ve always believed that to truly learn data science you need to practice data science and I wanted to do this project to practice working with imbalanced classes in classification problems. This was also a perfect opportunity to start working with mlflow to help track my machine learning experiments: it allows me to track the different models I have used, the parameters I’ve trained with, and the metrics I’ve recorded.

This project was aimed at predicting customer churn using the telecommunications data found on Kaggle (which is a publicly available synthetic dataset). That is, we want to be able to predict if a given customer is going the leave the telecom provider based on the information we have on that customer. Now, why is this useful? Well, if we can predict which customers we think are going to leave before they leave then we can try to do something about it! For example, we could target them with specific offers, and maybe we could even use the model to provide us insight into what to offer them because we will know, or at least have an idea, as to why they are leaving.

· 21 min read
Jack Leitch

An end-to-end data science project, from data collection to model deployment, aimed at predicting user interaction on Strava activities based on the given activity’s attributes.

Strava is a service for tracking human exercise which incorporates social network type features. It is mostly used for cycling and running, with an emphasis on using GPS data. A typical Strava post from myself is shown below and we can see that it contains a lot of information: distance, moving time, pace, elevation, weather, GPS route, who I ran with, etc., etc.

· 13 min read
Jack Leitch

Using Word2Vec, Scikit-Learn, and Streamlit

First things first, If you would like to play around with the finished app. You can here: https://share.streamlit.io/jackmleitch/whatscooking-deployment/streamlit.py.

alt

In a previous blog post (Building a Recipe Recommendation API using Scikit-Learn, NLTK, Docker, Flask, and Heroku) I wrote about how I went about building a recipe recommendation system. To summarize: I first cleaned and parsed the ingredients for each recipe (for example, 1 diced onion becomes onion), next I encoded each recipe ingredient list using TF-IDF. From here I applied a similarity function to find the similarity between ingredients for known recipes and the ingredients given by the end-user. Finally, we can get the top-recommended recipes according to the similarity score.

· 8 min read
Jack Leitch

Automating away those unceasing repetitive tasks the easy way

The Task At Hand

Being the avid runner and data lover that I am, naturally, I like to get the best out of logging my training. The solution I settled with a few years back was to log my training on both Strava, and Attackpoint. While both platforms provide a service for tracking exercise using GPS data, Strava has an emphasis on being a social network and it allows you to look into each activity in an in-depth way, something that Attackpoint lacks; Attackpoint on the other hand is more barebone and I personally prefer it for looking at my training as a whole (in timescales of weeks/months/years).

· 5 min read
Jack Leitch

Using Python to automate the process of setting up new project directories and making the first commit to a new repository in Github

I find myself doing the same thing over and over again when starting a new data science project. As well as it being extremely tedious (and frustrating when I can’t remember how to use git…), it's also just a waste of time. Inspired by the YouTuber Kalle Hallden, I decided to try and automate this process using Python.

· 9 min read
Jack Leitch

A guide to organizing your machine learning projects

I just want to start with a brief disclaimer. This is how I personally organize my projects and it’s what works for me, that doesn't necessarily mean it will work for you. However, there is definitely something to be said about how good organization streamlines your workflow. Building my ML framework the way I do allows me to work in a very plug n’ play way: I can train, change, adjust my models without making too many changes to my code.

Why is it important to organize a project?

  1. Productivity. If you have a well-organized project, with everything in the same directory, you don't waste time searching for files, datasets, codes, models, etc.

  2. Reproducibility. You’ll find that a lot of your data science projects have at least some repetitively to them. For example, with the proper organization, you could easily go back and find/use the same script to split your data into folds.

  3. Understandability. A well-organized project can easily be understood by other data scientists when shared on Github.

· 12 min read
Jack Leitch

Optimizing your choice of optimizer

Gradient descent is an optimization algorithm used to minimize some cost function by iteratively moving in the direction of steepest descent. That is, moving in the direction which has the most negative gradient. In machine learning, we use gradient descent to continually tweak the parameters in our model in order to minimize a cost function. We start with some set of values for our model parameters (weights and biases in a neural network) and improve them slowly. In this blog post, we will start by exploring some basic optimizers commonly used in classical machine learning and then move on to some of the more popular algorithms used in Neural Networks and Deep Learning.