Wrap your brain around one of Pandas’ most powerful tools for statistical analysis.

Photo by Pascal Müller on Unsplash

In this tutorial you will learn how to use the Pandas dataframe .groupby() method and aggregator methods such as .mean() and .count() to quickly extract statistics from a large dataset (over 10 million rows). You will also be introduced to the Open University Learning Analytics dataset


Pandas is the most adorable and cuddly tabular data management library for Python. Once you get the hang of it its intuitive, object-oriented implementation and clever tricks to improve computational efficiency make for flexible and powerful data handling.

How I did not deploy my first SARIMA COVID-19 forecasting model using Dash, Plotly, and Heroku.

Photo by Ethan Hu on Unsplash

I did not deploy a SARIMA time series model using the statsmodels library that predicts future COVID-19 infection and death rates. Using Plotly to create interactive graphs of current and predicted case and death rates, allowing users to to decide which statistics to include, which countries or states to predict, and how far out to predict, I did not make a publicly accessible and interactive predictive website. I worked hard and learned a lot in not deploying this model to a Heroku server.

Photo by Ilona Froehlich on Unsplash

This article will guide you through quickly transforming millions of rows of trace data into a wide format time series table using the power of mighty Pandas.

The Data

For this demonstration, I’ll be using the OULAD dataset from Open University. The full dataset, as well as a lovely data description, is available at the link above. This is a dataset about student activity in a virtual learning environment in online classes. It spans 2 years and four cohorts for each of 7 modules.

Photo by JESHOOTS.COM on Unsplash

The opportunities for humans to contribute to the work of the world are changing rapidly. Businesses growing to take advantage of these opportunities need workers with new skills. Programmers, data scientists, web developers, and leadership positions are hiring, but there are not enough folks with the right skills to fill the need. This is true of many industries.

Callbacks allow you to adjust settings or save your model during training.

Photo by Karsten Winegeart on Unsplash

In this article, you will learn how to use the ModelCheckpoint callback in Keras to save the best version of your model during training.

Modeling is Fun!

I love building predictive deep learning models. I love watching the training outputs, seeing the loss fall and watching for the diverging losses between training and validation sets that indicate overfitting. But sometimes a model finds a great solution…and keeps training to a solution that only works for the training set. Now, if I’m there, staring like it’s a fish tank, I can interrupt the training before too much damage is done. But, who wants to…

Using SpaCy pre-trained embedding vectors for transfer learning in a Keras deep learning model. Also, bonus, how to use TextVectorization to add a preprocessing layer to the your model to tokenize, vectorize, and pad inputs before the embedding layer.

Photo by Alexandra on Unsplash

In this article you will learn how to use SpaCy embedding vectors to create a pre-trained embedding layer for natural language processing models in Keras. This reduces training time for NLP models and transfers learning about words and their relationships from larger models.


Words make up most of the world most of us live in. If you are reading this, I…

Getting started with forecasting quickly with the fbprophet library

Photo by Drew Beamer on Unsplash

Why Facebook Prophet?

The Challenges of Timeseries Forecasting

Timeseries forecasting is a complex art form. Many models are very sensitive to trends, cycles (called ‘seasons’) and changing magnitudes of fluctuations, and instead require stationary data, which lack these features.

Can we predict whether a student will pass an online course without knowing anything about who they are?

Photo by Frank Romero on Unsplash

Learning online has been a growing trend for decades now. In 2018, 35% of college students took at least one course online and 17% took all of their classes remotely (NCES study). With COVID-19 a reality, learning online has exploded and become a necessary health and safety issue for more people than ever. While students will eventually return to school, the industry has had opportunity, funding, and impetus to improve and expand. This will undoubtedly lead to a sharper rise in the importance of internet based learning in the post COVID future.

Predictive analytics, human expertise, data mining, and empathy come together to improve graduation rates for tens of thousands of students, many the first in their families.

Photo by Element5 Digital on Unsplash

My first year of college was hard in so many ways. I had never lived away from home, my friends, family, and girlfriend were far away, and I didn’t know anyone. I was on my own for the first time and encountering some of the most difficult challenges I had yet faced. But, my struggles were invisible. I didn’t reach out to campus services, and they did not know I needed them. If you…

Photo by Xavi Cabrera on Unsplash

How fun is it to explore? As data scientists, we are all about discovery and interacting with data. Folium allows you and your audience to explore data with interactive maps, and it is quick and simple to set up.

Josh Johnson

I'm a data scientist with a background in education. I empower learners to become the folks they want to be.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store