In this tutorial you will learn how to use the Pandas dataframe
.groupby() method and aggregator methods such as
.count() to quickly extract statistics from a large dataset (over 10 million rows). You will also be introduced to the Open University Learning Analytics dataset
Pandas is the most adorable and cuddly tabular data management library for Python. Once you get the hang of it its intuitive, object-oriented implementation and clever tricks to improve computational efficiency make for flexible and powerful data handling.
Pandas facilitates data mining, data processing, data cleaning, data visualization, and some basic statistical analysis on…
I did not deploy a SARIMA time series model using the statsmodels library that predicts future COVID-19 infection and death rates. Using Plotly to create interactive graphs of current and predicted case and death rates, allowing users to to decide which statistics to include, which countries or states to predict, and how far out to predict, I did not make a publicly accessible and interactive predictive website. I worked hard and learned a lot in not deploying this model to a Heroku server.
Here is my story.
In this walkthrough you will learn to deploy a website to Heroku that…
This article will guide you through quickly transforming millions of rows of trace data into a wide format time series table using the power of mighty Pandas.
For this demonstration, I’ll be using the OULAD dataset from Open University. The full dataset, as well as a lovely data description, is available at the link above. This is a dataset about student activity in a virtual learning environment in online classes. It spans 2 years and four cohorts for each of 7 modules.
The opportunities for humans to contribute to the work of the world are changing rapidly. Businesses growing to take advantage of these opportunities need workers with new skills. Programmers, data scientists, web developers, and leadership positions are hiring, but there are not enough folks with the right skills to fill the need. This is true of many industries.
Education is expensive. Traditional teachers have to be multi-talented, high educated, passionate, and hard-working. Teaching and assessing are done manually and at considerable expense of time and money. However, we live in a magical age where data driven, accessible, personalized, and effective…
In this article, you will learn how to use the ModelCheckpoint callback in Keras to save the best version of your model during training.
I love building predictive deep learning models. I love watching the training outputs, seeing the loss fall and watching for the diverging losses between training and validation sets that indicate overfitting. But sometimes a model finds a great solution…and keeps training to a solution that only works for the training set. Now, if I’m there, staring like it’s a fish tank, I can interrupt the training before too much damage is done. But, who wants to…
Using SpaCy pre-trained embedding vectors for transfer learning in a Keras deep learning model. Also, bonus, how to use TextVectorization to add a preprocessing layer to the your model to tokenize, vectorize, and pad inputs before the embedding layer.
In this article you will learn how to use SpaCy embedding vectors to create a pre-trained embedding layer for natural language processing models in Keras. This reduces training time for NLP models and transfers learning about words and their relationships from larger models.
Words make up most of the world most of us live in. If you are reading this, I…
Timeseries forecasting is a complex art form. Many models are very sensitive to trends, cycles (called ‘seasons’) and changing magnitudes of fluctuations, and instead require stationary data, which lack these features.
This devastating disease has killed hundreds of thousands in the US and millions around the world and at this time continues to spread ever faster. Forecasting the rate of future infections can help hospitals and aid organizations plan and prepare for future needs.
A good example of data that not-stationary is the cumulative spread of COVID-19 in the United States over 2020. For this project I will be using…
Learning online has been a growing trend for decades now. In 2018, 35% of college students took at least one course online and 17% took all of their classes remotely (NCES study). With COVID-19 a reality, learning online has exploded and become a necessary health and safety issue for more people than ever. While students will eventually return to school, the industry has had opportunity, funding, and impetus to improve and expand. This will undoubtedly lead to a sharper rise in the importance of internet based learning in the post COVID future.
Predictive analytics, human expertise, data mining, and empathy come together to improve graduation rates for tens of thousands of students, many the first in their families.
My first year of college was hard in so many ways. I had never lived away from home, my friends, family, and girlfriend were far away, and I didn’t know anyone. I was on my own for the first time and encountering some of the most difficult challenges I had yet faced. But, my struggles were invisible. I didn’t reach out to campus services, and they did not know I needed them. If you…
How fun is it to explore? As data scientists, we are all about discovery and interacting with data. Folium allows you and your audience to explore data with interactive maps, and it is quick and simple to set up.
Folium is a python library that allows you to combine the amazing data wrangling libraries of python and the beautiful mapmaking abilities of Leaflet.js. With just a few lines of code in your IPython Jupyter Notebook, you can produce eye-catching interactive maps to help your audience explore your data in a visual and geographical way. …
I'm a data scientist with a background in education. I empower learners to become the folks they want to be.