I am a Sr. Software Developer at Oracle Cloud. The opinions expressed here are my own and not necessarily those of my employer.
Recently I spoke at RedisDay Seattle about using Redis for Data Engineering and Data Science. In this article I want to revisit these ideas.
- Python Pandas
- Worker containers
- Web container for Jupyter Notebook
Python Pandas is a popular library for data science tasks such as importing data from various sources and analyzying it.
The challenge is that often our data acquisition is much more complex that simply reading it from one file or a single DB query. We often have to pull data from different sources. If one of our queries or API requests fails we do not want to repeat the entire process from the beginning.
In this article we will explore how to use Redis for two purposes to build a simple yet more scalable system:
- As a job queue to run multiple data acquisition tasks in parallel.
- As a DB to temporarily store our datasets.
We will be using Docker and Docker Compose to manage our environment.
We will start our environment with
docker-compose up --build -d --scale worker=2 command. This will bring up 1 Redis, 1 Web and 2 Worker containers.
This file will contain common environment variables to be shared across containers.
This file will NOT be commited to the repo but this is where we will store token necessary to access the Github APIs. More info is available here
This file contains the logic that depending on
CONTAINER_TYPE env variable starts either Flask web server and Jupyter Notebook OR background job worker using
To make sure RQ worker runs properly we need to create
We will install various dependcies with
We will jump into Python shell and start running our background jobs with
jobs.github_users.queue(). First the code will query
https://api.github.com/users?since=0 endpoint and then it will loop through the users and hit each user URL (like this https://api.github.com/users/mojombo).
?since parameter to paginate through the
users endpoint and queue subsequent requests. We use
counter to stop after 10
users requests. Overall we make 310 HTTP requests so we do not want to start from the beginning in case one of them fails.
Each job will be sent via Redis queue thanks to
.queue(...) method and picked up by one of the worker containers. As the jobs complete they massage the data to extract
['public_repos', 'public_gists', 'followers', 'following'] and store them in Redis Hashes. Data in Redis will look this like:
Web container for Jupyter Notebook
This is where we transition from data engineering to data science. We can browse to
http://localhost:8888 and user the Notebooks to do data analysis.
One limitation of Redis is that we cannot query by value so we will use Panda DataFrame to pull data out of Redis and store it in Python memory. Then we can do regular Pandas aggregations.
Overall this approach can be a good option for medium data scale. Job processing can stopped and restarted later. This solution does require a lot of memory so it is probably not a good choice when we need to store large amounts of data for extended periods of time. But it could be very useful as a place to temporarily store data as we are processing it and once the aggregations are done we can flush Redis.
- Video of my presenation https://www.youtube.com/watch?v=Koh6piVaYh0
- Slides from my presentation http://bit.ly/36mQ8H2
- Code samples from my presentation https://github.com/dmitrypol/redis_data