I am a Sr. Software Developer at Oracle Cloud. The opinions expressed here are my own and not necessarily those of my employer.
Redis for Data Engineering and Data Science
Recently I spoke at RedisDay Seattle about using Redis for Data Engineering and Data Science. In this article I want to revisit these ideas.
- Python Pandas
- docker-compose.yml
- Dockerfile
- Worker containers
- Web container for Jupyter Notebook
- Links
Python Pandas
Python Pandas is a popular library for data science tasks such as importing data from various sources and analyzying it.
The challenge is that often our data acquisition is much more complex that simply reading it from one file or a single DB query. We often have to pull data from different sources. If one of our queries or API requests fails we do not want to repeat the entire process from the beginning.
In this article we will explore how to use Redis for two purposes to build a simple yet more scalable system:
- As a job queue to run multiple data acquisition tasks in parallel.
- As a DB to temporarily store our datasets.
We will be using Docker and Docker Compose to manage our environment.
docker-compose.yml
We will start our environment with docker-compose up --build -d --scale worker=2
command. This will bring up 1 Redis, 1 Web and 2 Worker containers.
Environment files
common.env
This file will contain common environment variables to be shared across containers.
secrets.env
This file will NOT be commited to the repo but this is where we will store token necessary to access the Github APIs. More info is available here
Dockerfile
entrypoint.sh
This file contains the logic that depending on CONTAINER_TYPE
env variable starts either Flask web server and Jupyter Notebook OR background job worker using RQ
library.
To make sure RQ worker runs properly we need to create rq_config.py
file:
pipfile
We will install various dependcies with pipenv
and Pipfile
:
Worker containers
We will jump into Python shell and start running our background jobs with jobs.github_users.queue()
. First the code will query https://api.github.com/users?since=0
endpoint and then it will loop through the users and hit each user URL (like this https://api.github.com/users/mojombo).
We use ?since
parameter to paginate through the users
endpoint and queue subsequent requests. We use counter
to stop after 10 users
requests. Overall we make 310 HTTP requests so we do not want to start from the beginning in case one of them fails.
Each job will be sent via Redis queue thanks to .queue(...)
method and picked up by one of the worker containers. As the jobs complete they massage the data to extract ['public_repos', 'public_gists', 'followers', 'following']
and store them in Redis Hashes. Data in Redis will look this like:
Web container for Jupyter Notebook
This is where we transition from data engineering to data science. We can browse to http://localhost:8888
and user the Notebooks to do data analysis.
One limitation of Redis is that we cannot query by value so we will use Panda DataFrame to pull data out of Redis and store it in Python memory. Then we can do regular Pandas aggregations.
Overall this approach can be a good option for medium data scale. Job processing can stopped and restarted later. This solution does require a lot of memory so it is probably not a good choice when we need to store large amounts of data for extended periods of time. But it could be very useful as a place to temporarily store data as we are processing it and once the aggregations are done we can flush Redis.
Links
- Video of my presenation https://www.youtube.com/watch?v=Koh6piVaYh0
- Slides from my presentation http://bit.ly/36mQ8H2
- Code samples from my presentation https://github.com/dmitrypol/redis_data
- https://pandas.pydata.org/
- https://palletsprojects.com/p/flask/
- https://python-rq.org/