Updated: Sep 12, 2019
Recently and quite suddenly a data pipeline I have to create for Rolls-Royce got complicated. Starting out in Azure was great, not much infrastructure to manage, so I can focus on the moving parts of the code to ingest, parse and merge many different formats of data into one structured data asset. That was until we had to temporarily move away from Azure due to export control ambiguity - really not something to mess around with.
So this now leaves me with a challenge, the pipeline needs to continue its development, however it makes little sense to do this in a way that would make it difficult to move back to azure once we have all the legal documentation we need.
Resulting in this series of articles on the particular solution I implemented to solve this problem. Clearly I'm not going to be discussing the particulars of the data, however I hope that somebody in a similar predicament will find this resource useful :-)
What do we need to do?
Store documents of different format (csv, xlsx, pdf, etc...) in a scalable storage technology that is easily accessible
Run python scripts to extract, manipulate and store the results
Store the JSON data asset in a high availability nosql database
Provide an API to access to the JSON resulting from the data processing
Starting in Azure, requirement 1 was accomplished using Blob Storage, requirement 3 was implemented using Azure Cosmos with a MongoDB interface. Python scripts have so far been run from the command line and not yet migrated into Azure and also an API has not been developed that provides the full suite of endpoints that we need.
In order to make the solution portable we will need to accept the fact that we will need to manage more infrastructure and configuration than we would using Azure PaaS. As such we will use Docker to maintain a number of images as replacements for the Azure services.
Object Storage: Minio provides an S3 compatible object store as a docker image
Azure Cosmos, since we are using the MongoDB API for Cosmos we might as well use MongoDB directly since they provide a Docker Image
Running python code. Well, essentially we wish to run some python functions on demand and pass arguments to these functions to provide execution context. They will need to access both Minio and MongoDB. Obviously, we will build the functions into Docker images. Managing these though is where we would want to use a framework to avoid the need to write code to manage the scale out and resource management. Thankfully the Cloud Native Computing Foundation has the Kubeless project (among others) that can help us deploy server less functions in much the same way as I would do using AWS.
API - we will need an API to provide data to apps consuming the data asset and also some endpoints to add too the data asset. Again docker will be used to containerize the API and deploy it using Kubernetes (the underlying platform for Kubeless)
Apps - web apps will be built and deployed using Kubernetes, native apps will access the API.
So there we have it, a bit of thought and canvasing of the Open Source community has yielded the tools required to pull this data pipeline and apps out of Azure and deploy it anywhere that Docker and Kubernetes can run (pretty much anywhere).
The below diagram shows the components of the architecture.