DataOps

I’m new to this world. I’ve been thinking about it recently because DevOps often get’s questions from Data Science Engieers asking about how to do certain things in git, and I’m learning Data Science has some similarties and differences to traditional software development.

Software Engineering vs Data Engineering

In the software world, multiple developers are working on the same code base. They are all making new features, or correcting issues. Then when the code is considered “okay”, it runs through a CI/CD pipeline is packaged together and that package is deployed (or consumed or something).

In the Data world, multiple engineers are working on the same dataset, but they’re not really packaging up an artifact to share with the client. The engineers instead are finding ways to present the data in a meaningful way for their clients.

However, there are several process that need to be shared across the data engineers too. These process will take data from multiple sources, transform it, and then dump it to a data warehouse, or into another format that is easier for the engineers to consume.

Engineers need source control so they can collaborate on pipelines to ETL (or ELT) data, and to share what they are working on with other engineers. But there’s no concept of a SQL module that you would package into an artifact for another engineer to consume. Sharing is purely copy & paste from anothers example.

DevOps

DevOps I understand well. Before DevOps, developers would make their changes, and then tell an operations team how to run the application in production. Operations engineers were then responsible for making sure the application continued to run. This works as long as the application works (hint, it never does) and the Operations team is already setup to handle the workload. For example an application may need a specific version of the dotnet runtime installed, and the ops team may not have a machine that supports that yet. You can read all about it yourself in “The Phoenix Project” by Gene Kim but the answer is that everything works better when developers and operations engineers work together instead of blaming each other for failing applications.

DataOps

DataOps to me sounds a lot like the Agile manefsto. (DataOps vs Agile) As far as I know, Data Engineers and Operations engineers have never clashed. Data Engineers only need access to the tools, and data from operations. Also Operations are not responsible for making sure that a specific query works properly. So why choose the name “DataOps”? I am disappointed. I was hoping to find a group of individuals searching for a way to streamline a Data Engineers workflow. I need something to describe how branches should live in source control, what the folder structure should look like, and when to use a CI/CD pipeline. I want standards and best practices.

I will be thinking about this more over the next week, but right now I see 2 paths.

Data Engineers use folders for different environments, and experiments. Imagine having a dev,qa,prod folder and copying the SQL files from one to the next when you want to “deploy” to that environment. The source control flow becomes dead simple. Branch from main, make your change, merge back to main and destroy your branch.
Data Engineers use git branches for environments, and experiments. This I think makes sense for experiments. I don’t really want to put it into the master branch because I don’t know if it works yet. Even if it does work, I don’t know if I want to keep it. I need this space to be the way I left it so I can revisit it whenever the business wants to see the results again with new data. Branches for environments is fine; we’ve been doing that forever. This is where things like ETL processes, and datawarehouse schema changes should exist because the concerns are shared among the whole team. Deployments to an environment then are managed by the environment. The database keeps track of which schema changes have already taken place, and runs the new ones (in order).