I have been in data (in the fuzziest sense of the word) since about 2009, whether that means data engineering, management, analysis, strategy, or visualization. Over that time, things have changed drastically.
In my first “real” data position, I was asked to identify and organize fallout from a claim auto adjudication engine to identify ways to avoid manual processing … in MS Access. This was objectively the wrong way to do this type of analysis, but I did what I was told, and it took forever to do something that could have been solved much more simply with basically any other technology.
Our “data engineering” process involved manually pulling down an Excel sheet that had been manually created and saved on a SharePoint site on a daily basis. If you missed work one day, that day’s data would be overwritten and lost into the ether. It was not ideal. If this is how you are currently handling data engineering, I apologize, and please reach out to me, because I can help you do better.
As my career progressed, so did the sophistication I saw at companies and with data engineering. Things moved from manual processes, to on-prem databases with manual processes, to automated processes kicked off by humans, to fully automated processes with no human interaction necessary.
We have come a long way, but what makes a data engineering process really stand out? Five years ago, I probably would have said some amount of automation with the option for manual refreshes, the ability to process a full data load in a reasonable amount of time, a process for what to do if something goes wrong, and some sort of alerting to abnormalities in the data (at one company, a quirky engineer did this by sending emails with emojis based on data load – this was not especially helpful).
Now, however, effective data engineering processing requires much more. You need to leverage CI/CD processes, have tracking that identifies who is working on what and where it lives, and check your code once a day (or maybe even more than once). In addition, you should leverage lower environments and make sure that there is some sort of code review process that doesn’t just involve the person who wrote the code. Finally, you must use some sort of data pipeline orchestration that can run processes without human intervention, in a cadence that appeases all data users and can detect, through statistical analysis and validation testing, when something is unexpected and that notifies multiple people who know what to do when something is wrong.
That’s not all, though: You also need documentation. I suggest that every company maintains a best practices doctrine – a document that outlines how (what technology, tools, practices) the work should be done and what common mistakes or “gotchas” occur in the data and why they matter. One thing that hasn’t ever changed and is the most often overlooked is the need for clear and concise technical documentation of the data and process itself that is updated on a regular basis. I know that this is the very last thing that anyone in a “data” position wants to do, but that is why we invented business analysts.
All joking aside, the best data engineering teams do not operate in a vacuum – they partner with their stakeholders and take them along for the data journey, ensuring at multiple points that the data being engineered is the data that people want and that everyone understands where the data comes from and what analyses are appropriate for the data.