“Data Wrangling” — as a Data Management term, is about as good as it gets. Action-oriented describes what it actually does and doesn’t make people depressed as soon as they hear it. It is what you think it is — essentially putting square pegs in round holes. This is an important, but often overlooked part of the Data Management equation.
Before there is a highly-refined data model, supporting architecture, and a corresponding process that helps folks that know what to do with it all — we have to figure some things out. Data Wrangling is all about early stage data preparation where we get it somewhere we can use it before we decide if it is worth putting it somewhere more robust. This involves identifying potential data use cases, identifying source(s) of data to support them, and mocking up prototypes to see if our hypotheses have merit.
This may sound a lot like Data Science — and that is because Data Scientists often waste a lot of their time doing Data Wrangling. Preparing data for use is generally a lower-value-added activity than performing fancy analysis, which is the real reason Data Scientists make the big bucks. If we want to get the most of our investments in Data Science, we need to teach the data to walk before the scientists teach it to dance downstream.
So Data Wrangling is about taking relatively ambiguous ideas about potential Data Value and then doing a rough-cut of what might eventually evolve into a production solution. This kind of thing does not happen in old-world data shops, where everything is predetermined before anybody has a chance to learn anything. Remember, a waterfall can move in any direction you want, as long as you want everything to come crashing down!
Some analogies write themselves.
Data Wrangling is antithetical in comparison to waterfall project methodologies. Data Wrangling is a discovery process with a high degree of uncertainty, and will propel learning and movement in ways that cannot be easily predicted. Sure, we could conceivably wrap all this in a waterfall project, but why would we if the project management methodology adds nothing to help the actual work succeed? Then it becomes overhead for the sake of overhead, and this is not something we data professionals should be advocating to do.
That said, the tools of Data Wrangling are as varied as the data itself. When I am working on new data sourcing, my preference is to read file specs, try to view in a text editor (if applicable, and if the data file is small enough), or load it into a relational database. For very large files, dedicated profiling tools come in handy. Other data sources like streaming can require different tools and techniques. Python and micro-services have enabled entirely new ways of processing data. Though the tools are incredibly varied, it largely depends on our personal technical proclivities and the types of data we are looking at.
Once we select the technologies we want to work with, we need to, for lack of a better term, play with the data. We need to see if the data reflects its specifications (typically documentation is low on developers’ priority lists). Or, we need to figure out how we might join it to your existing data and what path it might take to achieve the Data Value use cases we have in mind.
Data Wrangling is equal parts design and tinkering. It does not have a lot of prescribed activities, but it should result in a better understanding of what it will take to develop a more complete solution so that it will have a positive ROI. I like to think of Data Wrangling as a loose collection of mini-proofs-of-concept that will help us figure out the costs and benefits of particular options while minimizing the costs of doing that research.
I like doing Data Wrangling myself. I find it an essential part of the design and architecture process. Just like a chef might want to touch and taste their ingredients or visit the place from which their ingredients came, data practitioners will want to get close to the sources of data.
Many data sources are internally-grown — so the above statement implies that data folks will need to get close to the operations of their own organization. This means the systems, sure, but also the people and processes that are involved.
To succeed with Data Wrangling, the most important thing to do is start doing it. Be inquisitive, solve problems, break through the walls ourselves. If we are starting out in the smallest or most inexperienced environment, a little Data Security, a little Data Wrangling, and a little bit of insight about what data might help change for the better — these alone might be enough to move the needle and get your data efforts going.
And until next time, go make an impact!