Stop boiling the data ocean

What evidence do we have that this strategy is the right one?

🦗🦗🦗

Before taking a hiatus to do a stint of public service, Night Train was my full-time strategy consulting business focused on helping growing businesses overcome their organizational strategy and product management challenges.

Over time, it became clear that many of my clients struggled to gain alignment on priorities because there wasn’t a shared reality of “What do we know about our business?” to build a path forward from.

As a result, my business shifted more into data consulting over time, which happened to coincide with the advent of the “modern data stack” movement whereby a small team could stitch together relatively inexpensive and low overhead SaaS tools to build a solid data platform. I spent a few years building data systems and teams for a number of clients, and in reflecting upon lessons learned, one thing stands out:

I don’t care how much time you’ve invested in building data pipelines, setting up a data warehouse, choosing a Business Intelligence (BI) tool, or hiring a team of data scientists. If a team can’t look at a chart / metric / etc. together and feel confident that it accurately reflects the answer to the question they’re asking, then none of the rest matters.

Data modeling is the process of transforming an organization’s “raw” data that gets generated by its applications, SaaS tools, etc. into “transformed” data that is more easily consumable by BI tools and humans alike.

dbt's data modeling diagram

There are a variety of tools that can do this, but regardless:

  1. A team must work together to align on business term definitions: As an example, “What do we mean when we say 'The average order size of a customer’s first order is $X’? Does that include shipping, taxes, and coupons? Is it for all customers, or just those that are currently active? What does 'active' even mean?"
  2. The data modeling tooling must be rigorous: Notice how much nuance is baked into the questions above. This must be baked into the data modeling tooling, and it needs to be iterated upon over time. It needs to inspire confidence that it reliably produces transformed datasets that can be leveraged to build analyses that can be trusted.

For #1, my favorite book to help teams learn and apply data modeling practices is called Agile Data Warehouse Design (physical, digital) because it explains common modeling approaches while encouraging an agile data team process and culture.

For #2, my favorite transformation tool is dbt because it applies the rigor of engineering practices like version control, automated data quality tests, and documentation coverage to the data transformation process, while being free, database agnostic, and easy enough for non-engineers to learn.

Stop boiling the data ocean

It’s wild to me how many organizations start investing in data initiatives without the clarity of what questions they’re trying to answer or how they’ll ensure the quality of reporting. What’s worse is that in the mad rush to be seen as a data-driven company or leader in AI, they often invest a tremendous amount of money on engineers building data pipelines that aren’t necessary, data systems / tools that likely won’t get used, and data scientists that will be spending most of their time cleaning up crappy data.

My advice for organizations looking to level up their data platform is:

  1. Pick a single question to answer that’s at the intersection of “It’d be valuable to know the answer to this” and “We think we likely already have the right raw data to inform this”.
  2. Work as a group to agree upon data definitions for the objects / attributes you’re working with to answer the question.
  3. Invest in a rigorous data modeling layer between your database/data warehouse and analysis tools, and just focus on using it to do modeling work for that one question.
  4. Use whatever existing analysis tool you already have to create a report (chart / metric / whatever), and review it as a team. Repeat steps #2 and #3 until you feel confident that the report accurately reflects your business reality.
  5. If that process goes well with the tools you have, great. Build upon that momentum by modeling the next question. If it didn’t go well, use that as concrete evidence of needing to invest in different tools, skills, etc. asap.
  6. Over time, as you model more of your business domain, iteratively level-up the other parts of your data team and data stack.

In each of the data platform modernization projects I’ve been a part of, we followed this "data modeling first" approach and began seeing tangible benefits within weeks. And in most cases, within months we were able to finish modeling most of the domain, create a full suite of reports, and build a solid data science foundation.

The beginning of wisdom is to call things by their proper name – Confucius

Resources

A lot of the work artifacts I created in this space can’t be shared because they are client-specific. But here are some resources that you may find helpful:

Data Initiative Goal Setting

Data Pyramid of Needs

Data Systems in Modern Data Stack

Data Definitions

The goal should be for all employees throughout the organization to use the same terminology and have a shared understanding of how critical data points are being defined. I like to call the artifact / tool a given organizations uses to do so their “Lexicon”, since it defines the branch of knowledge that is their business domain.

The closer these definitions are to “where the code is”, the better, since they are less likely to get out of date. For example, dbt allows you to do both the data modeling and documentation in the same place, with a nice exploration UI layered on top:

dbt's data documentation explorer

Even if you don’t have a tool for doing so, at the very least, create a shared doc that is widely available and has a clear cadence for review and updates:

Data Resources

I hope you found this post useful! Feel free to reach out if I can ever be of help.

7 firmaments of heaven and hell; 8 million stories to tell…