How to Scope a Data Engineering Project: A Detailed Guide

If you’re like most data scientists, scoping projects probably isn’t your favorite part of your job. Scoping projects can often feel like a mix between tedious busywork meant to placate executives and wildly guessing. Chances are, you’ve had projects in the past totally miss their scoping requirements, only to see no negative side effects. Feeling like your work is meaningless, tedious and baseless is a recipe for frustration, no matter your profession. For data scientists, who are used to measuring things to determine their efficiency, it’s excruciating.

Fortunately, it’s possible to get better at scoping your projects. Project scoping is never going to go away, and despite what it might feel like, it’s not meaningless. Project scoping helps decision-makers determine how to prioritize projects for an organization. Doing it well means both that the most important projects receive the attention they need, and also that you’re more likely to be successful when you embark on a new project. In this post about how to scope a data engineering project, we’ll walk through a detailed guide on what you need to understand.

Understand the Problem You’re Trying to Solve

The first step to scoping any data engineering project is understanding the problem you’re trying to solve. A significant percentage of organizations miss this step when they’re scoping projects. Often, when a project gets to data engineers, it’s expressed as a simple solution. “We need a new metric” is one way that data engineers hear about problems. Sometimes, this looks like needing a new dashboard, or report.

Those aren’t problems. They’re artifacts. It’s entirely possible that a new data engineering problem will produce a number of new metrics, dashboards or reports. Those are effective tools to measure whether you’re solving problems. However, they’re not enough to know that the work you’re doing is effective. Instead of hearing that someone needs a new metric, understand instead that they’re trying to understand why customers aren’t purchasing as many new products. In lieu of producing a new report, an effective data engineering project understands that the sales team needs to know the behaviors of customers who stay on your service for a full year compared to those who drop after three months.

Once you understand the problem space, you’re far more capable of scoping a project successfully. You’ll be able to point out when requested artifacts aren’t going to serve their purpose. Your team might be able to point to an existing artifact that serves as an effective proxy for the question stakeholders are asking. The time that you spend digging in and understanding the specific problem to solve means that you’ll be much more effective when it comes to providing a timeline and cost for the project.

Vet Your Impact Hypothesis

Once you understand your problem, the next step is outlining how this project will fix the problem. While many projects miss that first step, just as many miss this step, too. In short, most projects assume that if you carry them out, some impact will come to pass. That’s not a guarantee! Data science is fundamentally about measuring things. Measuring things, as you know, does not guarantee results. Our pasts are littered with projects that measure things effectively, only for those measurements never to be used to inform even a single decision.

A way to avoid this kind of failure is to codify your impact hypothesis. When you do this, you’ll outline exactly how the data you collect and analyze will turn into positive outcomes. Sometimes, by stating the impact hypothesis, you’ll recognize that this project will never succeed. Other times, you’ll find that your impact hypothesis has paths to failure which aren’t reliant on the data science portion of the project at all. Moving from measurement to producing tangible value is the key difference between data science and data engineering.

One path to success when vetting impact hypotheses is to state them specifically as a hypothesis. “If we do this, then something will happen.” The scientist in you will instantly begin to recognize ways that your hypothesis might not be true. That recognition is key to effective project scoping. You can recognize potential points of failure and build time and budget into your scope to manage them.

Enumerate What New Data Sources You Need

One of the more expensive and difficult parts of any data engineering project is acquiring new data. Not every project is going to need to source new data. If you’re entering a project where you anticipate that you already warehouse all of the data you need, great! If you’re not so lucky, this step of project scoping may take considerable effort. This is why it’s so critical that you understand the problem you’re solving and how that solution will turn into real value. You need to make sure that you’re not wasting time evaluating new sources of data. As such, you need a clear picture of how the data you’ll get from this new source will impact the overall project.

It’s never possible to get every part of project planning correct from the beginning. If it were, fewer projects would fail. This is a part of the scope that you want to get right as soon as possible, though. As noted, sourcing new data is expensive and time-consuming. It often requires that engineers build new data collection mechanisms, or contracts to be signed with data providers. Often times, this is the part of the project that isn’t going to rely on the data engineering team. Your project will be stuck in limbo while you wait for other people to finish their work. This can feel absolutely painstaking, and you’ll want to minimize the time spent waiting in this state. By effectively scoping this part of the project, you ensure that work on important sources starts as soon as possible.

Outline the Analysis You Need to Do

This is probably the part of the project scope you spend most of your time on, if you’re anything like me. This is the work that you like to think about. What kind of difficult problems will you solve? What information will arise out of the data that suggests something you’d never considered before?

This is also important work when scoping the project. However, it’s critical that you leave it for last. You don’t want to focus on artifacts of your work, you want to focus on the problem you’re solving. Scoping data engineering projects is a holistic process, which means understanding the root problem before you ever start to think about the interesting, difficult work you’ll need to do. In my experience, the key here is to not short yourself when estimating time for analysis. Expect that some of your models will provide insufficient data. Plan for some of your existing metrics needing updates due to new ways of looking at the data.

Building in sufficient time for analysis is key to delivering high-quality scopes. Know how fast your team works, and communicate their pace effectively during scoping to make sure you’re providing the best picture to decision-makers.

A Well-Scoped Project Sets You up for Success

I’m not going to pretend that scoping projects is suddenly going to be fun. I can’t imagine that reading through these steps will suddenly make you enjoy the process of calculating the ROI on some new data source. However, if you’re following these guidelines, you will produce projects that are more accurately scoped. Those projects are set up for success from day one, instead of coming out of the gate on a path to failure.

That also doesn’t mean that every project will be successful. Like we noted above, sometimes your impact hypothesis is going to have holes in it. You’re going to test that hypothesis and find it to be incorrect. That’s OK! A failed project doesn’t mean your work was a failure. It just means that you found something that didn’t work. What’s important, though, is that you gave the project the best chance to succeed. Instead of inaccurate timelines and wild guesses about which projects will provide the best return on investment, your work will mean that your employers focused on the most important work.

If you’re still looking for resources on scoping a data engineering project, I’ve found that this example worksheet is effective at helping point out which questions you should be asking and which problems you should be thinking about.

This post was written by Eric Boersma. Eric is a software developer and development manager who’s done everything from IT security in pharmaceuticals to writing intelligence software for the US government to building international development teams for non-profits. He loves to talk about the things he’s learned along the way, and he enjoys listening to and learning from others as well.