top of page

Data Pipelines for Non-engineers

Data pipelines manage some of the most important tasks for a web company. As data comes in from users, it needs to come right back out as improved user experiences and provide fresh analytics to the team as quickly as possible. I was quite proud of having built an easy to use pipeline framework at Houzz that made it simple to add new mapreduce jobs, hive/impala queries and data transfers, while integrating them seamlessly with existing jobs. The framework grew with the needs of my fellow engineers and did a really good job at what we needed. It became an integral part of everyone’s jobs and met all their needs. Then I discovered that analysts just a few seats over from me were scheduling cron jobs to run their analytics jobs rather than use our pipeline.

Beyond the cron Cron jobs are great for making sure something runs regularly, but they have a lot of limitations that make them poorly suited for data pipelining. They aren’t guaranteed to run after their inputs are ready. They don’t automatically retry upon failure, validate outputs, or transfer data to our reporting structure. The Houzz data pipeline did all of these things and more. So why would an analyst use a cron job rather than the pipeline? The initial answer was pretty simple. They didn’t know that the data pipeline existed. All of the data engineers knew about it because it’s what we work on and with. It’s all over our internal chat rooms, e-mail threads and group discussions. Of course everyone knows about the pipeline and how to use it, unless they aren’t participating in engineering chat rooms, e-mail threads and discussions. The first step to getting non-engineers to use your tools is to let them know the tools exist.

More than just letting them know the tool exists, you have to walk them through how to use it. When other engineers wanted to know how to use the pipeline, I would just send them code paths for usage examples and the underlying framework code. This was usually enough to get them started. This doesn’t work for less technical people. Instead, sit next to them and help them use the pipeline for the first time. They’ll likely take detailed notes that can become the reference you point people to for how to use the data pipeline. Make sure these notes get shared. This will likely even be useful to engineers. Encourage the person you’re helping to post their notes on an internal wiki or website or offer to do it for them. This will save you a lot of future effort.

Users versus pipelines Once you have non-engineers using your pipeline, you have an entirely new and more intractable problem: non-engineers are using your pipeline. They’ll tend to do things in ways you never anticipated. When the assumptions behind the design of your pipeline are broken, your pipeline is broken. Our pipeline was originally written so that a single misconfigured job would completely break the pipeline, making it completely obvious that the job was misconfigured and ensuring that all jobs are properly configured. This worked well with engineers, who either tested before merging or at least quickly noticed and responded when they broke something. Non-engineers don’t have the same aversion to broken builds that we do. They will merge code without testing it, never check to see that it’s working, and be slow to respond when informed that they’ve broken the pipeline. You need to either prevent these jobs from being merged or handle them gracefully. Our data pipeline now safely prunes misconfigured jobs and their dependents and prints a warning. Not seeing their data produced is enough to alert people that their jobs aren’t working, and the warning makes it clear to anyone who tries to test their job at all.

Complexity is another enemy of wider pipeline adoption. Reducing complexity helps every user of your pipeline, so you should see this problem as a partial blessing. The part of our pipeline that gets used by non-engineers is the reporting structure for running sql-like queries against hdfs data using hive or impala. A job consists of a query template and a configuration for how and when the template should be filled and run. The configuration causes all sorts of problems for non-engineers and should really be eliminated as much as possible.

Consider the following configuration that used to be in production for one of our impala queries:

user_segments:
   <<: *report_table_date
   table_name: user_segments 
   first_time: 2013-09-01
   report_name: "user_segments"
   dependencies: 
      - web_request 
   frequencies: 
      - daily
   dependencies:
      - primary_db

This is in yaml format, and is very confusing. First, we have the initial line of <<: *report_table_date. This copies entries from a mapping defined earlier in the file to save some typing, but obscures the configuration. This is not something you want to be doing. We also have the name user_segments appear twice in the mapping, even though the whole mapping is nested underneath a key user_segments already. This makes 3 places that someone who has copy-pasted a configuration needs to make the exact same edit. As an engineer, this makes sense. The config identifier, storage table, and query template identifier are 3 different things. But if they tend to be the same, you can just make that the default so people who don’t understand or care about the distinction will have an easier time. Similarly, we can also remove the daily frequencies specification by making that the default too. Pulling out all this cruft, we end up with the more streamlined

user_segments:
   db: logs
   first_time: 2013-09-01
   dependencies:
      - web_request
   dependencies:
      - primary_db

The db: logs line was originally hidden in report_table_date but now it’s clear that you must specify the storage db. Now a nasty bug in the configuration has become more apparent. The dependencies sub-map has been defined twice. Because this is a yaml configuration, this means that the first map will be overridden, and we will only have primary_db as a dependency. This section specifies which tables the query depends on, so this report might now get run before the web_request table is ready for that day. This is an easy mistake to make, as updating the report config is usually an afterthought after changing the query template. Oftentimes people would neglect to even update the configuration at all. In this case, the actual query did not depend on either of web_request and never had. This was probably just a bad combination of copy-pastes. To avoid these unnecessary mistakes, we now automatically infer table dependencies from the query itself, resulting in the shorter config that actually gets the dependencies right:

user_segments:
   db: logs
   first_time: 2013-09-01

We could probably even get rid of the two lines in this config for most cases, and may well do so in the future, but this is enough of an improvement that misconfigurations rarely cause problems anymore.

The template language used for our queries has also been a challenge. We’ve kept things simple by just using mustache templates to substitute in variable information like dates and times. This is pretty easy to use. People have no problem replacing where dt <= ‘2015-01-11’ with where dt <= ‘{{dt}}’ in a query to generalize it. They simply write a query template like

select photo_id, count(*) from</