top of page

Data Pipelines for Non-engineers

Data pipelines manage some of the most important tasks for a web company. As data comes in from users, it needs to come right back out as improved user experiences and provide fresh analytics to the team as quickly as possible. I was quite proud of having built an easy to use pipeline framework at Houzz that made it simple to add new mapreduce jobs, hive/impala queries and data transfers, while integrating them seamlessly with existing jobs. The framework grew with the needs of my fellow engineers and did a really good job at what we needed. It became an integral part of everyone’s jobs and met all their needs. Then I discovered that analysts just a few seats over from me were scheduling cron jobs to run their analytics jobs rather than use our pipeline.

Beyond the cron Cron jobs are great for making sure something runs regularly, but they have a lot of limitations that make them poorly suited for data pipelining. They aren’t guaranteed to run after their inputs are ready. They don’t automatically retry upon failure, validate outputs, or transfer data to our reporting structure. The Houzz data pipeline did all of these things and more. So why would an analyst use a cron job rather than the pipeline? The initial answer was pretty simple. They didn’t know that the data pipeline existed. All of the data engineers knew about it because it’s what we work on and with. It’s all over our internal chat rooms, e-mail threads and group discussions. Of course everyone knows about the pipeline and how to use it, unless they aren’t participating in engineering chat rooms, e-mail threads and discussions. The first step to getting non-engineers to use your tools is to let them know the tools exist.

More than just letting them know the tool exists, you have to walk them through how to use it. When other engineers wanted to know how to use the pipeline, I would just send them code paths for usage examples and the underlying framework code. This was usually enough to get them started. This doesn’t work for less technical people. Instead, sit next to them and help them use the pipeline for the first time. They’ll likely take detailed notes that can become the reference you point people to for how to use the data pipeline. Make sure these notes get shared. This will likely even be useful to engineers. Encourage the person you’re helping to post their notes on an internal wiki or website or offer to do it for them. This will save you a lot of future effort.

Users versus pipelines Once you have non-engineers using your pipeline, you have an entirely new and more intractable problem: non-engineers are using your pipeline. They’ll tend to do things in ways you never anticipated. When the assumptions behind the design of your pipeline are broken, your pipeline is broken. Our pipeline was originally written so that a single misconfigured job would completely break the pipeline, making it completely obvious that the job was misconfigured and ensuring that all jobs are properly configured. This worked well with engineers, who either tested before merging or at least quickly noticed and responded when they broke something. Non-engineers don’t have the same aversion to broken builds that we do. They will merge code without testing it, never check to see that it’s working, and be slow to respond when informed that they’ve broken the pipeline. You need to either prevent these jobs from being merged or handle them gracefully. Our data pipeline now safely prunes misconfigured jobs and their dependents and prints a warning. Not seeing their data produced is enough to alert people that their jobs aren’t working, and the warning makes it clear to anyone who tries to test their job at all.

Complexity is another enemy of wider pipeline adoption. Reducing complexity helps every user of your pipeline, so you should see this problem as a partial blessing. The part of our pipeline that gets used by non-engineers is the reporting structure for running sql-like queries against hdfs data using hive or impala. A job consists of a query template and a configuration for how and when the template should be filled and run. The configuration causes all sorts of problems for non-engineers and should really be eliminated as much as possible.

Consider the following configuration that used to be in production for one of our impala queries:

user_segments:
   <<: *report_table_date
   table_name: user_segments 
   first_time: 2013-09-01
   report_name: "user_segments"
   dependencies: 
      - web_request 
   frequencies: 
      - daily
   dependencies:
      - primary_db

This is in yaml format, and is very confusing. First, we have the initial line of <<: *report_table_date. This copies entries from a mapping defined earlier in the file to save some typing, but obscures the configuration. This is not something you want to be doing. We also have the name user_segments appear twice in the mapping, even though the whole mapping is nested underneath a key user_segments already. This makes 3 places that someone who has copy-pasted a configuration needs to make the exact same edit. As an engineer, this makes sense. The config identifier, storage table, and query template identifier are 3 different things. But if they tend to be the same, you can just make that the default so people who don’t understand or care about the distinction will have an easier time. Similarly, we can also remove the daily frequencies specification by making that the default too. Pulling out all this cruft, we end up with the more streamlined

user_segments:
   db: logs
   first_time: 2013-09-01
   dependencies:
      - web_request
   dependencies:
      - primary_db

The db: logs line was originally hidden in report_table_date but now it’s clear that you must specify the storage db. Now a nasty bug in the configuration has become more apparent. The dependencies sub-map has been defined twice. Because this is a yaml configuration, this means that the first map will be overridden, and we will only have primary_db as a dependency. This section specifies which tables the query depends on, so this report might now get run before the web_request table is ready for that day. This is an easy mistake to make, as updating the report config is usually an afterthought after changing the query template. Oftentimes people would neglect to even update the configuration at all. In this case, the actual query did not depend on either of web_request and never had. This was probably just a bad combination of copy-pastes. To avoid these unnecessary mistakes, we now automatically infer table dependencies from the query itself, resulting in the shorter config that actually gets the dependencies right:

user_segments:
   db: logs
   first_time: 2013-09-01

We could probably even get rid of the two lines in this config for most cases, and may well do so in the future, but this is enough of an improvement that misconfigurations rarely cause problems anymore.

The template language used for our queries has also been a challenge. We’ve kept things simple by just using mustache templates to substitute in variable information like dates and times. This is pretty easy to use. People have no problem replacing where dt <= ‘2015-01-11’ with where dt <= ‘{{dt}}’ in a query to generalize it. They simply write a query template like

select photo_id, count(*) from photo_views 
where dt = {{dt}} group by photo_id

and you can easily produce daily reports on photo views. The problem is that needs grew beyond simple daily reporting, so people started writing queries like

select photo_id, count(*) from photo_views
where dt between DATE_SUB({{dt}}, 30) and {{dt}}
group by photo_id

to get a 30-day window. This contains an annoying fencepost error leading to a 31-day window rather than the desired 30-day window as well as breaking the dependency inference. When the pipeline tries to run this query, it will only ensure that photo_views is ready for the one day corresponding to {{dt}}, and not other days. I had already fixed this by allowing a template like

select photo_id, count(*) from photo_views
where {{dt_window}}
group by photo_id

We can then control the window size via configuration and everything will work properly. The only problem is that only the other engineers knew that I had done this, as it was only mentioned in engineering chat rooms. It’s important to keep educating users about new features and requirements. I had to correct a lot of templates and make public announcements to make sure that everyone knew how to use this feature going forward. I also took the time to add clear explanations of all template features to the internal pipeline documentation.

Fixing broken queries Most surprising of all the challenges presented was the way in which a non-developer develops a query. An analyst or salesperson doesn’t really care about maintainability or scalability. They only care that it produces the desired results. This is a very pragmatic view, but it means that the queries will often break as systems change and the underlying data grows.

It’s not uncommon to see jobs failing and inspect them to find a 500+ line query. Trying to figure out what’s going wrong with such a query can be a daunting task, but most of these lines were probably not hand-written. If you look for patterns, you’ll likely see where copy-paste was used and be able to refactor it into something smaller. Often there are a few repeated sub-queries that can be put into their own tables. If you check other jobs, you may find that these sub-queries can be factored out of many other queries as well. Not having to store these intermediate results fully in memory can make a big difference. Another common issue is the use of UNION and WHERE clauses to achieve what GROUP BY is meant for. For example, you may find a query like the following to count the number of users in different age brackets:

select "<18" as age_bracket, count(*) from users where age < 18
union
select "18-24" as age_bracket, count(*) from users where age between 18 and 24
union
select "25-34" as age_bracket, count(*) from users where age between 25 and 34
union
select "35-44" as age_bracket, count(*) from users where age between 35 and 44
union
select "45-54" as age_bracket, count(*) from users where age between 45 and 54
union
select "55-64" as age_bracket, count(*) from users where age between 55 and 64
union
select "65+" as age_bracket, count(*) from users where age > 65

which causes a lot of unnecessary extra work and can actually break if the query runner tries to load 7 copies of the user table simultaneously in memory to run the queries in parallel. It’s also easy for errors to creep in when the unioned sub-queries get more complicated, as every edit needs to be repeated 7 times. It’s fairly simple to remove all the unions by using a case statement for age_bracket and grouping by age bracket:

SELECT
   CASE
      WHEN age < 18 THEN "<18"
      WHEN age BETWEEN 18 AND 24 THEN "18-24"
      WHEN age BETWEEN 25 AND 34 THEN "25-34"
      WHEN age BETWEEN 35 AND 44 THEN "35-44"
      WHEN age BETWEEN 45 AND 54 THEN "45-54"
      WHEN age BETWEEN 55 AND 64 THEN "55-64"
      ELSE "65+"
   END AS age_bracket,
   COUNT(*)
FROM users
GROUP BY age_bracket

Simplifications like this make queries easier to read and debug and can also actually fix them sometimes. It’s important to keep an eye out for these issues and help both to fix these types of issues and show your pipeline’s users how to avoid them in the future.

Conclusions These are just a few of the things we’ve had to do to keep the Houzz data pipeline accessible to non-engineers. It’s a lot of hard work, but quite worth it. Over 40% of our report queries are now written by non-engineers, providing a lot of useful data that is always ready when needed. This makes everyone’s jobs easier. For more technical details on how our pipeline works, stay tuned for an upcoming post on how we use Luigi for pipelining. Or apply to work at Houzz now to become a part of it.

4 views0 comments

Recent Posts

See All

Scaling Data Science

Being the first Data Scientist (DS) at a startup is exciting, yet comes with a myriad of challenges from navigating data infrastructure and data engineering staffing to balancing proper modeling again

bottom of page