top of page

Becoming a Data Scientist at Houzz: A Checklist

In this blog post, Peizhou Liao (pictured above) takes us through his first few months as a Houzzer and shares a checklist about what it takes to be a great data scientist at Houzz. 

On my first day at Houzz, I sat in on a product review meeting and as I listened to the team debate the merits of various new features, I was struck by the sincere passion for products that solve real problems, and of a data-driven culture that drives new product development. Houzz was just the place I had been looking for, and I couldn’t wait to plunge into new projects.

As my first quarter comes to an end, I’ve prepared a checklist to help other qualified data scientists get ready for their role at Houzz. These tips fall into four essential categories: data intuition, model development, programming skills and product sense.

Data Intuition

Data intuition is partially innate but mostly learned by playing with numerous data sets. Houzz logs the activities of over 40 million monthly unique users and over two million home renovation and design professionals. That’s a massive amount of data, which enables data scientists to develop their data intuition and grow technical skills and expertise, all while contributing to Houzz’s business.

To fully utilize the data, an intuitive understanding of basic concepts and interesting relationships is highly desired. For instance, we predict inventory availability for advertising packages that can be offered to home professionals in a future period. Such inventory forecasting is based on historical time series data. To obtain an accurate forecast, it is crucial to understand the fundamentals of time series including trend, seasonality, and noise, asking questions like:

– Is the increasing trend linear or exponential?

– Can the seasonality be explained by the nature of the industry?

– Is it safe to assume normally distributed noise?

Another important element is to identify related external variables other than information from past observations of the series. At Houzz, the effects of seasonal increases in website traffic, product updates, and renovation trends in the wider housing market, are often of particular interest. These effects differ considerably as seasonal increases tend to be very narrow timeframes while new product launches and renovation trends have a longer-term impact. By carefully examining these relationships, we can decide whether or not to include an external variable in the downstream model development, i.e., employing the dynamic regression models or regular ARIMA (AutoRegressive Integrated Moving Average).

To get the most information out of the data, we commonly aggregate or segment and graph the data in different ways. Data segmentation helps to avoid Simpson’s paradox, in which a trend appears in several different groups of data but disappears or reverses when these groups are combined, and is therefore important for correctly interpreting the information. Data aggregation effectively alleviates sparsity problems in calculating summary statistics. Additionally, visualizing data from different angles gives us various perspectives on our business.

Model Development

A data scientist cannot tell a good story with data in the absence of appropriate modeling, which is indispensable to turn raw data into meaningful business insights. All Houzz data scientists are experts in the multifaceted process of model development, which typically includes four major components: hypothesis generation, feature engineering, model building, and performance evaluation.

Hypothesis Generation

Models are always developed for practical use cases and business operations improvement. Before diving into the data, it’s critical for us to talk with stakeholders to define the use cases, identify associations of particular interest, and establish what the model should predict and how. Such communication for creating a hypothesis ensures that the model focuses on the right problems, and that the results positively influence operations for our business. Our data scientists are always encouraged to take a hypothesis-centred approach in practice.

Feature Engineering

Data has to be refined into relevant information in order to train a model. Feature engineering allows us to craft data features optimized for accurately representing key patterns, which leads to a higher prediction power for the model. In order to power feature engineering, our data scientists must gain industry-specific domain knowledge and develop a range of techniques including, transforming individual predictors into more contextually meaningful information and grouping data into reasonable bins. For example, field experience is indispensable to detect and remove a “smoking gun” feature, and log transformation is often necessary to reduce data variability.

Model Building

Models are often imperfect but some may still be useful. At Houzz, we strive to maximize model elegance and prediction power. Given our fast-paced environment however, we often seek useful models rather than perfect ones. If the total development costs of a new model exceeds the value it can add, it is preferable to use the existing one but revisit and improve the existing model later. It is highly desirable for a data scientist to have extensive expertise in at least one machine learning model, because most of the time, they are able to obtain satisfactory results by fine-tuning that model, and successfully deploying it.

Performance Evaluation

Good models are typically those that have very high prediction accuracy for new data and are easy to interpret. To assess generalizability, we perform cross validation and receiver operating characteristic curve analysis. In the case of imbalanced classification problems, we use a precision-recall curve as well as an F1 score. For interpretability, we often choose a model that is interpretable rather than a black box, such that the stakeholders can comprehend why certain decisions or predictions have been made. Sometimes only a complex model can be used. In those cases, we employ explanation techniques that describe the predictions of the model (e.g., LIME, or Local Interpretable Model-Agnostic Explanations) to understand the cause or reason for a decision.

Programming Skills