This is the sixth and the concluding post in our Data Visualization Spotlight series where we showcase how different organizations are using data visualization and analytics to solve their day to day problems.
Founded in August 2008, Airbnb is an online community marketplace for people to list, discover, and book unique accommodations around the world. With over 500,000 listings in more than 34,000 cities and 192 countries, Airbnb connects people to unique travel experiences.
To create those memorable experiences for its guests, Airbnb has to continuously come up with creative ways to help people find what they are looking for, sometimes in places they know very little about. The key to this is their search algorithm—a system that combines dozens of signals to surface the listings guests want.
Perfecting the search algorithm
In an article published in the Airbnb blog, Maxim Charkov, Riley Newman & Jan Overgoor talk about how they went on to improve their search algorithm.
Initially when there was not enough data to understand what guest would want, “they returned what they considered to be the highest quality set of listings within a certain radius from the center of wherever someone searched (as determined by Google).”
Fig: SF heatmap of listings returned without location relevance model. Image Source: nerds.airbnb.com
However, they soon realized that this model will not suffice in the long run. The listings that came up for a specific search query was spread randomly across the town, sometimes even outside the town. “This is a problem because the location of a listing is as significant to the experience of a trip as the quality of the listing itself. However, while the quality of a listing is fairly easy to measure, the relevance of the location is dependent upon the user’s query.”
To improve on this, they “introduced an exponential demotion function based upon the distance between the center of the search and the listing location, which they applied on top of the listing’s quality score.” The logic behind being, listings that are closer to the center of the search area are more relevant to the query.
Fig: SF heatmap with distance demotion. Image Source: nerds.airbnb.com
Though this was a step forward in the right direction because it removed the issue of random locations, but the model overemphasized centrality, returning listings predominantly in the city center as opposed to other neighborhoods where people might prefer to stay.
To improve on the algorithm further, they “tried shifting from an exponential to a sigmoid demotion curve. This had the benefit of an inflection point, which we could use to tune the demotion function in a more flexible manner.”
Fig: Listing Density from City Center. Image Source: nerds.airbnb.com
However, this modification was far from perfect too. Every city required individual tweaking to accommodate its size and layout. And the city center still benefited from distance-demotion. It quickly became clear that predetermining and hardcoding the perfect logic was too tricky when thinking about every city in the world all at once.
Fig: Choropleth of probability of booking given a general query for San Francisco. Image Source: nerds.airbnb.com
To solve this riddle further, they looked towards their community data. “Using a rich dataset comprised of guest and host interactions, we built a model that estimated a conditional probability of booking in a location, given where the person searched. A search for San Francisco would thus skew towards neighborhoods where people who also search for San Francisco typically wind up booking.”
This solved their centrality problem and A/B test showed positive lift over the previous model.
However, two issues cropped up with this new change. One, they were pulling every search to where they had the most bookings thereby excluding the unexplored but exquisite experiences they had on offer. Secondly, by tightening their search results they removed all possibilities of guests discovering some unique experience serendipitously. “The mushroom dome, for example, is a beloved listing for our community, but few people find it by searching for Aptos, CA. Instead, the vast majority of mushroom dome guests would discover it while searching for Santa Cruz. However by tightening up our search results for Santa Cruz to be great listings in Santa Cruz, the mushroom dome vanished.”
Fig: Change in location ranking score for Pacifica before and after normalization. Image Source: nerds.airbnb.com
To solve the first issue they “tried normalizing by the number of listings in the search area”. (Related read: Normalization)
Fig: Behavior of cities for a query for Santa Cruz. Image Source: nerds.airbnb.com
To solve the second issue, they “decided to layer in another conditional probability encoding the relationship between the city people booked in and the cities they searched to get there.”
“While all of the cities in the graph (above) have a low booking likelihood relative to Santa Cruz itself, they are also mostly small markets and we can give them some credit for depending on Santa Cruz for searches for their bookings. At the same time places like San Jose and Monterey have no clear connection to Santa Cruz, so we can consider them as completely separate markets in search. It was important that improvements to the model do not lead to regressions in other parts of the world. In this case, little changed for our bigger markets like San Francisco. But this additional signal brings back the mushroom dome and other remote but iconic properties, facilitating the unique experiences our community is looking for.”
By analyzing user behavior data with the help of statistical models and data visualization, Airbnb created a search algorithm which was more location relevant for its users. The modified algorithm allowed their community to dynamically inform future guests where they will have great experiences. It also made it possible for Airbnb to apply the same model uniformly to all places around the world where their hosts are offering up places to stay.
If you’ve missed any of the previous posts in this series, here they are:
Additionally, if you’d like to read the entire series in one sitting, download our white paper: