This is the sixth and the concluding post in our Data Visualization Spotlight series where we showcase how different organizations are using data visualization and analytics to solve their day to day problems. Founded in August 2008, Airbnb is an online community marketplace for people to list, discover, and book unique accommodations around the world. With over 500,000 listings in more than 34,000 cities and 192 countries, Airbnb connects people to unique travel experiences. Airbnb home page To create those memorable experiences for its guests, Airbnb has to continuously come up with creative ways to help people find what they are looking for, sometimes in places they know very little about. The key to this is their search algorithm—a system that combines dozens of signals to surface the listings guests want.

Perfecting the search algorithm 

In an article published in the Airbnb blog, Maxim Charkov, Riley Newman & Jan Overgoor talk about how they went on to improve their search algorithm. Initially when there was not enough data to understand what guest would want, “they returned what they considered to be the highest quality set of listings within a certain radius from the center of wherever someone searched (as determined by Google).” Heatmap SF without location awareness Fig: SF heatmap of listings returned without location relevance model.  Image Source: nerds.airbnb.com However, they soon realized that this model will not suffice in the long run. The listings that came up for a specific search query was spread randomly across the town, sometimes even outside the town. “This is a problem because the location of a listing is as significant to the experience of a trip as the quality of the listing itself. However, while the quality of a listing is fairly easy to measure, the relevance of the location is dependent upon the user’s query.” Exponential distance demotion curve To improve on this, they “introduced an exponential demotion function based upon the distance between the center of the search and the listing location, which they applied on top of the listing’s quality score.” The logic behind being, listings that are closer to the center of the search area are more relevant to the query. Heatmap SF with distance demotion Fig: SF heatmap with distance demotion. Image Source: nerds.airbnb.com Though this was a step forward in the right direction because it removed the issue of random locations, but the model overemphasized centrality, returning listings predominantly in the city center as opposed to other neighborhoods where people might prefer to stay. SIgmoid distance demotion curve To improve on the algorithm further, they “tried shifting from an exponential to a sigmoid demotion curve. This had the benefit of an inflection point, which we could use to tune the demotion function in a more flexible manner.”  Listing Density from City Centre Fig: Listing Density from City Center. Image Source: nerds.airbnb.com However, this modification was far from perfect too. Every city required individual tweaking to accommodate its size and layout. And the city center still benefited from distance-demotion. It quickly became clear that predetermining and hardcoding the perfect logic was too tricky when thinking about every city in the world all at once. Choropleth of probability of bookings given a general query for San Francisco Fig: Choropleth of probability of booking given a general query for San Francisco. Image Source: nerds.airbnb.com To solve this riddle further, they looked towards their community data. “Using a rich dataset comprised of guest and host interactions, we built a model that estimated a conditional probability of booking in a location, given where the person searched. A search for San Francisco would thus skew towards neighborhoods where people who also search for San Francisco typically wind up booking.” This solved their centrality problem and A/B test showed positive lift over the previous model. However, two issues cropped up with this new change. One, they were pulling every search to where they had the most bookings thereby excluding the unexplored but exquisite experiences they had on offer. Secondly, by tightening their search results they removed all possibilities of guests discovering some unique experience serendipitously. “The mushroom dome, for example, is a beloved listing for our community, but few people find it by searching for Aptos, CA. Instead, the vast majority of mushroom dome guests would discover it while searching for Santa Cruz. However by tightening up our search results for Santa Cruz to be great listings in Santa Cruz, the mushroom dome vanished.”  Change in location ranking score before and after normalization Fig: Change in location ranking score for Pacifica before and after normalization. Image Source: nerds.airbnb.com To solve the first issue they “tried normalizing by the number of listings in the search area”. (Related read: Normalization) Behavior of cities for a query for Santa Cruz Fig: Behavior of cities for a query for Santa Cruz. Image Source: nerds.airbnb.com To solve the second issue, they “decided to layer in another conditional probability encoding the relationship between the city people booked in and the cities they searched to get there.” “While all of the cities in the graph (above) have a low booking likelihood relative to Santa Cruz itself, they are also mostly small markets and we can give them some credit for depending on Santa Cruz for searches for their bookings. At the same time places like San Jose and Monterey have no clear connection to Santa Cruz, so we can consider them as completely separate markets in search. It was important that improvements to the model do not lead to regressions in other parts of the world. In this case, little changed for our bigger markets like San Francisco. But this additional signal brings back the mushroom dome and other remote but iconic properties, facilitating the unique experiences our community is looking for.”

Final thoughts

By analyzing user behavior data with the help of statistical models and data visualization, Airbnb created a search algorithm which was more location relevant for its users. The modified algorithm allowed their community to dynamically inform future guests where they will have great experiences. It also made it possible for Airbnb to apply the same model uniformly to all places around the world where their hosts are offering up places to stay.

Reference:

If you’ve missed any of the previous posts in this series, here they are:

Additionally, if you’d like to read the entire series in one sitting, download our white paper:

Take your data visualization to a whole new level

From column to donut and radar to gantt, FusionCharts provides with over 100+ interactive charts & 2,000+ data-driven maps to make your dashboards and reports more insightful

Explore FusionCharts

Leave a comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

5 responses on “How Airbnb used conditional probability models and data visualization to make its search algorithm more location relevant?

  1. That is really attention-grabbing, You’re an overly skilled blogger.
    I’ve joined your rss feed and sit up for seeking more of your
    fantastic post. Also, I have shared your site in my social networks

  2. This is a perfect example of using a lot of data, math, payroll, and technology to solve a simple problem that common sense could have solved in five minutes. Anyone from San Francisco can tell you where in the city people want to stay. That’s “rich data.” Then you A/B it, done. This is really nonsense, to go through all this and come up with a dark blue shaded Mission District. It just goes to show that if you don’t know your own business, it is an expensive, inefficient uphill battle to make money, and if you are making money, you overestimate your own cleverness.

  3. Hi, there are, airbnb or somewhere, statistics about airbnb or similar sites that can gives me an idea about if is for me good to buy for rent trought that system…?!!? then, in simple, the price medium and days year rented by location sorted, as i can mix with costs of house and value if good or not???

    thanks for your help!