April 19, 2024
Hongwei Harvey Li
The Airbnb Tech Blog

How Airbnb leverages ML/NLP to extract helpful details about listings from unstructured textual content information to energy customized experiences for friends.

By: Hongwei Li and Peng Wang

At Airbnb, it’s vital for us to collect structured information about listings and higher perceive the info, so we may also help Hosts present nice experiences for friends. For instance, friends who work remotely have to know if a list has an appropriate workspace and dependable web, whereas friends with youngsters would possibly want gadgets like highchairs and cribs. Nonetheless, not all listings clearly show these attributes, inflicting there to be a mismatch between what Hosts listings have and what friends are searching for.

This is only one of many examples of how we will use the unstructured information generated on our platform, together with textual content information that has undergone anonymization steps from numerous text-based visitor interactions with the platform, to extract helpful construction information. As a substitute of counting on Hosts to manually enter all of the potential itemizing attributes, which might be tedious given the huge variety of attributes friends care and inquire about, we developed a machine studying system known as Listing Attribute Extraction Platform (LAEP) for extracting the construction information at scale. Word that the unique title of the mission known as LATEX (Listing ATtribute EXtraction) and it’s cited in our earlier tech weblog. We’ve since renamed the mission to LAEP.

LAEP mechanically extracts structured data, corresponding to itemizing attributes, straight from the unstructured textual content information we talked about above. The attributes collected by LAEP are then built-in into numerous functions, constructing Airbnb’s Itemizing Data. It powers downstream instruments just like the Attribute Prioritization System (APS) and itemizing attribute assortment system (Eve).

LAEP doesn’t simply extract itemizing attributes, it has the power to detect various kinds of entities, corresponding to actions, hospitalities, and factors of curiosity (POI) like well-known landmarks. This opens up potentialities for supporting a variety of product functions. For instance, hospitality information may also help friends get customized companies throughout the keep whereas exercise information may also help determine and create new classes that friends love.

Determine 1. An illustration of the method of from LAEP to downstream functions corresponding to itemizing attribute assortment system (Eve) and attribute prioritization system (APS), then feeds into Construction Knowledge Catalog.

Previous to LAEP, Airbnb had a number of methods to gather structured data for listings, together with the Itemizing Editors web page for Hosts, the Supplementary Overview Movement (SRF) for friends, and partnering with third-party distributors. Nonetheless, these approaches confronted a number of challenges and limitations. For example, Airbnb minimized the impression of SRF questions in the usual evaluate circulate to spice up the visitor evaluate expertise, leading to decreased information consumption from the visitor aspect. Consequently, there was a rising have to extract itemizing data from unstructured textual content information, and LAEP was developed to handle the aforementioned points by automating this information assortment course of.

The LAEP know-how gathers and analyzes nameless and unstructured textual content information, enabling many potential functions that may improve the Airbnb expertise for each Hosts and friends.

There are three primary parts in LAEP:

  1. Named Entity Recognition (NER): This part identifies and classifies particular phrases or entities in free textual content into predefined classes like facilities, locations of curiosity, and amenities. For instance, from numerous sources the phrase “swimming pool” could be detected as an entity with the sort “Amenity”.
  2. Entity Mapping (EM): As soon as an entity is detected, EM maps it to straightforward itemizing attributes saved in Airbnb’s attribute database (Taxonomy). This enables LAEP to create a complete catalog of Airbnb listings by associating detected entities with their corresponding attributes.
  3. Entity Scoring (ES): ES determines the presence of a detected phrase inside a list. It infers whether or not the attribute talked about truly exists within the related itemizing and supplies confidence degree.

Beneath is an illustration of the parts inside LAEP is as follows:

Determine 2. The scope of LAEP contains three primary parts: Named Entity Recognition, Entity Mapping and Entity Scoring.

There are numerous off-the-shelf pretrained NER fashions that may extract normal entity classes, however none of them absolutely helps Airbnb’s use circumstances. Subsequently, we constructed our personal NER fashions to detect and extract predefined entities vital to Airbnb enterprise from free textual content.

The NER mannequin defines 5 sorts of entities (Amenity, Facility, Hospitality, Location options, and Structural particulars) which might be vital to Airbnb. We sampled and labeled 30K instance texts from six channels, then skilled the NER mannequin. For present product use circumstances, we apply the language detection module and filter out English textual content solely. Sooner or later we might construct the multilingual Transformer based mostly NER mannequin to deal with non-English content material. Textual content is then break up into tokens. NER mode localates entity span, and classifies entity labels by utilizing a convolutional neural community (CNN) framework. The output is a listing of detected named entities, within the format of tuples <entity label, begin index, finish index>. Combining all parts collectively, the NER pipeline is proven in determine 3.

Determine 3. The overview of NER pipeline and the purposeful parts

Determine 4 reveals an instance output from the pipeline with detected entities highlighted.

Determine 4. Instance output from the NER pipeline. The detected entities are highlighted and every entity class is marked with a distinct coloration.

The labeled dataset was randomly break up into coaching and testing datasets with a 9:1 ratio. After coaching completes, we consider the mannequin efficiency on the testing dataset throughout all textual content channels. The analysis standards makes use of Strict Match which requires appropriately figuring out the boundary and class of the entity, concurrently. The mannequin total efficiency and every class’s efficiency are in determine 5.

Determine 5. Instance efficiency metrics (Precision, Recall, F1 scores) for NER mannequin

There are numerous other ways for folks to speak about the identical factor. For example, we discovered over twelve variations for the attribute “lockbox,” corresponding to lock field, lock-box, field for the important thing, and keybox. Typos like “ket field” are additionally widespread because of enter from error-prone cell units. Subsequently, we have to map totally different variations of named-entities to the usual entity title as outlined by the usual taxonomy for downstream functions.

With lots of of itemizing attributes however tens of millions of detected phrases in a yr, many phrases map to the identical attribute (like “lockbox”) whereas others don’t have any mapping. To deal with this, we introduce confidence ranges for mappings, permitting us to determine guidelines for circumstances the place mapping can’t be finished. A confidence worth between 0 and 1 is assigned, and if no mappings exceed the boldness threshold, it’s marked as “No Mapping.”

Labeling these mappings turns into difficult when coping with quite a few distinctive phrases and potential attributes. Sometimes, labeling entails evaluating the semantic similarity between the phrase and every of the 800+ attribute names. To beat this, we began with unsupervised studying strategies to sort out the issue as an alternative of utilizing the supervised studying strategies to save lots of important labeling efforts.

Determine 6. Entity Mapping: map detected NER phrases from free textual content to predefined itemizing attributes.

In LAEP, the entity mapping method entails the next steps:

  1. Preprocessing: Each the itemizing attributes and detected phrases endure preprocessing strategies corresponding to lowercasing and lemmatization to get rid of pointless phrase variations.
  2. Mapping to Phrase Embeddings: All normal itemizing attributes are mapped to the word-embedding house utilizing a word2vec mannequin fine-tuned with Airbnb’s textual content information.
  3. Discovering Closest Attribute: For a preprocessed detected phrase, the closest itemizing attribute is set based mostly on cosine similarity within the word-embedding house. The similarity rating serves as the boldness rating for the mapping.

As the instance within the determine above, the phrase “Lock-box” is mapped to the embedding house of itemizing attributes and in contrast with every attribute. The closest match is discovered with the attribute “lockbox,” which is recognized as the highest mapping.

After mapping a detected phrase to a normal itemizing attribute, it’s vital to deduce metadata concerning the attribute, corresponding to its existence, usability, and native sentiment. Amongst these, attribute presence is essential for the visitor expertise, particularly for the instance of facilities like “crib” or “highchair” for friends with infants.

The presence mannequin in LAEP determines if the mapped attribute exists within the itemizing by performing native textual content classification. It supplies a discrete output (YES, Unknown, NO) indicating attribute presence, accompanied by a confidence rating reflecting the extent of confidence within the inference.

The label lessons are YES, Unknown, NO, the place Sure means the attribute is current, NO means it’s not current, and Unknown accounts for circumstances the place presence is tough to find out from the textual content alone (e.g., amenity not current).

Determine 7. Illustration of entity scoring for the meta data about sure entities of curiosity.

To construct this textual content classification mannequin, the ES part employs a fine-tuned BERT mannequin. It analyzes supply information, together with detected phrases and their native context, to deduce attribute existence. The output can then be used within the APS and Eve system to supply suggestions to Hosts, merchandize present residence attributes, or make clear fashionable itemizing amenities..

Determine 8. Structure of Presence Rating Mannequin. (Revised based mostly on courtesy from Zahera and Sherif et al.)

The mannequin structure (Determine 8) makes use of a pre-trained BERT mannequin with textual content information from six totally different sources. The enter textual content is truncated to a most size of 512 tokens. Empirical research recommend that utilizing 65 phrases across the detected phrase (32 earlier than and 32 after), achieves the perfect end result. The embeddings from the [CLS] token are handed by means of a totally related layer, dropout layer, and ReLU linear projection layer to generate a probabilistic vector over the label lessons.

On this put up, we launched an end-to-end structural data extraction system inside Airbnb, LAEP, to detect phrases of curiosity from numerous textual content information sources, map them into normal itemizing attribute taxonomies, after which infer the meta data of the attributes from the contextual data within the texts whereas additionally having privateness by design controls with the target to not course of private data. LAEP is utilized in downstream functions like APS, and could be leveraged to assist our groups discover new classes of listings and uncover new itemizing attributes that matter to friends. It helps us to know Airbnb’s itemizing higher with scale and may energy future functions to proceed bettering the expertise of each our Hosts and friends.

If this kind of work pursuits you, try a few of our associated positions at Careers at Airbnb!

We want to thank all of the individuals who supported this mission — Qianru Ma, Pleasure Jing, Xiao Li, Brennan Polley, Paolo Massimi, Dean Chen, Guillaume Man, Lianghao Li, Mia Zhao, Pleasure Zhang, Usman Abbasi, Pavan Tapadia, Jing Xia, Maggie Jarley and extra. Particular due to Ben Mendeler, Shaowei Su, Alfredo Luque and Tianxiang Chen from the ML-infra crew for his or her beneficiant assist and assist.

All product names, logos, and types are property of their respective house owners. All firm, product and repair names used on this web site are for identification functions solely. Use of those names, logos, and types doesn’t suggest endorsement.