One needn’t be an oracle to forecast Airbnb listings prices — AI models fed customer reviews and rental features will do the trick. That’s the conclusion drawn by a team of graduate students at Stanford, who investigate in a paper on Arxiv.org (“Airbnb Price Prediction Using Machine Learning and Sentiment Analysis“) systems leveraging machine learning and natural language processing to anticipate future Airbnb rates.
Alongside the paper, they made available their optimized models on GitHub.
“Pricing a rental property on Airbnb is a challenging task for the owner as it determines the number of customers for the place. On the other hand, customers have to evaluate an offered price with minimal knowledge of an optimal value for the property,” wrote the coauthors. “This paper aims to develop a reliable price prediction model using machine learning, deep learning, and natural language processing techniques to aid both the property owners and the customers with price evaluation given minimal available information about the property.”
To train their price-predicting system, the researchers tapped the public Airbnb data set for New York City, which included 50,221 entries with 96 features in total. They inspected each feature to remove those with frequent and irreparable missing fields and change booleans to binaries, and to remove duplicate and “uninformative” features such as host picture URLs. (This reduced the number of features to 22.) In the course of training, the team used 90% of the original data (39,980 samples), reserving 5% for validation and testing (9,996 samples).
Prior to training, the team tapped the open source TextBlob corpus to analyze the sentiment of listings reviews, which assigned a score between -1 (very negative sentiment) and 1 (very positive sentiment) to each. The scores were averaged across all reviews associated with a given listing and included as a new feature in the training data set.
The team tested several price-predicting machine learning techniques, including linear regression, tree-based models, SVR, and neural networks. But they report that the best-performing model — Support Vector Regression (SVR) — achieved an R2 score (a measure of how well the predictions approximate the real data points) of 69% on the test set.
“This level of accuracy is a promising outcome given the heterogeneity of the dataset and the involved hidden factors and interactive terms, including the personal characteristics of the owners, which were impossible to consider,” wrote the coauthors, who leave to future work collecting more training examples and experimenting with novel neural network architectures.