Our websites use cookies. By continuing, you agree to their use. See details
Yelp Dataset Challenge 2016 | Portfolium
Yelp Dataset Challenge 2016
favorite 1
visibility 375
January 13, 2017 in Statistics
1 / 5
Goal: To understand what people are talking about in 2.4 million+ reviews- without actually reading them.

Yelp is primarily a search and review website for businesses- be it restaurants, shopping centers, or even the local barbershop. From a business perspective, Yelp reviews are a wealth of information that can be leveraged for tips on improvement. This treasure trove of data is also useful for the marketing and advertising folks- by correctly slicing the data, it gives insights into peoples' opinions and preferences at a city, state or nationwide level.

The problem- nobody wants to read thousands of reviews to get "insights" into the peoples' minds.

For this project, we used the technique of topic modelling (specifically the Latent Dirchlet Allocation, LDA) and sentiment analysis, as a method of understanding the contents of the reviews. The LDA is a stochastic process which creates and assigns reviews to self-taught "topics"- for example it understands that the words "enchilada", "taco", "burrito" should come together in one topic- a topic we call "Mexican food". To simplify the process, we used only 1 and 5-star user reviews. Our assumption was that 1-star reviews should have more negative tones, and 5-star reviews would be glowing in praise. After creating the topics, they are ranked in importance (proportion to the entire text). For example, for all 5-star reviews, "Mexican food" comes up as topic #2 in Phoenix, Arizona, and #8 in Pittsburgh, Pennsylvania. Right away, we can see what people tend to favor in different cities. In general, around 60% of all topics in 5-star reviews were specifically about food, and the remaining 40% about the ambiance and service.

When we repeated this process for 1-star reviews, we saw that 60% of the topics were about service, and 40% about food. A little probing into this and we realized- a restaurant can theoretically have "okay" food (maybe a 2 or 3 star level) but bad service provided to the customers prompted them to give 1-star ratings! This was a trend seen across all 4 cities we considered. Our first takeaway was that the factors of service quality and food quality are weighted differently in 1 and 5-star reviews.

Looking at what words were clumped together in topics for 1-star reviews, we could quickly paint a picture of what went wrong. For example, people in Phoenix complained about bars closing early. Users of all cities lamented about the bad service- a rude hostess, an inattentive waiter. If a bar owner in Phoenix wanted to up his profits, he should try advertising about later closing times. Restaurant owners could try retraining their staff.

We then narrowed down our efforts on a single restaurant- Meat and Potatoes in Pittsburgh. Supposedly an institution of the city, it has 4 stars on Yelp. Our goal- how do we improve it to 5 stars? Quickly we realized using LDA is not sufficient- as it does not provide the context or sentiment of the text. We used Stanford Sentiment Analysis on all reviews for that restaurant. This showed that in lowly-rated reviews (1,2 star), not all aspects of the review were negatively slanted. For example, a 2-star review for the restaurant showed that the enjoyed the entree and his drinks, but the place was crowded and his waiter sloppy. Another 3-star review said the user loved the place, but his steak was so salty he had to send it back.

Trending words with a negative sentiment were, in fact, hostess, lines, crowded and salty. Our recommendation to the owner? Don't try expanding the restaurant right away- try to improve service times and reduce the salt in some dishes. Maybe people would then not have to send back the food, and the overall time spent in the restaurant would be less- forcing the waiting time in lines to reduce. Also maybe retrain your staff to be friendlier.

Usually to get tips like these, the owner would have to pay for an exhaustive user survey, which are usually pricey. Our methods show that the same results can be obtained faster and cheaper (technically free).
© 2025 • All content within this project is strictly the property of Namrata Date and is not for public use without permission. Report Abuse

Comments

Namrata Date

10 Skills

3 Teammates

Beth Ross
Biswash Bhusal
Raul Rivera

9 Tags

1 Likers

Adam Markowitz