A Foolish Attempt at Beer Tasting

I. Introduction

The following is an inane attempt to squeeze conclusions out of data collected under dubious circumstances. Specifically, from a cheap beer tasting conducted during August 2015 at Burning Man.

II. Background

Beer is a delicious thing you can drink (Fig 1). Some people enjoy it, some people do not. Some people enjoy drinking it and trying to rate it with numbers. Those people are snobs. Or valuable test subjects who we very much appreciate.

Figure 1 Beer (“Aufseß Bier” by User: Benreis at wikivoyage shared. Licensed under CC BY-SA 3.0 via Wikimedia Commons – source)

III. Objective

Give people beer. Get people to rate the beer and say which beer they think it is. Collect data. Look at data. Drink a beer (optional).

IV. Methodology

We created a “beer tasting book” so that people knew this was a serious endeavor. We include notes about aroma and color, even though the beers pretty much all look and smell the same. If the beer tasting had been with things other than fizzy yellow beers, this would have been more interesting. The book also gave people a place to take notes as they tasted the beers. Finally, at the end of tasting all beers (or as many as they felt like doing), we gave people a dada collection card, so that we knew what they rated each beers and let them guess which beers they thought were which. If they were interested, we graded their guesses.

The number of participants was 37, but individual non-responses for each beer type varied between 7 and 11. That is, usable data points per beer type varied between 26 and 30.

A. The Beers

The aim is to test the ratings of cheap beers. Beers you can get in cans. Beers you can bring to festivals. Beers that are available in SoCal. The beers are as follows:

B. The Tasting and Dada Collection

1. Bar Setup

This experiment was run at a freestanding bar with space behind/under the bar to  hide the beers. We set one can of each on top of the bar so that people knew what beers were being tasted. We then ordered the beers. They were ordered as listed above. The beers were then set out, 3 cans at a time behind the bar so that people could not see them. Four sets of keys were hidden behind the bar as well, for the bartenders to refer to.

2. Randomizing

When a new taster walked up, they were handed a book that contained instructions. This book also contained the place where they could take notes, rate each beer and guess what beer it was.

When each person was ready for their first or a subsequent beer, they handed their empty cup and booklet to a bartender. The bartender would then find a number that they had not yet tasted, cross off the number and pour 2-3oz of that beer into the cup. The taster would receive their cup and book back and be told which number they were now tasting.

To make sure that no bias was created by the ordering of the beers, there was an attempt made to randomize the order that people tasted beers in. Each bartender was helping roughly six people at a time, so it was possible to give an individual taster a unique order. Some tasted 1-6 in order, some 6-1 in reverse order. Other orders were used two, even numbers in order followed by odd numbers in order, for example. Some just received a random order. A few people requested their own random order.

Three cans of each beer were pre-stocked from a cooler behind the bar. When a beer had it’s 3rd can opened, two additional beers were added behind the last can. In this manner, it was possible to serve nearly all the beer cool, but not cold. There should have been a roughly uniform temperature across the beers.

C. The Analysis

The primary data of interest are a) the blind rating of each beer and b) the score people thought they were giving a beer. For example, the mean rating of the beer that people guessed as Miller Lite. The mean and median of each of these ratings should provide some insight.

From this data, one can also examine how well participants performed at guessing, i.e. the percentage of correct guesses for each beer.

The next question is “Do the differences between the (blind) ratings of each beer matter?”. This will be answered in two ways. The first through a matrix of paired t-tests that establish a 95% confidence interval (α = 0.05) of the difference between each beer with each other beer. In standard statistical form:

H0 : μ1 – μ2 = 0
Ha : μ1 – μ2 ≠ 0

where

H0 = the null hypothesis
Ha = the alternative hypothesis
μ_ = the mean of a given set of data

The downside of the t-test is that one must assume normality in the target population. As an alternative, a similar matrix of paired Wilcoxon (a.k.a. Mann-Whitney) tests will be performed. This is a non-parametric test that has the benefit of not requiring population normality. Specifically, it tests the shift in distribution between two values, and therefore can be used to give an estimate of difference.

Normality will be checked using the Shapiro-Wilkes test as well as examination on quartile-quartile plots. Histograms may also prove useful.

Statistical analyses were performed using R software and associated packages (www.r-project.org).

V. Results and Discussion

The relationships between actual rating and perceived rating appear to have generally held out, with little discrepancy between the two (Figure 2). The superlatives were the same, with PBR nailing the best category at a rating of approximately 3.0, and Bud Light hitting the bottom at around 2.15 (Table 1). Of note were the discrepancies in Tecate Light and Coors Light. People rated Coors Light higher than what they thought was Coors Light, and rated Tecate Light lower than what they thought was Tecate Light. In other words, Coors Light was better than people thought it would be, and Tecate Light was worse than people thought it would be.

Figure 2 Aggregate Rating Results: Ratings and What People Thought They Rated

Table 1 Aggregate Rating Results

Regarding guesses, participants generally did not do a good job of identifying the beer they were drinking (Figure 3). Seven out of eight were under 20% correct, and half 15% or under. Steel Reserve, with its higher alcohol content, was understandably the easiest to identify among the beers.

Figure 3 Percentage of Participants Who Correctly Identified Each Beer

The distributions of ratings for each beer do not appear normally distributed (Figure 4). This is confirmed by a Shapiro-Wilks normality test (α=0.05) applied to each set of ratings, as well as visual inspection of quartile-quartile plots.

Figure 4 Histograms

One could theorize that more participants could establish clearer distribution shapes, and enable a non-compromised use of a t-test. However, the nature of the the rating system itself may prevent this. The histograms of the actual data are “blocky”. This is due to the majority of participants (90%) choosing to use integers to rate their beer, though any number between 0 and 5 was allowed. One can theorize that, in an evaluation such as this, the perceived difference between a rating of (for example) 3.9 and 4.1 is negligible, so it is no surprise that the majority did not use that level of specificity.

The next question is “Do the differences between ratings matter?”. The answer is a resounding “maybe”.

For the t-test (Table 2, upper right), almost all confidence intervals include zero, so it cannot be concluded that the mean difference between the ratings is not zero. That is, we cannot say that the mean differences matter. There are two differences we can say with confidence: Tecate and PBR are both rated better than Bud Light (orange highlights). We could probably say with 90% confidence that PBR rates higher than Tecate Light, and possibly that Steel Reserve rates higher than Bud Light (blue highlights).

The Wilcoxon test results in slightly more conclusions (Table 2, lower left). One can say with 95% confidence that Tecate, Rolling Rock, and PBR rate higher than Bud Light (orange highlights). A number of “almosts” (blue highlights), show up as well, most notably that a Coors Light, Tecate, Rolling Rock, and PBR rate higher than Tecate Light. These “almosts” have bounds that are very nearly zero, so numerical error could be the reason to conclude significance versus not. That is, there is very little practical difference between, say, 95% confident and 94% confident.

Table 2 Matrix of 95% Confidence Intervals of Difference.
Upper Right: Paired t-test, Top Minus Side
Lower Left: Paired Wilcoxon Rank Sum, Side Minus Top

VI. Conclusion

Conclusions are as follows: