Conference PaperPDF Available

Debunking Three Myths About Crowd-Based Forecasting

Authors:

Abstract and Figures

Who really is better at forecasting, crowds or models? And which crowd-based methods are the most reliable, polls or markets? And which types of markets, based on real money or play money? Using comparative data from a variety of leading prediction markets, prediction polls, and statistical models, this study assesses their relative performance in forecasting recent major political events. The results help debunk three popular myths about crowd-based forecasting.
Content may be subject to copyright.
Collective Intelligence 2017
Debunking Three Myths About Crowd-Based
Forecasting
EMILE SERVAN-SCHREIBER, Hypermind International
Ever since the publication of « The Wisdom of Crowds » (Surowiecki, 2004), the accuracy of crowd-
based forecasting has served as a prime example of the practical value of collective intelligence.
Prediction markets were long believed to be the gold standard for eliciting the highest-quality forecast
from a crowd (Arrow et al, 2008). However, over the last decade, this hypothesis has come under
increasingly heavy fire from two new approaches claiming to push the boundaries of “the art and
science of prediction”. First arose the big-data statistical models, popularized by Nate Silver’s
FiveThirtyEight and the New York Time’s Upshot. Then came a new generation of “prediction polls”
(Atanasov et al, 2016) and “superforecasters” (Tetlock & Gardner, 2015), fresh from winning a multi-
year geopolitical crowd-forecasting tournament sponsored by the U.S. government’s Intelligence
Advanced Research Projects Activity (IARPA). Now confusion reigns supreme. Who really is better at
forecasting, crowds or models? And which crowd-based methods are the most reliable, polls or
markets? And which types of markets, based on real money or play money?
Using comparative data from a variety of leading prediction markets, prediction polls, and statistical
models, this study assesses their relative performance in forecasting recent major political events. The
results help debunk three popular myths about crowd-based forecasting.
COMPARISON POINTS
The forecasters included in this study are the most established and reputable of each type:
Hypermind
Almanis
Pivit
Iowa
Electronic
Markets
PredictIt
Betfair
Good
Judgment
Five
Thirty
Eight
New
York
Times
Crowd
Crowd
Crowd
Crowd
Crowd
Crowd
Crowd
Data
Data
Market
Market
Market
Market
Market
Market
Poll
Stats
Stats
Play
Play
Play
Real
Real
Real
Play*
-
-
France, US
UK, US
US
US
US
UK
UK, US
-
-
* Although there is no betting in the Good Judgment “prediction poll”, participants do compete to make the most accurate
probability assessments. So in terms of competition and risk, the situation is similar to a play-money market.
Besides sports, only a few events are covered by enough forecasters to offer useful comparison points.
Major electoral events in the U.S. and U.K. offer common ground. Luckily, those few events are also
among the most consequential and media-covered political events of the past few years: the midterm
election of 2014 and presidential election of 2016 in the U.S., and the “Brexit” referendum of 2016 in
the U.K.. So what the data lack in volume, they make up in relevance. This analysis focuses
specifically on the outcomes that were hardest to forecast (most suspenseful), over the longest possible
time frame common to enough forecasters. In chronological order, they are:
2 E. Servan-Schreiber
Collective Intelligence 2017
A. Senate Control 2014: Will the Republican party control the U.S. Senate after the 2014
midterm elections? Outcomes: Yes or No. Time frame: September 3 to November 3, 2014.
B. GOP Nomination 2016: Who will be the Republican party’s presidential nominee in 2016?
Outcomes: Cruz, Rubio, Trump, or Other. Time frame: January 25 to May 3, 2016.
C. Brexit 2016: Will the U.K. vote to leave the European Union (i.e., Brexit) in 2016? Outcomes:
Yes or No. Time frame: April 1 to June 22, 2016.
D. USA President 2016: Will Donald Trump be elected President of the United States in 2016?
Outcomes: Yes or No. Time frame: July 1 to November 7, 2016.
All forecasters produced probabilities for each possible outcome, updated at least daily. Forecasts were
scored using the Brier scoring rule a strictly proper scoring rule, based on squared errors. Smaller
Brier scores denote higher accuracy. A perfect prediction probability 1 for outcomes that occur,
probability 0 for outcomes that don’t yields a score of 0 while a perfectly wrong prediction earns a
score of 2. To compute a forecaster’s performance on a question, its daily probabilities were scored
throughout the question’s time frame, then averaged into a mean daily Brier score. The results are
plotted in Figure 1.
Fig. 1. Mean daily Brier scores of the forecasters on each question. Smaller bier scores indicate better forecasts, and the
forecasters are ranked from best (top) to worst (bottom) for each question. The forecasters are also color-coded into three
classes: play-money crowd-based forecasts, real-money crowd-based forecasts, and data-driven models. See text for an analysis.
Debunking Three Myths About Crowd-based Forecasting 3
Collective Intelligence 2017
MYTH #1: DATA-DRIVEN MODELS ARE MORE RELIABLE THAN CROWD-BASED FORECASTS
The appeal of data-driven models is that they can perhaps claim to be more evidence-based than
crowd judgments. It is also easier to explain a model’s forecast than a crowd’s. However, their
forecasts are also more brittle in that they are more easily misled by bad data (e.g., bad polling
numbers) or lack of relevant data (e.g., no Trump-like precedent in U.S. politics). Human collective
intelligence can be more robust when the going gets tough. In this study, the two highest-profile, best
funded forecasting models failed to outperform the collective intelligence. On the USA Senate Control
2014 question, both models performed worse than the three prediction markets. On the USA
President 2016 question, one model bested almost all crowd-based forecasters, but the other
performed the worst. In both cases, the play-money market Hypermind performed better than both
models. This result confirms and extends those of Servan-Schreiber & Atanasov (2015) and Atanasov
& Joseph (2016) using somewhat different data sets from different forecasters and time frames.
MYTH #2: SUPERFORECASTING PREDICTION POLLS OUTPERFORM PREDICTION MARKETS
Fresh off its recent victory in the IARPA-sponsored ACE geopolitical forecasting tournament, Good
Judgment proposes two relative novelties in crowd-based forecasting. First, so-called “prediction
polls”, where participants compete to assess event probabilities and achieve the lowest Brier scores.
These inputs are then filtered, weighed, and transformed statistically to extract the most informed
compound judgment. During the IARPA tournament, it was found that teams of forecasters thus
surveyed could slightly outperform a generic play-money prediction market (Atanasov et al., 2016).
The second novelty is so-called “superforecasters”, a special breed of excellent forecasters the top 2%
of all participants whose common psychological traits and habits enable them to maintain a high
level of performance over the long term. During the IARPA tournament, teams of superforecasters
were able to beat a generic play-money prediction market by 15% to 30% (Tetlock & Gardner, 2015).
In the present study, however, the Good Judgment polls did not outperform the markets. In direct
comparisons on three of the most consequential political events in recent history Brexit and Trump’s
nomination, then victory in the U.S. presidential election Good Judgment’s prediction polls tied or
underperformed the play-money market Hypermind. On the Brexit question a poll of superforecasters
scored worse than all the markets. Furthermore, the Good Judgment superforecasting teams failed to
outperform Hypermind on 35 questions from the IARPA tournament itself that Hypermind was
allowed to list on its market (achieving .258 and .264 average Brier scores, respectively, according to
government data a non-significant difference). It seems that beating a generic prediction market in
a controlled experimental setting is easier than beating a full-featured market in the real world.
MYTH #3: REAL-MONEY PREDICTION MARKETS OUTPERFORM PLAY-MONEY ONES
This is perhaps the most persistent myth about prediction markets. It continues to thrive because
pundits, economists, and the general public find it so intuitively obvious that “putting your money
where your mouth is” is what drives a market’s forecasting accuracy. But the data disagrees. In the
sports domain, this myth has been debunked long ago (Servan-Schreiber et al., 2004). Still, doubts
persisted regarding financial and political predictions (Rosenbloom et al., 2006; Diemer, 2010). The
present study finds no real-money advantage whatsoever on the highest-stakes political questions. In
fact, in three of the four questions the hardest ones, according to overall Brier scores the play-
money market Hypermind tied or outperformed all the real-money markets. Furthermore, the deepest
and largest real-money market, Betfair (based in the U.K. where betting is legal and popular),
generally underperformed its two U.S.-based counterparts despite the severe regulatory constraints
imposed on them by the government regarding how much each trader can invest (only a few hundred
dollars). Clearly, liquidity and treasure are not the main drivers of market accuracy.
4 E. Servan-Schreiber
Collective Intelligence 2017
REFERENCES
Arrow, K., Forsythe, R., Gorham, M., Hahn, R., Hanson, R., Ledyard, J., Levmore, S., Litan, R.,
Milgrom, P., Nelson, F., Neumann, G., Ottaviani, M., Schelling, T., Shiller, R., Smith, V.,
Snowberg, E., Sunstein, C., Tetlock, P.C., Tetlock, P.E., Varian, H., Wolfers, J. and Zitzewitz, E.
(2008). The promise of prediction markets. Science, 320:877-878.
Atanasov P., and Joseph R. (2016). Which Election Forecast Was the Most Accurate? Or Rather: The
Least Wrong? The Washington Post, November 30, 2016.
Atanasov, P., Rescober, P., Stone, E., Swift, S., Servan-Schreiber. E., Tetlock, P., Ungar, L., and Mellers,
B. (2016). Distilling the Wisdom of Crowds: Prediction Markets vs. Prediction Polls. Management
Science, 2016.
Diemer, S. (2010). Real-Money Vs. Play-Money Forecasting Accuracy in Online Prediction Markets -
Empirical Insights from Ipredict. Journal of Prediction Markets, 4(3), December 2010.
Rosenbloom, E., & Notz, W. (2006). Statistical Tests of Real-Money Versus Play-Money Prediction
Markets. Electronic Markets, 16 (1) pp. 63-69.
Servan-Schreiber, E., and Atanasov, P. (2015) Hypermind vs Big Data: Collective Intelligence Still
Dominates Electoral Forecasting. In Proceedings of the 2015 Collective Intelligence Conference,
Santa Clara.
Servan-Schreiber, E., Wolfers, J., Pennock, D., & Galebach, B. (2004). Prediction Markets: Does
Money Matter? Electronic Markets, 14 (3) pp. 243-251.
Surowiecki, J. (2004). The Wisdom of Crowds. Doubleday.
Tetlock, P., and Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown.
... The research team coordinated with ProMED-mail, an infectious disease reporting newsletter that reaches over 80,000 subscribers in at least 185 countries, as well as other infectious disease newsletters, professional networks, and public health groups [12]. Skilled prediction traders recruited and vetted by Hypermind over several years of participation in its geopolitical and business prediction market were also invited to join the project on a voluntary basis [13,14]. Thirty one percent (31%) of the participants were recruited during the initial recruitment effort in January 2019. ...
Article
Full-text available
Background The global spread of COVID-19 has shown that reliable forecasting of public health related outcomes is important but lacking. Methods We report the results of the first large-scale, long-term experiment in crowd-forecasting of infectious-disease outbreaks, where a total of 562 volunteer participants competed over 15 months to make forecasts on 61 questions with a total of 217 possible answers regarding 19 diseases. Results Consistent with the “wisdom of crowds” phenomenon, we found that crowd forecasts aggregated using best-practice adaptive algorithms are well-calibrated, accurate, timely, and outperform all individual forecasters. Conclusions Crowd forecasting efforts in public health may be a useful addition to traditional disease surveillance, modeling, and other approaches to evidence-based decision making for infectious disease outbreaks.
... They allow the modelling of dependencies between individual markets and would thus address the problem of formulating complex questions as contracts as well as the fact that so far only one market question is possible. However, no implementation or discussion of the integration of a combinatorial market with the Delphi method 6 According to Servan-Schreiber (2017) the choice between play and real money primarily affects the self-selection of participants and thus only indirectly on the forecast(-quality), whereby prediction markets are robust against unbalanced samples in general. ...
Chapter
Delphi-Märkte bezeichnen Ansätze und Implementierungen der Integration von Prognosemärkten und Delphi-Studien (Real-Time Delphi). Durch die Kombination der zwei Methoden zur Erstellung von Prognosen können potenziell gegenseitige Schwächen ausgeglichen werden. So können Prognosemärkte zum Beispiel für die Auswahl von Teilnehmenden mit Expertise herangezogen werden und motivieren durch ihren spielerischen Ansatz und Anreizmechanismen auch zur langfristigen Teilnahme. In diesem Beitrag werden zwei Potenziale für Prognosemärkte und vier Potenziale für Delphi-Studien, welche sich durch die Integration ermöglichen, theoretisch hergeleitet. Anschließend werden drei verschiedene Integrationsansätze vorgestellt, anhand derer die Integration auf User-, Markt- und Delphi-Fragen Ebene exemplarisch verdeutlicht wird und wobei gezeigt wird, dass je nach Ansatz nicht alle Potenziale erreicht werden können. Am Ende werden Empfehlungen für den Einsatz von Delphi-Märkten abgeleitet, bestehende Limitationen für Delphi-Märkte sowie zukünftige Entwicklungen aufgezeigt.
... They allow the modelling of dependencies between individual markets and would thus address the problem of formulating complex questions as contracts as well as the fact that so far only one market question is possible. However, no implementation or discussion of the integration of a combinatorial market with the Delphi method 6 According to Servan-Schreiber (2017) the choice between play and real money primarily affects the self-selection of participants and thus only indirectly on the forecast(-quality), whereby prediction markets are robust against unbalanced samples in general. ...
Chapter
Full-text available
Delphi-Markets describe approaches and implementations of the integration of prediction markets and Delphi studies (Real-Time Delphi). By combining the two methods of forecasting, potential mutual weaknesses can be offset. For example, prediction markets can be used for the selection of participants with expertise and motivate long-term participation through their playful approach and incentive mechanisms. In this article , two potentials for prediction markets and four potentials for Delphi studies, which are made possible by the integration, are theoretically derived. Three different integration approaches are then presented, which illustrate the integration at user, market and Delphi-question level and show that, depending on the approach, not all potentials can be achieved at once. In the end, recommendations for the use of Delphi markets are given, existing limitations are pointed out, and future developments are estimated.
Article
Full-text available
The accuracy of prediction markets has been documented both for markets based on real money and those based on play money. To test how much extra accuracy can be obtained by using real money versus play money, we set up a real-world online experiment pitting the predictions of TradeSports.com (real money) against those of NewsFutures.com (play money) regarding American Football outcomes during the 2003-2004 NFL season. As expected, both types of markets exhibited significant predictive powers, and remarkable performance compared to individual humans. But, perhaps surprisingly, the play-money markets performed as well as the real-money markets. We speculate that this result reflects two opposing forces: real-money markets may better motivate information discovery while play-money markets may yield more efficient information aggregation.
Article
We report the results of the first large-scale, long-term, experimental test between two crowdsourcing methods: prediction markets and prediction polls. More than 2,400 participants made forecasts on 261 events over two seasons of a geopolitical prediction tournament. Forecasters were randomly assigned to either prediction markets (continuous double auction markets) in which they were ranked based on earnings, or prediction polls in which they submitted probability judgments, independently or in teams, and were ranked based on Brier scores. In both seasons of the tournament, prices from the prediction market were more accurate than the simple mean of forecasts from prediction polls. However, team prediction polls outperformed prediction markets when forecasts were statistically aggregated using temporal decay, differential weighting based on past performance, and recalibration. The biggest advantage of prediction polls was atthe beginning of long-duration questions. Results suggest that prediction polls with proper scoring feedback, collaboration features, and statistical aggregation are an attractive alternative to prediction markets for distilling the wisdom of crowds.
Article
Prediction markets are online trading platforms where contracts on future events are traded with payoffs being exclusively linked to event occurrence. Scientific research has shown that market prices of such contracts imply high forecasting accuracy through effective information aggregation of dispersed knowledge. This phenomenon is related to incentives for truthful aggregation in the form of real-money or play-money rewards. The question whether real- or play-money incentives enhance higher relative forecast accuracy has been addressed by previous works with diverse findings. The current state of empirical research in his field is subject to two inherent deficiencies. First, inter-market studies suffer from market disparities and differences in the definition of underlying events. Comparisons between two different platforms (one for play-money contracts, one for real-money contracts) are potentially biased by different trading behaviour. Second, the majority of studies are based upon identical datasets of market platforms (IOWA stock exchange, Tradesports/Intrade, NewsFutures). This paper contributes new insights by analysing 44,169 trading observations on ipredict, where real-money and play-money contracts are traded on a variety of events. Forecasting accuracy is analysed on overall trading activity as well as comparison of equal contracts under different monetary incentive schemes. Statistical models are built to analyse the influence of order volumes and days to expiry under both incentive schemes. Ignoring different events in underlying trading activity, play-money contracts imply statistically insignificant excess accuracy. In direct comparison of equal events, real-money contracts, however, real-money contracts predict at significantly higher accuracy. This paper finds a relationship between order volumes and forecasting accuracy whereas the influence of days to expiry and aggregated volumes showed lower R² than was expected by formed hypotheses.
Article
Prediction markets are mechanisms that aggregate information such that an estimate of the probability of some future event is produced. It has been established that both real-money and play-money prediction markets are reasonably accurate. An SPRT-like test is used to determine whether there are statistically significant differences in accuracy between the two markets. The results establish that real-money markets are significantly more accurate for non-sports events. We also examine the effect of volume and whether differences between forecasts are market specific.
Article
Prediction markets are markets for contracts that yield payments based on the outcome of an uncertain future event, such as a presidential election. Using these markets as forecasting tools could substantially improve decision making in the private and public sectors. We argue that U.S. regulators should lower barriers to the creation and design of prediction markets by creating a safe harbor for certain types of small stakes markets. We believe our proposed change has the potential to stimulate innovation in the design and use of prediction markets throughout the economy, and in the process to provide information that will benefit the private sector and government alike.
Which Election Forecast Was the Most Accurate? Or Rather: The Least Wrong? The Washington Post
  • P Atanasov
Atanasov P., and Joseph R. (2016). Which Election Forecast Was the Most Accurate? Or Rather: The Least Wrong? The Washington Post, November 30, 2016.
Superforecasting: The Art and Science of Prediction
  • P Tetlock
  • D Gardner
Tetlock, P., and Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown.