Conference PaperPDF Available

Debunking Three Myths About Crowd-Based Forecasting

June 2017

June 2017

Conference: Collective Intelligence 2017
At: Brooklyn, NY, USA

Authors:

Mohammed VI Polytechnic University

Who really is better at forecasting, crowds or models? And which crowd-based methods are the most reliable, polls or markets? And which types of markets, based on real money or play money? Using comparative data from a variety of leading prediction markets, prediction polls, and statistical models, this study assesses their relative performance in forecasting recent major political events. The results help debunk three popular myths about crowd-based forecasting.

Mean daily Brier scores of the forecasters on each question. Smaller bier scores indicate better forecasts, and the forecasters are ranked from best (top) to worst (bottom) for each question. The forecasters are also color-coded into three classes: play-money crowd-based forecasts, real-money crowd-based forecasts, and data-driven models. See text for an analysis.

…

Figures - uploaded by Emile Jacques Servan-Schreiber

Content may be subject to copyright.

Content uploaded by Emile Jacques Servan-Schreiber

Content may be subject to copyright.

Collective Intelligence 2017

Debunking Three Myths About Crowd-Based

Forecasting

EMILE SERVAN-SCHREIBER, Hypermind International

Ever since the publication of « The Wisdom of Crowds » (Surowiecki, 2004), the accuracy of crowd-

based forecasting has served as a prime example of the practical value of collective intelligence.

Prediction markets were long believed to be the gold standard for eliciting the highest-quality forecast

from a crowd (Arrow et al, 2008). However, over the last decade, this hypothesis has come under

increasingly heavy fire from two new approaches claiming to push the boundaries of “the art and

science of prediction”. First arose the big-data statistical models, popularized by Nate Silver’s

FiveThirtyEight and the New York Time’s Upshot. Then came a new generation of “prediction polls”

(Atanasov et al, 2016) and “superforecasters” (Tetlock & Gardner, 2015), fresh from winning a multi-

year geopolitical crowd-forecasting tournament sponsored by the U.S. government’s Intelligence

Advanced Research Projects Activity (IARPA). Now confusion reigns supreme. Who really is better at

forecasting, crowds or models? And which crowd-based methods are the most reliable, polls or

markets? And which types of markets, based on real money or play money?

Using comparative data from a variety of leading prediction markets, prediction polls, and statistical

models, this study assesses their relative performance in forecasting recent major political events. The

results help debunk three popular myths about crowd-based forecasting.

COMPARISON POINTS

The forecasters included in this study are the most established and reputable of each type:

Hypermind

Almanis

Pivit

Iowa

Electronic

Markets

PredictIt

Betfair

Good

Judgment

Five

Thirty

Eight

New

York

Times

Input

Crowd

Data

Method

Market

Poll

Stats

Betting

Play

Real

Play*

Main

Crowd

France, US

UK, US

* Although there is no betting in the Good Judgment “prediction poll”, participants do compete to make the most accurate

probability assessments. So in terms of competition and risk, the situation is similar to a play-money market.

Besides sports, only a few events are covered by enough forecasters to offer useful comparison points.

Major electoral events in the U.S. and U.K. offer common ground. Luckily, those few events are also

among the most consequential and media-covered political events of the past few years: the midterm

election of 2014 and presidential election of 2016 in the U.S., and the “Brexit” referendum of 2016 in

the U.K.. So what the data lack in volume, they make up in relevance. This analysis focuses

specifically on the outcomes that were hardest to forecast (most suspenseful), over the longest possible

time frame common to enough forecasters. In chronological order, they are:

2 E. Servan-Schreiber

Collective Intelligence 2017

A. Senate Control 2014: Will the Republican party control the U.S. Senate after the 2014

midterm elections? Outcomes: Yes or No. Time frame: September 3 to November 3, 2014.

B. GOP Nomination 2016: Who will be the Republican party’s presidential nominee in 2016?

Outcomes: Cruz, Rubio, Trump, or Other. Time frame: January 25 to May 3, 2016.

C. Brexit 2016: Will the U.K. vote to leave the European Union (i.e., Brexit) in 2016? Outcomes:

Yes or No. Time frame: April 1 to June 22, 2016.

D. USA President 2016: Will Donald Trump be elected President of the United States in 2016?

Outcomes: Yes or No. Time frame: July 1 to November 7, 2016.

All forecasters produced probabilities for each possible outcome, updated at least daily. Forecasts were

scored using the Brier scoring rule – a strictly proper scoring rule, based on squared errors. Smaller

Brier scores denote higher accuracy. A perfect prediction – probability 1 for outcomes that occur,

probability 0 for outcomes that don’t – yields a score of 0 while a perfectly wrong prediction earns a

score of 2. To compute a forecaster’s performance on a question, its daily probabilities were scored

throughout the question’s time frame, then averaged into a mean daily Brier score. The results are

plotted in Figure 1.

Fig. 1. Mean daily Brier scores of the forecasters on each question. Smaller bier scores indicate better forecasts, and the

forecasters are ranked from best (top) to worst (bottom) for each question. The forecasters are also color-coded into three

classes: play-money crowd-based forecasts, real-money crowd-based forecasts, and data-driven models. See text for an analysis.

Debunking Three Myths About Crowd-based Forecasting 3

Collective Intelligence 2017

MYTH #1: DATA-DRIVEN MODELS ARE MORE RELIABLE THAN CROWD-BASED FORECASTS

The appeal of data-driven models is that they can perhaps claim to be more evidence-based than

crowd judgments. It is also easier to explain a model’s forecast than a crowd’s. However, their

forecasts are also more brittle in that they are more easily misled by bad data (e.g., bad polling

numbers) or lack of relevant data (e.g., no Trump-like precedent in U.S. politics). Human collective

intelligence can be more robust when the going gets tough. In this study, the two highest-profile, best

funded forecasting models failed to outperform the collective intelligence. On the USA Senate Control

2014 question, both models performed worse than the three prediction markets. On the USA

President 2016 question, one model bested almost all crowd-based forecasters, but the other

performed the worst. In both cases, the play-money market Hypermind performed better than both

models. This result confirms and extends those of Servan-Schreiber & Atanasov (2015) and Atanasov

& Joseph (2016) using somewhat different data sets from different forecasters and time frames.

MYTH #2: SUPERFORECASTING PREDICTION POLLS OUTPERFORM PREDICTION MARKETS

Fresh off its recent victory in the IARPA-sponsored ACE geopolitical forecasting tournament, Good

Judgment proposes two relative novelties in crowd-based forecasting. First, so-called “prediction

polls”, where participants compete to assess event probabilities and achieve the lowest Brier scores.

These inputs are then filtered, weighed, and transformed statistically to extract the most informed

compound judgment. During the IARPA tournament, it was found that teams of forecasters thus

surveyed could slightly outperform a generic play-money prediction market (Atanasov et al., 2016).

The second novelty is so-called “superforecasters”, a special breed of excellent forecasters – the top 2%

of all participants – whose common psychological traits and habits enable them to maintain a high

level of performance over the long term. During the IARPA tournament, teams of superforecasters

were able to beat a generic play-money prediction market by 15% to 30% (Tetlock & Gardner, 2015).

In the present study, however, the Good Judgment polls did not outperform the markets. In direct

comparisons on three of the most consequential political events in recent history – Brexit and Trump’s

nomination, then victory in the U.S. presidential election – Good Judgment’s prediction polls tied or

underperformed the play-money market Hypermind. On the Brexit question a poll of superforecasters

scored worse than all the markets. Furthermore, the Good Judgment superforecasting teams failed to

outperform Hypermind on 35 questions from the IARPA tournament itself that Hypermind was

allowed to list on its market (achieving .258 and .264 average Brier scores, respectively, according to

government data – a non-significant difference). It seems that beating a generic prediction market in

a controlled experimental setting is easier than beating a full-featured market in the real world.

MYTH #3: REAL-MONEY PREDICTION MARKETS OUTPERFORM PLAY-MONEY ONES

This is perhaps the most persistent myth about prediction markets. It continues to thrive because

pundits, economists, and the general public find it so intuitively obvious that “putting your money

where your mouth is” is what drives a market’s forecasting accuracy. But the data disagrees. In the

sports domain, this myth has been debunked long ago (Servan-Schreiber et al., 2004). Still, doubts

persisted regarding financial and political predictions (Rosenbloom et al., 2006; Diemer, 2010). The

present study finds no real-money advantage whatsoever on the highest-stakes political questions. In

fact, in three of the four questions – the hardest ones, according to overall Brier scores – the play-

money market Hypermind tied or outperformed all the real-money markets. Furthermore, the deepest

and largest real-money market, Betfair (based in the U.K. where betting is legal and popular),

generally underperformed its two U.S.-based counterparts despite the severe regulatory constraints

imposed on them by the government regarding how much each trader can invest (only a few hundred

dollars). Clearly, liquidity and treasure are not the main drivers of market accuracy.

4 E. Servan-Schreiber

Collective Intelligence 2017

REFERENCES

Arrow, K., Forsythe, R., Gorham, M., Hahn, R., Hanson, R., Ledyard, J., Levmore, S., Litan, R.,

Milgrom, P., Nelson, F., Neumann, G., Ottaviani, M., Schelling, T., Shiller, R., Smith, V.,

Snowberg, E., Sunstein, C., Tetlock, P.C., Tetlock, P.E., Varian, H., Wolfers, J. and Zitzewitz, E.

(2008). The promise of prediction markets. Science, 320:877-878.

Atanasov P., and Joseph R. (2016). Which Election Forecast Was the Most Accurate? Or Rather: The

Least Wrong? The Washington Post, November 30, 2016.

Atanasov, P., Rescober, P., Stone, E., Swift, S., Servan-Schreiber. E., Tetlock, P., Ungar, L., and Mellers,

B. (2016). Distilling the Wisdom of Crowds: Prediction Markets vs. Prediction Polls. Management

Science, 2016.

Diemer, S. (2010). Real-Money Vs. Play-Money Forecasting Accuracy in Online Prediction Markets -

Empirical Insights from Ipredict. Journal of Prediction Markets, 4(3), December 2010.

Rosenbloom, E., & Notz, W. (2006). Statistical Tests of Real-Money Versus Play-Money Prediction

Markets. Electronic Markets, 16 (1) pp. 63-69.

Servan-Schreiber, E., and Atanasov, P. (2015) Hypermind vs Big Data: Collective Intelligence Still

Dominates Electoral Forecasting. In Proceedings of the 2015 Collective Intelligence Conference,

Santa Clara.

Servan-Schreiber, E., Wolfers, J., Pennock, D., & Galebach, B. (2004). Prediction Markets: Does

Money Matter? Electronic Markets, 14 (3) pp. 243-251.

Surowiecki, J. (2004). The Wisdom of Crowds. Doubleday.

Tetlock, P., and Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown.

Using prediction polling to harness collective intelligence for disease forecasting

Article

Full-text available

Nov 2021
BMC PUBLIC HEALTH

Background The global spread of COVID-19 has shown that reliable forecasting of public health related outcomes is important but lacking. Methods We report the results of the first large-scale, long-term experiment in crowd-forecasting of infectious-disease outbreaks, where a total of 562 volunteer participants competed over 15 months to make forecasts on 61 questions with a total of 217 possible answers regarding 19 diseases. Results Consistent with the “wisdom of crowds” phenomenon, we found that crowd forecasts aggregated using best-practice adaptive algorithms are well-calibrated, accurate, timely, and outperform all individual forecasters. Conclusions Crowd forecasting efforts in public health may be a useful addition to traditional disease surveillance, modeling, and other approaches to evidence-based decision making for infectious disease outbreaks.

Delphi-Märkte

Chapter

Apr 2019

Delphi-Märkte bezeichnen Ansätze und Implementierungen der Integration von Prognosemärkten und Delphi-Studien (Real-Time Delphi). Durch die Kombination der zwei Methoden zur Erstellung von Prognosen können potenziell gegenseitige Schwächen ausgeglichen werden. So können Prognosemärkte zum Beispiel für die Auswahl von Teilnehmenden mit Expertise herangezogen werden und motivieren durch ihren spielerischen Ansatz und Anreizmechanismen auch zur langfristigen Teilnahme. In diesem Beitrag werden zwei Potenziale für Prognosemärkte und vier Potenziale für Delphi-Studien, welche sich durch die Integration ermöglichen, theoretisch hergeleitet. Anschließend werden drei verschiedene Integrationsansätze vorgestellt, anhand derer die Integration auf User-, Markt- und Delphi-Fragen Ebene exemplarisch verdeutlicht wird und wobei gezeigt wird, dass je nach Ansatz nicht alle Potenziale erreicht werden können. Am Ende werden Empfehlungen für den Einsatz von Delphi-Märkten abgeleitet, bestehende Limitationen für Delphi-Märkte sowie zukünftige Entwicklungen aufgezeigt.

Delphi-Markets

Chapter

Full-text available

Mar 2019

Delphi-Markets describe approaches and implementations of the integration of prediction markets and Delphi studies (Real-Time Delphi). By combining the two methods of forecasting, potential mutual weaknesses can be offset. For example, prediction markets can be used for the selection of participants with expertise and motivate long-term participation through their playful approach and incentive mechanisms. In this article , two potentials for prediction markets and four potentials for Delphi studies, which are made possible by the integration, are theoretically derived. Three different integration approaches are then presented, which illustrate the integration at user, market and Delphi-question level and show that, depending on the approach, not all potentials can be achieved at once. In the end, recommendations for the use of Delphi markets are given, existing limitations are pointed out, and future developments are estimated.

Hypermind vs big data: collective intelligence still dominates electoral forecasting

Conference Paper

Full-text available

Jun 2015

Prediction Markets: Does Money Matter?

Article

Full-text available

Sep 2004

The accuracy of prediction markets has been documented both for markets based on real money and those based on play money. To test how much extra accuracy can be obtained by using real money versus play money, we set up a real-world online experiment pitting the predictions of TradeSports.com (real money) against those of NewsFutures.com (play money) regarding American Football outcomes during the 2003-2004 NFL season. As expected, both types of markets exhibited significant predictive powers, and remarkable performance compared to individual humans. But, perhaps surprisingly, the play-money markets performed as well as the real-money markets. We speculate that this result reflects two opposing forces: real-money markets may better motivate information discovery while play-money markets may yield more efficient information aggregation.

Distilling the wisdom of crowds: Prediction markets vs. prediction polls

Article

Mar 2017

We report the results of the first large-scale, long-term, experimental test between two crowdsourcing methods: prediction markets and prediction polls. More than 2,400 participants made forecasts on 261 events over two seasons of a geopolitical prediction tournament. Forecasters were randomly assigned to either prediction markets (continuous double auction markets) in which they were ranked based on earnings, or prediction polls in which they submitted probability judgments, independently or in teams, and were ranked based on Brier scores. In both seasons of the tournament, prices from the prediction market were more accurate than the simple mean of forecasts from prediction polls. However, team prediction polls outperformed prediction markets when forecasts were statistically aggregated using temporal decay, differential weighting based on past performance, and recalibration. The biggest advantage of prediction polls was atthe beginning of long-duration questions. Results suggest that prediction polls with proper scoring feedback, collaboration features, and statistical aggregation are an attractive alternative to prediction markets for distilling the wisdom of crowds.

Real-Money Vs. Play-Money Forecasting Accuracy in Online Prediction Markets - Empirical Insights from Ipredict

Article

Dec 2010

Prediction markets are online trading platforms where contracts on future events are traded with payoffs being exclusively linked to event occurrence. Scientific research has shown that market prices of such contracts imply high forecasting accuracy through effective information aggregation of dispersed knowledge. This phenomenon is related to incentives for truthful aggregation in the form of real-money or play-money rewards. The question whether real- or play-money incentives enhance higher relative forecast accuracy has been addressed by previous works with diverse findings. The current state of empirical research in his field is subject to two inherent deficiencies. First, inter-market studies suffer from market disparities and differences in the definition of underlying events. Comparisons between two different platforms (one for play-money contracts, one for real-money contracts) are potentially biased by different trading behaviour. Second, the majority of studies are based upon identical datasets of market platforms (IOWA stock exchange, Tradesports/Intrade, NewsFutures). This paper contributes new insights by analysing 44,169 trading observations on ipredict, where real-money and play-money contracts are traded on a variety of events. Forecasting accuracy is analysed on overall trading activity as well as comparison of equal contracts under different monetary incentive schemes. Statistical models are built to analyse the influence of order volumes and days to expiry under both incentive schemes. Ignoring different events in underlying trading activity, play-money contracts imply statistically insignificant excess accuracy. In direct comparison of equal events, real-money contracts, however, real-money contracts predict at significantly higher accuracy. This paper finds a relationship between order volumes and forecasting accuracy whereas the influence of days to expiry and aggregated volumes showed lower R² than was expected by formed hypotheses.

Statistical Tests of Real-Money Versus Play-Money Prediction Markets

Article

Feb 2006

Prediction markets are mechanisms that aggregate information such that an estimate of the probability of some future event is produced. It has been established that both real-money and play-money prediction markets are reasonably accurate. An SPRT-like test is used to determine whether there are statistically significant differences in accuracy between the two markets. The results establish that real-money markets are significantly more accurate for non-sports events. We also examine the effect of volume and whether differences between forecasts are market specific.

The Wisdom of Crowds

Book

Jan 2005

James Surowiecki

The Promise of Prediction Markets

Article

May 2008

Prediction markets are markets for contracts that yield payments based on the outcome of an uncertain future event, such as a presidential election. Using these markets as forecasting tools could substantially improve decision making in the private and public sectors. We argue that U.S. regulators should lower barriers to the creation and design of prediction markets by creating a safe harbor for certain types of small stakes markets. We believe our proposed change has the potential to stimulate innovation in the design and use of prediction markets throughout the economy, and in the process to provide information that will benefit the private sector and government alike.

Which Election Forecast Was the Most Accurate? Or Rather: The Least Wrong? The Washington Post

Nov 2016

P Atanasov

Atanasov P., and Joseph R. (2016). Which Election Forecast Was the Most Accurate? Or Rather: The Least Wrong? The Washington Post, November 30, 2016.

Superforecasting: The Art and Science of Prediction

Jan 2015

P Tetlock
D Gardner

Tetlock, P., and Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown.

Debunking Three Myths About Crowd-Based Forecasting

Abstract and Figures

Recommended publications

On Guidance and Volatility

Analysis of 2004 Political Futures Markets

Summary and Conclusions

Foul Play in Information Markets