ArticlePDF Available

A Journey into Bitcoin Metadata

March 2019
Journal of Grid Computing

March 2019

Authors:

Università degli studi di Cagliari

Besides recording transfers of currency, the Bitcoin blockchain is being used to save metadata — i.e. arbitrary pieces of data which do not affect transfers of bitcoins. This can be done by using different techniques, and for different purposes. For instance, a growing number of protocols embed metadata in the blockchain to certify and transfer the ownership of a variety of assets beyond cryptocurrency. A point of debate in the Bitcoin community is whether metadata negatively impact on the effectiveness of Bitcoin with respect to its primary function. This paper is a systematic analysis of the usage of Bitcoin metadata over the years. We discuss all the known techniques to embed metadata in the Bitcoin blockchain; we then extract metadata, and analyse them from different angles.

Metadata transactions by protocol type.

…

Temporal evolution of metadata (transactions before April 2015 are negligible).

…

Figures - uploaded by Livio Pompianu

Content may be subject to copyright.

Content uploaded by Livio Pompianu

Content may be subject to copyright.

Journal of Grid Computing manuscript No.

(will be inserted by the editor)

A journey into Bitcoin metadata

Massimo Bartoletti ·Bryn Bellomy ·Livio Pompianu

Received: date / Accepted: date

Abstract Besides recording transfers of currency, the

Bitcoin blockchain is being used to save metadata — i.e.

arbitrary pieces of data which do not aﬀect transfers of

bitcoins. This can be done by using diﬀerent techniques,

and for diﬀerent purposes. For instance, a growing num-

ber of protocols embed metadata in the blockchain to

certify and transfer the ownership of a variety of as-

sets beyond cryptocurrency. A point of debate in the

Bitcoin community is whether metadata negatively im-

pact on the eﬀectiveness of Bitcoin with respect to its

primary function. This paper is a systematic analysis

of the usage of Bitcoin metadata over the years. We

discuss all the known techniques to embed metadata in

the Bitcoin blockchain; we then extract metadata, and

analyse them from diﬀerent angles.

Keywords Bitcoin, blockchain, measurements

1 Introduction

The last few years have witnessed an increasing interest

in Bitcoin, the ﬁrst and most widespread decentralised

cryptocurrency [40,48]. Bitcoin records currency trans-

actions in a public, append-only data structure — the

so-called blockchain. The blockchain is maintained by

M. Bartoletti

Universit`a degli Studi di Cagliari, Dipartimento di Matema-

tica e Informatica, Via Ospedale, 72, 09124 Cagliari, Italy

E-mail: bart@unica.it

B. Bellomy

ConsenSys, 49 Bogart St., Brooklyn, NY 11206

E-mail: bryn.bellomy@consensys.net

L. Pompianu

Universit`a degli Studi di Cagliari, Dipartimento di Matema-

tica e Informatica

E-mail: livio.pompianu@unica.it

a peer-to-peer network, following a consensus protocol

which ensures that tampering with the past transac-

tions is computationally unfeasible [42].

The immutability of the Bitcoin blockchain, together

with its openness, have inspired the development of

new applications, that — going beyond transfers of cur-

rency — certify the existence of documents [16,21,25],

track the ownership of assets [12,18,19], run smart con-

tracts [14,29,36] or perform other useful tasks. These

applications exploit Bitcoin transactions to “piggy-back”

their metadata, i.e., pieces of data which are not inher-

ent to currency transfers, but are needed to implement

their application logic.

A debate about scalability has been taking place in

the Bitcoin community over the last few years [1,22,23].

In particular, users argue over whether the blockchain

should allow for storing these spurious data. Besides

quantifying the impact of metadata on the eﬀectiveness

of Bitcoin, many other relevant aspects on the usage of

metadata are still worth of investigation.

This paper is a systematic survey on the usage of

metadata in Bitcoin, based on the analysis of the ﬁrst

480,000 blocks in the blockchain, i.e. ∼245K transac-

tions collected until 2017/08/10. Our main contribu-

tions can be summarised as follows:

1. We survey the existing techniques for embedding

metadata in the Bitcoin blockchain, identifying 11

distinct ones. We compare these techniques, dis-

cussing their side eﬀects, and we compare their evo-

lution over time, quantifying the amount of meta-

data embedded through them.

2. We search the blockchain for metadata, and we parse

them to infer the intended usage. To this purpose,

we consider both metadata as single units of infor-

mation, and as aggregates of pieces scattered through

the blockchain (e.g., images). Overall, we recognise 7

2 Bartoletti M., Bellomy B, Pompianu, L.

diﬀerent types of metadata. We quantify the amount

and size of metadata by type.

3. We identify 45 distinct protocols which are used by

applications to embed metadata in the blockchain.

We classify them according to their application do-

main, and we measure the amount of metadata they

produced. We analyse the correlation between em-

bedding techniques, metadata types and protocols.

4. We compare the size of the extracted metadata with

the overall size of the blockchain, and we investigate

peaks of metadata that occurred over the years.

5. We make available a public dataset of metadata ex-

tracted from the blockchain [35], as well as the tools

we have developed for our analyses [33,38].

Structure of the paper. Section 2is a minimalistic in-

troduction to Bitcoin, containing all the technical back-

ground needed in the subsequent sections. In Section 3

we show and compare the techniques to embed meta-

data in the blockchain, presenting several statistics on

their usage. In Section 4we illustrate some techniques

to parse metadata and reconstruct the original con-

tent; then, we categorize and quantify the metadata

extracted from the blockchain. In Section 5we investi-

gate the protocols which embed metadata, we classify

them, and quantify the volume of metadata produced

by each protocol. Section 6discusses how the usage of

metadata impacts on the Bitcoin blockchain. Finally,

Section 7discusses some related works, and Section 8

draws some conclusions.

2 Background on Bitcoin

Bitcoin [48] is a decentralized infrastructure to exchange

virtual currency — the bitcoins. Users interact with Bit-

coin through addresses, by publishing transactions that

transfer bitcoins from one address to another. The log

of all transactions is recorded on the blockchain, a pub-

lic and immutable data structure maintained by the

nodes of the Bitcoin network. A subset of nodes, called

miners, gather the transactions sent by users, aggregate

them in blocks, and try to append these blocks to the

blockchain. A consensus protocol based on moderately-

hard “proof-of-work” puzzles is used to resolve conﬂicts

that may happen when diﬀerent miners concurrently

try to extend the blockchain, or when some miner at-

tempts to append a block with invalid transactions. Ide-

ally, the blockchain is globally agreed upon, and free

from invalid transactions, unless the adversary controls

the majority of the computational power of the net-

work [31,42,44]. The security of the consensus proto-

col relies on the assumption that miners are rational,

i.e. that following the protocol is more convenient than

trying to attack it. To make this assumption hold, min-

ers receive some economic incentives for performing the

time-consuming computations required by the protocol.

Part of these incentives is given by the fees paid by users

upon each transaction.

2.1 Transactions

To illustrate how transfers of bitcoins work, we con-

sider two transactions T0and T1, which we represent

graphically as follows:

previous transaction:· · ·

in-script:· · ·

value:v0

out-script(T , σ): ver k(T , σ)

previous transaction:T0

in-script:sigk(•)

value:v1

out-script: · · ·

The transaction T0contains v0Satoshis (1 bitcoin

= 108Satoshis). A user can redeem this amount by

publishing a transaction (e.g., T1), whose previous

transaction ﬁeld contains the identiﬁer of T0(dis-

played just as T0in the ﬁgure), and whose in-script

ﬁeld makes the out-script1of T0evaluate to true.

When this happens, the value of T0is transferred to

the new transaction T1, and T0becomes unredeemable.

A subsequent transaction can then redeem T1likewise.

In the transaction T0above, out-script checks that

σis a valid signature (made with the key k) of the

redeeming transaction. We denote with verk(T , σ ) the

signature veriﬁcation function, and with sigk(•) the sig-

nature of the enclosing transaction, including all the

parts of the transaction but its in-script (obviously,

because it contains the signature itself).

Now, assume that T0is redeemable on the blockchain

when someone tries to append T1. This is possible if

v1≤v0, and the out-script of T0, applied to T1and

to the signature sigk(•), evaluates true.

The previous example shows the simple case of trans-

actions with only one input and one output. In general,

transactions have the form displayed in Figure 1. First,

there can be multiple inputs and outputs (denoted with

array notation in the ﬁgure): in-counter speciﬁes the

number of inputs, and out-counter that of outputs.

Each input (resp. output) has its own in-script (resp.

out-script). The two ﬁelds in-script length and

out-script length denote the size of in-script and

out-script, respectively. Since each output can be re-

deemed independently, previous transaction ﬁelds

1The ﬁelds in-script and out-script are called, respec-

tively, scriptPubKey and scriptSig in the Bitcoin wiki.

A journey into Bitcoin metadata 3

version no: k

in-counter: n

previous transaction[0]:T0

previous out-index[0]: i0

in-script length[0]: · · ·

in-script[0]:· · ·

sequence no[0]: · · ·

out-counter: m

value[0]:v0

out-script length[0]: · · ·

out-script[0]: · · ·

lock time:s

Fig. 1: A Bitcoin transaction.

must specify which one they are redeeming (in the ﬁg-

ure, previous out-index). A transaction with multi-

ple inputs redeems all the (outputs of ) transactions in

its previous transaction ﬁelds, by providing a suit-

able in-script for each of them. The lock time ﬁeld

speciﬁes the earliest moment in time when the trans-

action can appear on the blockchain. The version no

ﬁeld is currently set to 1. Transaction inputs contain

also a 4-bytes ﬁeld called sequence no. Normally its

value is 0xFFFFFFFF, and it is ignored unless the trans-

action lock time is greater than 0 [11].

In order for Tto be appended to the blockchain, a

few conditions must be satisﬁed: for instance, for each

k < n, the out-script of the ik-th output of the trans-

action Tkmust evaluate to true when fed with the value

in in-script[k]; further, none of the inputs must have

been redeemed yet, and the sum of the values of all the

redeemed outputs must be greater than or equal to the

sum of the values of all outputs in T(see [30] for a

formal speciﬁcation).

The Unspent Transaction Output set (in short, UTXO

set) is the set of redeemable outputs of all transactions

in the blockchain.

2.2 Scripts

Bitcoin scripts [10,11] are programs in a stack-based

language featuring a limited set of logic, arithmetic,

and cryptographic operators (but without loops). In the

rest of this section we illustrate the pairs of in-script

and out-script which are considered standard by the

Bitcoin network [39]. In Section 3we will then show

how these scripts are commonly used for embedding

metadata.

We use the typewriter font for denoting opcodes

(e.g., OP CHECKSIG), and italic for denoting bitstrings

(e.g., sig). In Bitcoin scripts, these bitstrings are al-

ways preceded by a suitable OP PUSHDATA opcode, which

pushes the bitstring onto the stack. For the sake of sim-

plicity, hereafter we omit these OP PUSHDATA opcodes.

2.2.1 Pay to public key (P2PK)

# p ay - to - P ub k ey ( P2 P K )

in - s c ri pt : sig

ou t - s cr i pt : pu b K e y OP_CHECKSIG

This pair of scripts implements the signature veriﬁca-

tion function outlined in Section 2.1. The evaluation of

the output script starts with sig on top of the stack,

and then proceeds by pushing also pubKey. The opcode

OP CHECKSIG performs the signature veriﬁcation, pop-

ping the two top elements of the stack, and evaluating

to true if the veriﬁcation succeeds.

2.2.2 Pay to public key hash (P2PKH)

# p ay - to - P ub k e yH a sh ( P2 P KH )

in - s c ri pt : sig pubKey

ou t - s cr i pt : OP _ D U P O P _ H A S H 1 6 0 pubKeyHash

OP_EQ U A L V E R I F Y O P _ C H E C K SI G

This pair of scripts performs signature veriﬁcation, sim-

ilarly to the previous pair. The main diﬀerence with re-

spect to P2PK is that the output script now contains

the double hash of a public key, rather than the public

key itself.

More in detail, the in-script contains a signature

sig and a public key pubKey. The evaluation of the

output script starts with sig and pubKey on the stack.

The opcode OP DUP duplicates the top element of the

stack (i.e., pubKey), and OP HASH160 replaces the top

element with its double hash. Then, pubKeyHash (the

double hash of a key) is pushed into the stack. The op-

code OP EQUALVERIFY checks if the two top elements

of the stack are equal: if so, they are popped, other-

wise the script fails. Finally, OP CHECKSIG performs the

signature veriﬁcation.

2.2.3 Multi-signature

# mu lt i - s i gn a tu re

in - s c ri p t : O P _0 sig1 ... sigM

ou t - s cr i pt : M p u b K e y 1 . . . p u b K e y N N

OP_CHECKMULTISIG

4 Bartoletti M., Bellomy B, Pompianu, L.

This pair of scripts performs a multi-signature veriﬁ-

cation: the output can be redeemed if the in-script

provides Msignatures veriﬁed against Npublic keys,

where M≤N. The opcode OP CHECKMULTISIG tries to

verify the last signature with the last public key. If they

match, it proceeds to verify the previous signature in

the sequence, otherwise it tries to verify the signature

with the previous key. Notably, OP CHECKMULTISIG uses

each key only once, therefore the order of the signatures

in the in-script matters.

2.2.4 Pay to script hash (P2SH)

# p ay - to - S c ri p tH a sh ( P 2S H )

in - s c ri pt : v1 . . . vN bitstring

ou t - s cr i pt : O P _ H A S H 1 6 0 hash O P _ E Q U A L

In this pair, ﬁrstly the out-script is evaluated with

bitstring on top of the stack. The script checks that

the hash of bitstring is equal to hash. If so, the script

obtained by interpreting bitstring as a sequence of op-

codes is executed (on the parameters v1 . . . vN). Sum-

ming up, the output can be redeemed if the in-script

of the redeeming transaction provides a script whose

hash coincides with the hash contained in out-script,

and whose evaluation (on the parameters v1 . . . vN)

yields true.

2.2.5 OP RETURN

# op_return

ou t - s cr i pt : O P _ R E T U R N bitstring

An out-script containing OP RETURN always evaluates

to false, regardless of the value of the bitstring. There-

fore the corresponding output is unspendable, and it

can be safely removed from the UTXO set.

Currently, standard transactions can have only one

occurrence of OP RETURN: more precisely, if a transac-

tion has more than one out-script with OP RETURN, or

an out-script with more than one OP RETURN, or an

OP RETURN with more than one OP PUSHDATA, then the

transaction is not standard.

3 Embedding metadata in the blockchain

In Sections 3.1 to 3.7 we illustrate various techniques

(as far as we know, all those used in practice) to embed

metadata in the blockchain. In Section 3.8 we show how

to split large pieces of metadata into smaller pieces that

can be distributed among sets of transactions. In Sec-

tion 3.9 we discuss how to commit to speciﬁc values

without explicitly writing them in the blockchain. Sec-

tion 3.10 gives some statistics on embedding techniques.

3.1 Value ﬁeld

Transaction outputs specify the amount of Satoshis to

send through the value ﬁeld, of 8 bytes size. A ﬁrst way

to encode a message min the blockchain is to build a

transaction with an output whose value is the number

that represents m. For instance, the BitcoinTimestamp

protocol (see Table 5) exploits this method for saving

a SHA256 hash. The hash is ﬁrst split into 16 pieces,

which are then translated into amounts of Satoshis. Fi-

nally, the protocol builds a transaction containing an

output for each amount (e.g. see rows 1-2of Table 1).

Although users can easily recover the moved funds

(e.g. by specifying their own address as receiver), the

disadvantage of this technique is that it requires to own

at least the amount of Satoshis needed to represent m.

3.2 Input sequence

Some users exploited the 4 bytes in the sequence no

ﬁeld for appending their own metadata. Although this

technique does not have negative eﬀects on the Bit-

coin system, as far as we know no protocols use this

technique (as shown in Table 5). We conjecture that

protocols rely on other techniques because 4 bytes are

not enough for implementing any relevant use case (see

Section 4for a description of protocols and use cases).

3.3 Pay to public key / Pay to public key hash

In P2PK (Section 2.2.1) the output script speciﬁes the

recipient of some bitcoins through a public key, encoded

in 65 bytes (or 33 bytes when the key is compressed).

Similarly, P2PKH (Section 2.2.2) uses the hash of the

key, which is encoded in 20 bytes.

Users can embed an arbitrary message m(of suit-

able size) in the P2PK or P2PKH script of some out-

put of a transaction T, by writing min place of the

expected key or hash. Assuming that Bitcoin uses se-

cure signatures and collision-resistant hash functions,

it is computationally unfeasible, in general, to craft a

transaction which redeems such output of T.

A downside of this approach is that these unspend-

able outputs are indistinguishable from the spendable

ones: actually, a Bitcoin node has no way of knowing

whether or not a user exists who possesses the needed

hash preimage (nor does it know that the data was

never intended to represent an address in the ﬁrst place).

A journey into Bitcoin metadata 5

# Technique Transaction identiﬁer / Bitcoin address

1Value f6f89da0b22ca49233197e072a39554147b55755be0c7cdf139ad33cc973ec46

249a130ce4255fc91061c3d1170cbc256f51ed671256df837500d59183cfdd64f

3OP RETURN d84f8cf06829c7202038731e5444411adc63a6d4cbf8d4361b86698abad3a68a

Vanity Address

1ponziUjuCVdB167ZmTWH48AURW1vE64q

51CounterpartyXXXXXXXXXXXXXXXUWLpVr

672162e9224dbadefb84834046ee8b4706af77f57fa4e8fd5aaf3255abf516807

71WeRe3jh9XiaAyabyiE2Mz4v8bbcB52Gy

81FineoW99TYAAZuRSkbZLrx65iTXELqHhv

918chaNzLXvAbYvkad7MH2LNrQmBzeXbWLo

10 1PoStJBYu49Ezqcwh1VeMWZgRopwcYwksY

11 1FAke1neYErMQLebVPYBAToTLvafr5ZPF6

12 Coinbase 4a5e1e4baab89f3a32518a88c31bc87f618f76673e2cc77ab2127b7afdeda33b

13 P2PKH 5970ae129d1141663bd5e441a1555c16fb1c0586dd05f40c1db3d3e81218ee41

14 P2SH 1e47936f37e71b98e8bafe51ddc902d59c1318bc556329ba4ab1996981785292

Table 1: Some transactions and addresses containing metadata.

As a result, the nodes of the Bitcoin network must keep

these transactions in their UTXO set indeﬁnitely. Since

the UTXO set is usually stored in RAM for eﬃciency

concerns [52], the bloating of the UTXO set negatively

aﬀects the memory consumption of nodes [32].

3.4 Pay to script hash

By using P2SH scripts (Section 2.2.4), metadata can

be embedded in various ways. Similarly to the P2PKH

technique (Section 3.3), one can embed metadata in

the output script, in place of the hash. An alternative

technique is to embed metadata in the input script,

pushing them onto the stack with the OP PUSHDATA in-

struction, and immediately afterwards removing them

with OP DROP. As long as the script completes its exe-

cution successfully and there is some nonzero value on

the stack after completion, the transaction is valid [10]:

indeed, there is no rule specifying that the data accu-

mulated on the stack during the script execution must

be cleared. Stack items that are below the topmost item

at the end of execution are simply ignored. For instance,

consider the following scripts:

in - s c ri pt : OP_PUSHDATA1 8

0xaabbccddeeff0011 s i g p u b K e y

v1 . . . vN bitstring

ou t - s cr i pt : O P _ H A S H 1 6 0 hash O P _ E Q U A L

First, the in-script and the out-script are concate-

nated; the result is a common P2SH script preceded

by an OP PUSHDATA1. The evaluation starts by push-

ing 0xaabbccddeeff0011 onto the stack. This is done

through OP PUSHDATA1, where the trailing 1indicates

that the next 1byte contains the number of bytes to be

pushed onto the stack (in the snippet above, 8 bytes).

At the end of the execution, the top item on the stack

is true, resulting from the OP EQUAL. Underneath this

value there is the metadata 0xaabbccddeeff0011.

Note that transactions that make use of ignored

OP PUSHDATA for embedding metadata do not bloat the

UTXO set: indeed, their outputs can be spent by valid

addresses, because they do not need to overwrite the

address ﬁelds in the transaction scripts. The work [26]

provides further details on the P2SH technique.

3.5 OP RETURN

Standard OP RETURN transactions allows to store up to

80 bytes of arbitrary data (see e.g. row 3of Table 1). An

out-script containing OP RETURN always evaluates to

false, hence the output is unspendable, and its transac-

tion can be safely removed from the UTXO set. In this

way, OP RETURN overcomes the UTXO consumption is-

sue highlighted in Section 3.3.

3.6 Vanity address

Bitcoin addresses are the hash of ECDSA public keys.

By a brute force search during the generation of the key

pair, it is possible to obtain a so-called vanity address,

where a few bytes are equal to a given string (see e.g.

row 4of Table 1)2. One can embed these bytes as meta-

data in transactions, e.g. by using the vanity address in

P2PK scripts.

Although, in theory, the maximum size of the meta-

data corresponds to the size of an address (20 bytes),

in practice this technique is practical only for metadata

of the size of a few bytes. Longer metadata can be dis-

tributed among diﬀerent vanity addresses; for instance,

the transaction at row 6of Table 1transfers bitcoins

from 5 vanity addresses (displayed at rows 7-11). By

concatenating the ﬁrst four characters (but the leading

2Open-source tools like Vanitygen generate vanity ad-

dresses with user-deﬁned patterns.

6 Bartoletti M., Bellomy B, Pompianu, L.

1) of these addresses, we read the plain English words:

“We’re ﬁne, 8chan post fake”.

A diﬀerent use of vanity addresses is when all the

bytes of the address are ﬁxed, like e.g. at row 5of Ta-

ble 1. In this case, it is plausible that the corresponding

private key is not known to anybody, so a P2PK out-

put using such address cannot be spent. This fact is ex-

ploited to implement “Proof-of-Burn” protocols on top

of Bitcoin (like e.g. in Counterparty): in these proto-

cols, users receive some tokens in exchange for sending

some bitcoins to an unspendable address.

3.7 Coinbase transaction

Miners specify how to redeem the reward for the mined

block (and the fees of its transactions) through the ﬁrst

transaction of the block. This transaction does not have

an input script, and it contains a ﬁeld called coinbase,

that miners usually ﬁll in with metadata. The coinbase

data size is between 2 and 100 bytes. Nevertheless, af-

ter block 227,835 the available space is reduced, since

the Bitcoin Improvement Proposal 34 (BIP 0034) [27]

requires the ﬁrst bytes of the coinbase ﬁeld to store the

block height index.

Usually, the coinbase ﬁeld is used by miners for

identifying the mining pool, or for voting BIPs (for in-

stance, when they voted to support either the BIP 0016

or BIP 0017 [28]). The most famous message embedded

by using this technique is included in the genesis block

(see e.g. row 12 of Table 1): “The Times 03/Jan/ 2009

Chancellor on brink of second bailout for banks”.

3.8 Distributing metadata

The techniques discussed in Sections 3.1 to 3.7 embed

metadata within a single ﬁeld of a single transaction.

Below we describe three techniques that allow to split

metadata among multiple ﬁelds and transactions.

Multisignature. This technique uses the multisignature

script introduced in Section 2.2.3. Given N−1 pieces

of metadata v2, . . . , vN, the scripts are the following:

in - s c ri p t : O P _0 sig1

ou t - s cr i pt : 1 p u b K e y 1 v2 .. . v N N

OP_CHECKMULTISIG

This implements a 1-of-Nmultisignature: namely, only

one signature (sig1) needs to be veriﬁed against some

of the Npublic keys (pubKey1). The other “public keys”

in the out-script — actually, the N−1 pieces of meta-

data — are irrelevant for the execution of the script.

Note that this technique bloats the UTXO set only

until the transaction is redeemed.

Multiple inputs / outputs. Although Bitcoin imposes a

hard limit of 10K bytes on the size of a single script [9],

there is no limit to the number of inputs or outputs a

transaction may contain. Hence, metadata bigger than

10K bytes can be split into smaller chunks, and dis-

tributed among many inputs or outputs within a single

transaction. To ensure that the original data can be

reconstructed from these fragments, one needs to ﬁx

an encoding. The simplest one is to store the chunks

of metadata sequentially in Noutput scripts. Another

common encoding, used e.g. by the BIT-COMM pro-

tocol, uses the amount of bitcoins transferred by each

output to order the chunks (see e.g. row 13 of Table 1).

Transactions using this technique may or may not bloat

the UTXO set — that is determined by the structure

of the individual transaction outputs.

Transaction chains. The previous techniques store meta-

data within a single transaction. However, this is not

always ideal, or even possible. For instance:

–If the size of metadata exceeds the maximum block

size, the transaction containing the metadata will

be rejected by the network.3

–Large transactions require large fees. Even though,

in theory, one can send a transaction with zero fee,

in practice a transaction with no fee (or a low fee)

is unlikely to be mined. Depending on current fee

market dynamics, it may be more cost-eﬀective to

split the metadata across multiple transactions.

–Transactions greater than a certain size, or which

contain more than one OP RETURN, are considered

non-standard, with the consequence that most nodes

refuse to relay them. This limit has varied over time,

and is now replaced by the concept of “transaction

weight” which is similar, but accounts for Segre-

gated Witness data in a diﬀerent manner.4

Due to these considerations, metadata are often split

into sets of transactions. A common technique is to

connect transactions containing related data in a chain

structure. When building a transaction chain, one of

the techniques from Sections 3.1 to 3.7 is chosen for en-

coding data into each individual transaction. Then, a

spendable transaction output is added to each transac-

tion, to be redeemed by the subsequent transaction in

the chain. The output must be spendable by an address

over which the user embedding the data has control.

3See github.com/bitcoin/.../src/main.cpp#L829.

4See github.com/bitcoin/bitcoin/.../src/main.h#L56, and

github.com/bitcoin/bitcoin/.../src/main.cpp#L644-L648.

A journey into Bitcoin metadata 7

3.9 Embedding vs. committing metadata

Several Bitcoin-based protocols notarize documents by

embedding their hash on the blockchain. There are al-

ternative techniques, e.g. pay-to-contract [43] and

sign-to-contract5, which allow one to commit to a

hash without actually embedding it into a transaction.

For instance, Btproof applies RIPE160 to the hash, to

obtain a Bitcoin address; similarly, Originstamp daily

aggregates hashes into a seed, hashes it into a private

key, and then derives from it the corresponding pub-

lic key and address. Both protocols pay a small fee

to the generated address in order to publish it in the

blockchain. ContractHashTool exploits elliptic curves

for building an in-script that cryptographically com-

mits to a given hash, without embedding the hash itself.

Unless a protocol keeps a public track of its transac-

tions (like, e.g., Originstamp), these techniques prevent

external observers from inferring any metadata.

3.10 Statistics on embedding techniques

Table 2shows the amount of metadata embedded with

each technique. The leftmost column groups the tech-

niques in three categories: the Single category contains

the techniques which embed the whole piece of meta-

data into a single chunk; Multi contains the techniques

which split an element into multiple chunks, but store

all the pieces into a single transaction; ﬁnally, Chains

gathers the techniques which spread pieces across mul-

tiple transactions. The third and fourth columns show

the size (in bytes) of the ﬁeld storing metadata, and

where the ﬁeld is located. The ﬁfth column lists the

techniques bloating the UTXO. The sixth column dis-

plays the date in which ﬁrst chunk of metadata appears,

the seventh one shows the number of times a technique

has been used6, and last two columns show the total

and average size of metadata embedded.

In the ﬁrst 480,000 blocks we count 4,582,661 chunks

of metadata, for a total size of ∼99MB. Note that,

since the chunks of the Chains type are a subset of

those in the other categories, and the chunks in Multi

are a subset of those in Single, to avoid counting the

same piece of metadata multiple times, the totals only

consider the values of the Single type. This number

of chunks is a good indicator of the total number of

transactions with metadata, since the fraction of trans-

actions produced by Multi techniques is negligible. We

observe that ∼75% of the space used by metadata is due

5See bitcointalk.org/index.php?topic=915828.msg10056796

6Elements of type Single are chunks; Multi elements are

transactions; Chains elements are chains of transactions.

to OP RETURN transactions. The average size of transac-

tions chains is higher than other methods, since this

technique is used for embedding images and archives.

4 Analysis of Bitcoin metadata

In this section we present our techniques to parse meta-

data and reconstruct the original content. Then, we cat-

egorize the reconstructed items, and we measure them.

4.1 Collecting metadata

One of the most eﬀective techniques for recognising

chunks of metadata is to search strings for “suspicious”

byte patterns. For example, long strings of contiguous

ASCII characters are unlikely to occur in regular trans-

action data; similarly, the probability of ﬁnding spe-

ciﬁc bitstrings, like the Gzip header 0x1f9d9070, is ex-

tremely low. Finding such a bitstring is a trigger for fur-

ther investigation. We employed several types of these

searches, which we discuss below.

Frequency analysis. The GNU strings utility7takes

a data source as input, and yields as output all of the

ASCII plaintext characters found in that source. It pro-

vides a ﬂag for ﬁltering out strings of contiguous ASCII

characters under a given length. It is possible to run

strings directly on Bitcoin Core’s .dat ﬁles, but care

must be taken when tuning the ﬁlter. Obviously, too low

a threshold will yield a huge number of false positives.

On the other hand, due to the way inputs and outputs

are encoded in transaction data, too high a threshold

eliminates plaintext that has been split across multiple

transactions or transaction scripts.

While this approach is quite simple, some of the

data that we encountered — particularly, the conversa-

tions and code embedded into the blockchain by Peter

Todd (one of the Bitcoin Core developers) — mention

that they are speciﬁcally intended to be discovered and

extracted via this method. For example, Todd’s plain-

text uploader is a Python script stored in the blockchain

(see row 14 of Table 1). It describes itself as a tool that

can “publish text in the blockchain, suitably padded

for easy recovery with strings”.8The tool appears to

have been used to upload its own source code to the

blockchain. Peter Todd’s tool takes a text ﬁle as input,

and uses the P2SH ignored OP PUSHDATA technique (see

Section 3.4) to embed the contents of the ﬁle into the

7man7.org/linux/man-pages/man1/strings.1.html

8github.com/petertodd/python-bitcoinlib/blob/

master/examples/publish-text.py#L48

8 Bartoletti M., Bellomy B, Pompianu, L.

Type Technique Field Hosted in UTXO 1st item # items Tot. size Avg. size

size Bloating

Single

Value 8 Tx output No N/A N/A N/A N/A

Input Sequence 4 Tx input No 2011/02/25 1,305,372 5,221,488 4

P2PK-P2PKH 20,33,65 Script Yes 2013/03/16 66,762 1,335,240 20

P2SH 520 Script Yes 2013/04/10 1,578 31,560 20

OP RETURN 80 Script No 2014/03/12 2,903,186 76,700,965 26

Vanity Address 20 Script No N/A N/A N/A N/A

Coinbase 2–100 Tx input No 2009/01/03 305,763 18,442,641 60.3

Total — — — 2009/01/03 4,582,661 101,731,894 22

Multi Multi-signature Variable Script Transient 2013/04/06 15,067 2,926,590 194

Multi-in/out Variable Variable Variable 2013/03/16 529 4,437,616 8,389

Chains Tx chains Variable Variable Variable 2013/04/06 60 3,470,870 57,848

Table 2: Statistics about embedding techniques (sizes are in bytes).

input scripts of a single transaction. The reason this en-

coding lends well to strings-based extraction is that it

allows large amounts of arbitrary data to be stored with

minimal interruption by non-ASCII bytes. Input scripts

are stored contiguously in transaction data, meaning

that the only necessary interruptions will be the min-

imal set of Bitcoin script opcodes required to ensure

that the transaction is considered valid by the network.

Compared to the other methods, strings-based ex-

traction oﬀers the lowest barrier-to-entry. Thus, users

encoding large quantities of plaintext data that are in-

tended to be easily discoverable should make note of

encoding methods that lend well to this technique.

File signature. Many ﬁle formats require the inclusion

of speciﬁc bytestrings that are common to all ﬁles of a

given format. For example, many JPEG images begin

with the bytestring 0xffd8ffe000104a464946000101.

Similarly, ASCII-armored PGP messages begin with

-----BEGIN PGP MESSAGE-----. These bytestrings of-

ten occur in the header or footer of the ﬁle, although

there are formats that place them elsewhere. The prob-

ability of ﬁnding such bytestrings in Bitcoin blocks is

exceedingly low, and as such, they provide a useful in-

dicator of embedded data.

We used several tools to detect ﬁle signatures present

in Bitcoin transactions:

binwalk is an extensible tool for discovering valid ﬁles

embedded into other data [41]. It provides a lan-

guage for deﬁning ﬁle signatures, as well as a large

database of pre-deﬁned signatures for common ﬁle

formats. It also has the ability to carve detected ﬁles

out of the surrounding binary data. One can pro-

duce a number of valid results simply by running

binwalk on the Bitcoin Core .dat ﬁles. However,

since the tool is unaware of the Bitcoin block for-

mat, it is only suitable for recovering ﬁles embedded

into a single transaction script.

binary-grep searches a collection of input ﬁles for a

single bytestring speciﬁed by the user [37]. It out-

puts the byte oﬀsets of any matches, and has a sim-

ple carving function.

local-blockchain-parser provides a grep command

that, unlike binwalk and binary-grep, is aware of

the Bitcoin block format [38], skipping the parts of

the transaction that cannot embed metadata. For

each match, it outputs the block and transaction

hashes, script type (input/output), and byte oﬀset.

One of the most successful workﬂows we discovered

for recovering binary ﬁles based on ﬁle signatures was

the following. (i) We ran binwalk and/or binary-grep

on a .dat ﬁle, making note of any results that appeared

to be true positives. (ii) If there were promising re-

sults, we then ran the binary-grep subcommand of

local-blockchain-parser on that .dat ﬁle, obtaining

the transaction hashes where those results were found.

(iii) For each resulting transaction, we manually in-

spected the transaction graph around it. If it appeared

to be an isolated transaction, we ran the tx-info sub-

command of local-blockchain-parser. If it appeared

to be a part of a chain, we ran the tx-chain subcom-

mand instead. (iv) We inspected the binary output from

the previous step, performing manual carving where

necessary, and we attempted to ascertain the validity

of the results by opening them with applications ap-

propriate to their ﬁle type.

Protocol Identiﬁer. Many protocols mark their meta-

data by writing a speciﬁc string in the ﬁrst few bytes of

each chunk, but the exact number of bytes may vary

from protocol to protocol. In Section 5we take ad-

vantage of this for associating metadata to protocols.

Furthermore, since protocols give a detailed description

of the format of the elements produced, in Section 4.2

we distinguish diﬀerent types of metadata and classify

them. Hence, in order to associate metadata to pro-

tocols we: (i) search the web for known associations

A journey into Bitcoin metadata 9

between identiﬁers and protocols; (ii) we accordingly

classify strings beginning with one of the identiﬁers ob-

tained. In more details, in the ﬁrst step we query Google

to obtain public identiﬁer/protocol bindings. For in-

stance, since several protocols use the OP RETURN tech-

nique, we execute the query “Bitcoin OP RETURN”, that

returns ∼26,500 results, and we manually inspect the

ﬁrst few pages of them. Note that a protocol can be as-

sociated with more than one identiﬁer (e.g., Stampery,

Blockstore), or even do not have any identiﬁer. In this

way we obtain 45 protocols associated to 39 identiﬁers;

further, we ﬁnd several protocols that do not use any

identiﬁer (e.g., Diploma,Chainpoint). We also distin-

guish the main types of metadata produced by proto-

cols (e.g. Text, Hash and Record). The second step is

performed by our tool: it associates chunks of meta-

data to a protocol. The full list of protocols discovered

is shown in Table 5; identiﬁers are listed in Table 4.

Transaction chains. Although all spent transaction out-

puts in the Bitcoin blockchain naturally form a chain

structure, identifying chains containing embedded meta-

data is not entirely straightforward. A transaction may

have certain “giveaway” characteristics that suggest the

presence of a chain containing data, such as:

1. One or more unspendable outputs (i.e., OP RETURN

outputs), plus a single spent output. The unspend-

able output(s) would contain data, while the spent

output would be used to continue the chain.

2. One or more unspent outputs (possibly used for a

P2PKH embedding), plus a single spent output. The

unspent output(s) would contain data, while the

spent output would be used to continue the chain.

3. The unspent outputs, if any, contain a tiny amount

of Satoshis (such outputs are also known as dust).

Except for the Bit-Comm protocol, which uses out-

put values to order the data in the output scripts,

the funds included into outputs that can never be

spent are eﬀectively “burnt”, and add no informa-

tion to the embedded data. This disincentivizes the

embedder from including any more value than is

strictly necessary to create a valid transaction.

4. The spent output contains a relatively large amount

of bitcoins, used to fund further dust outputs in sub-

sequent links in the chain.

5. Preceding or subsequent transactions share a simi-

lar structure with the transaction in question. Many

of the transaction chains we found appeared to have

been constructed with the help of software (e.g. the

Python source we extracted). The software we found

tends to create strings of transactions sharing a sim-

ilar format. While it is altogether possible to em-

bed data into chains of dissimilar transactions, they

would be diﬃcult to ﬁnd and complex to decode.

These are helpful clues, but not deﬁnitive criteria. In

fact, there are many other types of transactions which

possess the characteristics described above. For exam-

ple, payouts from mining pools and Bitcoin casinos of-

ten send small amounts of bitcoins to many users at

once. These payout transactions are often constructed

algorithmically (according to some set of “threshold”

rules intended to minimize the impact of the fee on the

payout), meaning that preceding and following trans-

actions share a similar structure.

Therefore, it is generally necessary to have some un-

derstanding of the embedded data in order to determine

whether a given chain is of interest. If a transaction con-

tains a ﬁle signature for a ﬁle type that is unlikely to ﬁt

into the data provided by that transaction, it warrants

further investigation.

Extraction of data from transaction chains is rela-

tively easy when using the local-blockchain-parser

utility. This utility has a tx-chain subcommand that

takes a single transaction hash and crawls backwards

and forwards through the transaction graph, collect-

ing data from the transaction scripts. This data is ﬁl-

tered and permuted to account for the various ways in

which transaction chains are constructed. Finally, the

data from each transaction are concatenated in the or-

der that they appear in the chain. This process yields a

collection of binary ﬁles corresponding to the diﬀerent

ways in which data can be embedded into a chain.

4.2 Types of metadata

We associate the successfully reconstructed data items

to one of the following types:

Text Users have embedded a signiﬁcant amount of text,

since the very ﬁrst message by Satoshi Nakamoto.

This includes several birthday wishes, love state-

ments, prayers, greetings, developer conversations,

and magnet links. Besides user messages, miners

usually embed in coinbase transactions messages for

Bitcoin-related purposes, to identify their blocks,

vote on proposals, announce what features they sup-

port. We have also identiﬁed two pdf documents:

“Bitcoin: a peer-to-peer electronic cash system” [48],

and “The ﬁrst collision for full SHA-1” [51].

Hash Many users notarize the ownership of documents

by embedding their hash on the blockchain (embed-

ding the whole document would be too expensive,

because of the required transaction fees). Some pro-

tocols notarize several documents with a single piece

of metadata; this could be the hash of the sequence

10 Bartoletti M., Bellomy B, Pompianu, L.

of document hashes, or the root of the Merkle tree

of the document hashes.

Financial record A common application of the Bit-

coin blockchain is to record the ownership and ex-

change of digital or physical assets. These assets are

represented as tokens, and users are identiﬁed by

their Bitcoin addresses.

cols which act as marketplaces where artists publish

and sell their ﬁles to other users.

Script Developers have embedded in the blockchain

several scripts. We have found some Python scripts

(e.g., the Satoshi uploader,Satoshi downloader, and

Cryptograﬃti uploader), Bash scripts (e.g., Pass-

word script, and OpenSSL encoder), and also some

games (e.g., LinPyro,Bong ball and Lucifer).

Image The blockchain contains some small images, usu-

ally spread across chains of transactions. We have

found various ﬁle formats (PNG, JPEG, and GIF).

Archive This type includes compressed archives, like

e.g. the WikiLeaks Cablegate gzipped archive.

4.3 Statistics on types of metadata

Table 3shows some statistics about the type of meta-

data we reconstructed. The second column indicates the

day in which the ﬁrst piece of metadata of the corre-

sponding type appeared in the blockchain. Next, we

show the total number of elements found, followed by

their total size in bytes, and their average size.

Note that the total size of reconstructed metadata

is less than the 101,731,894 reported in Table 2: this

is because Table 2also includes bitstrings that we did

not manage to decode. For instance, the bytes embed-

ded with OP RETURN are always considered in Table 2,

but they appear in Table 3only if we are also able

to recognize their type (e.g., because their preﬁx re-

veals which protocol has produced them, among those

in Table 4). From the rightmost column we see that the

average size of scripts, images and archives exceeds the

maximum size of Bitcoin scripts; hence, these metadata

are embedded through “Multi-in/out” or “Tx chains”

techniques. Finally, note that although the ﬁrst ﬁnan-

cial record appeared only in May 2014, this type of

metadata now constitutes ∼70% of the reconstructed

elements, and it uses the majority of the space.

5 Analysis of Bitcoin-based protocols

In this section we focus on protocols which embed meta-

data on the Bitcoin blockchain. We ﬁrst propose a rough

taxonomy of protocols, which categorize them accord-

ing to the application domain. Then, considering the

collection of protocols reported in a previous paper [34],

we perform several analyses on their usage of metadata.

5.1 Types of protocols

Our taxonomy classiﬁes protocols in ﬁve categories:

Financial includes protocols that manage assets, e.g.

for certifying their ownership, endorsing their value,

and keeping track of trades. Metadata in these trans-

actions are used to specify the value of the asset, the

amount of the asset transferred, the new owner, etc.

Notary includes protocols that certify the ownership

and timestamp of documents. These protocols allow

users to publish the hash of a document in a transac-

tion, thus proving its existence and integrity. Since

the transaction is signed with a private key, users

can also certify the ownership of the document.

DRM includes protocols for declaring access rights

and copyrights on digital art documents, like e.g.

images or audio ﬁles.

Message groups protocols which record text messages.

Subchain gathers protocols which construct transac-

tion chains to record execution traces of third-party

smart contracts.

We now brieﬂy comment this taxonomy. Although

Notary and DRM protocols have the same overall

goal — certifying the ownership of documents — they

have some relevant diﬀerences. First, Notary protocols

do not usually require the original document (yet, they

ask that the document hash is provided by the owner);

further, their goal can be fulﬁlled also when their front-

end is no longer online. Conversely, DRM protocols

usually need to gather user documents, and have com-

plex front-ends to enable further interactions with users

(e.g., they often play the role of broker between media

producers and consumers). The ordering of metadata

embedded by Notary,DRM and Message protocols

is immaterial; instead, diﬀerent orderings in Financial

and Subchain protocols usually imply diﬀerent system

states. Indeed, transactions used by Financial proto-

cols are analogous to Bitcoin transactions, except that

they transfer tokens instead of bitcoins; depending on

the current balance of tokens, appending a transaction

may result in a state update, or even leave the state

unchanged (e.g., if an attacker attempts to sell assets

that she does not currently own). Subchain proto-

cols share the same mechanism, but they generalize to-

ken exchange to more complex computations, like those

arising from the execution of smart contracts (e.g., in

the RSK platform).

A journey into Bitcoin metadata 11

Type 1st item # items Tot. size Avg. size

Text 2009/01/03 309,894 18,811,329 61

Hash 2013/12/18 200,832 7,617,392 38

Financial record 2014/05/03 1,430,071 37,699,809 26,36

Script 2013/04/06 10 138,149 13,815

Image 2013/03/17 108 1,523,529 14,107

Archive 2013/04/06 12 2,838,760 236,563

TOTAL 2009/01/03 2,054,575 72,132,138 35

Table 3: Statistics on types of metadata (sizes are in bytes).

5.2 Statistics on Bitcoin-based protocols

Table 5shows some detailed statistics about protocols.

The ﬁrst and second columns indicate, respectively, the

protocol type and name. We use an additional type,

called Empty, to gather the transactions which use

OP RETURN without embedding any metadata. The third

and fourth columns show the type of metadata and the

embedding technique. The ﬁfth column shows when the

protocol generated the ﬁrst chunk of metadata; since

transactions do not carry a timestamp, to this purpose

we use the timestamp of the enclosing block. The next

two columns count the total number of elements pro-

duced by a protocol, and the total size (in bytes) of

the embedded metadata (net of script instructions and

other transaction ﬁelds). The rightmost column shows

the average size of the metadata.

We were able to associate to protocols ∼53.7MB

of metadata, which is quite less than the total amount

extracted (∼99MB). This diﬀerence has various moti-

vations. First, users often embed metadata not related

to any protocol; for instance, this is the case for sev-

eral images and text messages. Second, several proto-

cols make it impossible, for an external observer, to

recognize their chunks of metadata (unlike the proto-

cols in Table 4, which append an identiﬁer to the meta-

data): indeed, we have discovered 19 protocols that em-

bed metadata without any identiﬁer. Finally, our list of

protocols may be incomplete, so if some other proto-

cols embed metadata with OP RETURN, we count their

items but we can not classify them. We note a relevant

component of Empty transactions (∼10% of the total

OP RETURN transactions), which use OP RETURN without

any data attached, so they are not associated to any

protocol. We evaluate that ∼96% of these transac-

tions are related to the peaks, discussed later on in Sec-

tion 6.4. The ﬁfth column of Table 5suggests that, orig-

inally, the protocols were of Financial and Notary

type, while the other use cases were introduced sub-

sequently (indeed, the others types were not inhabited

before the end of 2014).

Type Protocol Identiﬁers

Financial

Colu CC

CoinSpark SPK

OpenAssets OA

Omni omni

Openchain OC

Helperbit HB

Counterparty CNTRPRTY

Notary

Factom Factom!!, FACTOM00, Fa, FA

Stampery S1, S2, S3, S4, S5, S6

Proof of Existence DOCPROOF

Blocksign BS

CryptoCopyright CryptoTests-, CryptoProof-

Stampd STAMPD##

BitProof BITPROOF

ProveBit ProveBit

Remembr RMBd, RMBe

OriginalMy ORIGMY

LaPreuve LaPreuve

Nicosia UNicDC

SmartBit SB.D

Notary Notary

DRM

Monegraph MG

Blockai 0x1f00

Ascribe ASCRIBE

Message Eternity Wall EW

BitAlias BALI

Subchain Blockstore id, 0x5888, 0x5808

Table 4: Protocol identiﬁers. Counterparty metadata

must be ﬁrst deobfuscated with ARC4 encryption, us-

ing the transaction identiﬁer of the ﬁrst unspent trans-

action output as the encryption key.

From Table 5we see that the large majority of pro-

tocols use the OP RETURN technique. Focussing on the

metadata embedded with this technique, Figure 2dis-

plays how metadata are distributed into the protocol

types, and Figure 3shows the temporal evolution of

their usage, in terms of the number of metadata items

published per week. Comparing Table 5with Figure 2

we see that although most protocols are Notary, their

transactions are a fraction of those produced by Finan-

cial protocols.

12 Bartoletti M., Bellomy B, Pompianu, L.

Type Protocol Metadata Technique 1st item # items Tot. size Avg. size

Financial

Colu Financial record OP RETURN 2015/07/09 244,411 4,425,702 18

CoinSpark Financial record OP RETURN 2014/07/02 28,120 960,664 34

OpenAssets Financial record OP RETURN 2014/05/03 207,132 3,255,499 16

Omni Financial record OP RETURN 2015/08/10 311,605 6,249,883 20

Openchain Hash OP RETURN 2015/10/21 2,758 115,283 42

Helperbit Financial record OP RETURN 2015/09/18 33 1,251 38

Counterparty Financial record

OP RETURN 2014/06/16 636,012 22,806,810 36

P2PKH N/A N/A N/A N/A

Multi-signature N/A N/A N/A N/A

Total — — 2014/06/16 1,430,071 37,815,092 26

Notary

Factom Merkle root OP RETURN 2014/04/11 105,188 4,207,262 40

Stampery Merkle root, Hash OP RETURN 2015/03/09 74,887 2,648,102 35

Proof of Existence Hash OP RETURN 2014/04/21 5,464 218,513 40

Blocksign Hash OP RETURN 2014/08/04 1,477 55,676 38

CryptoCopyright Hash OP RETURN 2014/08/02 46 1,840 40

Stampd Hash OP RETURN 2015/01/03 562 22,427 40

BitProof Hash OP RETURN 2015/02/25 770 30,800 40

ProveBit Hash OP RETURN 2015/04/05 57 2,280 40

Remembr Hash OP RETURN 2015/08/25 28 1,128 40

OriginalMy Hash OP RETURN 2015/07/12 126 4,788 38

LaPreuve Hash OP RETURN 2014/12/07 68 2,663 39

Nicosia Hash of hashes OP RETURN 2014/09/12 24 840 35

SmartBit Merkle root OP RETURN 2015/11/24 8,472 304,992 36

Notary Hash OP RETURN 2017/04/11 21 798 38

Originstamp Hash of hashes (Commit metadata) 2013/12/18 905 0 0

Btproof Hash (Commit metadata) N/A N/A N/A N/A

BitcoinTimestamp Hash Value, Multi-in/out N/A N/A N/A N/A

Blocknotary Merkle root OP RETURN N/A N/A N/A N/A

Tangible Hash OP RETURN N/A N/A N/A N/A

Chainpoint Merkle root OP RETURN N/A N/A N/A N/A

Diploma Hash OP RETURN N/A N/A N/A N/A

Apertus Hash P2PKH N/A N/A N/A N/A

Chronobit Hash N/A N/A N/A N/A N/A

Seclytics Hash OP RETURN N/A N/A N/A N/A

Total — — 2013/12/18 198,095 7,502,109 38

DRM

Monegraph Copyright OP RETURN 2015/06/28 67,286 2,464,282 37

Blockai Copyright OP RETURN 2015/01/09 670 38,327 57

Ascribe Copyright OP RETURN 2014/12/19 48,450 1,000,561 21

Verisart Merkle root N/A N/A N/A N/A N/A

Total — — 2014/12/19 116,406 3,503,170 30

Message

Eternity Wall Text OP RETURN 2015/06/24 4,129 177,916 43

Cryptograﬃti Text P2PKH, Multi-in/out N/A N/A N/A N/A

BIT-COMM Text P2PKH, Multi-in/out N/A N/A N/A N/A

Stone Text, File P2PKH, Multi-in/out N/A N/A N/A N/A

Key.run Magnet link OP RETURN N/A N/A N/A N/A

BitAlias Secret, Hash OP RETURN — 0 0 0

Total — — 2015/06/24 4,129 177,916 43

Subchain

Keybase Merkle root OP RETURN N/A N/A N/A N/A

Uniquebits PGP signed hash P2PKH, P2SH N/A N/A N/A N/A

Blockstore Key-Value OP RETURN 2014/12/10 209,422 6,068,584 29

Catena [53]Text OP RETURN, Tx chains N/A N/A N/A N/A

Total — — 2014/12/10 209,422 6,068,584 29

Empty Total — OP RETURN 2014/03/20 296,396 0 0

TOTAL — — — 2009/01/03 2,254,519 55,066,871 24

Table 5: Statistics on Bitcoin-based protocols (sizes are in bytes).

6 Discussion

In this section we discuss the impact of metadata on the

Bitcoin blockchain. We start by describing the historical

evolution of metadata, highlighting how the adoption

of the embedding techniques has varied over the years.

We then evaluate the memory and storage consumption

due to metadata, and we discuss the phenomenon of

transaction peaks.

6.1 Historical perspective

The ﬁrst piece of metadata was embedded in the gen-

esis block by Satoshi Nakamoto, through the Coinbase

technique; then, since October 2011, this technique has

been used regularly by miners. In the ﬁrst 3 years of Bit-

coin, the most used technique for embedding data was

P2PKH. Later, many protocols (e.g., Counterparty) mi-

grated from the P2PKH to the OP RETURN technique,

and we rarely ﬁnd protocols still using P2PKH. The

P2PKH is now used for embedding large ﬁles with the

A journey into Bitcoin metadata 13

Financial Notary

DRM Message

Subchain Empty

63.4% 8.8%

9.3%

13.1%

5.2%

Fig. 2: Metadata transactions by protocol type.

04.2015

06.2015

07.2015

09.2015

10.2015

12.2015

02.2016

03.2016

05.2016

07.2016

08.2016

10.2016

12.2016

01.2017

03.2017

05.2017

06.2017

·104

Number of transactions

Financial

Notary

DRM

Message

Subchain

Fig. 3: Temporal evolution of metadata (transactions before April 2015 are negligible).

support of the Multi-in/out, Multi-signature, and Tx

chains techniques. Despite the similarity with P2PKH,

the P2PK technique is less used, since P2PK scripts are

considered obsolete [10]. The Input Sequence and the

Value techniques are not widely adopted as well, prob-

ably because of the limited space they oﬀer respect to

other techniques. Also the P2SH technique is not widely

used, although there are some proposals for adopting it

for Counterparty9.

Although OP RETURN has been part of the scripting

language since the ﬁrst releases of Bitcoin, originally it

was considered non-standard, so transactions contain-

ing this opcode were not reliably mined. OP RETURN be-

came standard with Bitcoin Core 0.9.0 [8], but still the

release notes state that: “This change is not an endorse-

ment of storing data in the blockchain. The OP RETURN

change creates a provably-prunable output, to avoid data

storage schemes [...] that were storing arbitrary data

such as images as forever-unspendable TX outputs, bloat-

ing bitcoin’s UTXO database”. The limit for storing

data with OP RETURN was originally planned to be 80

9See counterpartytalk.org/t/cip-proposal-p2sh-data-

encoding/2169.

bytes, but the ﬁrst oﬃcial client supporting the op-

code, i.e. the release 0.9.0, allowed only 40 bytes. This

animated a long debate [4,5,13,15]. From the release

0.10.0 [6] nodes could choose whether to accept or not

OP RETURN transactions, and set a maximum for their

size. The maximum size was then set to 80 bytes by the

release 0.12.0 [7]. From Table 5we see that the majority

of the applications built on top of Bitcoin embed meta-

data through the OP RETURN technique; this is coherent

with the data in Table 2, from which we see that ∼63%

of the metadata in the blockchain have been embed-

ded with the OP RETURN technique (which is the most

adopted one since March 2014). In the last period of

our experiments, ∼40,000 new OP RETURN transactions

are published each week. Overall, OP RETURN transac-

tions amount to ∼1,18% of the total number of trans-

actions (∼1,37% when considering the portion of the

blockchain from 2014/03/12, when the ﬁrst OP RETURN

transaction appeared)10.

10 Despite the 5 years of delay, this percentage is quite close

to that for the whole blockchain: this is because the number

of daily transactions has largely increased since July 2014.

14 Bartoletti M., Bellomy B, Pompianu, L.

6.2 UTXO bloating

As remarked in Section 3, the embedding techniques

P2PK, P2PKH, and P2SH (which are often used to

embed media ﬁles), produce unspendable outputs. In

this way they contribute to the “UTXO bloating” ef-

fect, that deteriorates the performance of Bitcoin nodes.

In the UTXO set we have counted ∼68K unspend-

able outputs which are used to embed chunks of meta-

data. Of all the transaction which embed metadata,

only ∼1.49% contribute to the UTXO bloating eﬀect.

The other embedding techniques, among which the

OP RETURN, do not bloat the UTXO (even though they

still aﬀect the total size of the blockchain). This, to-

gether with the possibility of embedding up to 80 bytes

of metadata, are perhaps the reasons of the popularity

of the OP RETURN technique. Indeed, we see from Ta-

ble 5that OP RETURN is used by the large majority of

Bitcoin-based protocols. Note that the other techniques

which avoid the UTXO bloating eﬀect are not suitable

to be used by protocols, either because they have a low

bandwidth (Coinbase), or because they do not allow to

embed enough bytes (from Table 5we see that, on aver-

age, protocols require to embed 24 bytes of metadata).

6.3 Space consumption

A debated topic in the Bitcoin community is whether

it is acceptable or not to save arbitrary data in the

blockchain. From Table 5we can see that the net size

of metadata is ∼99MB. In same period of observation,

the size of the whole blockchain is ∼125GB, so the size

of metadata amounts to ∼0.077% of the total size of

transactions.

For the most widespread embedding method, the

OP RETURN, Figure 4a shows the average length of the

metadata of each week. Generally, the average length

of metadata is less than 40 bytes, despite the extension

to 80 bytes introduced on 2015/07/12. Peaks down on

the same period are related to the Empty transactions,

discussed later on in Section 6.4. Figure 4b represents

the number of OP RETURN transactions with a given data

length: also this chart conﬁrms a small number of trans-

actions that use more than the half of the available

space. Note that the discussed peak appears also in this

chart, in correspondence of the 0 value. From the last

column of Table 5we see that even the protocol which

embeds the largest number of bytes (Blockai, with 57

bytes on average), requires much less than the 80 bytes

available with OP RETURN. Several Notary protocols

take 40 bytes on average: 16 bytes for their identiﬁers,

and the remaining bytes for the hash they save. Gener-

ally, Notary protocols carry longer metadata than the

other protocols.

We now estimate the overall size of OP RETURN trans-

actions (including both the metadata and the other

parts of the transaction). The size of an Empty trans-

action with one input and one output is 156 bytes.

From Table 2we see that OP RETURN transaction carry

26 bytes of metadata, on average. We then approximate

the average size of an OP RETURN transaction as 182

bytes. Multiplying by the number of OP RETURN trans-

action, we obtain an approximation of their space con-

sumption as ∼503MB.

6.4 Transaction peaks

Figure 5represents peaks of OP RETURN transactions

from 2014/03 (date of the ﬁrst OP RETURN transaction)

to 2017/08. For each week, it shows (i) the number

of Empty transactions, (ii) the number of OP RETURN

transactions which are not produced to any protocol

(among those in our collection), and (iii) the total num-

ber of OP RETURN transactions. In the graph we note

several peaks, that we explain as follows:

1. ∼100K transactions from 2015/07/08 to 2015/08/05.

This peak is mainly composed of two diﬀerent peaks

of Empty transactions: the July peak (∼37K trans-

actions from 2015/07/08 to 2015/07/10) and the

August peak (∼29K transactions from 2015/08/01

to 2015/08/03). Both peaks occurred coincidentally

with stress tests and spam campaigns [32]11.

2. ∼300K transactions from 2015/09/09 to 2015/09/23.

This second peak is the highest and longest-lasting

one. As before, it is mainly caused by Empty trans-

actions (∼223K), although here we also observe a

component of Unclassiﬁed and Blockstore trans-

actions (∼35K each). The work [32] detects a spike

also in this period, precisely around 2015/09/13,

where an anonymous group performed a stress-test

on the network with a money drop. This involves a

public release of private keys, with the aim to cause

a big race and a consequent large number of double-

spend transactions. More speciﬁcally, people used

the private keys to transfer to themselves the bit-

coins redeemable with these keys; since many people

tried to perform these transfers simultaneously, the

network was ﬂooded with many transactions trying

to double-spend the same outputs. The conﬁrmed

11 We conjecture that Empty transactions are caused by

these events. To verify this conjecture we would need to com-

pare the transaction identiﬁers of our Empty transactions

with the identiﬁers of [32], which are not publicly available.

A journey into Bitcoin metadata 15

03.2014

09.2014

04.2015

10.2015

05.2016

12.2016

06.2017

Time interval

Average number of bytes

Avg length

(a) Size of metadata over time.

0 10 20 30 40 50 60 70 80

·105

Number of bytes

Number of transactions

Length

(b) Number of transactions by size of metadata.

Fig. 4: Usage and size of OP RETURN transactions.

03.2014

06.2014

09.2014

01.2015

04.2015

07.2015

10.2015

02.2016

05.2016

08.2016

12.2016

03.2017

06.2017

0.5

1.5

·105

Number of transactions

Empty

Unclassified

All

Fig. 5: Transactions (and transactions peaks) over time.

transactions caused a peak, which happened simul-

taneously to the peak of OP RETURN we measured.

3. ∼50K transactions from 2016/03/02 to 2016/03/09.

This last peak is given by the sum of two diﬀerent

peaks: Unclassiﬁed (∼18K) and Stampery (∼23K)

transactions. The part of the peak caused by Stam-

pery can be explained as follows. Being a notariza-

tion protocol, Stampery receives document hashes

oﬀ-chain, and subsequently it embeds these hashes

in transactions. Since Stampery has only a few trans-

actions before 2016/03/02 (probably, used for test-

ing), we conjecture that the peak coincides with

its bootstrap, when the protocol publishes on the

blockchain all the transactions related to the hashes

accumulated oﬀ-chain. The other part of the peak

could be due to the bootstrap of other protocols.

Besides the peaks of OP RETURN transactions, we can

also observe other peaks: for instance, for a duration of

100 blocks starting from 2015/05/22, Bitcoin was tar-

geted by a stress test [2], during which the network was

ﬂooded with a large number of transactions. However,

the usage of OP RETURN transactions in this period does

not seem to deviate from their normal usage.

7 Related works

There is a growing literature on the analysis of the Bit-

coin blockchain [32,45,47,49,50], and also some on-

line services which perform statistics on Bitcoin meta-

data [3,17,20,24]. Below, we group the related works

into three categories.

The ﬁrst category includes online services related

to Bitcoin metadata. The website opreturn.org shows

some statistics about OP RETURN transactions, organ-

ised by protocol, and statistics about their usage in a

certain time frame. The website smartbit.com recog-

nises some OP RETURN protocols and shows statistics on

them. Finally, the website kaiko.com sells data about

OP RETURN transactions.

The second category contains the works on embed-

ding techniques. At the best of our knowledge, besides

our work, this category includes only [26,46], which

have been developed concurrently and independently

from ours. Despite the common goals, the works [26,46]

diﬀer from ours in several aspects: (i) the “Tx chains”

methods and the techniques for committing metadata

are described only in our work; (ii) only our work and

16 Bartoletti M., Bellomy B, Pompianu, L.

[46] extract and quantify the embedded metadata; (iii)

the P2SH techniques are detailed in [26]. Further diﬀer-

ences between our work and [46] are discussed below.

The third category includes the works which anal-

yse the types of metadata, as those in Section 4. Also

in this case, the work [46] is the closest to ours: the

main diﬀerence between the two works is that, while

[46] is focussed on discussing the beneﬁts and risks re-

lated to metadata (e.g. privacy violations, illegal con-

tents), we develop a protocol-wise analysis, measuring

how much (and when) metadata is embedded by each

protocol, and studying which use cases they support.

Further, we recognize a few types of metadata (hash,

ﬁnancial records, and copyright) which are not dealt

with by [46].

8 Conclusions

Although Bitcoin does not explicitly support for embed-

ding metadata into transactions, over the years users

have devised various techniques to reach this goal. After

illustrating and comparing these techniques, we have

extracted all the metadata embedded up to 2017/08/10

(ﬁrst 480,000 blocks), measuring the data stored by

each technique. By processing the bytes extracted from

transaction metadata, we have often managed to re-

construct the original content. Overall, we have recon-

structed ∼69MB of documents, out of the ∼99MB to-

tally embedded. We have classiﬁed these documents,

ﬁnding that the majority of them are records produced

by ﬁnancial protocols. We have reconstructed also 120

ﬁles of various kinds (among which, 108 images), for a

total size of ∼4MB.

We have discovered 45 protocols which embed meta-

data into the blockchain for developing various appli-

cations. We have identiﬁed which types of metadata

they produce, and which embedding techniques they

use. Usually, each protocol produces one type of meta-

data (depending on the protocol type), using one em-

bedding technique (most often, OP RETURN). Overall,

∼53.7MB of metadata are produced by the protocols in

our collection. The majority of protocols are for docu-

ment notarization, but ∼70% of elements are produced

by ﬁnancial protocols.

Finally, we have discussed the impact of embed-

ding metadata in the blockchain, considering various

aspects, like e.g. the space consumption, the UTXO

bloating eﬀect, and the transaction peaks.

Although the oﬃcial Bitcoin documentation discour-

ages the use of the blockchain to store arbitrary data,

the trend seems to be a growth in the number of ap-

plications that embed their metadata in Bitcoin trans-

actions. We conjecture that the perceived sense of se-

curity and persistence of the Bitcoin blockchain is the

main motivation to avoid using cheaper and more eﬃ-

cient storage. If this trend will be conﬁrmed, the spe-

ciﬁc needs of these applications could aﬀect the future

evolution of the Bitcoin protocol.

Acknowledgements We thank Nicola Atzei for the insight-

ful discussion on a preliminary version of this paper. This

work is partially supported by Aut. Reg. of Sardinia projects

“Sardcoin” and “Smart collaborative engineering”, and by

COST Action IC1406 cHiPSET.

References

1. Bicoin scalability, https://en.bitcoin.it/wiki/

Scalability_FAQ. Last accessed 2018/01/01

2. Bitcoin network survives surprise stress test,

http://www.coindesk.com/bitcoin-network-

survives-stress- test/. Last accessed 2018/01/01

3. Bitcoin OP RETURN wiki page, https://en.bitcoin.it/

wiki/OP_RETURN. Last accessed 2018/01/01

4. Bitcoin pull request 5075, https://github.com/

bitcoin/bitcoin/pull/5075. Last accessed 2018/01/01

5. Bitcoin pull request 5286, https://github.com/

bitcoin/bitcoin/pull/5286. Last accessed 2018/01/01

6. Bitcoin release 0.10.0, https://bitcoin.org/en/

release/v0.10.0. Last accessed 2018/01/01

7. Bitcoin release 0.12.0, https://bitcoin.org/en/

release/v0.12.0. Last accessed 2018/01/01

8. Bitcoin release 0.9.0, https://bitcoin.org/en/release/

v0.9.0. Last accessed 2018/01/01

9. Bitcoin script interpreter, https:

//github.com/bitcoin/bitcoin/blob/

fcf646c9b08e7f846d6c99314f937ace50809d7a/src/

script/interpreter.cpp. Last accessed 2018/01/01

10. Bitcoin wiki script, https://en.bitcoin.it/wiki/

Script. Last accessed 2018/01/01

11. Bitcoin wiki transaction, https://en.bitcoin.it/wiki/

Transaction. Last accessed 2018/01/01

12. Colu website, https://www.colu.com/. Last accessed

2018/01/01

13. Counterparty open letter and plea to the Bitcoin core

development team, http://counterparty.io/news/an-

open-letter- and-plea- to-the-bitcoin- core-

development-team/. Last accessed 2018/01/01

14. Counterparty website, http://counterparty.io/. Last

accessed 2018/01/01

15. Developers battle over bitcoin block chain,

http://www.coindesk.com/developers-battle-

bitcoin-block- chain/. Last accessed 2018/01/01

16. Factom website, https://www.factom.com/. Last ac-

cessed 2018/01/01

17. Kaiko data store, https://www.kaiko.com/. Last ac-

cessed 2018/01/01

18. Omni website, http://www.omnilayer.org/. Last ac-

cessed 2018/01/01

19. Open assets website, https://github.com/OpenAssets/.

Last accessed 2018/01/01

20. opreturn.org, http://opreturn.org/. Last accessed

2018/01/01

21. Proof of existence website, https://proofofexistence.

com/. Last accessed 2018/01/01

A journey into Bitcoin metadata 17

22. Scalability debate ever end, https://www.

cryptocoinsnews.com/will-bitcoin- scalability-

debate-ever- end/. Last accessed 2018/01/01

23. Scaling debate in Reddit, http://www.coindesk.com/

viabtc-ceo- sparks-bitcoin- scaling-debate-reddit-

ama/. Last accessed 2018/01/01

24. Smartbit OP RETURN statistics, https://www.smartbit.

com.au/op-returns. Last accessed 2018/01/01

25. Stampery blockchain timestamping architecture, https:

//s3.amazonaws.com/stampery-cdn/docs/Stampery-

BTA-v6- whitepaper.pdf. Last accessed 2018/01/01

26. Data insertion in Bitcoin’s blockchain (2017),

http://digitalcommons.augustana.edu/cgi/

viewcontent.cgi?article=1000&context=cscfaculty.

Last accessed 2018/01/01

27. Andresen, G.: Block v2, height in coinbase, BIP

034, https://github.com/bitcoin/bips/blob/master/

bip-0034.mediawiki. Last accessed 2018/01/01

28. Antonopoulos, A.M.: Mastering Bitcoin: unlocking digi-

tal cryptocurrencies. O’Reilly Media, Inc. (2014)

29. Atzei, N., Bartoletti, M., Cimoli, T., Lande, S., Zunino,

R.: SoK: unraveling Bitcoin smart contracts. In: Princi-

ples of Security and Trust (POST). LNCS, vol. 10804,

pp. 217–242. Springer (2018)

30. Atzei, N., Bartoletti, M., Lande, S., Zunino, R.: A formal

model of Bitcoin transactions. In: Financial Cryptogra-

phy and Data Security (2018)

31. Badertscher, C., Maurer, U., Tschudi, D., Zikas, V.: Bit-

coin as a transaction ledger: A composable treatment.

In: CRYPTO. LNCS, vol. 10401, pp. 324–356. Springer

(2017)

32. Baqer, K., Huang, D.Y., McCoy, D., Weaver, N.: Stress-

ing out: Bitcoin “stress testing”. In: Financial Cryptog-

raphy Workshops. LNCS, vol. 9604, pp. 3–18. Springer

(2016)

33. Bartoletti, M., Bracciali, A., Lande, S., Pompianu, L.: A

general framework for blockchain analytics. In: Proc. 1st

Workshop on Scalable and Resilient Infrastructures for

Distributed Ledgers (SERIAL@Middleware). pp. 7:1–7:6.

ACM (2017), https://github.com/bitbart/blockapi

34. Bartoletti, M., Pompianu, L.: An analysis of Bitcoin

OP RETURN metadata. In: Financial Cryptography

Workshops. LNCS, vol. 10323, pp. 218–230. Springer

(2017)

35. Bartoletti, M., Pompianu, L., Bellomy, B.: Bitcoin meta-

data (2018), https://doi.org/10.7910/DVN/MOLW81

36. Bartoletti, M., Zunino, R.: BitML: a calculus for Bitcoin

smart contracts. In: ACM CCS (2018)

37. Bellomy, B.: Binary grep, htps://github.com/

spooktheducks/binary-grep

38. Bellomy, B.: Local blockchain parser, https://github.

com/spooktheducks/local-blockchain- parser

39. Bistarelli, S., Mercanti, I., Santini, F.: An analysis of non-

standard Bitcoin transactions. In: Crypto Valley Confer-

ence on Blockchain Technology (2018)

40. Bonneau, J., Miller, A., Clark, J., Narayanan, A., Kroll,

J.A., Felten, E.W.: SoK: Research perspectives and chal-

lenges for Bitcoin and cryptocurrencies. In: IEEE Symp.

on Security and Privacy. pp. 104–121 (2015)

41. devttys0: binwalk, https://github.com/devttys0/

binwalk

42. Garay, J.A., Kiayias, A., Leonardos, N.: The Bitcoin

backbone protocol: Analysis and applications. In: EURO-

CRYPT. LNCS, vol. 9057, pp. 281–310. Springer (2015)

43. Gerhardt, I., Hanke, T.: Homomorphic payment ad-

dresses and the pay-to-contract protocol. arXiv preprint

arXiv:1212.3257 (2012)

44. Kosba, A.E., Miller, A., Shi, E., Wen, Z., Papamanthou,

C.: Hawk: The blockchain model of cryptography and

privacy-preserving smart contracts. In: IEEE Symp. on

Security and Privacy. pp. 839–858 (2016)

45. Lischke, M., Fabian, B.: Analyzing the Bitcoin network:

The ﬁrst four years. Future Internet 8(1), 7 (2016)

46. Matzutt, R., Hiller, J., Henze, M., Ziegeldorf, J.H.,

Mu¨

llmann, D., Hohlfeld, O., Wehrle, K.: A quantitative

analysis of the impact of arbitrary blockchain content on

Bitcoin. In: Financial Cryptography and Data Security

(2018)

47. M¨oser, M., B¨ohme, R.: Trends, tips, tolls: A longitudinal

study of Bitcoin transaction fees. In: Financial Cryptog-

raphy and Data Security. LNCS, vol. 8976, pp. 19–33.

Springer (2015)

48. Nakamoto, S.: Bitcoin: a peer-to-peer electronic cash sys-

tem. https://bitcoin.org/bitcoin.pdf (2008)

49. Reid, F., Harrigan, M.: An analysis of anonymity in the

Bitcoin system. In: Security and privacy in social net-

works, pp. 197–223. Springer (2013)

50. Ron, D., Shamir, A.: Quantitative analysis of the full

Bitcoin transaction graph. In: Financial Cryptography

and Data Security. LNCS, vol. 7859, pp. 6–24. Springer

(2013)

51. Stevens, M., Bursztein, E., Karpman, P., Albertini,

A., Markov, Y.: The ﬁrst collision for full SHA-1. In:

CRYPTO. LNCS, vol. 10401, pp. 570–596. Springer

(2017)

52. Todd, P.: Delayed TXO commitments, https:

//petertodd.org/2016/delayed-txo- commitments.

Last accessed 2018/01/01

53. Tomescu, A., Devadas, S.: Catena: Eﬃcient non-

equivocation via Bitcoin. In: IEEE Symp. on Security

and Privacy. pp. 393–409 (2017)

A preview of this full-text is provided by Springer Nature.

Learn more

Content available from Journal of Grid Computing

This content is subject to copyright. Terms and conditions apply.

CoinPrune: Shrinking Bitcoin's Blockchain Retrospectively

Preprint

Nov 2021

Popular cryptocurrencies continue to face serious scalability issues due to their ever-growing blockchains. Thus, modern blockchain designs began to prune old blocks and rely on recent snapshots for their bootstrapping processes instead. Unfortunately, established systems are often considered incapable of adopting these improvements. In this work, we present CoinPrune, our block-pruning scheme with full Bitcoin compatibility, to revise this popular belief. CoinPrune bootstraps joining nodes via snapshots that are periodically created from Bitcoin's set of unspent transaction outputs (UTXO set). Our scheme establishes trust in these snapshots by relying on CoinPrune-supporting miners to mutually reaffirm a snapshot's correctness on the blockchain. This way, snapshots remain trustworthy even if adversaries attempt to tamper with them. Our scheme maintains its retrospective deployability by relying on positive feedback only, i.e., blocks containing invalid reaffirmations are not rejected, but invalid reaffirmations are outpaced by the benign ones created by an honest majority among CoinPrune-supporting miners. Already today, CoinPrune reduces the storage requirements for Bitcoin nodes by two orders of magnitude, as joining nodes need to fetch and process only 6 GiB instead of 271 GiB of data in our evaluation, reducing the synchronization time of powerful devices from currently 7 h to 51 min, with even larger potential drops for less powerful devices. CoinPrune is further aware of higher-level application data, i.e., it conserves otherwise pruned application data and allows nodes to obfuscate objectionable and potentially illegal blockchain content from their UTXO set and the snapshots they distribute.

Blockchain and Cryptocurrency: A Bibliometric Analysis

Article

Full-text available

Sep 2023

Blockchain is a technology that collects data in a blockchain and allows transactions securely and transparently. It is used in many fields. Bibliometric publications examining detailed research trends in blockchain and cryptocurrency are still rare, but one bibliometric report did include research on the topic. The present work is filling in the void that eludes the study of research in the blockchain area. In the current work, we have included a wide range of blockchain and cryptocurrency research efforts to date. This study used the Scopus database to extract 2,045 articles from 2014 to 2021. The goal was to examine blockchain and cryptocurrency growth in volume, trend, global distribution, significant journals, pioneering authors, leading nations, and dominant sectors. This study highlights many aspects of worldwide blockchain and cryptocurrency research by researchers from all over the world. This research also studies how consensus decision-making and its processes are used. This study highlights many aspects of global blockchain and cryptocurrency research. This study also examines their use and consensus methods.

Smart Contracts for Certified and Sustainable Safety-Critical Continuous Monitoring Applications

Chapter

Full-text available

Aug 2022

Monitoring applications are increasingly important to enable predictive maintenance and real-time anomaly detection in industrial and civil safety-critical infrastructures. Typical monitoring pipelines consist of a sensor network that collects and streams IoT data toward a cloud infrastructure that provides storage, visualisation and data analytic capabilities. However, since critical data generated must be often retained for regulatory and tracking purposes, cloud storage requirements become poorly sustainable when dealing with critical infrastructures that have to remain operative for decades while supporting lifelong continuous monitoring. While policies can be applied to remove redundant or outdated information, anti-tamper mechanisms are required to guarantee that data modifications are not driven by malicious intents to alter recorded data. This work presents a blockchain-based framework for continuous monitoring applications enabling certified removal of IoT data in safety-critical databases. The framework allows for the deployment of data-evaluation policies to identify redundant/outdated measurements flowing in the database and, therefore, mark them as eligible for removal. The novelty of our approach stands in the implementation of the data-evaluation policy as a smart contract. Furthermore, the use of a blockchain ensures that critical database operations (like removal) are tamper-proof and compliant with the guideline determined by system stakeholders. We demonstrate the effectiveness of the proposed framework in a real case study using accelerometer data of a bridge monitoring application, and we characterise the overhead of transactions to the blockchain.KeywordsDatabaseSmart contractIoTContinuous monitoring

A Moderation Framework for the Swift and Transparent Removal of Illicit Blockchain Content

Conference Paper

Full-text available

May 2022

Blockchains gained tremendous attention for their capability to provide immutable and decentralized event ledgers that can facilitate interactions between mutually distrusting parties. However, precisely this immutability and the openness of permissionless blockchains raised concerns about the consequences of illicit content being irreversibly stored on them. Related work coined the notion of redactable blockchains, which allow for removing illicit content from their history without affecting the blockchain's integrity. While honest users can safely prune identified content, current approaches either create trust issues by empowering fixed third parties to rewrite history, cannot react quickly to reported content due to using lengthy public votings, or create large per-redaction overheads. In this paper, we instead propose to outsource redactions to small and periodically exchanged juries, whose members can only jointly redact transactions using chameleon hash functions and threshold cryptography. Multiple juries are active at the same time to swiftly redact reported content. They oversee their activities via a global redaction log, which provides transparency and allows for appealing and reversing a rogue jury's decisions. Hence, our approach establishes a framework for the swift and transparent moderation of blockchain content. Our evaluation shows that our moderation scheme can be realized with feasible per-block and per-redaction overheads, i.e., the redaction capabilities do not impede the blockchain's normal operation.

Cryptocurrency Scams: Analysis and Perspectives

Article

Full-text available

Jan 2021

Since the inception of Bitcoin in 2009, the market of cryptocurrencies has grown beyond the initial expectations, as witnessed by the thousands of tokenised assets available on the market, whose daily trades exceed dozens of USD billions. The pseudonymity features of cryptocurrencies have attracted the attention of cybercriminals, who exploit them to carry out potentially untraceable scams. The wide range of cryptocurrency-based scams observed over the last ten years has fostered the study on their effects, and the development of techniques to counter them. The research in this field is hampered by various factors. First, there exist only a few public data sources about cryptocurrency scams, and they often contain incomplete or misclassified data. Further, there is no standard taxonomy of scams, which leads to ambiguous and incoherent interpretations of their nature. Indeed, the unavailability of reliable datasets makes it difficult to train effective automatic classifiers that can detect and analyse scams. In this paper, we perform an extensive review of the scientific literature on cryptocurrency scams, which we systematise according to a novel taxonomy. By collecting and homogenising data from different public sources, we build a uniform dataset of thousands of cryptocurrency scams. We build upon this dataset to implement a tool that automatically recognises scams and classifies them according to our taxonomy. We assess the effectiveness of our tool through standard performance metrics. We then analyse the results of the classification, providing key insights about the distribution of scam types, and the correlation between different types. Finally, we propose a set of guidelines that policymakers could follow to improve user protection against cryptocurrency scams.

A Blockchain-Based Distributed Paradigm to Secure Localization Services

Article

Full-text available

Oct 2021
SENSORS-BASEL

In the last decades, modern societies are experiencing an increasing adoption of interconnected smart devices. This revolution involves not only canonical devices such as smartphones and tablets, but also simple objects like light bulbs. Named as the Internet of Things (IoT), this ever-growing scenario offers enormous opportunities in many areas of modern society, especially if joined by other emerging technologies such as, for example, the blockchain. Indeed, the latter allows users to certify transactions publicly, without relying on central authorities or intermediaries. This work aims to exploit the scenario above by proposing a novel blockchain-based distributed paradigm to secure localization services, here defined as Internet of Entities (IoE). It represents a mechanism for the reliable localization of people and things, and it exploits the increasing number of existing wireless devices and the blockchain-based distributed ledger technology. Moreover, unlike most of the canonical localization approaches, it is strongly oriented towards the protection of the users' privacy. Finally, its implementation requires minimal efforts since it employs the existing infrastructures and devices, thus giving life to a new and wide data environment, exploitable in many domains, such as e-health, smart cities, and smart mobility.

Illicit Blockchain Content: Its Different Shapes, Consequences, and Remedies

Chapter

Aug 2023

Augmenting public blockchains with arbitrary, nonfinancial content fuels novel applications that facilitate the interactions between mutually distrusting parties. However, new risks emerge at the same time when illegal content is added. This chapter thus provides a holistic overview of the risks of content insertion as well as proposed countermeasures. We first establish a simple framework for how content is added to the blockchain and subsequently distributed across the blockchain’s underlying peer-to-peer network. We then discuss technical as well as legal implications of this form of content distribution and give a systematic overview of basic methods and high-level services for inserting arbitrary blockchain content. Afterward, we assess to which extent these methods and services have been used in the past on the blockchains of Bitcoin Core, Bitcoin Cash, and Bitcoin SV, respectively. Based on this assessment of the current state of (unwanted) blockchain content, we discuss (a) countermeasures to mitigate its insertion, (b) how pruning blockchains relates to this issue, and (c) how strategically weakening the otherwise desired immutability of a blockchain allows for redacting objectionable content. We conclude this chapter by identifying future research directions in the domain of blockchain content insertion.

An E-Coupon Service Based on Blockchain

Chapter

Nov 2023

E-coupons are used often, as e-commerce becomes increasingly popular because of its mobility and ease of use. The majority of e-coupon providers manage e-coupon data on a single and centralized server, which makes them susceptible to security problems. To address this issue, a novel system is proposed to provide e-coupon service with enhanced security by employing blockchain technology and the HMAC Digital Signature Algorithm. Additionally, we improve the blockchain performance (in terms of speed and security) with a coupon recommendation contract and a malicious block avoiding contract. The experimental findings demonstrate that, when compared to an existing e-coupon service, stronger security is provided by our proposed service.

Bitcoin Usage: Study on Bitcoin Usage Around the World 2020

Article

Jan 2022

Bitcoin usage has grown from zero to hundreds of billions of dollars in just over a decade. However, the research and studies of bitcoin usage fall short. This is mainly due to the nature of bitcoin, but also the myths of the early days. In this study, we go through the user data from LocalBitcoins, the largest peer-to-peer bitcoin marketplace in the world to study the bitcoin usage and needs. By analyzing and classifying major use cases for bitcoin through surveys and statistical analysis, this study will bring a better understanding of the different use cases for bitcoin and the role they play in bitcoin adoption. This research shows that the scope of bitcoin usage is noticeably more diverse and not merely limited to illegal usage or speculation.

Transações OP_RETURN: uma oportunidade de análise forense

Conference Paper

Full-text available

Oct 2021

A introdução da instrução OP_RETURN na Bitcoin possibilitou a inserção de dados de qualquer formato em transações. Este trabalho analisa o conteúdo da instrução em transações realizadas entre 2014 e 2019 e busca traçar um comparativo entre tais conteúdos com o perfil dos usuários responsáveis pelas transações. Os resultados mostram que a maioria das transações que utilizam a instrução OP_RETURN foram mapeadas em protocolos desconhecidos. Acerca do conteúdo legível presente em tal campo, foram encontrados potenciais novos protocolos, URLs relacionadas a torrents, memes e marcações de eventos relacionados à Bitcoins.

The First Collision for Full SHA-1

Book

Full-text available

Jul 2017

An Analysis of Non-standard Bitcoin Transactions

Conference Paper

Full-text available

Jun 2018

A formal model of Bitcoin transactions

Conference Paper

Full-text available

Jan 2018

We propose a formal model of Bitcoin transactions, which is sufficiently abstract to enable formal reasoning, and at the same time is concrete enough to serve as an alternative documentation to Bitcoin. We use our model to formally prove some well-formedness properties of the Bitcoin blockchain, for instance that each transaction can only be spent once. We release an open-source tool through which programmers can write transactions in our abstract model, and compile them into standard Bitcoin transactions.

Editorial V6, n1

Article

Full-text available

Jan 2018

Demétrio Soster

SoK: Unraveling Bitcoin Smart Contracts

Chapter

Full-text available

Apr 2018

Albeit the primary usage of Bitcoin is to exchange currency, its blockchain and consensus mechanism can also be exploited to securely execute some forms of smart contracts. These are agreements among mutually distrusting parties, which can be automatically enforced without resorting to a trusted intermediary. Over the last few years a variety of smart contracts for Bitcoin have been proposed, both by the academic community and by that of developers. However, the heterogeneity in their treatment, the informal (often incomplete or imprecise) descriptions, and the use of poorly documented Bitcoin features, pose obstacles to the research. In this paper we present a comprehensive survey of smart contracts on Bitcoin, in a uniform framework. Our treatment is based on a new formal specification language for smart contracts, which also helps us to highlight some subtleties in existing informal descriptions, making a step towards automatic verification. We discuss some obstacles to the diffusion of smart contracts on Bitcoin, and we identify the most promising open research challenges.

A general framework for blockchain analytics

Conference Paper

Full-text available

Dec 2017

Modern cryptocurrencies exploit decentralised blockchains to record a public and unalterable history of transactions. Besides transactions, further information is stored for different, and often undisclosed, purposes, making the blockchains a rich and increasingly growing source of valuable information, in part of difficult interpretation. Many data analytics have been developed, mostly based on specifically designed and ad-hoc engineered approaches. We propose a general-purpose framework, seamlessly supporting data analytics on both Bitcoin and Ethereum — currently the two most prominent cryptocurrencies. Such a framework allows us to integrate relevant blockchain data with data from other sources, and to organise them in a database, either SQL or NoSQL. Our framework is released as an open-source Scala library. We illustrate the distinguishing features of our approach on a set of significant use cases, which allow us to empirically compare ours to other competing proposals, and evaluate the impact of the database choice on scalability.

A Formal Model of Bitcoin Transactions

Chapter

Dec 2018

A Quantitative Analysis of the Impact of Arbitrary Blockchain Content on Bitcoin

Chapter

Dec 2018

Blockchains primarily enable credible accounting of digital events, e.g., money transfers in cryptocurrencies. However, beyond this original purpose, blockchains also irrevocably record arbitrary data, ranging from short messages to pictures. This does not come without risk for users as each participant has to locally replicate the complete blockchain, particularly including potentially harmful content. We provide the first systematic analysis of the benefits and threats of arbitrary blockchain content. Our analysis shows that certain content, e.g., illegal pornography, can render the mere possession of a blockchain illegal. Based on these insights, we conduct a thorough quantitative and qualitative analysis of unintended content on Bitcoin’s blockchain. Although most data originates from benign extensions to Bitcoin’s protocol, our analysis reveals more than 1600 files on the blockchain, over 99% of which are texts or images. Among these files there is clearly objectionable content such as links to child pornography, which is distributed to all Bitcoin participants. With our analysis, we thus highlight the importance for future blockchain designs to address the possibility of unintended data insertion and protect blockchain users accordingly.

BitML: A Calculus for Bitcoin Smart Contracts

Conference Paper

Oct 2018

We introduce BitML, a domain-specific language for specifying contracts that regulate transfers of bitcoins among participants, without relying on trusted intermediaries. We define a symbolic and a computational model for reasoning about BitML security. In the symbolic model, participants act according to the semantics of BitML, while in the computational model they exchange bitstrings, and read/append transactions on the Bitcoin blockchain. A compiler is provided to translate contracts into standard Bitcoin transactions. Participants can execute a contract by appending these transactions on the Bitcoin blockchain, according to their strategies. We prove the correctness of our compiler, showing that computational attacks on compiled contracts are also observable in the symbolic model.

Financial Cryptography and Data Security: FC 2015 International Workshops, BITCOIN, WAHC, and Wearable, San Juan, Puerto Rico, January 30, 2015, Revised Selected Papers

Book

Jan 2015
Lect Notes Comput Sci

This book constitutes the refereed proceedings of three workshops held at the 19th International Conference on Financial Cryptography and Data Security, FC 2015, in San Juan, Puerto Rico, in January 2015. The 22 full papers presented were carefully reviewed and selected from 39 submissions. They feature the outcome of the Second Workshop on Bitcoin Research, BITCOIN 2015, the Third Workshop on Encrypted Computing and Applied Homomorphic Cryptography, WAHC 2015, and the First Workshop on Wearable Security and Privacy, Wearable 2015.

A Journey into Bitcoin Metadata

Abstract and Figures

Recommended publications

An analysis of Bitcoin OP_RETURN metadata

Analysing blockchains and smart contracts: tools and techniques

A general framework for blockchain analytics

Fun with Bitcoin Smart Contracts: 8th International Symposium, ISoLA 2018, Limassol, Cyprus, Novembe...