ThesisPDF Available

Sentiment Analysis To Predict Global Cryptocurrency Trends

May 2018

May 2018

DOI:10.13140/RG.2.2.24311.32163

Authors:

American University of Armenia

This project involves discovering relationships between forum talks and bitcoin price changes. We have shown that there is a big correlation between bitcoin price and bitcointalk.org forum posts. For that purpose we took the data from the Bitcoin Discussion of bitcointalk.org, performed sentiment analysis as well as relevant statistical analysis, topic modeling and then constructed neural network to predict bitcoin price values.

Bitcointalk forums data.

…

Bitcoin price data.

…

The plot of bitcoin price historical data.

…

LDA topics.

…

Price label histograms:Price label min max vs price label uniform min max.

…

Figures - uploaded by Gasia Atashian

Content may be subject to copyright.

Content uploaded by Gasia Atashian

Content may be subject to copyright.

Capstone Thesis

Sentiment Analysis To Predict Global Cryptocurrency

Trends

By: Gasia Atashian,

Hrachya Khachatryan

Supervised by Arsen Mamikonyan

Submitted to the College of Science and Engineering

May 2018

Abstract

This project involves discovering relationships between forum talks and bitcoin price changes.

We have shown that there is a big correlation between bitcoin price and bitcointalk.org forum

posts. For that purpose we took the data from the Bitcoin Discussion of bitcointalk.org,

performed sentiment analysis as well as relevant statistical analysis, topic modeling and then

constructed neural network to predict bitcoin price values.

Contents  
 
Abstract 3 
Introduction 5 
Related research 6 
Data Gathering and Preprocessing: 7 
1 Data Collection: 7 
2 BitcoinTalk forums: 7 
3 Historical Bitcoin Price data 8 
Topic Models: LDA 10 
1 Data Preparation and Transformation 10 
2 Constructing a document-term matrix 11 
3 Applying the LDA model 11 
4 Results 11 
Sentiment analysis 13 
1 Sentiment analysis of forum posts 13 
2 Bitcoin price change Peaks 13 
3 Relation of sentiment analysis to price changes 14 
Neural Network 15 
1 Model Architecture 15 
2 Model validation. 16 
3 Feature Summary: 16 
4 Model results: 18 
Model Evaluation and Summary 23 
Code Sources 23 
References: 24 
 
 
 
 
 

1. Introduction

Bitcoin is a cryptocurrency that is used to make online payments and has become very popular

nowadays. Moreover this cryptocurrency is not only used for online payments, but also for

investments. Hence it became very important nowadays to understand how the bitcoin price

changes and what kind of factors influence its fluctuation. The rise of cryptocurrencies has

changed the way of economic transactions greatly. Besides the bitcoin several other

cryptocurrencies have come into existence as well. The increase of the bitcoin circulation also

brought a lot of users into social media and online forums to share their opinions and reactions

about this cryptocurrency. People who have common interests tend to make posts about certain

topics. Moreover, as bitcoin is mostly traded on the Internet, they make their decisions about

buying or selling the bitcoins mainly based on the information that they obtain from the same

Internet, usually from online forums and social media. And the relationship of such forum

discussions and bitcoin price fluctuations is not well studied today. Therefore this paper aims to

find such relationships by making sentiment and topic analysis of Bitcoin discussion posts from

bitcointalk.org and construct a predictive model.

In order to build our model we have downloaded data from bitcointalk, cleaned the data and used

machine learning to figure out correlation between blog posts and bitcoin price. We have scraped

the data from the mentioned source (Bitcoin discussion), then cleaned that to keep only the

useful information by removing unmeaningful words as well as HTML tags, stemmed the words,

(more on this in section 4.1). And after that we constructed a topic model with Latent Dirichlet

allocation (LDA) [1], which is described in section 2. Then we made a sentiment analysis of our

data. Afterward by having that information we analyzed which kind of impact does that have on

bitcoin price fluctuations by building a neural network with multiple layers (section 6).

2. Related research

Some people made forum posts sentiment analysis, without considering the information derived

from cumulative user posts’ data gathered during a specified period [9], while others made

research on online user comments.

For that purpose, topic modeling has been intensively studied as a technique for analysing user

opinions and thoughts from their textual posts[10]. Topic modeling[11] is a text-mining method

that gives a collection of prevailing topics and related keywords out of a large-scale document

corpus. The topics give users an instant overview of the overall corpus, by eliminating the need

to read through all the posts, which would be very difficult and time-consuming process.

Lately, collaborative filtering and topic modeling were integrated for generating scientific article

suggestion systems on online community[12]. A Temporal Latent Dirichlet Allocation

(TM-LDA) system was used to make a deep analysis of the online social media by employing an

advanced Latent Dirichlet Allocation (LDA) topic modeling[13]. Also, application of the LDA

approach to Chinese social reviews discovered the sentiments underlying some social events and

services[14].

3. Data Gathering and Preprocessing:

3.1 Data Collection:

Data is a gathered information that comes from real observations. Data collection plays a vital

role in each project, as the effectiveness and the cleanliness of the data directly affects the

results. For this project, the data is mainly gathered from two resources; from the bitcointalk

forums and from the historical bitcoin price exchange data.

3.2 BitcoinTalk forums:

Analyzing the different forums of social media, we mostly found bitcoin related data on Reddit,

Twitter and BitconTalk. We tried to gather data from Reddit and do analysis on, however

Bitcointalk forum was more related to our project as Reddit contained more generalized posts.

Although there are many projects that did sentiment analysis on Twitter data, but we decided to

use BitcoinTalk data to add more value to the overall research and project.

We gathered data from https://bitcointalk.org/ forum, which consists of topics and each topic

contains replies. Using the library Scrapy in python [4], we scraped the HTML data of each page

from bitcointalk forums since April 23, 2011 6:24:16 PM until May 05, 2018 01:15:00 AM,

which took almost 5 hours to complete. Then we separated each topic with its replies. We

preprocessed the data by cleaning the message part and separating the quoteheader for the

message. Initially, messages contained HTML noise, like ''' which we replaced it by its

human readable form which is '.

We used htmlpackage in python, specifically the unescape function to clean the HTML noise

and also the following mappings to further clean the rest:

map_clean= {''' : "'",

'"' : '"',

' ' : ' ',

'<br />' : ' ',

'<br/>' : ' ',

'<br >' : '',

'<b>' : '',

'&lt': '<',

'&gt': '>',

'&le': '<=',

'&ge': '>='}

Our final data frame consists of the following columns:

Timestamp, topic_id, topic_title, message_number, message_author, message_text, quoteheader.

Timestamp is an integer in Unix time

Topic_id is the id of each topic as the Bitcointalk keep tracks of unique id topic

Ttopic_title is the title of the topic

Message_number is the queue where the message (reply) stands in the following topic

Message_author is the author of the message (reply)

Message_text is the extracted message (reply)

Quoteheader refers to the quotheader if there is any.

In total we have 1046382 rows, with 25047 unique topic_title (topic_id)

Figure 3.1. Bitcointalk forums data.

Having the data ready, we can proceed to the next step.

3.3 Historical Bitcoin Price data

We have used the GDAX python library [5] to gather bitcoin history data. The API gives 200

points on each request, we had to gather the data part by part and then merge them.

We collected the data from 2014 until 2018 in an interval of 5 minutes.

The data contained six features, timestamp, low, high, open, close and volume.

Timestamp refers to the time in Unix time which we converted it to human readable date-time.

Low is the lowest price the bitcoin has been in the interval, same as high with the highest price.

Open is the opening price, which means the initial price in that interval, same as closing with the

closing price.

Volume refers to the traffic of transaction in the given time.

The following figure is our data:

Figure 3.1. Bitcoin price data.

Figure 3.2. The plot of bitcoin price historical data.

As the most interesting part was from 2016 to 2018, we filtered the historical data, as well as the

forum posts data to that range, and built our model based on that.

4. Topic Models: LDA

As our main goal is to find relations between forum talks and bitcoin price changes, we also

considered topic modeling in order to include topic distributions of the posts as an additional

feature in our model to have better results. Topic modeling is described as a way of finding a

group of words (topics) from a set of documents, which can be a collection of sentences, that

best shows the information of each of the documents. It can also be considered as a method of

text mining – a way of obtaining recurring patterns of words in textual context [15]. There are

many techniques of such modelings, however, the widely used one is the Latent Dirichlet

Allocation (LDA), which we used in our model to separate our posts into topics [3]. The LDA

model discovers the different topics that are contained in a particular document and the portion

of the topic that is present in a document. We took bitcoin price value per 5 minutes as labels of

our model; hence as input we took the set of all the posts that were made during every 5 minutes.

So the LDA model was constructed on the set of documents, which are represented as bags of

posts per 5 minutes. We built a model with 15 topics. Then for each document we got the

weights of the topics, and added those weights as additional features to our input data for the

neural network.

4.1 Data Preparation and Transformation

Data preparation is very important in order to build a meaningful topic model, because

documents may contain a lot of nonsensical words which will interfere generating useful topics.

So the following transformations were performed to prepare the data.

●Tokenizing: converting a document to its atomic elements.  In our case, we are interested

in tokenizing to words [1].

●Stopping: removing meaningless words. Certain parts of English speech, like

conjunctions (“for”, “or”) or the word “the” are meaningless to a topic model. These

terms are called stop words and were removed from our token list [1].

●Stemming: bringing words, that have equivalent meaning, to the same term. For

example,the words “stemming”, “stemmer”, “stemmed”, should be interpreted as the

same; hence stemming reduces those words to “stem”. So this is very important, because

otherwise our model would view those words as separate entities and will reduce the

frequency of that model which will influence the decision of the topics. For this purpose

the python implementation of Porter stemming algorithm was used to stem the words,

which performs different operations to remove unnecessary parts from the words by

having lists of English suffixes and prefixes in accordance with their usage rules [2].

4.2 Constructing a document-term matrix

Now at this stage, we have a tokenized, stopped and stemmed list of words. Then we constructed

a list of lists, one list for each of our original documents, which means we grouped all posts for

each subject into one list [3].

As was already mentioned, in order to build an LDA model we need to get the frequencies of the

terms within the documents. To do that we constructed a document-term matrix, which will

contain the frequencies of the words for each of the documents. First, we assigned a unique

integer id to each unique token then collected word counts and relevant statistics. As a result we

got an object called corpus, which represents a list of vectors equal to the number of documents.

In each document vector is a sequence of tuples. For example one of the vectors can be: [(0, 2),

(1, 1), (2, 2), (3, 2), (4, 1), (5, 1)],  which represents one of the documents. The tuples are (term

ID, term frequency) pairs.

4.3 Applying the LDA model

Now we have a document-term matrix and we generate an LDA model using gensim library [3].

Our LDA model is then stored. This model has a lot of functions that provide us with the needed

information. For instance, we can review our topics with the print_topic and print_topics

methods. Also we can get the topics for each of our documents. Hence by having this model, we

can understand which topics were most popular in particular periods of time and which kind of

impact does that have on the performance of the neural network predictions.

4.4 Results

The obtained 15 topics are represented as array of words with their weights. So the overall

picture of the topics is the following:

Figure 4.1. LDA topics.

Based on these topic representations we can see that indeed there are meaningful topics in the

documents. We can notice that the discussions of the forum are about important topics which

indeed can influence the price fluctuations. We can even label these topics. For instance, if we

look at topic number 2, based on the represented words, we can conclude that this topic is mainly

about stock investments in the market. As another example, we can take topic number 5. In this

case we can even make a complete meaningful sentence with the provided words. An example of

such sentence can be: ”It is good time to invest money and buy coins to make a profit”. Another

interesting topic is the 10th. We can see that this topic is about future price rise.

To sum up the results of LDA topic modeling, we can say that, the topics of the forum

discussions are indeed related to bitcoin. Hence, it is logical to make a hypothesis that the

discussions around the relevant topics may play a significant role in bitcoin price fluctuations.

Therefore as we became confident about the propriety of the topics, we decided to include the

weights of the 15 topics for each document as an auxiliary feature to our neural network (see

section 6).

5. Sentiment analysis

5.1 Sentiment analysis of forum posts

After dividing our documents into topics and constructing the LDA model, it is logical to

understand how positive the posts of each document were. To that purpose we use TextBlob

library to evaluate documents from our data set. TextBlob gives two values polarity (from -1 to

1), that shows the positiveness of the data and subjectivity (from 0 to 1), which shows how

subjective the data is [6]. Next, these two values also were added to the input data of the

predictive model as auxiliary features.

TextBlob example:

Text

Polarity

Subjectivity

look way sub even perfect price

buy majority people see sub coin

1 (positive)

1 (subjective)

price go halve people literally

retard

-0.9 (negative)

1(subjective)

argument support fact general

trend price dip well

0.05 (neutral)

0.5 (less subjective)

zero loss formula review software

peter morgan muler

0 (neutral)

0 (not subjective)

5.2 Bitcoin price change Peaks

We followed a general algorithm for finding local minimum and maximum in the bitcoin price

change data. According to our algorithm, a point is considered a maximum peak if it has the

maximal value, and was preceded (to the left) by a value lower by delta, and followed (to the

right) by a value lower by delta. The following plot shows the result of detecting maximum and

minimum points on bitcoin price change data. In next section we will see the relation of the

sentiments to such peaks

Figure 5.1. Bitcoin price peaks.

5.3 Relation of sentiment analysis to price changes

Figure 5.2. Sentiment analysis and price change relation.

From the figure above we can notice some correlation of sentiments with price changes. For

instance, in some regions, the positiveness is correlated with the increases in the price, and

negativeness is correlated with the decreases. However there are also some regions where this

correlation is not observed. The obtained correlation factor is not so high it is about 3%. All in

all, we added the sentiments to the neural network in order to see how it helps to make better

predictions.

6. Neural Network

6.1 Model Architecture

Next, we started to construct a neural network that predicts bitcoin price based on our data,

which is a set of user posts with some auxiliary features. We built Keras neural network [7]. As

the network required the inputs to be vectors, we decided to use StarSpace word embedding in

order to map words to a vector in a space that has high dimensions. These word embeddings give

semantic and syntactic information of the words. For instance, the words, which are similar, are

close to each other in this space and the words which are dissimilar, are far apart [8].

Before training the model we labeled the data. We tried different methods of labeling. First we

calculated the labels as the logarithm of the ratio of the average price of bitcoin in next 5 minutes

to the average price of previous 5 minutes: ). However, after trainingog( lAverage P rice N ext 5 M in

Average P rice P rev 5 M in

our model we got very bad results, our model was not learning anything and the loss was not

decreasing. Then we took the label as the normalized values of bitcoin price for each 5-minute

data and got better results. Moreover, we tried different ways of normalizing and the results for

each method are represented in section 7.

The main input to the model was the collection of the posts for every 5 minutes, represented as a

sequence of words which we got from StarSpace in a vector form. But to spice things up, we also

added auxiliary inputs to the model, such as polarity and subjectivity, whose correlation to the

data was 3%. Then we included the volume of transactions within each 5 minute as another

feature, whose correlation was 12%. And finally we added the 15 weights of the topics of our

LDA model as new auxiliary features. By having such amount of features we examined their

different combinations in the neural network and represented all obtained results in section 7. To

combine the main input data (the posts) with the auxiliary features we first passed our vectors to

the LSTM layer. LSTM transformed the sequence of the vectors into a single vector, containing

information about the entire sequence. Then we fed into the model our auxiliary input data by

concatenating it with the LSTM output. Next we stacked three deep densely-connected networks

with different amounts of neurons and finally applied the main logistic regression layer with

sigmoid activation as our labels range from 0 to 1.

6.2 Model validation.

Our final filtered data contains 351666 forum replies, which we transformed into 5-minute bags,

which made a total of 141981, and divided randomly to three sets: train, test and validation with

the respective ratios: 80%, 10%, 10%. To measure the performance of our model we used the

values of loss function measured in mean squared error, as well as the R squared score which

measures how the data points are close to the fitted values. Assuming that the best naive model

without training it on the data is to predict the mean value of the initial label, we can compare

our model to the mean model and measure how well it does. The R-squared score is calculated

by using the following formula:

where is the sum of square of difference between the real value and the R2= 1 − RS S

T SS SST

predicted, and is the sum of squares of difference between the real value and the meanSSR

value.

In other words, R squared is used as the indicators of the goodness of our model. The values of

the indicators, as well as the plots of the losses, are represented in section 7.



Trying different

architectures and different combinations of features, we obtained different results which are

shown below.

6.3 Feature Summary:

●StarSpace converted our words into 100 dimensionality vectors, followed by an LSTM of

32 neurons and three fully connected layers which takes the output of LSTM and the

auxiliary variables as its input. The input and output size of the fully connected layers is

the same size as the sum of LSTM output and number of auxiliary variables.

●Word padding refers to the dimensionality of the initial sentence input; for example

having 150 word padding with a sentence containing 100 words, means adding empty

words to the sentence to make the dimensionality 150, wherease the sentences that have

more than 150 words, will cut it up to 150 words.

The following is the histogram of the number of word in our messages. It goes up to 2000

words.

Figure 6.1.

Histogram of

number of words

in sentences

Our experiment showed that using padding of 750 words reduce the complexity of the

algorithm and doesn’t affect the loss.

●Sigmoid is the activation function which is used in the fully connected layers as well as

the last layer.

●LDA topic refers to the probability vector of each forum to belong to one of the 15 LDA

trained topics.

●Price label, we used two types of normalization. In the Min Max case we normalized the

price to fall into the range 0 to 1. And in the Uniform min max case, we took the

logarithm of the price and then normalized it to fall in the range 0 to 1 which gave almost

a uniform distribution.

Figure 6.2: Price label histograms:Price label min max vs price label uniform min max.

6.4 Model results:

This experiment is done to try different word paddings in case of the non-uniform price labeling

(min max), and one experiment using LDA topics.

Model 1

Model 2

Model 3

Model 4

Word padding

1500

750

150

Price label

Min Max

LDA topics

False

True

False

polarity

True

subjectivity

True

volume

True

epochs

RMSE Train set

0.044

0.041

0.028

RMSE Test set

0.045

0.046

0.042

0.029

RMSE validation set

0.043

0.0436

0.040

0.030

R squared Train set

0.163

0.161

0.224

0.470

R squared Test set

0.172

0.167

0.225

0.446

R squared Validation set

0.165

0.225

0.444

# hours to train on CPU

2.8

0.33

Train_val loss

predictions

This experiment is done to try different combinations of auxiliary variables in case of the

uniform price labeling (uniform min max).

Model 5

Model 6

Model 7

Model 8

Model 9

Word padding

1500

750

Price label

Uniform min

max

Uniform min

max

Uniform min

max

Uniform min

max

Uniform min

max

LDA topics

False

True

False

polarity

True

False

subjectivity

True

False

volume

True

False

True

epochs

Loss Train set

0.072

0.722

0.065

0.085

0.072

Loss Validation set

0.072

0.071

0.653

0.086

0.072

RMSE Train set

0.072

0.071

0.065

0.086

0.072

RMSE Test set

0.072

0.066

0.087

0.073

RMSE validation set

0.072

0.071

0.065

0.086

0.072

R squared Train set

0.208

0.212

0.283

0.056

0.204

R squared Test set

0.214

0.218

0.282

0.056

0.211

R squared Validation

set

0.211

0.214

0.283

0.049

0.206

# hours to train on CPU

1.25

0.83

0.66

0.53

1.2

Train_val loss

predictions

The following plots are dedicated to the Train and Validation loss depending on epoch (left side

part plot), and predictions on a part of test data set (right side part plot) for each model.

Model 1:

Model 2:

Model 3:

Model 4:

Model 5:

Model 6:

Model 7:

Model 8:

Model 9:

8. Model Evaluation and Summary

From the obtained results we can conclude that the model works pretty well taking into account

that the data was not so much and was taken only from one forum. We can see that the highest R

squared score was obtained in case of having only 150 words in each document, however, this is

not a good prediction because the large amounts of posts were omitted in this case. Next, we can

see that changing that value from 750 to 1500 does not make much difference, because there are

not so many documents with a length larger than 750. Min max price labeling gives better R

squared score, and less loss, however, the distribution of the price is right skewed, which means

we have more chance to be near the real value by predicting the value near the left peak.

Although uniform min max gives more loss and less R squared, but statistically it is more

accurate to do the experiment on. Using uniform min max labeling, with 750 we have R squared

score about 0.22. However, when we added also LDA topics’ weights as a feature the model

worked better and we got R squared score 0.28. By this result we can say that our hypothesis

made in section 4.4 was true; indeed LDA topic weights improved the R squared score. Hence

we can say that topic distribution of the posts influence price fluctuations. And we should also

notice that in case of including LDA topics the uniform normalization gave a higher score than

the other one. Moreover, the results of adding the auxiliary variables of polarity, subjectivity and

volume also increase the wellness of the model.

In conclusion, from the plots of the losses of our model we can see that in addition to not bad

accuracies of value predictions, the model predicts the moments of increases and decreases of the

price very well. And for further improvements we will try to test our model on larger dataset

from different forums as well as social media.

9. Code Sources

All the source codes are provided in the following GitHub repository:

https://github.com/Gasia44/Capstone_Project

References: 
1. Blei D. (2003). Latent Dirichlet Allocation. 

Retrieved from: 
http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf 
2. Porter M. (2006). The Porter Stemming Algorithm. 

Retrieved from: 
https://tartarus.org/martin/PorterStemmer/ 
3. Barber J. (n. d.). Latent Dirichlet Allocation (LDA) with Python.

 Retrieved: 
https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a248
67a.html 
4. Scrapy 1.5 documentation. (n.d.). Retrieved from https://doc.scrapy.org/en/latest/ 
5. D. (n.d.). Danpaquin/gdax-python. Retrieved from 
https://github.com/danpaquin/gdax-python 
6. Tutorial: Quickstart¶. (n.d.). Retrieved from 
http://textblob.readthedocs.io/en/dev/quickstart.html 
7. Chilamkurthy, S. (2017, January 05). Keras Tutorial - Spoken Language Understanding. 
Retrieved from 

https://chsasank.github.io/spoken-language-understanding.html 
8. Getting started with the Keras functional API. (n.d.). Retrieved from 
https://keras.io/getting-started/functional-api-guide/ 
9. Kim, Y. B., Lee, S. H., Kang, S. J., Choi, M. J., Lee, J., & Kim, C. H. (2015). Retrieved 
from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4524693/ 
10. Linton M, Teo EG, Bommes E, Chen CY-H, Härdle WK. Dynamic Topic Modelling for 
Cryptocurrency Community Forums. 2016. 
11. Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. Journal of machine Learning 
research. 2003;3(Jan):993–1022. 
12. Wang C, Blei DM. Collaborative topic modeling for recommending scientific articles. 
Retrieved from http://www.cs.columbia.edu/~blei/papers/WangBlei2011.pdf 
13. Wang Y, Agichtein E, Benzi M. TM-LDA: efficient online modeling of latent topic 
transitions in social media. Proceedings of the 18th ACM SIGKDD international 
conference on Knowledge discovery and data mining. 2012:123–31. 
14. 28. Xianghua F, Guo L, Yanyan G, Zhiqiang W. Multi-aspect sentiment analysis for 
Chinese online social reviews based on topic modeling and HowNet lexicon. 
Knowledge-Based Systems. 2013;37:186–95. 
15. KDnuggets. (n.d.). Retrieved from 
https://www.kdnuggets.com/2016/07/text-mining-101-topic-modeling.html 

Topic Modeling Based on Two-Step Flow Theory: Application to Tweets about Bitcoin

Preprint

Full-text available

Mar 2023

Digital cryptocurrencies such as Bitcoin have exploded in recent years in both popularity and value. By their novelty, cryptocurrencies tend to be both volatile and highly speculative. The capricious nature of these coins is helped facilitated by social media networks such as Twitter. However, not everyone's opinion matters equally, with most posts garnering little to no attention. Additionally, the majority of tweets are retweeted from popular posts. We must determine whose opinion matters and the difference between influential and non-influential users. This study separates these two groups and analyzes the differences between them. It uses Hypertext-induced Topic Selection (HITS) algorithm, which segregates the dataset based on influence. Topic modeling is then employed to uncover differences in each group's speech types and what group may best represent the entire community. We found differences in language and interest between these two groups regarding Bitcoin and that the opinion leaders of Twitter are not aligned with the majority of users. There were 2559 opinion leaders (0.72% of users) who accounted for 80% of the authority and the majority (99.28%) users for the remaining 20% out of a total of 355,139 users.

Topic Modeling Based on Two-Step Flow Theory: Application to Tweets about Bitcoin

Article

Full-text available

Mar 2023

Dynamic Topic Modelling for Cryptocurrency Community Forums

Chapter

Aug 2017

Cryptocurrencies are more and more used in official cash flows and exchange of goods. Bitcoin and the underlying blockchain technology have been looked at by big companies that are adopting and investing in this technology. The CRIX Index of cryptocurrencies http://hu.berlin/CRIX indicates a wider acceptance of cryptos. One reason for its prosperity certainly being a security aspect, since the underlying network of cryptos is decentralized. It is also unregulated and highly volatile, making the risk assessment at any given moment difficult. In message boards one finds a huge source of information in the form of unstructured text written by e.g. Bitcoin developers and investors. We collect from a popular crypto currency message board texts, user information and associated time stamps. We then provide an indicator for fraudulent schemes. This indicator is constructed using dynamic topic modelling, text mining and unsupervised machine learning. We study how opinions and the evolution of topics are connected with big events in the cryptocurrency universe. Furthermore, the predictive power of these techniques are investigated, comparing the results to known events in the cryptocurrency space. We also test hypothesis of self-fulling prophecies and herding behaviour using the results.

Latent dirichlet allocation

Article

Jan 2007

Multi-aspect sentiment analysis for Chinese online social reviews based on topic modeling and HowNet lexicon

Article

Jan 2013
KNOWL-BASED SYST

User-generated reviews on the Web reflect users’ sentiment about products, services and social events. Existing researches mostly focus on the sentiment classification of the product and service reviews in document level. Reviews of social events such as economic and political activities, which are called social reviews, have specific characteristics different to the reviews of products and services. In this paper, we propose an unsupervised approach to automatically discover the aspects discussed in Chinese social reviews and also the sentiments expressed in different aspects. The approach is called Multi-aspect Sentiment Analysis for Chinese Online Social Reviews (MSA-COSRs). We first apply the Latent Dirichlet Allocation (LDA) model to discover multi-aspect global topics of social reviews, and then extract the local topic and associated sentiment based on a sliding window context over the review text. The aspect of the local topic is identified by a trained LDA model, and the polarity of the associated sentiment is classified by HowNet lexicon. The experiment results show that MSA-COSR cannot only obtain good topic partitioning results, but also help to improve sentiment analysis accuracy. It helps to simultaneously discover multi-aspect fine-grained topics and associated sentiment.

The Porter Stemming Algorithm

Article

Jan 2009

M. F. Porter

Collaborative topic modeling for recommending scientific articles

Conference Paper

Aug 2011

Researchers have access to large online archives of scientific articles. As a consequence, finding relevant papers has become more difficult. Newly formed online communities of researchers sharing citations provides a new way to solve this problem. In this paper, we develop an algorithm to recommend scientific articles to users of an online community. Our approach combines the merits of traditional collaborative filtering and probabilistic topic modeling. It provides an interpretable latent structure for users and items, and can form recommendations about both existing and newly published articles. We study a large subset of data from CiteULike, a bibliography sharing service, and show that our algorithm provides a more effective recommender system than traditional collaborative filtering.

Latent Dirichlet Allocation

Article

May 2003
J MACH LEARN RES

We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

Getting started with the Keras functional API

Jan 2017

S Chilamkurthy

Chilamkurthy, S. (2017, January 05). Keras Tutorial-Spoken Language Understanding. Retrieved from https://chsasank.github.io/spoken-language-understanding.html 8. Getting started with the Keras functional API. (n.d.). Retrieved from https://keras.io/getting-started/functional-api-guide/

TM-LDA: efficient online modeling of latent topic transitions in social media

Jan 2012
123-131

Y Wang
E Agichtein
M Benzi

Wang Y, Agichtein E, Benzi M. TM-LDA: efficient online modeling of latent topic transitions in social media. Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. 2012:123-31.

Latent Dirichlet Allocation (LDA) with Python

J Barber

Barber J. (n. d.). Latent Dirichlet Allocation (LDA) with Python. Retrieved: https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a248 67a.html

Keras Tutorial -Spoken Language Understanding

Jan 2017

S Chilamkurthy

Chilamkurthy, S. (2017, January 05). Keras Tutorial -Spoken Language Understanding. Retrieved from https://chsasank.github.io/spoken-language-understanding.html 8. Getting started with the Keras functional API. (n.d.). Retrieved from https://keras.io/getting-started/functional-api-guide/

Sentiment Analysis To Predict Global Cryptocurrency Trends

Abstract and Figures

Recommended publications

News: Bitcoin price crash finds new victims

Price clustering in Bitcoin

Bitcoin Fluctuations and the Frequency of Price Overreactions

Linear and nonlinear models for forecasting the realized volatility of cryptocurrencies