Fig 1 - uploaded by Pulkit Mehndiratta
Content may be subject to copyright.
Variation of accuracy with different tweet grouping sizes 

Variation of accuracy with different tweet grouping sizes 

Source publication
Conference Paper
Full-text available
Authorship Attribution (AA), the science of inferring an author for a given piece of text based on its characteristics is a problem with a long history. In this paper, we study the problem of authorship attribution for forensic purposes and present machine learning techniques and stylometric features of the authors that enable authorship to be dete...

Context in source publication

Context 1
... also did a quantitative analysis, by bunching tweets in different group sizes. An overview of results obtained are summarized and illustrated in Table. 1 and Fig.1. Accuracy (%) is number of authors correctly classified. ...

Similar publications

Article
Full-text available
Stylometry is a form of authorship attribution that relies on the linguistic information to attribute documents of unknown authorship based on the writing styles of a suspect set of authors. This paper focuses on the cross-domain subproblem where the known and suspect documents differ in the setting in which they were created. Three distinct domain...
Article
Full-text available
Automated social media bots have existed almost as long as the social media environments they inhabit. Their emergence has triggered numerous research efforts to develop increasingly sophisticated means to detect these accounts. These efforts have resulted in a cat and mouse cycle in which detection algorithms evolve trying to keep up with ever evo...
Conference Paper
Full-text available
We present SocialLink, a publicly available Linked Open Data dataset that matches social media accounts on Twitter to the corresponding entities in multiple language chapters of DBpedia. By effectively bridging the Twitter social media world and the Linked Open Data cloud, SocialLink enables knowledge transfer between the two: on the one hand, it s...
Conference Paper
Full-text available
Online social media websites like Twitter has become one of the most popular platforms for people to obtain or spread information. However, in absence of any moderation and use of crowd sourcing, there is no guarantee that the information shared is credible or not. This makes online social media highly susceptible to the spread of rumors. As part o...
Conference Paper
Full-text available
In this work, we explore the possibility to detecting life events from Social Media by means of machine learning classification algorithms. One important difficulty of this kind of detection task is that, typically, Social Media posts are quite short, and there is not much context provided. This lack of context usually implies strong ambiguity lead...

Citations

... The research questions asked by the author are: (1) Does the use of multinomials by the two categories of authors show quantitatively distinct patterns? (2) How can the quantitative distinctions in point be accounted for in terms of the communicative functions pursued by the authors and what patterns emerge with regard to the construal of professional group identities? ...
... More narrowly, the research project also fits into the strand of authorship factor analyses in general language, in the context of authorship attribution studied from the highly technical perspective [e.g. 2,9] or specifically legal language addressing the domain of phraseology [29] or other aspects of language [40]. ...
Article
Full-text available
The paper explores the hypothesis that multinomials can act as authorship-based style distinguishing markers in legal communication. Specifically, the analysis focuses on identifying the quantitative distribution patterns of structural categories of multinomials as typical for two authorship categories and on their communicative function. The two authorship categories that are contrasted here are legal professionals/experts and lay people. The analysis is conducted in the corpus-based methodology with a custom-designed corpus of English, authentic texts found in the legal trade, in the domain of company registration proceedings. The findings confirm that multinomials that are conventionally considered to be a feature of professional legal communication are also cognitively salient in lay communication. Further, the texts drafted by the two categories of authors are profiled by structurally distinct multinomials. Functionally, it has been demonstrated that the structurally distinct types of multinomials that are found quantitatively salient in the two authorship categories are used predominantly for specific stylistic and/or pragmatic functions. Stylistically, multinomials contribute to conventional and ritual patterns which are used to meet the formality standards that have evolved in specific legal professions where authority is of particular importance. Pragmatic factors which account for quantitative salience of specific, structurally profiled categories of multinomials involve mainly reduplication of multinomials that embody norm-related concepts, which is required on the ground of intertextuality and ensures the materialisation of legal effect.
... To our knowledge, this is the first work incorporating stylometry to discriminate AI-generated text from human-written text. However, stylometry is a well-established tool used in author attribution and verification in many domains, including Twitter [2]. For detecting style changes within a document, different stylistic cues are leveraged in order to identify a given text's authorship and find author changes in multi-authored documents [9,28]. ...
Preprint
Full-text available
Recent advancements in pre-trained language models have enabled convenient methods for generating human-like text at a large scale. Though these generation capabilities hold great potential for breakthrough applications, it can also be a tool for an adversary to generate misinformation. In particular, social media platforms like Twitter are highly susceptible to AI-generated misinformation. A potential threat scenario is when an adversary hijacks a credible user account and incorporates a natural language generator to generate misinformation. Such threats necessitate automated detectors for AI-generated tweets in a given user's Twitter timeline. However, tweets are inherently short, thus making it difficult for current state-of-the-art pre-trained language model-based detectors to accurately detect at what point the AI starts to generate tweets in a given Twitter timeline. In this paper, we present a novel algorithm using stylometric signals to aid detecting AI-generated tweets. We propose models corresponding to quantifying stylistic changes in human and AI tweets in two related tasks: Task 1-discriminate between human and AI-generated tweets, and Task 2-detect if and when an AI starts to generate tweets in a given Twitter timeline. Our extensive experiments demonstrate that the stylometric features are effective in augmenting the state-of-the-art AI-generated text detectors.
... Although AA can be used for good purposes such as social media and email forensics (Rocha et al., 2016;Apoorva and Sangeetha, 2021), authorship identification and verifying (Theophilo et al., 2022(Theophilo et al., , 2021Boenninghoff et al., 2019), they can also be leveraged for malicious purposes such as to unmask the authorship of private, anonymous texts. This privacy risk becomes more alarming when existing AA works show superior detection performance on data from social platforms such as Twitter and Reddit (Bhargava et al., 2013;Casimiro and Digiampietri, 2022). AO technique (Hagen et al., 2017) is to prevent this privacy risk. ...
... The authors in [73] achieved an accuracy of 53.2% in identifying an author from 10,000 scale microblog users using character n-gram frequency with cosine similarity to discover the most relevant stylometric features. The authors in [74] used lexical, syntactic, tweet-specific, and other useful features to extract stylometric information from a given tweet using a natural language toolkit (NLTK). The features corresponding to an anonymous author were identified with the help of the SVM method. ...
Article
Full-text available
Post-authorship attribution is a scientific process of using stylometric features to identify the genuine writer of an online text snippet such as an email, blog, forum post, or chat log. It has useful applications in manifold domains, for instance, in a verification process to proactively detect misogynistic, misandrist, xenophobic, and abusive posts on the internet or social networks. The process assumes that texts can be characterized by sequences of words that agglutinate the functional and content lyrics of a writer. However, defining an appropriate characterization of text to capture the unique writing style of an author is a complex endeavor in the discipline of computational linguistics. Moreover, posts are typically short texts with obfuscating vocabularies that might impact the accuracy of authorship attribution. The vocabularies include idioms, onomatopoeias, homophones, phonemes, synonyms, acronyms, anaphora, and polysemy. The method of the regularized deep neural network (RDNN) is introduced in this paper to circumvent the intrinsic challenges of post-authorship attribution. It is based on a convolutional neural network, bidirectional long short-term memory encoder, and distributed highway network. The neural network was used to extract lexical stylometric features that are fed into the bidirectional encoder to extract a syntactic feature-vector representation. The feature vector was then supplied as input to the distributed high networks for regularization to minimize the network-generalization error. The regularized feature vector was ultimately passed to the bidirectional decoder to learn the writing style of an author. The feature-classification layer consists of a fully connected network and a SoftMax function to make the prediction. The RDNN method was tested against thirteen state-of-the-art methods using four benchmark experimental datasets to validate its performance. Experimental results have demonstrated the effectiveness of the method when compared to the existing state-of-the-art methods on three datasets while producing comparable results on one dataset.
... From a dataset of over 600.000 comments, we isolated 121.822 shared by CT users and 481.185 from legit ones. We performed a stylometric analysis similar to the one conducted in (Bhargava, Mehndiratta, and Asawa 2013), based on Lexical Features, Syntactical Features, and Emoji Features. ...
Preprint
Full-text available
In 2021, Influencer Marketing generated more than $13 billion. Companies and major brands advertise their products on Social Media, especially Instagram, through Influencers, i.e., people with high popularity and the ability to influence the mass. Usually, more popular and visible influencers are paid more for their collaborations. As a result, many services were born to boost profiles' popularity, engagement, or visibility, mainly through bots or fake accounts. Researchers have focused on recognizing such unnatural activities in different social networks with high success. However, real people recently started participating in such boosting activities using their real accounts for monetary rewards, generating ungenuine content that is very difficult to detect. Currently, on Instagram, no works have tried to detect this new phenomenon, known as crowdturfing (CT). In this work, we are the first to propose a CT engagement detector on Instagram. Our algorithm leverages profiles' characteristics through semi-supervised learning to spot accounts involved in CT activities. In contrast to the supervised methods employed so far to detect fake accounts, a semi-supervised approach takes advantage of the vast quantities of unlabeled data on social media to yield better results. We purchased and studied 1293 CT profiles from 11 providers to build our self-training classifier, which reached 95% accuracy. Finally, we ran our model in the wild to detect and analyze the CT engagement of 20 mega-influencers (i.e., with more than one million followers), discovering that more than 20% of their engagement was artificial. We analyzed the profiles and comments of people involved in CT engagement, showing how difficult it is to spot these activities using only the generated content.
... As indiscutíveis facilidades oferecidas pela Internet têm sido usadas também para a prática de diferentes crimes, como estelionato, furto de ativos bancários, acessos indevidos a informações de dispositivos pessoais, comerciais e públicos, mas também para a prática de crimes relacionados ao abuso sexual infantil e de adolescentes, comumente referido pelo termo pedofilia. O cometimento de delitos usando a rede mundial de computadores é incentivado pela dificuldade de se identificar seus autores (ABBASI; ZHENG et al., 2006;BHARGAVA;MEHNDIRATTA;ASAWA, 2013;YANG;CHOW, 2014). Os crimes relacionados ao abuso sexual de menores de idade pela Internet são praticados por meio de diversas modalidades, incluindo a atração e cooptação de crianças e adolescentes e a troca, comércio e divulgação de fotos e filmes digitais de práticas pedófilas ou de pornografia infantil (FRANCO;MAGALHÃES, 2015). ...
... As indiscutíveis facilidades oferecidas pela Internet têm sido usadas também para a prática de diferentes crimes, como estelionato, furto de ativos bancários, acessos indevidos a informações de dispositivos pessoais, comerciais e públicos, mas também para a prática de crimes relacionados ao abuso sexual infantil e de adolescentes, comumente referido pelo termo pedofilia. O cometimento de delitos usando a rede mundial de computadores é incentivado pela dificuldade de se identificar seus autores (ABBASI; ZHENG et al., 2006;BHARGAVA;MEHNDIRATTA;ASAWA, 2013;YANG;CHOW, 2014). Os crimes relacionados ao abuso sexual de menores de idade pela Internet são praticados por meio de diversas modalidades, incluindo a atração e cooptação de crianças e adolescentes e a troca, comércio e divulgação de fotos e filmes digitais de práticas pedófilas ou de pornografia infantil (FRANCO;MAGALHÃES, 2015). ...
... As indiscutíveis facilidades oferecidas pela Internet têm sido usadas também para a prática de diferentes crimes, como estelionato, furto de ativos bancários, acessos indevidos a informações de dispositivos pessoais, comerciais e públicos, mas também para a prática de crimes relacionados ao abuso sexual infantil e de adolescentes, comumente referido pelo termo pedofilia. O cometimento de delitos usando a rede mundial de computadores é incentivado pela dificuldade de se identificar seus autores (ABBASI; ZHENG et al., 2006;BHARGAVA;MEHNDIRATTA;ASAWA, 2013;YANG;CHOW, 2014). Os crimes relacionados ao abuso sexual de menores de idade pela Internet são praticados por meio de diversas modalidades, incluindo a atração e cooptação de crianças e adolescentes e a troca, comércio e divulgação de fotos e filmes digitais de práticas pedófilas ou de pornografia infantil (FRANCO;MAGALHÃES, 2015). ...
Article
Full-text available
Objetivos: Identificar o atual estado da arte das pesquisas científicas no campo da atribuição de autoria aplicada a investigações de crimes sexuais contra crianças e adolescentes pela Internet envolvendo material escrito. Propor uma metodologia de utilização da atribuição de autoria para identificação de suspeitos de serem autores de textos com conteúdo que incentive o abuso sexual infantojuvenil.Metodologia: Trata-se de uma pesquisa qualitativa que utiliza a Revisão Sistemática da Literatura para identificar trabalhos que versem a respeito das técnicas de atribuição de autoria com o intuito de se buscar evidências científicas de sua aplicação a problemas semelhantes ao abordado no presente estudo.Resultados: Apresenta-se o atual estado da arte das pesquisas científicas que relacionam a utilização de técnicas de atribuição de autoria a textos presentes na internet que incentivam a prática de abuso sexual de crianças e adolescentes e, a partir disso, propõe-se uma metodologia para identificação de autores de textos com aquelas características.Conclusões: Conclui-se que não existe abundância de pesquisas científicas sobre esse tema, o que sugere ser um campo aberto à novos estudos. Também se conclui que é plenamente possível a aplicação das técnicas de atribuição de autoria na identificação dos prováveis autores de textos que tenham como objetivo orientar e fomentar a prática de abuso sexual infantojuvenil, o que é explicitado pela metodologia proposta.
... Traditionally, methods for authorship analysis are based on the extraction of stylometric features. Stylometric features, i.e., statistical features of a document, can be divided into several categories, such as lexical features, character features, syntactic features, structural features, and semantic features [173][174][175][176]. Halvani et al. examined 12 existing author verification methods on their own self-compiled corpora, where each corpus focuses on a different aspect of applicability [177]. ...
Preprint
In this paper, we review recent work in media forensics for digital images, video, audio (specifically speech), and documents. For each data modality, we discuss synthesis and manipulation techniques that can be used to create and modify digital media. We then review technological advancements for detecting and quantifying such manipulations. Finally, we consider open issues and suggest directions for future research.
... The issue of the authorship factor has been addressed in research studied from the highly technical perspective of authorship attribution (e.g. Bhargava, Mehndiratta, Asawa, 2013;Coyotl-Morales, Villaseńor-Pineda, Montes-y-Gómez, Rosso, 2006) and it concerns various types of register. This study contributes to relevant findings in the area of legal language (e.g. ...
Article
Full-text available
The paper aims at describing the findings and conclusions formulated in the analysis of the authorship factor in legal discourse. It is hypothesised that verbal structures show systemically varied distribution across legal discourse and the relevant distinctions run through the authorship categories. When it comes to the aim of the research it draws on the tradition of sociolinguistic methodology targeting issues related to language variation which follows the basic assumptions of functional grammar. From the point of view of the material covered by the analysis it contributes to the research on legal discourse and specifically on its specialised domain referred to as corporate, company or business discourse. It provides additional empirical data pointing to the non-homogeneity of the legal style and formal distinctions originating from rich contextual background. The study is conducted on the material of a custom-designed corpus of English legal texts, classified as secondary genres. Methodologically, the study makes use of the tenets of supervised search of digitalised corpora and automatic data extraction based on discrete units, subsequent identification of recurring longer contiguous and/or non-contiguous sequences, if any, built around the axis of specific verbal structures and finally qualitative comparative analysis (characterisation) of the material. The discussion presents sample data and focuses on the most salient categories, both quantitatively and qualitatively. The inductive approach confirms the formal divergencies in the communicative situation covered by the analysis. The findings encapsulate patterns and tendencies in the quantitative distribution of verbal structures depending on the authorship category. It may be concluded that authorship is a factor delineating distinctions as regards (i) the repertoire of grammatical instruments exploited (verbal structures), which contributes to the specific stylistic profile of given authors. This shows that the thesis posed is verified positively and the study shows further, more detailed distinctions running through groups of subcategories distinguished within the authorship categories specified upon the start of the research.
... Stylometry traditionally is the field of research which studies the characteristics of "style" (usually of text, but also music or art) [119], usually with the aim of determining authorship, and has now found a new popularity thanks for the massive increase in the availability of text over the Internet. Stylometric analysis, as a statistical analysis of textual input, has been used at example for studying blogging and micro-blogging (Twitter) [21,71] or also to improve cybersecurity algorithms [197]. ...
Thesis
Full-text available
The number of higher education students facing mental health issues has reached a critical point, with major consequences both in terms of academic achievement and general health and wellbeing. Institutions recognise the issue and have put in place a number of measures to try and counteract the crisis. Monitoring students’ wellbeing can play a critical role, and Virtual Learning Environments could be instrumental in achieving this goal. This is particularly true in online-learning scenarios, where face-to-face contact is less present or not present at all, and educators need to rely on observing the online behaviour. However, the information overload to lecturers is significant, keeping track of each online discussion forum is extremely onerous, and the help from "learning analytics" not always useful, as they often concentrate on measures of students’ performance, engagement, and presence in the virtual classroom, which are not necessarily good indicators of mental health. The work presented in this thesis proposes to bridge this gap by addressing the effectiveness of emotion or writing style profiling to identify students at risk. The work proposes a conceptual framework for a system, intended to sit alongside the virtual learning environment, and able to play the role of an "emotion observer", identifying and flagging potential issues to the educator. We propose a system where technology is supportive of educators rather than replacing them, and is not intrusive or changing the classroom dynamics. We demonstrate the validity of the approach with a series of experiments. We address the technical feasibility of such a system by investigating how established artificial intelligence techniques, and "off-the-shelf" tools implementing them, can be used to carry out the tasks that would need to be performed by a system implementing the approach, and we discuss their performances on either available datasets, or, for one of the experiments, a purpose built dataset, which is part of the contributions of the thesis. We address the admissibility of such an approach by conducting a focus group study with a group of experts in online learning. The contribution of the thesis is therefore the first complete feasibility study on the development of a novel system able to monitor students’ emotional state, both individually and as a cohort, and longitudinally over the course of their studies, which is aimed at supporting online educators identify students at risk and implement strategies for intervention.
... Syntactical structures vary from simple to complicated. The latter are difficult to formalize [5][6][7]. The advantage of our research is the choice of the phonological level. ...
Article
Full-text available
A one-consonant group approach to the authorship attribution has been proposed. The approach is based on determining, by the chi-square test, the consonant group in which the difference between the texts by different authors is statistically significant. The developed model determines author-differentiating capability of each consonant group in a relation of the number of comparisons, in which the difference between the texts by two authors is statistically significant to the total number of comparisons. The determined general author-differentiating capability of the group of stop consonants, which is a statistical parameter of the authorial style, is the highest in the comparisons of texts from the publicist and belles-lettres styles. The one-consonant group approach simplifies the whole process of authorship attribution and ensures a higher level of automation. The conducted experiments on the Java programming language have proved that the chi-square test is a powerful nonparametric statistical test that can be used for author identification on the level of English consonants with a test validity of 95%.