Figure 5 - uploaded by Vittorio Loreto
Content may be subject to copyright.
Indo-European family language tree: this figure illustrates the phylogenetic-like tree constructed on the basis of more than 50 different versions of the 'The Universal Declaration of Human Rights'. The tree is obtained using the Fitch–Margoliash method applied to the symmetrical distance matrix based on the R distance defined in section 2.2 built from the ATC method. This tree features essentially all the main linguistic groups of the Euro-Asiatic continent (Romance, Celtic, Germanic, Ugro-Finnic, Slavic, Baltic, Altaic), as well as a few isolated languages as the Maltese, typically considered an Afro-Asiatic language, and the Basque, classified as a non-Indo-European language and whose origins and relationships with other languages are uncertain. The tree is unrooted, i.e. it does not require any hypothesis about common ancestors for the languages and it cannot be used to infer information about common ancestors of the languages. For more details, see the text. The lengths of the paths between pairs of documents measured along the tree branches are not proportional to the actual distances between the documents.  

Indo-European family language tree: this figure illustrates the phylogenetic-like tree constructed on the basis of more than 50 different versions of the 'The Universal Declaration of Human Rights'. The tree is obtained using the Fitch–Margoliash method applied to the symmetrical distance matrix based on the R distance defined in section 2.2 built from the ATC method. This tree features essentially all the main linguistic groups of the Euro-Asiatic continent (Romance, Celtic, Germanic, Ugro-Finnic, Slavic, Baltic, Altaic), as well as a few isolated languages as the Maltese, typically considered an Afro-Asiatic language, and the Basque, classified as a non-Indo-European language and whose origins and relationships with other languages are uncertain. The tree is unrooted, i.e. it does not require any hypothesis about common ancestors for the languages and it cannot be used to infer information about common ancestors of the languages. For more details, see the text. The lengths of the paths between pairs of documents measured along the tree branches are not proportional to the actual distances between the documents.  

Source publication
Article
Full-text available
In this paper we exploit concepts of information theory to address the fundamental problem of identifying and defining the most suitable tools to extract, in a automatic and agnostic way, information from a generic string of characters. We introduce in particular a class of methods which use in a crucial way data compression techniques in order to...

Similar publications

Conference Paper
Full-text available
This paper addresses the problem of side information generation in distributed video compression (DVC) schemes. Intermediate frames constructed by motion-compensated interpolation of key frames are used as side information to decode Wyner-Ziv frames. The limitations of block-based translational motion models call for new motion models. This article...

Citations

... Here we focus on the ones that yielded the best results: Overlapping Space-Free N -Gram [41] (OSF), i.e., strings of characters of fixed length N that include spaces only as first or last characters, thus discarding words shorter than N − 2. We further explore a hybrid approach, which gave optimal results for one of the considered dataset (see SI [34]), where we exploit the structure captured by the Lempel and Ziv compression algorithm (LZ77) [42]. Namely, we consider LZ77 Sequences as tokens, i.e., repeated sequences extracted through a modified version [43] of the Lempel and Ziv algorithm already used for attribution purposes in [44]. ...
Preprint
In this letter, we introduce a new approach to quantify the closeness of symbolic sequences and test it in the framework of the authorship attribution problem. The method, based on a recently discovered urn representation of the Pitman-Yor process, is highly accurate compared to other state-of-the-art methods, featuring a substantial gain in computational efficiency and theoretical transparency. Our work establishes a clear connection between urn models critical in interpreting innovation processes and nonparametric Bayesian inference. It opens the way to design more efficient inference methods in the presence of complex correlation patterns and non-stationary dynamics.
... We define the cross-complexity (though here we shall refer to it as cross-entropy) of a sequence B with respect to another sequence A as the amount of bits needed to specify B once A is known. We follow a refined version of the data-compression approach introduced in [17,18], that was shown to be successful in authorship attribution and corpora classification [18]. We use in particular the LZ77 compressor [19] and we scan the B sequence looking for existing matching sub-sequences only in A and we code each matching as in the usual LZ77 algorithm. ...
... We define the cross-complexity (though here we shall refer to it as cross-entropy) of a sequence B with respect to another sequence A as the amount of bits needed to specify B once A is known. We follow a refined version of the data-compression approach introduced in [17,18], that was shown to be successful in authorship attribution and corpora classification [18]. We use in particular the LZ77 compressor [19] and we scan the B sequence looking for existing matching sub-sequences only in A and we code each matching as in the usual LZ77 algorithm. ...
... In this section we report some details about the data-compression techniques we used to estimate the similarity between pairs of sequences. We followed in particular the approach proposed in [17,18], based on the Lempel-Ziv algorithm [19]. We adopt in particular the notion of cross-complexity (from now onwards referred as cross-entropy) between two sequences of characters. ...
Article
Full-text available
We introduce a Maximum Entropy model able to capture the statistics of melodies in music. The model can be used to generate new melodies that emulate the style of the musical corpus which was used to train it. Instead of using the $n-$body interactions of $(n-1)-$order Markov models, traditionally used in automatic music generation, we use a $k-$nearest neighbour model with pairwise interactions only. In that way, we keep the number of parameters low and avoid over-fitting problems typical of Markov models. We show that long-range musical phrases don't need to be explicitly enforced using high-order Markov interactions, but can instead emerge from multiple, competing, pairwise interactions. We validate our Maximum Entropy model by contrasting how much the generated sequences capture the style of the original corpus without plagiarizing it. To this end we use a data-compression approach to discriminate the levels of borrowing and innovation featured by the artificial sequences. The results show that our modelling scheme outperforms both fixed-order and variable-order Markov models. This shows that, despite being based only on pairwise interactions, this Maximum Entropy scheme opens the possibility to generate musically sensible alterations of the original phrases, providing a way to generate innovation.
... Compression algorithms are especially efficient when examining natural language because they contain so many redundancies (Brillouin, 2004). Scholars have demonstrated how zipping can be used in measuring language similarity between two or more texts (Baronchelli, Caglioti, & Loreto, 2005) and how compression algorithms and entropybased approaches are useful to measure online texts (Gordon, Cao, & Swanson, 2007;Huffaker, Jorgensen, Iacobelli, Tepper, & Cassell, 2006;Nigam, Lafferty, & McCallum, 1999;Schneider, 1996). In our study, entropy represents the number of similar linguistic choices at both word and phrase levels. ...
Article
Full-text available
The purpose of this study is to examine how language affects coalition formation in multiparty negotiations. The authors relied on communication accommodation theory for theoretical framing and hypothesized that language can help coalition partners reach an agreement when it is used to increase a sense of unity. Findings of an experimental study support this hypothesis, demonstrating that greater linguistic convergence and assent increase agreements between potential coalition partners whereas the expression of negative emotion words decrease agreement. The implications for coalition formation and the study of language in negotiations are discussed.
... For a given compression scheme, and two objects (e.g. bit-strings) of the same length, the object whose compressed representation requires the fewest symbols can be considered less complex as it contains more identifiable regularity [1]. In the experiments that follow, this idea is applied to assess the complexity of evolved neural network behaviors using an real-world compressor. ...
Article
Full-text available
Model complexity is key concern to any artificial learning system due its critical impact on generalization. However, EC research has only focused phenotype structural complexity for static problems. For sequential decision tasks, phenotypes that are very similar in struc- ture, can produce radically different behaviors, and the trade-off between fitness and complexity in this context is not clear. In this paper, behav- ioral complexity is measured explicitly using compression, and used as a separate objective to be optimized (not as an additional regularization term in a scalar fitness), in order to study this trade-off directly.
... For a given compressor, and two objects (e.g. bit-strings) of equal length, the object with the shortest compressed representation can be considered less complex as it contains more identifiable regularity [1]. In the experiments that follow, this idea is applied to assess the complexity of evolved neural network behaviors, using an real-world compressor. ...
Conference Paper
Full-text available
Model complexity is key concern to any artificial learning system due its critical impact on generalization. However, EC research has only focused phenotype structural complexity for static problems. For sequential decision tasks, phenotypes that are very similar in struc- ture, can produce radically different behaviors, and the trade-off between fitness and complexity in this context is not clear. In this paper, behav- ioral complexity is measured explicitly using compression, and used as a separate objective to be optimized (not as an additional regularization term in a scalar fitness), in order to study this trade-off directly.
... For example, in the early days of computational biology, lossless compression was routinely used to classify and analyze DNA sequences. We refer to, e.g., Allison et al. (2000), Baronchelli et al. (2005), Farach et al. (1995), Frank et al. (2000, Gatlin (1972), Kennel (2004), Kit (1998), Loewenstern et al. (1995), Loewenstern and Yianilos (1999), Mahnoney (2003, Unpublished), Needham and Dowe (2001), Segen (1990), and Teahan et al. (2000), and references therein for a sampler of the rich literature existing on this subject. More recently, Benedetto et al. (2002) have shown how to use a compression-based measure to classify fifty languages. ...
Article
Full-text available
The vast majority of data mining algorithms require the setting of many input parameters. The dangers of working with parameter-laden algorithms are twofold. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate the significance of the reported patterns. This is especially likely when the user fails to understand the role of parameters in the data mining process. Data mining algorithms should have as few parameters as possible. A parameter-light algorithm would limit our ability to impose our prejudices, expectations, and presumptions on the problem at hand, and would let the data itself speak to us. In this work, we show that recent results in bioinformatics, learning, and computational theory hold great promise for a parameter-light data-mining paradigm. The results are strongly connected to Kolmogorov complexity theory. However, as a practical matter, they can be implemented using any off-the- shelf compression algorithm with the addition of just a dozen lines of code. We will show that this approach is competitive or superior to many of the state-of-the-art approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/XML/video datasets. As a further evidence of the advantages of our method, we will demonstrate its effectiveness to solve a real world classification problem in recommending printing services and products.
... More recently, researchers have experimented with data compression algorithms as a measure of document complexity and similarity. This technique uses compression ratios as an approximation of a document's information entropy (Baronchelli, Caglioti, & Loreto, 2005;Benedetto, Caglioti, & Loreto, 2002). Standard Zipping algorithms have demonstrated effectiveness in a variety of document comparison and classification tasks. ...
Article
Full-text available
This paper examines language similarity in messages over time in an online com-munity of adolescents from around the world using three computational meas-ures: Spearman's Correlation Coefficient, Zipping and Latent Semantic Analysis. Results suggest that the participants' lan-guage diverges over a six-week period, and that divergence is not mediated by demographic variables such as leadership status or gender. This divergence may represent the introduction of more unique words over time, and is influenced by a continual change in subtopics over time, as well as community-wide historical events that introduce new vocabulary at later time periods. Our results highlight both the possibilities and shortcomings of using document similarity measures to as-sess convergence in language use.
... For example, in the early days of computational biology, lossless compression was routinely used to classify and analyze DNA sequences. We refer to, e.g., Allison et al. (2000), Baronchelli et al. (2005), Farach et al. (1995, Frank et al. (2000), Gatlin (1972), Kennel (2004), Kit (1998), Loewenstern et al. (1995), Yianilos (1999), Mahnoney (2003, Unpublished), Needham and Dowe (2001), Segen (1990), and Teahan et al. (2000), and references therein for a sampler of the rich literature existing on this subject. ...
Article
Full-text available
Abstract The vast majority of data mining algorithms require the setting of many input parameters. The dangers of working with parameter-laden algorithms are twofold. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate the significance of the reported patterns. This is especially likely when the user fails to understand the role of parameters in the data mining Responsible editor: Johannes Gehrke.
Article
The instrumental variables estimator first appeared explicitly in Appendix B of The Tariff on Animal and Vegetable Oils by Philip G. Wright (1928). It has been suggested that this appendix was written by Philip's son Sewall Wright, then already an important genetic statistician. To find out who wrote Appendix B, we use stylometric statistics to compare it to other texts known to have been written solely by the father and son. The sharp results are consistent with contextual and historical evidence on the authorship of Appendix B and on the origination of the idea of IV estimation.
Article
Full-text available
We dene a simple, purely surface frequency based measure S(T, t) which quanties the similarity of a training text T with a test text t. S can be decomposed into three factors: one depending on the training text, one depending on the test text, and one nearly constant residual factor. The slight variations of this near constant allow us to measure stylistic dierences between T and t with high accuracy. The dened quantity S is unique among other stylometric measures in that it uses the full frequency information of all substrings in both texts. Its applicability for stylometric classications was tested in a variety of experiments.