Indo-European family language tree: this figure illustrates the phylogenetic-like tree constructed on the basis of more than 50 different versions of the 'The Universal Declaration of Human Rights'. The tree is obtained using the Fitch–Margoliash method applied to the symmetrical distance matrix based on the R distance defined in section 2.2 built from the ATC method. This tree features essentially all the main linguistic groups of the Euro-Asiatic continent (Romance, Celtic, Germanic, Ugro-Finnic, Slavic, Baltic, Altaic), as well as a few isolated languages as the Maltese, typically considered an Afro-Asiatic language, and the Basque, classified as a non-Indo-European language and whose origins and relationships with other languages are uncertain. The tree is unrooted, i.e. it does not require any hypothesis about common ancestors for the languages and it cannot be used to infer information about common ancestors of the languages. For more details, see the text. The lengths of the paths between pairs of documents measured along the tree branches are not proportional to the actual distances between the documents.

Source publication

Artificial Sequences and Complexity Measures

Article

Full-text available

Mar 2004

In this paper we exploit concepts of information theory to address the fundamental problem of identifying and defining the most suitable tools to extract, in a automatic and agnostic way, information from a generic string of characters. We introduce in particular a class of methods which use in a crucial way data compression techniques in order to...

Mesh-Based Motion-Compensated Interpolation for Side Information Extraction in Distributed Video Coding

Conference Paper

Full-text available

Nov 2006

This paper addresses the problem of side information generation in distributed video compression (DVC) schemes. Intermediate frames constructed by motion-compensated interpolation of key frames are used as side information to decode Wyner-Ziv frames. The limitations of block-based translational motion models call for new motion models. This article...

Innovation processes for inference

Preprint

Jun 2023

In this letter, we introduce a new approach to quantify the closeness of symbolic sequences and test it in the framework of the authorship attribution problem. The method, based on a recently discovered urn representation of the Pitman-Yor process, is highly accurate compared to other state-of-the-art methods, featuring a substantial gain in computational efficiency and theoretical transparency. Our work establishes a clear connection between urn models critical in interpreting innovation processes and nonparametric Bayesian inference. It opens the way to design more efficient inference methods in the presence of complex correlation patterns and non-stationary dynamics.

Maximum entropy models capture melodic styles

Article

Full-text available

Oct 2016

We introduce a Maximum Entropy model able to capture the statistics of melodies in music. The model can be used to generate new melodies that emulate the style of the musical corpus which was used to train it. Instead of using the $n-$body interactions of $(n-1)-$order Markov models, traditionally used in automatic music generation, we use a $k-$nearest neighbour model with pairwise interactions only. In that way, we keep the number of parameters low and avoid over-fitting problems typical of Markov models. We show that long-range musical phrases don't need to be explicitly enforced using high-order Markov interactions, but can instead emerge from multiple, competing, pairwise interactions. We validate our Maximum Entropy model by contrasting how much the generated sequences capture the style of the original corpus without plagiarizing it. To this end we use a data-compression approach to discriminate the levels of borrowing and innovation featured by the artificial sequences. The results show that our modelling scheme outperforms both fixed-order and variable-order Markov models. This shows that, despite being based only on pairwise interactions, this Maximum Entropy scheme opens the possibility to generate musically sensible alterations of the original phrases, providing a way to generate innovation.

The Language of Coalition Formation in Online Multiparty Negotiations

Article

Full-text available

Mar 2011
J LANG SOC PSYCHOL

The purpose of this study is to examine how language affects coalition formation in multiparty negotiations. The authors relied on communication accommodation theory for theoretical framing and hypothesized that language can help coalition partners reach an agreement when it is used to increase a sense of unity. Findings of an experimental study support this hypothesis, demonstrating that greater linguistic convergence and assent increase agreements between potential coalition partners whereas the expression of negative emotion words decrease agreement. The implications for coalition formation and the study of language in negotiations are discussed.

Measuring and Optimizing Behavioral Complexity

Article

Full-text available

Sep 2009

Model complexity is key concern to any artificial learning system due its critical impact on generalization. However, EC research has only focused phenotype structural complexity for static problems. For sequential decision tasks, phenotypes that are very similar in struc- ture, can produce radically different behaviors, and the trade-off between fitness and complexity in this context is not clear. In this paper, behav- ioral complexity is measured explicitly using compression, and used as a separate objective to be optimized (not as an additional regularization term in a scalar fitness), in order to study this trade-off directly.

Measuring and Optimizing Behavioral Complexity for Evolutionary Reinforcement Learning

Conference Paper

Full-text available

Sep 2009

Compression-based data mining of sequential data

Article

Full-text available

Feb 2007
DATA MIN KNOWL DISC

The vast majority of data mining algorithms require the setting of many input parameters. The dangers of working with parameter-laden algorithms are twofold. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate the significance of the reported patterns. This is especially likely when the user fails to understand the role of parameters in the data mining process. Data mining algorithms should have as few parameters as possible. A parameter-light algorithm would limit our ability to impose our prejudices, expectations, and presumptions on the problem at hand, and would let the data itself speak to us. In this work, we show that recent results in bioinformatics, learning, and computational theory hold great promise for a parameter-light data-mining paradigm. The results are strongly connected to Kolmogorov complexity theory. However, as a practical matter, they can be implemented using any off-the- shelf compression algorithm with the addition of just a dozen lines of code. We will show that this approach is competitive or superior to many of the state-of-the-art approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/XML/video datasets. As a further evidence of the advantages of our method, we will demonstrate its effectiveness to solve a real world classification problem in recommending printing services and products.

Computational measures for language similarity across time in online communities

Article

Full-text available

Jul 2006

This paper examines language similarity in messages over time in an online com-munity of adolescents from around the world using three computational meas-ures: Spearman's Correlation Coefficient, Zipping and Latent Semantic Analysis. Results suggest that the participants' lan-guage diverges over a six-week period, and that divergence is not mediated by demographic variables such as leadership status or gender. This divergence may represent the introduction of more unique words over time, and is influenced by a continual change in subtopics over time, as well as community-wide historical events that introduce new vocabulary at later time periods. Our results highlight both the possibilities and shortcomings of using document similarity measures to as-sess convergence in language use.

Data Min Knowl Disc DOI 10.1007/s10618-006-0049-3 Compression-based data mining of sequential data

Article

Full-text available

Jan 2005

Abstract The vast majority of data mining algorithms require the setting of many input parameters. The dangers of working with parameter-laden algorithms are twofold. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate the significance of the reported patterns. This is especially likely when the user fails to understand the role of parameters in the data mining Responsible editor: Johannes Gehrke.

Retrospectives: Who Invented Instrumental Variable Regression?

Article

Sep 2003
J ECON PERSPECT

The instrumental variables estimator first appeared explicitly in Appendix B of The Tariff on Animal and Vegetable Oils by Philip G. Wright (1928). It has been suggested that this appendix was written by Philip's son Sewall Wright, then already an important genetic statistician. To find out who wrote Appendix B, we use stylometric statistics to compare it to other texts known to have been written solely by the father and son. The sharp results are consistent with contextual and historical evidence on the authorship of Appendix B and on the origination of the idea of IV estimation.

A new text statistical measure and its application to stylometry

Article

Full-text available

Jul 2014

Felix Golcher

We dene a simple, purely surface frequency based measure S(T, t) which quanties the similarity of a training text T with a test text t. S can be decomposed into three factors: one depending on the training text, one depending on the test text, and one nearly constant residual factor. The slight variations of this near constant allow us to measure stylistic dierences between T and t with high accuracy. The dened quantity S is unique among other stylometric measures in that it uses the full frequency information of all substrings in both texts. Its applicability for stylometric classications was tested in a variety of experiments.

Similar publications

Citations