Figure - available from: Royal Society Open Science
This content is subject to copyright.
Expected counts of parses of the training (a) and test (b) data. The ‘total’ bar shows the total expected counts of each parse type. The ‘optimal’ bar shows the expected counts of the optimal parses. The ‘uniform baseline’ bar shows the expected counts of the non-regular parses under the assumption of the uniform distribution over the logically possible parses of each data string.

Expected counts of parses of the training (a) and test (b) data. The ‘total’ bar shows the total expected counts of each parse type. The ‘optimal’ bar shows the expected counts of the optimal parses. The ‘uniform baseline’ bar shows the expected counts of the non-regular parses under the assumption of the uniform distribution over the logically possible parses of each data string.

Source publication
Article
Full-text available
A pervasive belief with regard to the differences between human language and animal vocal sequences (song) is that they belong to different classes of computational complexity, with animal song belonging to regular languages, whereas human language is superregular. This argument, however, lacks empirical evidence since superregular analyses of anim...

Citations

... Katahira et al. [9] assert that the song syntax of the Bengalese finch can be better described with a lower-order hidden Markov model [69] than the n-gram model. Moreover, hierarchical language models used in computational linguistics (e.g., probabilistic context-free grammar) are known to allow a more compact description of human language [70] and animal voice sequences [71] than sequential models like HMM. Another compression possibility is to represent consecutive repetitions of the same syllable categories differently from transitions between heterogeneous syllables [16,17] (see also [72] for neurological evidence for different treatments of heterosyllabic transitions and homosyllabic repetitions). ...
... We conclude the present paper by noting that the analysis of context dependency via neural language modeling is not limited to Bengalese/zebra finch's song. Since neural networks are universal approximators and potentially fit to any kind of data [75,76], the same analytical method is applicable to other animals' voice sequences [11,43,71], given reasonable segmentation and classification of sequence components like syllables. Moreover, the analysis of context dependency can also be performed in principle on other sequential behavioral data besides vocalization, including dance [77,78] and gestures [79,80]. ...
... Thus, it is convenient to represent the syllables in such a format [83,84]. Previous studies on animal vocalization often used acoustic features like syllable duration, mean pitch, spectral entropy/shape (centroid, skewness, etc.), mean spectrum/cepstrum, and/or Mel-frequency cepstral coefficients at some representative points for the fixed-dimensional representation [9,30,71]. In this study, we took a non-parametric approach based on a sequence-to-sequence (seq2seq) autoencoder [85]. ...
Article
Full-text available
Context dependency is a key feature in sequential structures of human language, which requires reference between words far apart in the produced sequence. Assessing how long the past context has an effect on the current status provides crucial information to understand the mechanism for complex sequential behaviors. Birdsongs serve as a representative model for studying the context dependency in sequential signals produced by non-human animals, while previous reports were upper-bounded by methodological limitations. Here, we newly estimated the context dependency in birdsongs in a more scalable way using a modern neural-network-based language model whose accessible context length is sufficiently long. The detected context dependency was beyond the order of traditional Markovian models of birdsong, but was consistent with previous experimental investigations. We also studied the relation between the assumed/auto-detected vocabulary size of birdsong (i.e., fine- vs. coarse-grained syllable classifications) and the context dependency. It turned out that the larger vocabulary (or the more fine-grained classification) is assumed, the shorter context dependency is detected.
... Kershenbaum et al. (2016) identify six classes of models and analyses for analyzing temporal sequences: Markov chains, Hidden Markov Models, Networkbased analyses, Formal grammars analyses, and temporal models. Analyses of temporal organization in animal communication has traditionally been largely influenced by Chomsky's hierarchy of formal grammars, with a focus on trying to understand what class of the Chomsky hierarchy animal's behaviors belong within (Hauser et al., 2002;Rohrmeier et al., 2015;Jiang et al., 2018;Morita and Koda, 2019). For example, Markov models, Hidden Markov Models, and Network models are all finite-state models in the Chomsky hierarchy. ...
Article
Full-text available
Recently developed methods in computational neuroethology have enabled increasingly detailed and comprehensive quantification of animal movements and behavioral kinematics. Vocal communication behavior is well poised for application of similar large-scale quantification methods in the service of physiological and ethological studies. This review describes emerging techniques that can be applied to acoustic and vocal communication signals with the goal of enabling study beyond a small number of model species. We review a range of modern computational methods for bioacoustics, signal processing, and brain-behavior mapping. Along with a discussion of recent advances and techniques, we include challenges and broader goals in establishing a framework for the computational neuroethology of vocal communication.
... The European starling dataset is from [3] and was acquired from [101]. The gibbon song is from [102]. The marmoset dataset was received via personal correspondence and was recorded similarly to [47]. ...
Article
Full-text available
Animals produce vocalizations that range in complexity from a single repeated call to hundreds of unique vocal elements patterned in sequences unfolding over hours. Characterizing complex vocalizations can require considerable effort and a deep intuition about each species’ vocal behavior. Even with a great deal of experience, human characterizations of animal communication can be affected by human perceptual biases. We present a set of computational methods for projecting animal vocalizations into low dimensional latent representational spaces that are directly learned from the spectrograms of vocal signals. We apply these methods to diverse datasets from over 20 species, including humans, bats, songbirds, mice, cetaceans, and nonhuman primates. Latent projections uncover complex features of data in visually intuitive and quantifiable ways, enabling high-powered comparative analyses of vocal acoustics. We introduce methods for analyzing vocalizations as both discrete sequences and as continuous latent variables. Each method can be used to disentangle complex spectro-temporal structure and observe long-timescale organization in communication.
... Morita & Koda (henceforth also M&K; [1]) attempt to resurrect supra-regular analyses of animal pattern recognition. First, they state that claims about the regularity of animal song are not supported by empirical evidence, as supra-regular analyses of animal patterns are possible; second, they analyse gibbon data via probabilistic context-free grammars (PCFG; a notoriously supraregular formalism), and invoke compactness of the analysis as a fundamental advantage of this approach. ...
... This is a reply to a comment from De Santo & Rawski (DS&R) [1] regarding our recent investigation of the gibbon song syntax [2]. The major objectives are to clarify the (i) difference between the proposed analytical method and formal language theory (FLT) advocated by DS&R and (ii) difficulties in applying the existing FLT-based analysis to animal studies. ...
... Thus, a broader range of models must be included for comparison, given an appropriate search algorithm. In [2], the hypothesis space for animal voice sequences was expanded to probabilistic context-free grammars (PCFGs) from previous regular domain. This difference between the FLT and proposed methods regarding search procedures has 'pushed the scientific community towards misguidance' with respect to idealization and optimum-among-regulars [7]. ...
... The likelihood metric is not suitable for non-neural, rule-based superregular models such as PCFG, as no remarkable advantages are exhibited over regular models for both human language [15] and animal voice sequences [2] ( §4.2). Thus, NLP did not benefit much from superregularity prior to the deep learning era; the previous state-of-the-art architecture was smoothed n-gram models, which can only generate a subclass of regular languages (termed strictly locally testable languages) [16,17]. ...
Preprint
Full-text available
Animals produce vocalizations that range in complexity from a single repeated call to hundreds of unique vocal elements patterned in sequences unfolding over hours. Characterizing complex vocalizations can require considerable effort and a deep intuition about each species’ vocal behavior. Even with a great deal of experience, human characterizations of animal communication can be affected by human perceptual biases. We present here a set of computational methods that center around projecting animal vocalizations into low dimensional latent representational spaces that are directly learned from data. We apply these methods to diverse datasets from over 20 species, including humans, bats, songbirds, mice, cetaceans, and nonhuman primates, enabling high-powered comparative analyses of unbiased acoustic features in the communicative repertoires across species. Latent projections uncover complex features of data in visually intuitive and quantifiable ways. We introduce methods for analyzing vocalizations as both discrete sequences and as continuous latent variables. Each method can be used to disentangle complex spectro-temporal structure and observe long-timescale organization in communication. Finally, we show how systematic sampling from latent representational spaces of vocalizations enables comprehensive investigations of perceptual and neural representations of complex and ecologically relevant acoustic feature spaces.