Conference Paper

PiCoGen: Generate Piano Covers with a Two-stage Approach

June 2024

June 2024

DOI:10.1145/3652583.3657626

Conference: ICMR '24: International Conference on Multimedia Retrieval

Authors:

ResearchGate has not been able to resolve any citations for this publication.

MIDITOK: A PYTHON PACKAGE FOR MIDI FILE TOKENIZATION

Conference Paper

Full-text available

Nov 2021

This article presents MidiTok, a Python package to encode MIDI files into sequences of tokens to be used with sequential Deep Learning models like Transformers or Recurrent Neural Networks. It allows researchers and developers to encode datasets with various strategies built around the idea that they share common parameters. This key idea makes it easy to :1) optimize the size of the vocabulary and the elements it can represent w.r.t. the MIDI specifications ; 2) compare tokenization methods to see which performs best in which case; 3) measure the relevance of additional information like chords or tempo changes. Code and documentation of MidiTok are on Github: https://github.com/Natooz/MidiTok .

Spleeter: a fast and efficient music source separation tool with pre-trained models

Article

Full-text available

Jun 2020

Automatic Music Transcription: An Overview

Article

Full-text available

Jan 2019

The capability of transcribing music audio into music notation is a fascinating example of human intelligence. It involves perception (analyzing complex auditory scenes), cognition (recognizing musical objects), knowledge representation (forming musical structures), and inference (testing alternative hypotheses). Automatic music transcription (AMT), i.e., the design of computational algorithms to convert acoustic music signals into some form of music notation, is a challenging task in signal processing and artificial intelligence. It comprises several subtasks, including multipitch estimation (MPE), onset and offset detection, instrument recognition, beat and rhythm tracking, interpretation of expressive timing and dynamics, and score typesetting.

This time with feeling: learning expressive musical performance

Article

Full-text available

Feb 2020
NEURAL COMPUT APPL

Music generation has generally been focused on either creating scores or interpreting them. We discuss differences between these two problems and propose that, in fact, it may be valuable to work in the space of direct performance generation: jointly predicting the notes and also their expressive timing and dynamics. We consider the significance and qualities of the dataset needed for this. Having identified both a problem domain and characteristics of an appropriate dataset, we show an LSTM-based recurrent network model that subjectively performs quite well on this task. Critically, we provide generated examples. We also include feedback from professional composers and musicians about some of these examples.

Audio Cover Song Identification and Similarity: Background, Approaches, Evaluation, and Beyond

Chapter

Full-text available

Oct 2010

A cover version is an alternative rendition of a previously recorded song. Given that a cover may differ from the original song in timbre, tempo, structure, key, arrangement, or language of the vocals, automatically identifying cover songs in a given music collection is a rather difficult task. The music information retrieval (MIR) community has paid much attention to this task in recent years and many approaches have been proposed. This chapter comprehensively summarizes the work done in cover song identification while encompassing the background related to this area of research. The most promising strategies are reviewed and qualitatively compared under a common framework, and their evaluation methodologies are critically assessed. A discussion on the remaining open issues and future lines of research closes the chapter.

Automatic Generation of Lead Sheets from Polyphonic Music Signals.

Conference Paper

Full-text available

Jan 2009

A lead sheet is a type of music notation which summa- rizes the content of a song. The usual elements that are reproduced are the melody, chords, tempo, time signature, style and the lyrics, if any. In this paper we propose a sys- tem that aims at transcribing both the melody and the as- sociated chords in a beat-synchronous framework. A beat tracker identifies the pulse positions and thus defines a beat grid on which the chord sequence and the melody notes are mapped. The harmonic changes are used to estimate the time signature and the down beats as well as the key of the piece. The different modules perform very well on each of the different tasks, and the lead sheets that were rendered show the potential of the approaches adopted in this paper.

Musical Genre Classification of Audio Signals

Article

Full-text available

Aug 2002

Musical genres are categorical labels created by humans to characterize pieces of music. A musical genre is characterized by the common characteristics shared by its members. These characteristics typically are related to the instrumentation, rhythmic structure, and harmonic content of the music. Genre hierarchies are commonly used to structure the large collections of music available on the Web. Currently musical genre annotation is performed manually. Automatic musical genre classification can assist or replace the human user in this process and would be a valuable addition to music information retrieval systems. In addition, automatic musical genre classification provides a framework for developing and evaluating features for any type of content-based analysis of musical signals. In this paper, the automatic classification of audio signals into an hierarchy of musical genres is explored. More specifically, three feature sets for representing timbral texture, rhythmic content and pitch content are proposed. The performance and relative importance of the proposed features is investigated by training statistical pattern recognition classifiers using real-world audio collections. Both whole file and real-time frame-based classification schemes are described. Using the proposed feature sets, classification of 61% for ten musical genres is achieved. This result is comparable to results reported for human musical genre classification.

Manipulation of Music For Melody Matching

Article

Full-text available

Aug 2002

Large volumes of music are available online, represented in performance formats such as MIDI and, increasingly, in abstract notation such as SMDL. Many types of user would find it valuable to search collections of music via queries representing music fragments, but such searching requires a reliable technique for identifying whether a provided fragment occurs within a piece of music. The problem of matching fragments to music is made difficult by the psychology of music perception, because literal matching may have little relation to perceived melodic similarity, and by the interactions between the multiple parts of typical pieces of music. In this paper we analyse the properties of music, music perception, and music database users, and use the analysis to propose alternative techniques for extracting monophonic melodies from polyphonic music; we believe that such melodies can subsequently be used for matching of queries to data. We report on experiments with music listeners, which rank our proposed techniques for extracting melodies.

Pop2Piano : Pop Audio-Based Piano Cover Generation

Conference Paper

Jun 2023

Multitrack Music Transformer

Conference Paper

Jun 2023

Compose & Embellish: Well-Structured Piano Performance Generation via A Two-Stage Approach

Conference Paper

Jun 2023

MuseMorphose: Full-Song and Fine-Grained Piano Music Style Transfer With One Transformer VAE

Article

Jan 2023

Transformers and variational autoencoders (VAE) have been extensively employed for symbolic (e.g., MIDI) domain music generation. While the former boast an impressive capability in modeling long sequences, the latter allow users to willingly exert control over different parts (e.g., bars) of the music to be generated. In this paper, we are interested in bringing the two together to construct a single model that exhibits both strengths. The task is split into two steps. First, we equip Transformer decoders with the ability to accept segment-level, time-varying conditions during sequence generation. Subsequently, we combine the developed and tested in-attention decoder with a Transformer encoder, and train the resulting MuseMorphose model with the VAE objective to achieve style transfer of long pop piano pieces, in which users can specify musical attributes including rhythmic intensity and polyphony (i.e., harmonic fullness) they desire, down to the bar level. Experiments show that MuseMorphose outperforms recurrent neural network (RNN) based baselines on numerous widely-used metrics for style transfer tasks.

Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs

Article

May 2021

To apply neural sequence models such as the Transformers to music generation tasks, one has to represent a piece of music by a sequence of tokens drawn from a finite set of pre-defined vocabulary. Such a vocabulary usually involves tokens of various types. For example, to describe a musical note, one needs separate tokens to indicate the note’s pitch, duration, velocity (dynamics), and placement (onset time) along the time grid. While different types of tokens may possess different properties, existing models usually treat them equally, in the same way as modeling words in natural languages. In this paper, we present a conceptually different approach that explicitly takes into account the type of the tokens, such as note types and metric types. And, we propose a new Transformer decoder architecture that uses different feed-forward heads to model tokens of different types. With an expansion-compression trick, we convert a piece of music to a sequence of compound words by grouping neighboring tokens, greatly reducing the length of the token sequences. We show that the resulting model can be viewed as a learner over dynamic directed hypergraphs. And, we employ it to learn to compose expressive Pop piano music of full-song length (involving up to 10K individual tokens per song), both conditionally and unconditionally. Our experiment shows that, compared to state-of-the-art models, the proposed model converges 5 to 10 times faster at training (i.e., within a day on a single GPU with 11 GB memory), and with comparable quality in the generated music

High-Resolution Piano Transcription With Pedals by Regressing Onset and Offset Times

Article

Oct 2021

Automatic music transcription (AMT) is the task of transcribing audio recordings into symbolic representations. Recently, neural network-based methods have been applied to AMT, and have achieved state-of-the-art results. However, many previous systems only detect the onset and offset of notes frame-wise, so the transcription resolution is limited to the frame hop size. There is a lack of research on using different strategies to encode onset and offset targets for training. In addition, previous AMT systems are sensitive to the misaligned onset and offset labels of audio recordings. Furthermore, there are limited researches on sustain pedal transcription on large-scale datasets. In this article, we propose a high-resolution AMT system trained by regressing precise onset and offset times of piano notes. At inference, we propose an algorithm to analytically calculate the precise onset and offset times of piano notes and pedal events. We show that our AMT system is robust to the misaligned onset and offset labels compared to previous systems. Our proposed system achieves an onset F1 of 96.72% on the MAESTRO dataset, outperforming previous onsets and frames system of 94.80%. Our system achieves a pedal onset F1 score of 91.86%, which is the first benchmark result on the MAESTRO dataset. We have released the source code and checkpoints of our work at https://github.com/bytedance/piano_transcription</uri

Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions

Conference Paper

Oct 2020

Lead Sheet Generation and Arrangement by Conditional Generative Adversarial Network

Conference Paper

Dec 2018

Audio-Based Automatic Generation of a Piano Reduction Score by Considering the Musical Structure: 25th International Conference, MMM 2019, Thessaloniki, Greece, January 8–11, 2019, Proceedings, Part II

Chapter

Jan 2019

A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music

Article

Mar 2018

The Variational Autoencoder (VAE) has proven to be an effective model for producing semantically meaningful latent representations for natural data. However, it has thus far seen limited application to sequential data, and, as we demonstrate, existing recurrent VAE models have difficulty modeling sequences with long-term structure. To address this issue, we propose the use of a hierarchical decoder, which first outputs embeddings for subsequences of the input and then uses these embeddings to generate each subsequence independently. This structure encourages the model to utilize its latent code, thereby avoiding the "posterior collapse" problem which remains an issue for recurrent VAEs. We apply this architecture to modeling sequences of musical notes and find that it exhibits dramatically better sampling, interpolation, and reconstruction performance than a "flat" baseline model. An implementation of our "MusicVAE" is available online at http://g.co/magenta/musicvae-colab.

PYIN: A fundamental frequency estimator using probabilistic threshold distributions

Conference Paper

May 2014

We propose the Probabilistic YIN (PYIN) algorithm, a modification of the well-known YIN algorithm for fundamental frequency (F0) estimation. Conventional YIN is a simple yet effective method for frame-wise monophonic F0 estimation and remains one of the most popular methods in this domain. In order to eliminate short-term errors, outputs of frequency estimators are usually post-processed resulting in a smoother pitch track. One shortcoming of YIN is that such post-processing cannot fall back on alternative interpretations of the signal because the method outputs precisely one estimate per frame. To address this problem we modify YIN to output multiple pitch candidates with associated probabilities (PYIN Stage 1). These probabilities arise naturally from a prior distribution on the YIN threshold parameter. We use these probabilities as observations in a hidden Markov model, which is Viterbi-decoded to produce an improved pitch track (PYIN Stage 2). We demonstrate that the combination of Stages 1 and 2 raises recall and precision substantially. The additional computational complexity of PYIN over YIN is low. We make the method freely available online1 as an open source C++ library for Vamp hosts.

Automatic Transcription of Melody, Bass Line, and Chords in Polyphonic Music

Article

Sep 2008

This article proposes a method for the automatic transcription of the melody, bass line, and chords in polyphonic pop music. The method uses a frame-wise pitch-salience estimator as a feature extraction front-end. For the melody and bass-line transcription, this is followed by acoustic modeling of note events and musicological modeling of note transitions. The acoustic models include a model for the target notes (i.e., melody or bass notes) and a background model. The musicological model involves key estimation and note bigrams that determine probabilities for transitions between target notes. A transcription of the melody or the bass line is obtained using Viterbi search via the target and the background note models. The performance of the melody and the bass-line transcription is evaluated using approximately 8.5 hours of realistic polyphonic music. The chord transcription maps the pitch salience estimates to a pitch-class representation and uses trained chord models and chord-transition probabilities to produce a transcription consisting of major and minor triads. For chords, the evaluation material consists of the first eight Beatles albums. The method is computationally efficient and allows causal implementation, so it can process streaming audio. Transcription of music refers to the analysis of an acoustic music signal for producing a parametric representation of the signal. The representation may be a music score with a meticulous arrangement for each instrument or an approximate description of melody and chords in the piece, for example. The latter type of transcription is commonly used in commercial songbooks of pop music and is usually sufficient for musicians or music hobbyists to play the piece. On the other hand, more detailed transcriptions are often employed in classical music to preserve the exact arrangement of the composer.

Song2Guitar: A difficulty-aware arrangement system for generating guitar solo covers from polyphonic audio of popular music

Jan 2017

Shunya Ariga
Satoru Fukayama
Masataka Goto
Ariga Shunya

Melody transcription via generative pre-training

Jan 2022

Chris Donahue
John Thickstun
Percy Liang
Donahue Chris

FIGARO: Generating symbolic music with fine-grained artistic control

Jan 2023

Luca Dimitri Von Rütte
Yannic Biggio
Thomas Kilcher
Hofmann
von Rütte Dimitri

POP909: A pop-song dataset for music arrangement generation

Jan 2020

Ziyu Wang
Ke Chen
Junyan Jiang
Yiyi Zhang
Maoran Xu
Shuqi Dai
Guxian Bin
Gus Xia
Wang Ziyu

Summarizing and comparing music data and its application on cover song identification

Jan 2018

Diego Furtado Silva
Felipe Falcao
Nazareno Andrade
Silva Diego Furtado

Groove2groove: One-shot music style transfer with supervision from synthetic data

Jan 2020

Umut Ondvr Ej C'ifka
Gaël Richard
C'ifka Ondvr

MT3: Multi-task multitrack music transcription

Jan 2022

Josh Gardner
Ian Simon
Ethan Manilow
Curtis Hawthorne
Jesse Engel
Gardner Josh

Da-TACOS: A dataset for cover song identification and understanding

Jan 2019

Furkan Yesiler
Chris Tralie
Albin Andrew Correya
Diego F Silva
Philip Tovstogan
Emilia Gómez Gutiérrez
Xavier Serra
Yesiler Furkan

Sequence-to-sequence piano transcription with Transformers

Jan 2021

Curtis Hawthorne
Ian Simon
Rigel Swavely
Ethan Manilow
Jesse Engel
Hawthorne Curtis

Adoption of AI technology in music mixing workflow: an investigation

Maryam Soumya Sai Vanka
Jean-Baptiste Safi
György Rolland
Fazekas
Vanka Soumya Sai

Automatic piano transcription with hierarchical frequency-time Transformer

Jan 2023

Keisuke Toyama
Taketo Akama
Yukara Ikemiya
Yuhta Takida
Wei-Hsiang Liao
Yuki Mitsufuji
Toyama Keisuke

PiCoGen: Generate Piano Covers with a Two-stage Approach

No full-text available

Recommended publications

Learning To Generate Piano Music With Sustain Pedals

A variable neighborhood search algorithm to generate piano fingerings for polyphonic sheet music

First Describe, Then Depict: Generating Covers for Music and Books via Extracting Keywords: This pap...

“A Perfect Fit for Our Era”: Using New Yorker Covers to Generate Curiosity and Provoke Argument in t...