As time keeps going in this pandemic situation where several industries have suffered due to COVID-19, technology is trendily implemented to control it, also people are looking to improve their expertise (thoroughly famous quarantine activities) and engage themselves consistently.
If talking about the promised technologies in this tough situation, last week OpenAI launched “JUKEBOX”, yes!!!! Heeded correctly, AI is now stepping into the music creation province. (Don’t get amazed, music is itself a province as an integral part of human culture that evolves into a broad diversity of forms)
“Provided with genre, artist, and lyrics as input, Jukebox outputs a new music sample produced from scratch” - Company post, Jukebox
Hardly anyone amid us who wouldn’t be enraptured with music, and of course me included, yeah, let me ask, don’t you?
So, going to the next part in a blog to learn;
What about “Jukebox”?
What was the approach used?
AI and ML are behind generating music, little glimpse, how?
Generating models for music
Let’s begin ….1,2, 3, and go!!!!!
A year ago, OpenAI announced MuseNet, a deep neural network, capable of producing four-minute musical compositions upon 10 different instruments and blend several styles from country to Mozart to the Beatles.
And now, it brings a new system “Jukebox”;
“It is a machine learning framework that creates music, along with rudimentary songs, as raw audio in an area of genres and musical styles.”
More specifically, it can construct songs from widely diverse genres of music like hip-hop, jazz, and rock and pick up the melody, rhythm, long-term compositions for a longer variety of instruments, including the manner and tones of singers to be generated with the music.
Here is the classic pop in Frank Sinatra’s style that is available on SoundCloud; the song is about Christmas time being “hot tub time”…
The autoencoder model of Jukebox transforms audio with an approach known as Vector Quantized Variational AutoEncoder (VQ-VAE). Three levels of VQ-VAE shorten 44kHz raw audio in almost 8, 32, and 128 times.
The high-level encoding (128 times) maintains only imperative musical information like pitch, volume, and timbre, on the other hand, the ground-level encoding (8 times) makes the excellent quality reformation in the style of “musical codes”
In order to shape Jukebox on specific genres and artists, a high-level transformer model was trained depending on the work of estimating squeezed audio tokens that allow Jukebox to obtain better features of any music style.
However, OpenAI has designed an encoder, to give a framework with extra lyrical content, that appends a query-using layer from the Jukebox music layer to lyrics encoder in order to receive keys and values that further enable Jukebox to gain appropriate sequence of lyrics and music.
Jukebox models demand essence amount of calculation and time to train;
The VQ-VAE that involved around 2 million variables, was trained on 256 Nvidia V100 graphics cards for three days.
The upsamplers that consisted of more than 1 billion variables, were trained on 128 Nvidia V100 graphics cards for two weeks.
The top-level prior that carried across 5 billion variables, was trained on 512 Nvidia V100 graphics cards for four weeks.
With all these regards, Jukebox is the generation bounce on prior work of OpenAI as “MuseNet” that has discovered incorporating music on the basis of a huge volume of MIDI data.
Also, Jukebox models to acquire control on the overall structure and diversity with raw audio, even though, alleviating errors in long, medium, and short-term. (material adapted from)
The jukebox AI is trained on massive music datasets of appropriately every genre. As AI can produce songs that are mostly identical to the artists (in many cases) it was trained on, now Jukebox explores how it can emulate the pattern and genre of music. It also attempts to imitate the way of singing of particular singers.
Precisely, AI was selected to simulate music by the OpenAI team, they deploy raw audio to train Jukebox;
First, researchers deployed convolutional neural networks, that are useful machine learning algorithms specifically favorable at identifying images and language patterns, to cipher and constraint raw audio with the 1.2 million songs and their corresponding metadata. The metadata for every song consisted of information like genre, artist, album, and any correlated playlist keywords.
Secondly, they implemented a transformer to make new compact audio that was then transformed back into raw audio by applying upsampling.
However, you can consider a reference to code for "Jukebox: A Generative Model for Music".
Generative models have been adaptive in producing music, earlier models such as rule-based systems, chaos, and self-similarity, constraint programming, etc, have created music significantly in the form of piano-roll mainly laid out pitch, velocity, timing, and instrument of every single note to be performed. This symbolic method makes the modeling task easier by working on problems in the low-dimensional space.
In parallel to this, researchers have deployed non-symbolic methods to produce music plainly in terms of a piece of audio. As the space of raw audio is extremely high-dimensional along with a huge volume of information content to the model, a non-symbolic method is quite challenging.
In the reference of a general published, recent data-driven approaches are DeepBach, CoCoNet that use Gibb sampling to generate notes in the pattern of Bach chorals, MidiNet, and MueGAN that use generative adversarial networks, MusicVAE and HRNN that use hierarchical recurrent networks and many more.
OpenAI has been active in the making of AI-music for a few years and already has developed “MuseNet” to create MIDI tracks (full-length) with tones and compositions. Moreover, the newly released Jukebox was also trained like MuseNet, but it went a step ahead via delivering vocal parts and lyrics over a huge bandwidth of genres.
Also, OpenAI has designed a website for keeping track of all the sampling produced by Jukebox so that anyone can browse them easily. Jukebox maintains records of millions of timestamps per song in comparison to the thousand of timestamps that OpenAI’s language generator GPT-2 uses when it holds track of a piece of writing.
Did Kanye West, Katy Perry, Lupe Fiasco and the estates of Aretha Franklin, Frank Sinatra and Elvis Presley give OpenAI permission to use their audio recordings as training material for a voice-synthesis/musical-composition/lyric-writing algorithm? My guess is no.
— Cherie Hu (@cheriehu42) April 30, 2020
6 Major Branches of Artificial Intelligence (AI)READ MORE
Reliance Jio and JioMart: Marketing Strategy, SWOT Analysis, and Working EcosystemREAD MORE
8 Most Popular Business Analysis Techniques used by Business AnalystREAD MORE
Top 10 Big Data TechnologiesREAD MORE
Elasticity of Demand and its TypesREAD MORE
What is PESTLE Analysis? Everything you need to know about itREAD MORE
An Overview of Descriptive AnalysisREAD MORE
5 Factors Affecting the Price Elasticity of Demand (PED)READ MORE
Dijkstra’s Algorithm: The Shortest Path AlgorithmREAD MORE
What Are Recommendation Systems in Machine Learning?READ MORE