Google recently released one of the most spectacular features yet: song recognition through humming. Just ask the Google Assistant ‘what song is this’ and hum your tune to it so you can recognize it, all through artificial intelligence, as usual in Google.
We are going to tell you how Google has achieved this, since they wanted to explain it thoroughly on his own blog. From a hummed melody to an exact search result with the song you were thinking of. How is it possible to achieve it?
Isolate the melody to achieve the result
Most music recognition models work by taking a sample of the sound, transforming it into a spectrogram (like the one you see above) and comparing said spectrogram with the ones they have in the database. The problem with hums is that the spectrogram has less information, since it only contains the melody.
A complete spectrogram has instruments, lyrics, rhythm, and all the key elements in a song. As you can see in the image, corresponding to the ‘Bella Ciao’ spectrogram, the difference between the studio spectrogram and the humming spectrogram is quite clear. To solve problems with such a large amount of information, Google focuses on the melody, so the other elements of the song do not matter.
In very summarized terms, in order not to get into technicalities, Google has a database with more than 50 million spectrograms, with which you can find the songs that we hum, with only the dominant melody of the song. All this is achieved even with background noise, since the model only focuses on that melody.
Training the model
To achieve this purpose Google made certain modifications to the Now Playing and Sound Search recognition model, which have been with us for a long time. For this training, he used a pair system (humming audo and recorded audio), generating different embeddings of each one. In other words?
Google exposes its neural network millions and millions of times to these pairs, until it is able to generate humming embeddings similar to the reference recording. With this system, Google claims that is able to recognize 4 out of 5 songs and, in our tests, the efficiency is quite high.
As we have said, Google needs millions and millions of hummed songs to compare with the originals, so it had to simulate hums, with a software called SPICE, capable of extracting the tones of the songs. To give you an idea, this is the original audio and this the audio generated by the software. The software output is refined by a neural network, allowing it to be even cleaner.
This explanation should also help us understand that, at least in this case, Google does not use user data to create this system. Undoubtedly the new hums will serve to continue training the network and make it more precise, but the original method is the one they show: simulating humming songs to compare them with the originals.
was originally published in