Quick links

and all other forms of audio files recognition
to MIDI or MUS file

Joel Ellis Rea

Laurier Nappert

1- Is it possible?

 The short answer is: NO! For a longer explanation, read on and see the illustrations at the end of this article.

2- The task to do

I am referring here to people who presumably want to take a recording of a full musical performance (say, an entire opera, complete with orchestra and all vocals, both solos and choruses, or a rock band complete with singers. I inferred this from the fact that we want to convert MP3 and other audio types of audio files, these would almost always be some pre-recorded, and likely copyrighted full-fledged musical performance. We hope to get a nice .MUS file out of it that would already have all tracks very accurately transcribed into staves, with lyrics in place for Virtual Singer and maybe even with parameters set up in VS to closely match each individual singer's voice, inflection, accent, etc. All the instruments would already be assigned and modified to match the real instruments in the recording, Digital Reverb effects already applied and tweaked to match the acoustic space of the original recording, etc. etc. This is by far a very popular feature request for a music program. I agree that it would indeed be cool, but it just isn't possible with today's technology.

3- Why not possible?

Asking for this would be like asking for a line-art type graphics program (Illustrator, FreeHand, CorelDRAW!, Canvas's draw features only, Xara X, etc.) to add a feature to take a scanned photograph and convert all the objects in it to line art objects, complete with Bezier curves and handles, automatically grouping related things together (the wheels of a car would be grouped with its body, leaves with a tree, facial features of a person with the rest of the person, etc.). While autotrace features and programs do exist (Adobe Streamline, for instance), none of them would be capable of handling photographs and recognizing the relationships of objects within them. Most just handle scanned monochrome or, in rare cases, colored line art, converting the scanned bitmap into real line art in .EPS or some similar resolution-independent line-art format.

That analogy applies to audio and music because MIDI (and variants such as .KAR and .ABC), .SEQ, and .MUS files (etc.) are analogous to .EPS, .EMF, QuickDraw, .AI, .FH, .CNV, .WEB, or other line-art formats, while uncompressed or losslessly-compressed .WAV or AIFF or .AU files would be analogous to uncompressed or losslessly-compressed TIFF/.TIF, .RIFF, .PSD, .BMP, .PNG, etc., and .MP3, .ASF, .WMA, .RA., or lossily-compressed .WAV, AIFF, etc. would be analogous to JPEG/.JPG or other lossily-compressed image formats.

In short, bitmap/pixelmap image file formats (no matter how they're compressed) and their analogous digital audio formats are direct, resolution-dependent digitizations of analog information. While they may appear to a viewer or listener to contain multiple independent objects or instrument tracks, actually they are just one image or sound recording (two in the case of stereo audio, and the layers of a .RIFF or .PSD would count as separate pixelmaps, but within each the same restriction applies). The computer just sees them as 1s and 0s that contain only the raw information for that particular point in the image (e.g. RGB or other color values of a single pixel) or audio (individual audio amplitude sample), and need not process them (beyond decompression if they were compressed) to display or play them in a form that the human brain, fed by input from eyes or ears, would recognize as being extremely close to the original analog image or sound source. When you view a .TIFF of the Mona Lisa, you see a beautiful woman against a background of clouds and terrain, but the computer just sees rows of pixels, each pixel having a Red, Green, and Blue (or Cyan, Magenta, and Yellow) value, each value consisting of eight or sixteen bits. When you listen to "Beethoven's Fifth Symphony" as a .WAV file, you hear strings and brass and woodwinds and percussion playing specific notes at specific rhythms, but all the computer "hears" is 8-bit or 16-bit audio samples telling the sound card how to move the speaker cones in time to reproduce the recorded sound.

But image formats such as .AI and .EPS, or sound formats such as .MID and .MUS, do not contain actual image or sound information, but rather the commands needed to generate them. An .EPS of a sketch of the Mona Lisa would have the actual curves of the sketch in a format that the computer understands and can display to the user. The computer can manipulate each curve independently, even where it overlaps other curves, without disturbing them. Likewise, a MIDI file of the Fifth Symphony has tracks or channels for each of the instruments, and the notes and velocities and other commands for each of those in a way the computer understands and can manipulate. You could change the individual notes of an individual instrument without affecting the other sounds that occur during the same time. You can't do that with a .WAV or .MP3.

4- What exists now

 Just as monochrome line-art autotrace programs do exist, so too do monophonic audio-to-MIDI programs exist. And as with autotracing, most of them aren't very good even with that mono-source limitation, and a few gems do shine above the general muck (Streamline). The few that try to exceed the mono-source limitation tend to do even worse jobs than the others. Unless, that is, if the .WAV (or .AIFF or .AU or .MP3 or RealAudio or Windows Media Audio or QuickTime Audio or any other audio sampling format, compressed or uncompressed) file in question is of a solo performance, of one instrument which only plays one note at a time. It could also be of an individual person singing, humming, "scatting" whistling, etc. a tune, without any form of accompaniment (not even rhythm or a metronome).

An inexpensive program called "Digital Ear" can do a quite decent job of translating such files into a single MIDI track. Unlike its competition, Digital Ear can track and respond to not only pitch changes, but also volume and brightness changes, translating those into appropriate MIDI events (e. g. volume to MIDI Expression [CC#11] or Breath Controller [CC#2] Continuous Controller messages, and brightness to Brightness [CC#74] or Harmonic Content [CC#71, if I remember right]). The resulting .MID file can, of course, be imported into Melody or Harmony Assistant, or any other MIDI-compatible program.

But if you want to be able to take a .WAV of, say, a rock or jazz band performance, or the Mormon Tabernacle Choir singing the Hallelujah Chorus with full orchestra and pipe organ accompaniment, or even a Barbershop Quartet doing a traditional Tin Pan Alley song, and get all of that translated niceley into separate music tracks or notation (let alone lyrics!), then no, that just can not be done with today's technology, nor is it likely to be possible with any technology in the forseeable future. Some programs claim to be able to handle polyphonic audio, but in practice they can only handle source files of a single polyphonic instrument (e.g. a piano or acoustic guitar), preferrably played in an anechoic chamber (to eliminate reverb that might be confused with additional notes), of mellow instruments low in harmonics (e.g. a Baldwin grand piano, not a Kawai spinet piano, or a nylon-stringed guitar, not a steel-stringed one), etc., and then only if the settings are tweaked exactly right (which is not at all an easy task), etc. A true polyphonic audio-to-MIDI converter that actually works is many years down the road, and will require CPUs many dozens of times more powerful than today's Pentium 4s or Athlon XPs or PowerPC G4s (or even Itaniums and Hammers and G5s), as well as better software technology and algorithms.


5- What about the future?

Will it ever be possible to do what I described in the first paragraph? As I said, not with typical computer technology. Neural networks, though, are another matter. Most small child can hear a piece of complex music and pick out the people singing words, and hear the individual instruments (or sections of the same instrument playing the same notes) from the mix - even if they don't know the names of the instruments, they can still hear that the tones made by a flute sound very different from those made by a violin, which in turn are different from those made by an electric guitar with high distortion or fuzz effects. Furthermore, the child does this in real time, and doesn't have to think about performing Fast-Fourier Transforms and other complex math analyses on the analog audio waveforms coming into their brains via the cochlear nerves in their inner ears in response to vibrations of their eardrums.

Why then is it so hard for computers? Because computers are linear, performing tasks in sequence. Tasks like this, though, require more of a pattern-matching approach, which the human brain excels at. Neural networks work in a way similar to the brain. Another possibility is quantum computers (which are molecules - I saw a photo of a vial containing trillions of such molecules, and it looked like a tiny vial of over-colored lemon-flavored Kool-Aid, not at all like what you would expect a computer to look like!), which are also (theoretically) very good at non-linear tasks.


But those are still well into the future (at least a decade or two) before we have any that can process any musical audio file and spit out a fully accurate representation in some command- or object-based format (e.g. MIDI or .MUS).


A computer based on binary digital Von Neuman/Babbage technology (and that's the sort of computer most people would recognize as such, whether it's a microcontroller in a VCR or microwave oven, or a massive supercomputer in the Pentagon, or anything in between such as a desktop or laptop personal computer, be it a Windows PC or a PowerMac G4, or even future generations such as a next-generation 64-bit Itanium or AMD Hammer or PowerPC G5 machine), no matter how fast, simply cannot do such a task, at least not in the same way that our brains do. Our brains are not binary digital Von Neuman/Babbage machines. They don't work like binary digital Von Neuman/Babbage machines, and, more importantly, the converse is also true.


The same thing applies visually: you can just look at a photograph of someone you know and just instantly recognize who that person is without even thinking about it, let alone performing complex edge-detection and content-extraction analyses, but even the most powerful analytical imaging software has to go through such steps to perform facial recognition, and even then doesn't get it right nearly as much or as easily as even a toddler would.


To get an idea of just what a computer would have to go through to be able to do this task, try to switch senses: our visual cortex is no more designed to process sound and extract instrument, note, lyrics, etc. information out of that than a computer is. So, have a friend digitally record three sample .WAV (or .AIFF if you use a Mac) files: a recording of a live musical concert performance with vocals and multiple instrumentals, a recording of the inside of a noisy factory, and a recording of a crowded mall on a peak Christmas shopping day. Your friend must name those files with plain names that do not describe their content, such as "A.WAV", "B.WAV" and "C.WAV" not respectively (in random order). Your task is simple: you are to completely mute your computer's sound system (unplug the speakers if you have to), load the files into a wave editing program that lets you see a visual representation of the waves, and try to tell which file is the music, which the noisy factory or similar non-musical noise, and which the crowded mall. If you can even do that much, I will be impressed. Now, using only your eyes, try to pick out the individual notes and instruments from the music recording, or what individual people are saying in the recording of the crowd.


Here is a graphic of two different sounds. What are they just by looking at them? Here is a hint, you have 3 choices: they are both music, they are both spoken words or one is music and the other spoken words. If one is that of a person speaking and the other one is music, can you tell them apart? What are the spoken words if they are spoken words? What notes are being played by which instrument if they are music?

This is the music

This are spoken words

Try the reverse experiment, too: save a picture in uncompressed bitmap format, then load it into an audio program as raw audio samples and play it back, this time with the audio on, and see if you can "hear" the graphical nature of the image in any way


So, it IS possible to do simple pitch-to-MIDI conversion of monophonic audio sources, and even simple polyphonic sources of one instrument that plays chords (piano, guitar, etc.). For now, Digital Ear remains my favorite monophonic audio-to-MIDI program, because it does more than just note-to-MIDI. It tracks and translates pitch fluctuations such a vibrato, scooping, guitar pulls, trombone and steel guitar slides, etc. into MIDI Pitch Bends and Portamentos, and even tracks volume and brightness changes, converting those into MIDI Continuous Controller events of your choice. So, you could wordlessly sing a tune into your computer microphone, using vowel sounds such as "oo" and "ah" to do brightness changes, and vary your volume to match the way you would want, say, a saxophone track to sound, and Digital Ear would translate that into MIDI.


6- What about converting MIDI or MUS files to audio type files?

It is VERY EASY to go the other way, namely, to turn a MIDI file into a WAV, just as it is easy to turn an .EPS into a bitmap (TIFF, etc.) -- in fact, since the MIDI and .EPS files don't really contain sound or an image, respectively, but rather the commands used to REPRODUCE the sound or image, you can't even HEAR a MIDI file OR SEE a .EPS file UNTIL it has been converted to wave audio or a bitmap (respectively), even if only temporarily. Doing it permanently is only a matter of storing the results of the conversion that has to be done anyway for the results of the commands contained in the file to be humanly perceivable!


For instance, when using Adobe Illustrator or any similar program, you are NOT seeing the actual Bezier curves on the screen. You are seeing a rendered bitmap of them, since the screen is inherently a bitmapped device (in this case, there ARE exceptions: pen plotters, X-Y vector-scan monitors, etc. -- but in general what I said holds true). When you print out such a file on any ordinary printer (even a PostScript laser printer), you are seeing dots that were produced by a rasterizer in the printer, which converted the graphic commands into a bitmap.


When you play a MIDI file, however you do so, the MIDI device interprets the commands and generates sound waves. Once generated, they are of the same nature as sound waves generated by recorded audio files.


In both cases, the output of the conversion can be saved to a file, and the resulting file is an ordinary uncompressed or compressed bitmapped graphics file (TIFF, JPG, etc.) or an audio file (WAV, AIFF, MP3, etc.), respectively.


7- What does Myriad Software has to say?

 In the very first version of Harmony Assistant, in 1994, we implemented a frequency (notes) recognition module. It worked quite well when only one instrument played only one note at a time, as it is the case or voice and all wind instruments. It worked too for most of polyphonic instruments that play chords, for example a single guitar, a single piano, etc. But it did not work at all on complex orchestration, drums, etc.


Because many users tried to use this feature in a way it has not been built for, then complained to our tech support that it did not work as they expected, we deleted it from the next versions of the program. From this original set of features, only remains now in Harmony Assistant the "Tune-up" option, that has been remained untouched for 7 years !


Because computer power goes increasing, we took a look to what is available now in this domain. It seems that things did not improved too much these last years. Many programs can process mono-instrument, mono-pitched samples, some of them try to recognize notes for single polyphonic instrument, with more or less success, but *none* of them is capable of outputing a clean score from a complex source, as an orchestra piece or even a small rock band digital track. We can really wonder whether it is actually possible. On my mind, it is not possible at present to get good results for such tracks.


If you read carefully the usage notice of most of these pieces of software, you will notice it is clearly said it works only for "solo" instrument, and not intended to process complex digital tracks taken from a CD. But the first thing each user does to evaluate the software (I did this too) is to try it on pieces as complex as the 9th symphony of Beethoven, or on an excerpt of the latest CD from Iron Maiden, according to each one's taste. And of course, it does not work, and users complain either on the newsgroup or to the software company tech support (It *didn't* do this).


It is for this reason we do not want to implement such a feature in Harmony Assistant. Because we know even if we display in big red fonts the way this feature is designed to work, most of the users will try it above its limits, then bug the tech support...


So we probably will never implement a recognition feature as most of you expect. In any case, such a module wouldn't be capable to output a 30-staves score from a symphony. If we do something in this field, it will be as part of a more global module, for example voice-oriented, so that there could not be any confusion about the limit of the process.




Analogy between graphic programs and music programs




This is your computer screen magnified many times. Each square is called a pixel and is in fact a very tiny dot that can be ON (white or lit) or OFF (black or shut).

If you type the letter "a" in a "paint" type program, here is what you see on screen, again magnified many times. Your file will be a BMP, PICT, PSD, TIFF or similar type of file. This is similar to an MP3, WAV and other types of audio files. In this case, we would have only one intrument playing one note, nothing more. If we had a word, it would be like having as many notes as there are letters.



This is what the computer sees: a serie of 1 and 0. The 1 indicates that this dot of light or pixel is ON and the 0 indicates that the pixel in OFF.

If you want to convert this file to an Illustrator's EPS or similar file type automatically, you would have to autotrace the bitmap file. This is just like trying to convert a WAV file to a MIDI or MUS file automatically. The resulting file is editable at will.



Your converted file would look similar to this. This compares to the end result of your one note one intrument MP3 file converted to MIDI. The blue lines and dots indicates points from where the drawing can be modified without loss of resolution or quality.

This is what you hoped to get with your automatic conversion. This is just like the original MUS or MIDI file would look like if it had been created as such in the first place. This is what you see. To see what the computer sees, click here. You will see a serie of commands that tells the computer what to do and how to do it instead of a serie of 1 and 0.




So, this is what you started with, with the hope that your automatic conversion would give you...

...this MUS or MIDI file: a file that can be edited at will, but...

...this is the end result of your automatic conversion and it is the very best you can hope for with only one instrument playing one note in the best of conditions.


Now, this is similar to the recording of an opera, a jazz big band or the Rolling Stones playing and singing with many voices, instruments, chords, etc. What you want to do is to convert it automatically to something that would be exactly the same but that you could edit at will afterwards hoping that all that belongs to the tree on the right will be grouped in logical order and the same for all other picture elements as well so you can edit them easily (the notes played by the piano are all on the same staff with proper velocity, duration, etc. and the same for each intruments). Can you imagine the task? It is impossible to do with any graphic programs on the market at the present time and we may have serious doubts that it will ever be possible in the near future. The same goes for music.



Written by Joel Ellis Rea aka "COMALite J.", with a comment by Myriad Software.
Compiled and processed by Laurier Nappert. Illustrations and painting by Laurier Nappert.

November 2001

Home > Resources > Experience Sharing > WAV and all other forms of audio files recognition

Top of page
Legal information Last update:  (c) Myriad