Multimedia Retrieval

We use mixture models to jointly model different media for the task of information retrieval. Documents of different media are associated together (for example, the musical score and lyrics for a song, or the image and archival comments of a painting), and clustering is performed on the associated sets of media using expectation maximization (EM), or online EM.

Once the documents are clustered, a query can be entered, and we return a set of ranked results. The query can be composed of any combination of media that the database accepts, for example, a few bars of a piece of music and some words of the lyrics. The query is then clustered with the database, and results are randomly sampled from the clusters, based on the mixture components of the query. For each sample, we find the best unreturned match for the query in the selected cluster and return it.

Querying can be used in two ways. First, partial information (keywords, melodies) can be submitted to find documents that match the search criteria. Second, entire documents (lyric sheets, musical scores, and/or images) can be submitted to find documents in the database that are similar. For example, submitting TS Eliot's poem, The Waste Land to the music-lyrics database described below will return songs with similar relative word frequencies.

This research has grown out of the work of Nando de Freitas, Kobus Barnard and others at Berkeley. The UBC SML multimedia retrieval researchers are Eric Brochu and Nando de Freitas.

J S Bach -- Toccata and Fugue in D Minor (score)
J S Bach -- Invention #5

Nine Inch Nails -- Closer (score and lyrics)
Nine Inch Nails -- I Do Not Want This

T S Eliot -- The Waste Land (text poem)
The Cure -- One Hundred Years

Using multimedia retrieval to find documents similar to a query. Queries composed of text and/or music documents are shown in blue and the first-ranked matches are shown in green.

Combining media can help us find better results. In this (admittedly artificial) example, we have information about the song The Boy With a Thorn in His Side, by The Smiths, represented by the song's score and the cover of an album on which it appears (left). We wish to find other similar songs. Querying using the album cover alone identifies as the closest match Moby's Alone on the album Animal Rights (center). But by including both the image and the musical score, we find the song Bigmouth Strikes Again, on the Smiths' album, The Queen is Dead (right). We are able to combine music and images to find matches that are similar both visually and musically.

Name that song
The first implementation of our multimedia retrieval models was a MATLAB system for querying songs on musical scores and lyrics [1]. Each datum consisted of a lyric sheet, represented in a standard "bag of words" format, and the notes of the song, which were extracted from MIDI files using existing software, and represented internally as a Markov transition frequency matrix.

A query can be a series of notes from the song and/or words from the lyrics. Currently, the notes must be entered in GUIDO notation, an ASCII music representation, which is a bit awkward. Ideally, we would implement "query-by-humming" as a more user-friendly solution.

The Sound of an album cover
Give the success of our music-text model, our somewhat obvious next step was to add graphics to the mix. In our implementation, we used the same data as in our previous effort, but with each song we associated a JPEG file -- the cover of the album on which the song appears [2]. Images are represented internally by vectors representing the intensity histogram of the image.

Queries on the music-text-image model may consist of words, music (still in GUIDO format), and/or JPEG files. The addition of images adds a new level of complexity. While text and music can easily be determined to match or not match a query, images are much more subjective. Furthermore, there is the question as how an image is to be represented. Our image-histogram multinomial model does quite well for matching colour schema, but is incapable of finding recognizing shapes.

[1] Eric Brochu and Nando de Freitas. "Name that Song!": A Probabilistic Approach to Querying on Music and Text. NIPS 2003.

[2] Eric Brochu, Nando de Freitas and Kejie Bao. The Sound of an Album Cover: A Probabilistic Approach to Multimedia. AI-Stats 2003.