The paper BPM Measurement of Digital Audio by Means of Beat Graphs & Ray Shooting describes the essence of a BPM counting algorithm. Over the years that algorithm has evolved. It became faster and more accurate. Nevertheless, the essence has always stayed the same. Namely: calculate what delay one needs to find the same information back. E.g: in a 60 BPM 4/4 rhythm, it takes 2 seconds before a bar repeats.
A video I made for a talk at OHM2013, illustrates the content of the various intermediate buffers. The top panel is the delaybuffer (which runs from left to right). It spans 2.6 seconds. Every time a sample comes in the samples at the white vertical lines are subtracted from the incoming sample. That value is plotted in the second pane (difference to t0). The third pane visualizes the absolute value of this difference. That value is then added to the accumulated autodifference. The Autodifference pane is plotted with (0,0) top left (hence swapped vertically). The different colors are combinations of three channels: red (low frequencies), green (middle band) and blue (high frequences). More details can be found in the slides of the talk at Measuring Tempo, Extracting Rhythm, Finding similar Music.
The full presentation given at OHM2013 can be found below:
BpmDj provides the digital DJ with an automatic BPM Counter, a fairly novel way to visualize music, and nearest neighbor detection to find similar tracks. The project was initiated in 2000 and over the years it grew into both a testenvironment for new algorithms as well as a tool to DJ. In 2013, I rewrote the entire codebase forAndroid, thereby seriously gambling on the Android platform to make me rich (that didn't work though). and, while doing so, I solved a number of tricky outstanding issues. In particular, I improved the tempo detector and the nearest neighbor detection. In this talk I would like to focus on these technical aspects.
Before explaining the improvements, it is maybe best to first demonstrate BpmDj briefly.
This image is a Beatgraph. It is a 2 dimensional representation of the energy content of a track. From left to right all bars are visualized, while from top to bottom the energy content of one bar is shown. Bright pixels indicate a lot of energy, while dark pixels indicate less energy. The advantage of beatgraphs is that they directly show us where the energy is located within a song. I can for instance directly say that this area has less energy, and this area has more energy. I can also directly see how many beats there are per bar. In this case 4, because we have 4 horizontal bands.
Of course, in order to plot such a beatgraph we must know the length of each bar, which itself is directly related to the tempo of the track. So in order to plot a correct beatgraph we must know the tempo. Let's have a look at what would happen if the measured tempo would not match the actual tempo. Let's assume that the measured tempo is a bit too slow. In that case, each visualized bar will be a tiny little bit too large.
Let's now draw the Beatgraph. The first visualized bar starts with the full first actual bar, but then adds something from the second actual bar (which is drawn in yellow). When we then draw the second bar, we are already a bit into the yellow bar, so we draw the remainder of that one, and then we add some more from the third (red) bar. At the 3th visualized bar, we are again already into the 3th actual bar, so we draw the remaining of that one and then pull in some more from the 4th actual bar (drawn in blue). So, as the Beatgraph is generated, the mismatch between the actual and the visual bars becomes progressively worse, leading to a slanted beatgraph.
This is an example of a Beatgraph with a tempo that is not fully correct.
In summary, when the beatgraph is slanted the tempo is not measured correctly, and in a sense we are now looking to find a tempo that will produce horizontal beatgraphs.
So, now how do we do that ? The easiest approach is to look for similarity between consecutive measures. And that is done by subtracting the energy of one bar from the energy of the previous bar. To integrate all these values we merely take the absolute value of the difference and add them all together. If this value is a small number, then the consecutive bars are fairly similar, if the values is larger then the consecutive bars are more dissimilar. One can then repeat this process and compare bars that are 2, 4, 8 or even more bars apart. However, in the remaining we will only be looking at the case where directly consecutive bars are compared.
Mathematically this can be written down very concisely as an 'autodifference' value. In this equation, A refers to the audiostream and is indexed with time. R_p tells us how 'horizontal' the beatgraph is, if we would draw it under the assumption that each bar is P samples long. P is thus the length of 1 bar.
The only thing we need to do now is scan all possible tempos and find the P for which R_p is minimal.
This plot is the autodifference plot for 'Rolling on Chrome'. As you can see at around 96 BPM the autodifference value is the smallest.
An alternative way to look at this plot is to flip it over the y-axis, and instead of talking about the autodifference value, we then talk about the autosimilarity. Instead of looking for the minimal value of the autodifference function, we will be looking for the maximal value of the autosimilarity function.
Now, this technique has its flaws, and when implemented in a straightforward manner it leads to some problems, namely there are often multiple solutions to draw a horizontal beatgraph.
The beatgraph of this song, which is at 140 BPM, can be painted with 4 beats per bar. 1 2 3 4 1 2 3 4
However, it can also be drawn with 3 beats per bar. 1 2 3 1 2 3. If the algorithm then assumes that it actually saw 4 beats per bar it will tell you that the tempo is 186 BPM.
We can even draw a Beatgraph with 5 beats per bar. 1 2 3 4 5 1 2 3 4 5
In all cases, the beatgraphs were horizontal, and the algorithm will simply pick the one that happens to be just a tiny little bit better than the others.
That this is somewhat of a random process is shown in the following plot. It shows the autosimilarity of the song you just heard and as can be seen the 3 peaks are very close to each other with respect to their height. So, the peak that is chosen is merely a random effect.
Generally, autodifferencing selects the correct tempo harmonic in about ~53% of the cases, which is the first major problem.
A second problem is that the autosimilarity values oscillate wildly, which makes no sense, because, as we come closer to a correct tempo we should not see sudden jumps, we expect a smooth transition into a peak. Why would 139.98 BPM be so much worse than 140 BPM ? And this is exactly what this plot shows. Around the peak the automsimilarity function oscilates wildly.
A last problem of the algorithm as presented is that it cannot directly be used on a live audiostream because the autodifferencing value is calculated across the entire song. So, we must have the entire song in memory.
Next we will explain how we create a live BPM counter that will also solve the previously mentioned problems.
But first some requirements: The tempo detector we want to create has to be fast (we don't want too many operations per sample), it must continuously report the tempo and become better at producing the actual tempo. And it must have a low latency. We don't want the user to wait 30 seconds before the first reliable estimate is produced.
To solve the problem that it should work on an audiostream (as opposed to a complete wave file in memory), we will work with a delay line. This provides us with the opportunity to compare new incoming samples against samples we saw in the past.
The following short animation will plot the content of the various buffers in the algorithm.
The top section is the delay line. New samples come in at the left and are shifted through to the right. The full length of the delay line is about 2.6 seconds.
Every time a sample comes in we compare the incoming sample against all other samples in the delay buffer. The subtraction is plotted in the second pane. In particular, this value minus the incoming value gives you a value below zero, which can be seen here.
After that the absolute value of the difference is taken and plotted in the third pane.
The values from this buffer are all added to the autodifference value; or subtracted from the auto-similarity plot, which is plotted in the bottom section.
While new samples come in, we also search for the the position of the peak-value in the auto-similarity plot, and that position, which is a lagtime can be converted to a tempo value. At the right side are large delays, hence long bars, thus slow tempos. At the left side of the plot are short delay times, thus short bars, thus fast tempos.
As we see this song progress through our algorithm, we can observe that, every time the signal passes a specific time point, that it tends to have a low autodifference value, which will consequently increase the auto-similarity at this position.
Now the first thing that jumps in the eye and stabs your brain is that there is a bias in the auto-similarity plot. The autodifference of periods close to 0 is 0. This is of course the case, and it does not matter because infinite tempos are not very cool. Nevertheless, this bias often extends far into the useable tempo range and thereby it affects peak selection.
Even worse is that the bias profile is unique for each individual song.
There are two strategies to deal with this problem.
The first strategy is to search for the peak tempo at places that are as far away as possible from the bias. That this is possible is illustrated in this plot. It show the tempos associated with particular lag times. At the total right we have a lag-time of 2.6 seconds, which is the time needed for 2 beats at 45 BPM. It is also the time necessary to play 4 beats at 90 BPM or 8 beats at 180 BPM. Now, in the middle of the plot we have a lag-time of 1.3 seconds. And this is the time necessary to play 1 beat at 45 BPM, or 2 beats at 90 BPM or 4 beats at 180 BPM. At around 0.65 seconds, the lag-time corresponds to 1 beat at 90 BPM or 2 beats at 180 BPM. The fact that the same tempo (90 BPM) occurs at multiple places in the auto-similarity plot is very valuable to us because we can look for a peak in the range 0.325-0.65 seconds, or we can look for peaks in the range 0.65 seconds until 1.3 seconds. Or we can look for peaks in the range from 1.3 seconds to 2.6 seconds. In each case the lag-time of the peak will correspond to a multiple of the actual tempo. However, at the right side of this plot we are much less affected by the bias coming from the origin. Actually, in musical terms, what we measure is not the repetition of the 'beats' (which would be the smallest range euhm from 0.325 to 0.65 seconds), but instead we measure the repetition of the bars. An even better strategy is to measure the repetition of each bar pair, but that would mean we need a delay buffer of 5.2 seconds, which would double the time until the first sensible tempo is reported.
A second strategy to deal with the bias is to understand where it comes from and why it is unique for each individual song. The answer to that question is: noise. If we would have a noiseless signal, generated by the perfect synth, producing the exact same tone at each position in time, then we would find that subtracting A from A, which is to be found in the previous bar. That subtraction would be 0. However, as soon as noise is added to the signal (under the form of a guitar, phase shifts, airplanes flying by and other sources of deep enjoyment in music), then we would find that subtracting the two signals will not produce a 0, but will merely add the variances of the noises. Practically speaking, the autodifference behaves differently towards noise than it behaves towards a clean signal. If we sample the autodifference more, we will measure more of the noise, and less of the clean signal. Thus, oversampling the autodifference affects the bias-profile.
Luckily, we were already a bit in trouble with our auto-differencing. To calculate R_p we needed to scan the entire delay line every time a new sample came in. That meant, that, if the delay line was 5000 samples long, then we would need to perform 5000 operations per sample. This is expensive, and as it turns out, it merely helps us to measure more noise and less signal. So let's try not to oversample the autodifference by reducing the number of comparisons we have per sample. And at the same time, improve the execution speed of the algorithm..
To do that, we introduce a tracer in the delay buffer. With each new sample, only one particular position in the delay buffer is compared and added to the autodifference. That trace-position then rotates through the buffer as time progresses.
The following video show this algorithm in action.
At this point the tracer has rotated almost twice through the buffer. The autodifference values to the left of the tracer have been measured twice, while the autodifference values to the right of the tracer have been measured only once. As time progresses this difference becomes negligible. However, when the algorithm starts, it takes a long time before it can produce any sensible result.During the rest of the song you will see the selected peak jump back and forth.
Maybe we might have underdone it a bit. It required many operations, many iterations before the first sensible auto-similarity was accumulated. That means that it would require many times 2.6 seconds to produce a correct tempo. That would lead to an unacceptable delay. This of course not a problem when we analyze full songs, but in case we ever go live with this BPM counter, it does matter. As it turns out, using 16 traces per delay buffer is fairly optimal. It provides for a good balance between under-sampling and over-sampling. I will demonstrate this later after explaining some more refinements.
The two mentioned strategies: interpreting the autodifference plot far away from the origin of the bias and reducing the bias by not oversampling the autodifference, solves a number of problems, except the one I started this talk with: namely tempo harmonics.
To deal with that problem we introduce an extra interpretation stage. Instead of simply finding the peak and using that at face value, we also will take the distances between the peaks into account.
That is possible because in a 4/4 rhythm with a beat at every 1/4th note, we find that the auto-similarity plot has peaks at every beat length, and that the distances between those peaks are also a multiple of the tempo. In more complex rhythms, we will see a hierarchy of auto-differences. We will find a strong peak with flanking subpeaks, but in almost all cases the distance between the main peak and the subpeaks will be a multiple of the tempo. Think of it as follows. The main peak tells you the repetition time for full bars, the second peak tells you the repetition time -within that bar- for beats and the third peak tells you the repetition time necessary for the hi-hats within the beats. Although this is a metaphor and not fully correct it explains the idea quite well.
What our algorithm will do now is to find the first peak. In this case at around 140BPM. Then it will find the second peak and calculate the distance between the 1st and the 2nd peak, which is 0.857 seconds. That time is 1 beat at 70 BPM or 2 beats at 140 BPM. Finally, a third estimate is obtained by measuring the distance between the 1st largest peak and the 3th largest peak. That time difference is 0.428, or 1 beat at 140 BPM. 3 estimates like this, all multiples of the actual tempo, help us to choose the two that tell us the same thing and report the tempo.
Lastly, some other improvements have been added to the temp detector We made it frequency sensitivity. Our ear listens to frequencies not raw samples. So it makes more sense to let the content in the delay buffer represent the energy of specific frequency bands. The bark psycho-acoustic scale is a good start. With its 24 bands it covers quite exactly the loudness profile of our hearing. Some people have done tests with 6 bands, spread across the auditory spectrum, which also turned out to be sufficient. In our case, because we wanted to run it on Android, we didn't expect to have much computational time left, we reduced it even further to 3 bands. The Lo, middle and high frequency bands. Each band covering 8 Bark bands. Aside from being frequency specific, it is also valuable to respect loudness. That means that we store in the delay buffer log-values (as opposed to direct RMS values) and also that we apply a 150 to 200 millisecond decay to ever incoming sample.
The result of all these improvements is shown in the following video. The 4 panels are again the same as they were before The delay buffer running from left to right, the difference, absolute difference and accumulated auto-similarity. The three color channels (red green and blue), represent the 3 frequency bands we have been measuring. At the bottom, the measured peak and its two subpeaks are shown, together with an estimate on the actual tempo.
In the first part of this talk I explained how using loudness measurement on multiple frequency bands, combined with an appropriate sampling of the auto-similarity and appropriate interpretation of the peaks leads to an algorithm that measures the tempo correctly in 95% of the cases.
The biggest problem we had in the past was the presence of unwanted harmonics, such as the 3,5,6 or 7 harmonics. Our algorithm no longer reports such tempo harmonics. Nevertheless, the algorithm might still be 'off by a factor two'. It might report a tempo of 70 BPM while it should actually be 140 BPM. However, that does not matter because, if a DJ wants to play a 70 BPM song after a 140 BPM song, then no sensible DJ software will double the play speed of the 70BPM song. Instead, it is better to assume that the song should be interpreted as a 140BPM song.
In the next part of the talk I focus on measuring rhythm patterns in music, but before doing so I would like to demonstrate why that would be useful.
You might remember the track we started with. In order to add a new track we can click on the + icon at the right. Then we end up in a song selector that suggests songs that are similar to the one that is playing.
The topsong is the target. All other songs are similar in either rhythm or loudness.
To make it possible for BpmDj to find songs with a similar rhythm, it needs the ability to measure it. The 2nd part of this talk explains how that is done.
When we know the tempo of a track we can extract one bar from any position we want. Suppose the tempo is 120 BPM, then each beat is half a second and one bar, which is four beats, would be 2 seconds long. If we now were to select any 2 seconds in such a song we would have extracted exactly 1 bar. And although 1 bar might already give us some information on the rhythm, we might just have selected the wrong bar, the bar with a fill-in, or the bar with an ambient passage, or the bar with a Screaming Jay Hawkins. That problem can easily be mediated by extracting all bars from the song and calculating their average. In our 120 BPM song, we would chop up the song in sections of 2 seconds. Each section being one bar. Adding them all together will give us an idea of the rhythm of the song.
The following 3 examples illustrate this. The first sample is a couple of bars from the original track. The second sample is the extracted rhythm.
Of course, BpmDj does not store the rhythm information directly in this sample based format. It is after all not normalized. Each pattern has a different length because the tempo of each song is different and the samples are way too large. With over 88000 samples per pattern we are looking for a better solution.
Luckily, one is at hand. We start from a beatgraph and then add all the bars as found in the image. Thereby we also use the different frequency bands that were used for the tempo analysis: 3 for BpmDj for Android or 24 if we are really thorough. To normalize the length of the rhythm pattern we choose 192 ticks/pattern. That allows us to store 3th and 4th notes at integer positions as well as 12th, 16th and !1/32th notes. For that one obnoxious tick, we can even store 1/96th notes.
This picture is an image of a colored beatgraph, one in which each pixels contains 24 sub bands, colored differently. The resulting summation leads to a rhythm pattern that describes, in red, the low frequency pattern and in blue the high frequency pattern. The colors in between is the presence of frequencies in between.
In order to make reading these patterns a bit easier we will rotate them 90 counter clockwise. That means that from now on a rhythm pattern should be read from left to right. The bottom of each pattern picture represents the presence of low frequencies, the top of the picture represents the presence of high frequencies.
In this case we see four bass drums per bar with hi-hats spaced in between them. The hi-hats have an accent every second beat.
Now, once we have such rhythm pattern we can use that to synthesize noise, in accordance with the pattern. By generating 24 noise bands and letting the brightness of each pixel row steer their respective envelopes, we can generate an auditory representation of the extracted rhythm. What you hear is in other words a fully synthetic rhythm.
The following is an ambient fragment.
To illustrate that rhythm information is correctly captured in each song, I will now demonstrate some unusual rhythms.
The first is Conga Fury, with the hi rhythm section played at third beats.
An even better example is Jartecken. It contains a series of triples within a 3/4 rhythm.
The last example is Electroplasm. A song that has a 7/8 time signature.
With this I would like to end the part on rhythm extraction and dive into the improvements we made on nearest neighbor detection.
The nearest neighbor detection in BpmDj has been created over many years. Over time, we have been using various data sets to test our algorithm. From 2000 to 2010, I mainly worked with the music I had on disk, some 80000 songs. The tempo of these songs was estimated by an algorithm that was not as accurate as the one we currently have. The number of spectral bands we used was 24. We also stored spectral information and a rhythm pattern. The rhythm pattern was 96 ticks wide and 24 bands high. At that time, we also used a property that would tell us something about the composition of the music. That is, after how many bars a specific frequency would change. We did not include loudness information and did not assess the time signature of the music.
In 2010, I re-investigated the tempo algorithm and started using a different set of music. In this case we had a list of 3600 curated songs. We omitted the spectrum, echo, loudness and composition properties. The rhythm pattern was however fairly extensive. 384 ticks wide and 24 frequency bands high.
In 2011, I worked at Audiotool and had access to about 300'000 tracks, with a known tempo. Starting with this data set, I started to introduce new properties and measured others more accurately. For instance, the spectrum of a song was measured as 120 notes, including a spectral histogram .We also included a dB per octave drop. We included composition, echo, loudness and measurements on the time signature. The rhythm patterns were 384 ticks wide, but only 6 frequency bands high because I had read somewhere that those 6 bands were sufficient to allow people to recognize rhythms.
In 2013, I started to create BpmDj for Android and used about 400'000 test tracks, from which the tempo was largely known. The spectrum was measured solely as lo/middle and high. No echo, composition nor time signature properties were retained. The rhythm patterns now covered 384 ticks and used only 3 frequency bands.
In what follow I might occasionally switch from one dataset to the other, mainly to illustrate a specific lesson we were taught by the data.
Given such a collection of properties, how does one go about storing and retrieving the data ? Generally, each song will be considered to be a point in a multidimensional space. Each dimension will be a value of a property. For instance, dimension 1 could be the energy of the rhythm at the lo frequency band, tick 0. The second dimension could be the energy of the rhythm at the middle frequency band, tick 0, and so on. Once all values in the rhythm are exhausted, we can shift to the next property. Dimension 1152 could be the energy of the spectrum for note 12, the next dimension the energy of the spectrum for note 13, and so on.
Depending on the number of properties, we thus have a different number of dimensions. For instance, a 8000 dimensional space or a 1563 dimensional space.
The advantage of storing track meta-data as if they were points in a multi dimensional space, is that we can measure the distance between them. This is given as the sum of the absolute differences between the values along each dimension. A conventional distance measure would square the differences and then apply a square root to the summation. In our case, since I'm a total fan of absolute values, we simply don't do that. We just stick with the absolute value. An extra advantage of that is that we can work with integers, and don't have to go through floating points, which speeds up nearest neighbor detection.
Suppose now that we have a target song. Once we know the distance measure we can find all the nearest neighbors by simply scanning all the vectors associated with all the songs and calculate their distances to our target point, which is our song that we are interested in, and sort the result set according to their distances.
Initially we believed that, because all the properties were mathematically and exactly measured, that should have been sufficient to have a good nearest neighbor detection. When the vector representation of two songs would be close, then they should sound the same. In the Linux version of BpmDj, we started to notice however that certain properties were deemed more important. BpmDj had a tendency to favor songs. If one simply followed the nearest neighbors, one tended to end up on empty tracks with a very hard bass drum. This indicated that there was a problem.
When working with the Audiotool data set, I further figured out that too many properties degrade the efficiency of the nearest neighbor algorithm. It turned out that certain properties are more important to compare songs than others. For instance the presence of 20 Hz tone in a song (a low rumble) is not that important when comparing two tracks. Such properties might still force the distance of two points to become very close nevertheless, while in reality, for humans, they are not.
The consequence of all this, led us to introduce weights to tune the distance metric. We want an equation that reflects each dimension in terms of its importance attributed by humans.
The problem is that there is not really a good way to guess those weights, so we set out an experiment in which we would present a listener with 3 songs and ask him or her: 'Is song B closer to A, or is song C closer to A' ? By registering the answer, we would already have some information that might be useful to create a weight vector.
This is a demonstration of the 3 song test.
This is song A
This is song B
This is song C
Is Song B or C closer to A ? As you can hear the test itself can be fairly difficult.
An extra problem with such a test is that it is a relative time consuming and that we don't get much information from it: With each 3 song test, we get 2 distance measures, which will lead to 1 comparison of distances. That is not very efficient. To make this more efficient, we modified the test and presented the user with 15 songs and 1 target song. Then we would ask him (or her) to group the 15 songs in 'near' and 'far' sets. The result was that, if we had X near songs and Y far songs, that we would have X*Y distance comparisons. On average, this provided between 44 and 56 comparisons per test. Not only that, it was often easier to find the songs that were 'in the ballpark' and songs that were so far removed from the ballpark that we could safely mark them as 'far away'. We did allow empty answers. Not all songs had to have a label like close or far, which was also important. Sometimes tracks simply could not be compared. For instance, when the track was actually two songs, or when the track was empty.
This is a demonstration of the 16 songs test. The target song is shown top left.
Now, even with this improvement, it still remained a difficult task. If we used a truly random set of song, most songs tended to be extremely far away. So far away even that they were not comparable and it was often not possible to decide which subgroup could actually be considered 'near'. To solve that problem, we bootstrapped the song selection using a weight vector that was 1. Another strategy we used was to start with a weight vector that would ensure a fixed variance for the values along each dimension. All in all, the exact bootstrapping didn't matter much , as long as the data set was not truly random.
Once we finished performing this test about 1000 times on 300 000 songs, we ended up with 44000 distance pairs. Now a new problem dawned: overfitting.
With 8000 dimensions and thus 8000 weights that we could tune, we could always find some combination of weights that would satisfy the 1000 tests we had. Actually we could find some weights that would only satisfy the 1000 tests we had. This was because the learning algorithm would cleverly make use of noise in the data set to actually make it fit.
A second problem was that the 44000 distance comparisons which we had, only applied to the tests which we actually did. Only if we presented the user with a distance comparison between A, B and C would we be able to use those tracks to tune weights. 44000 distance comparisons is about 0.00000048 of all potential distance comparisons. Making it not only an over-fitted model but also a completely irrelevant one.
Therefore we needed some way to generalize the data we obtained from the set. We needed the ability to say something about song A-B-D, even if D did not appear in any of the tests. To achieve that we set up a statistical model of the distance metric.
The idea is as follows: we have two parallel processes, both telling us something about the distance metric. The first process tells us what the human thinks is closer. The second process is whether the computer, given the weights it is using, thinks that it is closer. In both cases we can aggregate statistics that tell us something about how important each dimension is.
At the moment, these two processes are disjoint. What the computer thinks is good has little to do with what 'we' think is good. To match these two processes, we insert a learning algorithm. Thereby we assume that if we are able to match the statistics of both distance metrics, that the distance metric will be more or less to our liking. Thus: what the learning algorithm tries to do is to tune the weights such that the resulting statistics come closer to each other.
The next question is of course. Which statistics are we talking about ?
In our case, we count how many times a specific dimension was in alignment with the distance result. That is to say, if A is closer to B than it is to C, and the value across dimension 25 for A is also closer to B than to C, then we would mark dimension 25 with a hit, otherwise it would be a miss. In another case, suppose that, although the distance metric says that A is closer to B, the values along dimension 78 are further apart, then we would mark dimension 78 as a miss. By counting multiple distance pairs, we can obtain a form of correlation that would tell us how often dimension X is in alignment with the distance metric.
To demonstrate this clearer. Look at the three dots. If I ask you: which one is closer to red ? And you answer 'yellow' then I know that the horizontal axis is more important to you than the vertical axis. Likewise, if you would have answered green, then I would know that the vertical axis is more important to you.
Because later on I will introduce a novel method to train the weight vector, I need to introduce some formal notation and become a bit mad. Dafuq ?
The set of tests we conducted is called T. It contains tuples of A, B and C, all points from an N dimensional space. A, B and C are the vector representations of each song.
The hitcount for a certain dimension tells us how many times that dimension was in alignment with the distance metric. This is written as the number of times A_i-B_i was smaller than A_i minus C_i given that the distance between A and B was also smaller than the distance between A and C.
So ! If we aggregate these statistics we would see the following.
In case of the spectrum properties we measured on the Audiotool dataset, we found that the 2nd octave, the 3th octave and around 1 kHz were positively correlated with the distance measure. Anything higher than 1 kHz tended to be... well.... all the same.
In case of the loudness property, we observed first that how loud the louder passages are matters, but so does the loudness of the more silent passages. Anything that was between transient and ambient was apparently less important. Whether this is the result of the loudness wars is not entirely clear, it could as well indicate that this is the reason why compression works. It would only affect the areas that we are really interested in.
The statistics of the rhythm property are shown here. It looks pretty much like an EEG, but otherwise not much direct information can be read from this plot.
The same happens with the statistics of the rhythm property as measured in the BpmDj for Android. Not much can directly be discerned.
At this point is very tempting to use the correlations, as we measured them, directly as weight vector. That however does not work because it is merely a statistic we aggregated. It is necessary to try to match the statistics of the human distance metric and the computer distance metric. That is done by randomly testing 10000 pairs and generating the computer distance statistic. Then the learning algorithm compares the wanted statistics against the actual statistics and tunes the weights. This process is repeated until the distance metrics are sufficiently close to each other.
This is a plot showing the distance statistics for the human and computer metrics.
In order to train the weights we initially used Q-learning. Thereby we would check which dimensions (some tongue flapping) were under-present in the statistics and increase their weight in accordance with how much change we wanted.
The resulting statistic, shows how the computer metric (the red line) tries to follow the human metric (the blue line). It is however in general above the blue line. The Q-Learning algorithm could not bring certain correlations down to their wanted level. This was especially apparent in certain areas.
What happened inside the algorithm was that it would continuously try to fix issues by increasing or decreasing weights that could not be matched. An instance of this is visible in the spectrum weights generated by the algorithm. The weight for a totally irrelevant piece of the spectrum (22000 Hz) was extremely high, and depending on how long we would run the algorithm, would keep on rising. At the same time, other weights would keep on decreasing.
A second aspect to the learning algorithm was that it placed all its eggs in one basket. Choosing 22 kHz as an 'important' weight is sensible because it indicates indirectly how far above 5kHz the song goes.
Similar things happened to the rhythm statistics, both for the Audiotool dataset as the Android dataset.
With respect to loudness measurements we saw something different. Two levels of accents turned out to be important, however the most interesting thing was that the 'silent levels' (found left on the plot) required a much higher weight, because their values tended to be so low.
So, in the end Q-learning was certainly a good attempt and it produced often weights that were 'believable'. At other moments it just produced garbage because it could not deconvolve the dataset. That indicated to us that it was an unbalanced solution. The fact that the weight vector would become more extreme as time progresses indicated that there was no convergence. In a sense the thing behaved like a PCA analysis and would try to find a representative dimension for each group of dimensions.
After some thinking, we figured that a straightforward learning did not work because often there are dependencies between dimensions. That is to say: if dimension 46 is correlated with the final distance measure and dimension 47 is correlated with the distance measure as well, then dimension 46 and 47 will tend to be correlated to each other as well. So, if we would then simply use the correlation of dimension 46 and dimension 47 then we would influence the final distance too much, we would count the same virtual property twice. A good example of this is a 4/4 beat. If two songs are similar, they tend to be also in a 4/4 rhythm. That means that beat 1, 2 3 and 4 will all correlate together with the distance metric. Let us assume that the correlation is 1 in all cases. If we would simply use that 1 as a weight, then every time a song in 4/4 comes in, we would count the 4, 4 times, because if the first beat is present, then all others tend to be present as well. Essentially, if a dimension is cross correlated to other dimensions the integral of those cross correlations influence how strong that dimension will influence the end result. This is not what we wanted. We want the weights chosen such that each dimension influences the end result in such a way that the correlation of that dimension matches what humans like. To solve this we tested many different update methods and the results were always influenced by the equation we wrote. In the end we came up with an equation that works like a charm.
Let's get to the meat of the solution now. First we need to know the cross-correlations between various dimensions. H_i,j is the number of times the value of A along dimension I is closer to B when the value of A along dimension J is also closer to B. To convert this to a type of correlation, which we will call R_i,j, we need to bring it in range -1 to 1, which is done by taking the size of the dataset into account and translating a 50% hitratio to a correlation of 0.
The following shows such correlations. If in the high frequencies a beat at 2.X is present, then a similar (but less strong) presence will be at all other beats, at least for the high frequencies. The middle and low frequencies are not correlating towards the high frequencies of beat 2.X.
Next, we need to adapt the weights. The update equation contains the old weights and the new weights. W_i on the left side is the new weight of dimension I- W_i on the right side is the old weight assigned to dimension I. I should have added a prime to the left hand side, but apparently it fell of the slide.
The update itself is done in an exponential fashion because, from my experience I knew, that it made a lot more sense to express weight vectors as logarithms, hence the choice of an exponential when modifying the weights. Intuitively: in large dimensional spaces 1 weight needs to counter all other weights before being able to make a sizable dent in the resulting distance. A second advantage of using an exponential is that the weights remain positive. Some researchers learn distance functions (which can result in negative distances), we don't want to do that because it feels like a hack. What would a 'negative distance' mean anyway ?
The exponent itself is a summation of influences and wishes of all other dimensions. To update weight I we need to take into account all other dimensions. The question now is how we can affect the outcome of dimension I by taking into account the weights of dimension J. The answer to that lies in the crosscorrelations which we calculated before. It is clear that dimension J itself has a target, denoted M_j. It is also clear that if we need to lower the weight of dimension J, then H_j-M_j will be negative. Similarly, if we need to increase the hitrate of our metric, then we will need to increase that weight. So the direction and strength by which we will tend to modify the weight is related directly to how strong the other dimension J wants to be modified. However, we are modifying weight I, which means that we must map the changes we want, through the cross correlation matrix to our own dimension, which is done by multiplying the wanted change for J, with R_i,j, which is the cross-correlation. A last normalization step is necessary to bring this total sum back to a reasonable level by including the total weight we expect to receive from various dimensions.
Essentially, what we do to modify weight I, is to calculate how much other weights will want to produce a change in the same direction as weight I, and normalize towards that. That way, if there are many cross-correlation with other dimensions, each individual weight will be affected a little. In case one dimensions goes solo-slim and has no cross-correlations then the equation reduces to a direct change of w_i in relation to what is required. The advantage of taking into account the cross correlations is that if a group of weights requires a change, the algorithm will spread the required weight changes over that entire group. It will not (as happens with standard update procedures), use 1 particular dimension of that group to represent the entire group.
This slide demonstrates the result of the learning process. These are the weights for the rhythm property. And as can be seen, many irregular positions are marked as important. The positions are 'irregular' in the sense that they do not correspond to a typical position that is used in many songs. We do not find much importance in the presence (or lack thereof) at regular positions such as every beat, or every half beat. No, it appears that to decide whether a rhythm is similar or not, humans listen to the things and the transients in between the expected pattern.
To show that this strategy results in a better weight matrix, I will let you hear the nearest neighbors of a specific song.
I will only demonstrate the songs that are different between the two sets. The first demonstration are the nearest neighbors without using a weight matrix. Again, the top left song is the target. The second demonstration are the nearest neighbors, using a weight matrix.
Given these advances in tempo detection, rhythm extraction and nearest neighbors. The last thing missing, in our DJ software, is the ability to edit transitions, which it of course can do and which I would now like to demonstrate.
The software you just saw is called BpmDj. It is currently available through the Google Playstore as 2 versions. The free version of the software is a demo that allows you to mix up to 4 tracks. The full version has no limitations.
Lastly, I am currently looking out for interesting jobs, such as anything related to signal processing and/or interesting audio applications. If you have any serious offers, please feel free to contact me.
If you would like to fund further development of BpmDj, you can also directly contact me.