Introduction
Imagine you have a 440 Hz tone playing at the same time a 500 Hz and a 490 Hz note is playing. Your microphone picks them all up as a single sound sample. How can we separate the sound sample into the individual frequencies that make it up? In simple terms: we can apply a filter.
When you want to separate chlorine from your tap water, you add a chlorine filter. The filter separates the chlorine from the rest of the solution. In the same sense, we can apply a mathematical filter to our sound sample. If we want to know if the sample contains a 440 Hz tone, we can apply a 440 Hz filter and see if the filter "catches" anything.
The mathematical filter we will be using is called the Fourier Transform. It looks complicated, but I will show you that it's actually really simple.
I'll represent the transformation like this (there's no "f-hat" character in unicode so I'll just use "g" instead):
g(ξ) = ∫(-∞, ∞, f(x)·e^(-2πixξ) dx)
Again, it looks scary, but we'll break it down and make it simpler to understand.
What this is telling us is that if we have a function, called f(x), we can "transform" it into another function, g(ξ), using this equation.
In this case, f(x) is a function to represent our audio sample that we recorded, such that f(x) gives us the amplitude of the audio at that particular point in time. The function g(ξ) is our function for filtering frequencies. We insert a frequency value (ξ is our frequency) and it returns some number if our frequency exists inside of our sample or not.
One thing we can recognize right off the bad is that we have an integral (this guy here: "∫"), which represents a continuous summation. Continuous summations are analog, but computers are not analog, they're digital. So we don't actually need a continuous summation, instead what we want is a
discrete summation.
Not only this, but we also have a range of (-∞, ∞). This makes no sense for a computer. Again, f(x) represents our audio sample, so this would imply our sample size is infinite, reaching infinitely into the past and infinitely into the future. This is unrealistic, our range should simple be [0, N-1] where N is our sample size.
This form of the Fourier transform represents conditions, but we are not working in perfect conditions, so we can make these adjustments.
g(ξ) = ∑(0, N-1, f(x)·e^(-2πixξ))
Our function no longer represents a continuous transformation but a
discrete transformation, so this form of the Fourier transform is known as the
Discrete Fourier Transform (DFT).
This still seems a bit frightening, because we have e^(-2πixξ). There's "e" in there, there's "π" in there, and there's "i" in there! That's all 3 of the major constants in math. What could these be here for? 🤔
Polar Coordinates
Imagine an X, Y, Cartesian coordinate plane. We all have used these in math class, especially back in middle/high school geometry.
Any point on that plane can be represented with two numbers: the X and Y coordinates. Such as, we can have a point located at (5, 9) where 5 is the X coordinate and 9 is the Y coordinate.
It turns out that we can also represent any point on this plane with two completely different numbers: an angle and radius, we will call θ and r.
So how can we represent our point (5, 9) where X=5 and Y=9 using θ and r?
Imagine a circle who's center is at the origin of the Cartesian coordinate plane. Imagine the circle's radius grows and grows in size until the outer rim of the circle intersects with our point (5, 9). The radius here describes the size of the circle needed intersect the point.
We can calculate this radius using Pythagorean's theorem.
Imagine a line drawn from the origin to the coordinate (5, 9). This would be equal to our circle, and we can also imagine it as a hypotenus to a right triangle, where our two sides are of length 5 and 9.
a² + b² = c²
c = √(5² + 9²) = √106
So we now know the radius of our circle is √106, or about 10.3. But this isn't enough to describe the point (5, 9). It tells us how far
out from the origin we need to go, but it doesn't tell us what
direction we need to go. This direction is our
angle, θ. It is the angle of our right triangle.
You may have learned SOHCAHTOA from geometry class. In this case, TOA would be the easiest. We know the opposite angle is our 9 and the adjacent angle is our 5, meaning the following (where atan is the inverse of tangent, pronounced "arc tangent)...
tan(θ) = 9 / 5
θ = atan(9/5)
So the Cartesian coordinates (5, 9) can also be represented by the numbers (atan(9/5), √106), these are called "Polar coordinates". A way to intuitively understand them is that our Cartesian coordinates are telling us how far "to the right" we have to walk and how far "up" we have to walk, and our Polar coordinates are instead telling us what "direction" we have to walk (our angle) and how far in that direction we have to walk (our radius).
It's important to note here that there's actually an issue with the "atan" function. To the atan function, atan(-Y/X) and atan(Y/-X) are identical, but they're obviously not identical as these points would exist in totally different quadrants. Division here causes information about where the X and Y coordinates are to be lost, so we get the same angles for those two points even though the two angles should be in opposite directions.
This is usually solved by using "atan2(Y, X)" instead. This function takes in the original Y and X rather than the division of the two, so the information isn't lost and it can tell the difference between atan2(-Y, X) and atan2(Y, -X) and report back the correct angle. So, for our example above, we can also say that θ = atan2(9, 5), or about 1.06 radians.
We can sum up the process of converting Cartesian coordinates to Polar coordinates like so:
θ = atan2(Y, X)
r = √(X² + Y²)
What if we want to go the opposite direction and convert Polar coordinates into Cartesian coordinates?
This is a much simpler process. In this case, imagine we are simply given a direction to walk in (1.06 radians) and a distance to walk in that direction (10.3). How do we convert this to X and Y coordinates, that being the distance "to the right" we need to walk in and the distance "up" we need to walk in to get to that same point?
Again, imagine a circle centered at the origin with a radius of 10.3. Move along the perimeter of that circle until you've moved 1.06 radians. Then, draw a line from the origin to that point you are at. That can be the hypotenus of our right trangle.
The height of our triangle is how far "up" we walked (our Y coordinate), and the base of our right trangle is how far "to the right" we walked (our X coordinate). The hypotenus is the distance from us to the origin, which we know is 10.3 because that's the radius of the circle we are standing on.
This means to calculate X and Y, we just need to calculate those two sides of the triangle (the base and the height respectively). We know the angle and we know the hypotenus, so our good friend SOHCAHTOA comes back to play.
SOH:
sin(1.06 radians) = Y / 10.3
Y = sin(1.06 radians) · 10.3
= 9
CAH:
cos(1.06 radians) = X / 10.3
X = cos(1.06 radians) · 10.3
= 5
So converting from Polar coordinates back to Cartesian coordinates is much simpler. We can sum up the rules like so:
X = r·cos(θ)
Y = r·sin(θ)
Waves and Circles
Let's imagine what it means to graph a function in polar coordinates.
If I were to take a function, let's say y=sin(x), and graph it in Cartesian coordinates, what we would do is for every vertical "slice" of the X-axis, we'd plug that X value into the function and get a corresponding Y value. Those X and Y values combine to make the coordinate we plug in. You continue to increase the X value as much as you want to make the graph go as far as you want.
In polar coordinates, we can do something similar. Rather than y=sin(x), we can think of our function to graph as r=sin(θ). This means that for every "slice" of the circle--by "slice" I mean, imagine cutting out a slice of pizza--we plug the corresponding angle of that slice into our function and get a corresponding radius, this being how far "out" we walk in that direction.
If we were to graph a function in polar coordinates, we would effectively be wrapping that function
around a circle. As you increase the θ value, you move in different directions aroudn the circle, and the corresponding r values tell you how far you need to move out from the circle to plot the point.
Let me ask you something: what do you think would happen if we were to wrap the function sin(x) around a circle? Or, in other terms, what were to happen if we were to graph the function r=sin(θ)?
See the first graph below.
No, your eyes aren't deceiving you. It does indeed plot a perfect circle
on top of the origin.
Cool, right? But how can we use this...
What if we were to manipulate the frequency of the sine wave? By this I mean, what if we were to multiply our angle by a constant?
Check the next two graphs. These are graphing r=sin(2θ) and r=sin(0.4θ) respectively. By manipulating the angle, we get a unique pattern. But the important thing to take away from this pattern is that they are all
symmetrical.
If I were to sum up all the X and Y values from the second and third graph, they would sum to 0, because all the positive X values will be cancelled out by the negative ones, and the same for the Y values. This is different from our first graph, where r=sin(θ). All our points were above the X axis, so the values would sum to a positive number.
Let's make a simple rule for this.
Let's rewrite this as r=sin(kθ). If we sum up our X and Y coordinates generated from this function, if k=1 then it will
diverge towards infinity. This is because as long as we continue to graph the function, we will always get positive values on the Y axis, so the values will continue to increase forever until we stop. However, if k≠1, then the sum of our coordinates will
converge towards 0, because we will get a symmetric pattern and everything will cancel out.
Identifying Sine Waves
If you know anything about sine waves, you might recognize this function:
f(x) = sin( (2πxf) / v)
This function is a function that allows us to produce a sine wave with any given frequency we want, where f is our frequency and v is our velocity (how fast the wave is moving through space). Velocity is a constant depending on whatever system you're dealing with, so we won't worry about it for now and can just throw it out. This means that f(x) = sin(2πx·440) would produce a sine wave of frequency 440 Hz.
Let's imagine we're given some generic sine wave: sin(2πxf), where the frequency (f) could be anything, we don't know what it is.
I could then wrap this sine wave around a circle by plotting r=sin(2πθf). This isn't very useful, however, because we know that if k≠1, then we will get a symmetrical pattern around the origin. This doesn't tell us any new information.
But what if we did a clever little trick here. What if we instead graphed this function around the circle:
r=sin( (2πθf) / (2πξ) )
If ξ=f, then all of our constants will cancel out to 1, and then k=1. But if ξ≠f, then all of our constants will not cancel out and thus k≠1.
If we define ξ as our "search frequency" (the frequency we are trying to "filter out"), and we were to sum up the X and Y coordinates produced by this function, then our results will diverge to infinity if ξ=f and will converge to 0 if ξ≠f...
Huh, we now know of a way of identifying frequencies! Simply wrap the sine wave around a circle using the formula above, if the X and Y coordinates diverge towards infinity, then we found the frequency we were looking for!
But how does
any of this relate to the Fourier transformation?
Piecing it All Together
Recall our discrete Fourier transform:
g(ξ) = ∑(0, N-1, f(x)·e^(-2πixξ))
Here was have a pretty frightening looking piece: e^(-2πixξ).
However, this isn't as scary as it looks. Euler's formula tells us this:
e^(ix) = cos(x) + i·sin(x)
This means that we can expand out our discrete Fourier transform to instead look like this (where x=-2πxξ)
g(ξ) = ∑(0, N-1, f(x)( cos(-2πxξ) + i·sin(-2πxξ) ))
g(ξ) = ∑(0, N-1, f(x)·cos(-2πxξ) + f(x)·sin(-2πxξ)·i)
The second part we just distributed.
Now, this should look familiar to you:
f(x)·cos(-2πxξ) + f(x)·sin(-2πxξ)·i
Separate this into two parts and what do we get:
f(x)·cos(-2πxξ)
f(x)·sin(-2πxξ)·i
Doesn't this look a lot like our rule to converting polar coordinates into Cartesian coordinates?
X = r·cos(θ)
Y = r·sin(θ)
The values on the inside also look incredibly similar to this:
f(x) = sin( (2πxf) )
And remember how I told you that we were summing all the X and Y coordinates, right?
So effectively, this is plotting our function in polar coordinates then summing the results.
The reason we include "-2πxξ" inside of our cosine and sine is because it has a similar effect to this:
r=sin( (2πθf) / (2πξ) )
By using a negative sign within "-2πxξ", it will function similarly to the division here and cancel out the "2πxf" component of our sine wave. In this case, our sine wave comes from f(x) that's being multiplied.
Why do we use a summation around all this, the "∑"? Because we aren't just summing the X and Y coordinates for one point, but over our entire function (from 0 to N-1).
So that's literally all the Fourier transform does. It wraps your function around a circle and sums the coordinates with a search frequency equal to ξ. If the summation explodes to infinity, the function f(x) must contain a sine wave of frequency ξ. But if it converges towards 0, then it must not.
If a function contains more than one frequency, all those frequencies will converge towards 0 if the search frequency is not found. However, if the search frequency is found, even if it's in a mess of other frequencies, it will diverge towards infinity. This means that we can use the Fourier transform to filter out frequencies even if we have hundreds of frequencies playing at the same time.
You can actually throw away the negative signs inside of the formula.
This:
g(ξ) = ∑(0, N-1, f(x)·cos(-2πxξ) + f(x)·sin(-2πxξ)·i)
Can become:
g(ξ) = ∑(0, N-1, f(x)·cos(2πxξ) + f(x)·sin(2πxξ)·i)
Why? Because if you plot this, you simply find that negating the inside of the cosine makes the whole thing negated, and negating the inside of the sine simply negates the whole thing. But we don't need to negate the sine in practice, actually, because if the entire thing explodes to 0, well -0 is still 0, so it makes no difference if your application is filtering out frequencies.
The final formula showed is actually one of the official representation of the Discrete Fourier Transform.
The reason they divide by N on the inside is because remember when we said we are assuming velocity is constant? Well velocity is not constant, so we need to divide by our velocity. In this case, they assume that the Fourier transform is being applied to a sample of size N and the sample rate is also N (meaning a 1-second sample), so you can just divide by N since that would effectively be your velocity. Also, you notice they keep the negative sine outside of the sin. Again, in our application this makes no difference.
Back to the Original Formula
One thing I didn't mention is the "·i" component. Why didn't I mention this? Well, it's actually unnecessary. You can throw it out.
So this:
g(ξ) = ∑(0, N-1, f(x)·cos(2πxξ) + f(x)·sin(2πxξ)·i)
Can become this:
g(ξ) = ∑(0, N-1, f(x)·cos(2πxξ) + f(x)·sin(2πxξ))
And it will still work just fine. We don't need to worry about the imaginary numbers here if all we're doing is filtering out frequencies. This is because if g(ξ) diverges towards infinity, it will do so even if the imaginary numbers are there, and if it converges towards 0, the imaginary numbers will all cancel out and it will converge towards 0 anyways. So if all we're doing is filtering out different frequencies, we don't need to worry about the imaginary number.
But... why is it there in the first place?
Imaginary numbers cannot be found anywhere on the real number line. This means if you want to plot a single imaginary number, you need
two number lines. You would plot our real number component along the real number line (usually horizontal axis) and the imaginary number component along the imaginary number line (usually the vertical axis). Such as the complex number 4+5i would be graphed 4 to the right and 5 up. This means a single number, in some sense, exists in
two dimensions.
This here:
f(x)·cos(2πxξ) + f(x)·sin(2πxξ)·i
Can be split up into this:
f(x)·cos(2πxξ)
f(x)·sin(2πxξ)·i
Which I mentioned looks a lot like this:
X = r·cos(θ)
Y = r·sin(θ)
Our third set of equations requires two dimensions: an X and a Y axis. The first equation is a complex number, where the real number part is f(x)·cos(-2πxξ) and the imaginary number part is f(x)·sin(2πxξ)·i.
This means that if we were to graph the first equation, we would require two dimensions to graph it. We would in fact graph it like so:
X = f(x)·cos(2πxξ)
Y = f(x)·sin(2πxξ)·i
Why? Because we need two dimensions to graph an imaginary number, and typically the X axis is used to represent the real number line and the Y axis is used to represent the imaginary number line (this is usually called the "complex plane").
This means that, in some weird sense, by making the right hand side of the formula "f(x)·cos(-2πxξ) + f(x)·sin(-2πxξ)·i" an imaginary number, we are representing the action of plotting this function in two dimensions... but using a single number.
Weird to wrap your head around, yes.
If we were to look back the original formula:
g(ξ) = ∫(-∞, ∞, f(x)·e^(-2πixξ) dx)
Here, f(x)·e^(-2πixξ) is a much simpler way to represent this. The formula "e^(πix)", when graphed on the complex plane has the property of drawing a circle.
Multiplying e^(πix) by a function, in this case f(x), has the property of wrapping that given function around the circle when graphed on the complex plane.
Finally, including the constant "-2πξ" has the property of dividing any constants in our function by 2πξ, so if our function f(x) contains 2πξ, then that will be cancelled out to 1. Such that this equation is true:
f(2πxξ)·e^(-2πixξ) = ±f(x)·e^(ix)
If f(x) is a sine wave, f(x) = f(2πxξ), we get this equality:
sin(2πxf)·e^(-2πixξ) = ±sin(x·(f/ξ))·e^(ix)
We know f(x)·e^(ix) has the property of plotting f(x) around a circle in the imaginary plane.
We also know that if we plot sin(x) in polar coordinates around a circle, it converges towards infinity. And this will be the case if ξ=f since they will cancel out to 1. But if this is not the case, it will diverge towards 0.
This means that the Fourier transform will explode to infinite if your function f(x) contains the search frequency, ξ, that we are looking for, but will diverge towards 0 if it doesn ot.
Implenting This In Code
First, let's set up our audio sample.
OPTION STRICT
VAR SAMPLE_RATE=32730
VAR SAMPLE_DEPTH=16
VAR SAMPLE_TIME=1
VAR SAMPLE_LENGTH=SAMPLE_RATE*SAMPLE_TIME
DIM SAMPLE[SAMPLE_LENGTH]
Microphones essentially work by reading air pressure and returning that value to the computer as a number.
Our sample rate is how many numbers it can return every second. I chose 32730 because that's what the DLC sound stuff uses (you don't need the DLC to follow this tutorial but it's nice to easily playback your sample if you want to hear it). These numbers the microphone reads are formatted in some way. In the case of what the DLC supports, it uses 16-bit signed values. This is why I set our sample depth to "16" here. I never specify "signed" in this tutorial because I will assume you are using signed whether it's 8-bit or 16-bit or 32-bit. Our sample time is how long, in seconds, our sample lasts. It takes longer to process longer samples so I just did 1 second here as an example. Our sample length is how much space the actual sample data takes up in memory.
Now we've formatted our sample, let's generate a dummy sample for testing purposes. You can try recording real samples if you want, but I'm not going to bother.
DEF GENERATE_TONE F
VAR I
VAR B=POW(2, SAMPLE_DEPTH-1)-1
FOR I=0 TO SAMPLE_LENGTH-1
VAR X=I/SAMPLE_RATE
SAMPLE[I]=SAMPLE[I]+SIN(2*PI()*X*F)*B
NEXT
END
This takes a frequency "F" and will generate a sine wave of that frequency and
add it to the sample.
The sine function produces a wave between -1 and 1, but a 16-bit signed sample should have values between -2^15 and 2^15-1. To deal with this, we multiply our sample by that value (called "B") to scale it up.
This comes from the function I mentioned earlier for producing sine waves with specific frequencies:
f(x) = sin( (2πxf) / v)
In this case, our velocity (v) is equal to our sample rate. In real life, the velocity is how fast a wave moves through space. In your computer, it's how fast the wave moves through memory. A sample rate of 32730 means that the wave is moving through memory at a speed of 32730 bytes per second.
We can listen to some tones by doing something like this:
GENERATE_TONE 440
PCMSTREAM SAMPLE
WAIT SAMPLE_TIME*60
But then again, that requires the DLC, if you don't care about hearing the tones you don't have to.
Let's actually look at our Discrete Fourier Transform.
DEF DFT(SF, THRESHOLD)
VAR SUM=0
VAR I
FOR I=0 TO SAMPLE_LENGTH-1
VAR X=I/SAMPLE_RATE
VAR C=COS(2*PI()*X*SF)
VAR S=SIN(2*PI()*X*SF)
SUM=SUM+SAMPLE[i]*(C+S)
NEXT
RETURN ABS(SUM) > THRESHOLD
END
I based it off the final formula without imaginary numbers we derived:
g(ξ) = ∑(0, N-1, f(x)·(cos(2πxξ) + sin(2πxξ)) )
We through out our velocity because we had originally assumed it was a constant, but it's not, so we actually need that velocity back.
So in practice, our real Discrete Fourier Transform looks something like this:
g(ξ) = ∑(0, N-1, f(x)·( cos( (2πxξ)/v ) + sin( (2πxξ)/v ) ) )
Remember, velocity is just the sample rate.
"SF" is our search frequency. What is the threshold? Remember how I said the Fourier transform will explode towards infinity or diverge towards 0? Well, it only reaches infinity or reaches 0 if our sample size is infinitely long. But of course, it isn't, so instead it will either
get really close to 0 or
get really big. We can draw a line between what "really big" and "close to 0" means by using a threshold. Anything above the threshold is defined as "really big" (meaning the search frequency was found), and anything below 0 would be "close to 0" (meaning it was not found).
Now let's put it to the test.
PRINT "Generating frequencies..."
GENERATE_TONE 505
GENERATE_TONE 512
GENERATE_TONE 524
GENERATE_TONE 538
GENERATE_TONE 541
PRINT "Filtering frequencies..."
VAR I
FOR I=500 TO 550
IF DFT(I, 100) THEN
PRINT "Frequency found: " + STR$(I)
ENDIF
NEXT
First, it generates 5 tones. Remember, our tone generator
adds the tone to our sample, so after those five calls, we will end up with a single sample of 5 overlapping tones, all of different frequencies. We then use the Fourier Transform to apply 51 filters to our sample, those being filters 500-550 inclusive. For each filter, if we find the frequency we're looking for, we then print it to the CLI.
The output of this will look something like this:
Generating frequencies...
Filtering frequencies...
Frequency found: 505
Frequency found: 512
Frequency found: 524
Frequency found: 538
Frequency found: 541
[DEFAULT]OK
And that's about it! Well, not quite. There is a faster version of the Fourier Transform called, well, the
Fast Fourier Transform (FFT). In fact, this is built-in to the DLC if you have it. But I will not be going over that here...
Back in 80s, you didn't have USB flash drives or massive hard drives to store your files on. Instead, people stored their data on cassette tapes. The computer would encode your program as a series of sine waves and then record that to the cassette tape. It would be very inefficient to store just a single sine wave for every interval of time on a cassette tape. Instead, you could store many many sine waves on top of each other and then pull them apart using the a Fourier transformation. This allowed for storing a ton of data on a single little cassette tape, well, at least a ton of data relative to the time period.