Hearing aids: an introduction to DSP in hearing aids -- let's start with the part before it hits the processor
The which is what we're about to do now.
I should give a disclaimer first. I'm writing this first and foremost for myself, to help me synthesize and understand ideas behind hearing aid technologies. Since I studied electrical engineering as an undergraduate, I take a lot of basic DSP concepts for granted; to some extent, I don't feel like I need to take notes on this, except for application-specific statistics and measures I want to note. However, there's a secondary audience -- it's been interesting to see what aspects of the things I take for granted are new to my audiology-PhD classmates, and what things they take for granted that I have no clue about. So as I go along in my fast clip of I'm-an-engineer dialogue, I'll try to pause and explain the terms I've noticed are less familiar for folks without that technical background. Here goes.
DNR is a feature implemented via DSP, which stands for "digital signal processing." To start talking about either, we need to understand the "D," which stands for "digital."
The world is not digital. The world is analog -- it is continuous, with infinite resolution in both time and amplitude. But computers are not analog; they only understand numbers. You don't ask how many megapixels the Grand Canyon is -- the answer, if anything, is "infinitely many." But you do say that your photograph of the Grand Canyon is 5 megapixels, because that's the resolution at which your digital camera was able to capture its image of the Grand Canyon. (Just as the original Grand Canyon will always be cooler than your picture, analog signals will always be higher-fidelity than any digitization, because they are the original signal we're trying to replicate.)
In order to convert between the two, we use an ADC, which stands for analog to digital converter, which is what we call anything -- component, process, algorithm, whatever -- that does the job of making something analog into something digital, turning something continuous into something discrete.
The word for "making something discrete" is discretization. Discretization of time is called sampling; think about sampling cookies off a baking sheet every few minutes to check when they're done baking. ("15 minutes, undercooked. 20 minutes, almost there. 25 minutes, perfect. 25 minutes, burnt.") Discretization of amplitude is called quantization; think about how many different levels of cookie-doneness you have (is it just undercooked/perfect/burnt, or do you say raw/gloopy/slightly soft/crisp/dry/charcoal?)
One thing the cookie analogy is good for is showing how sampling and quantization are independent of each other. We could only have undercooked/perfect/burnt but be sampling cookies every 5 seconds (om nom nom.) Or we could have a really, really fine gradient of classifications with 30 levels on our "cookie doneness" scale, but only be checking the oven every 20 minutes. (Sadface.) When we say "high-resolution," we have to ask: are you tasting the cookies often (high sampling rate), or do you have a really detailed cookie-doneness scale (quantization)? Usually, "high-resolution" means you've got a lot of detail on both, but it's good to know.
Let's hang out in the time domain for a while and talk about sampling first. Sampling is expensive; it takes time, energy, battery, etc, so we want to sample as infrequently as possible. How do we know how often to sample? You already intuitively know the answer to this question; it depends on how fast the thing we're sampling is liable to change. If the turkey needs to bake for 7 hours, maybe we'll check in every hour or two. If the shrimp crackers puff up after 5 seconds in hot oil and then rapidly burn, we're going to watch the wok like a hawk, not looking away for more than 2-3 seconds at a time.
If we have a signal that changes often -- with high frequency -- it is (drumroll...) a high-frequency signal. Low-frequency signals change more slowly -- that's by definition, that's what high and low frequency mean. In order to capture a higher-frequency signal (shrimp chips), we need a higher-frequency sampling rate (check wok more often). To be precise, the Nyquist Sampling Theorem (technically the "Nyquist-Shannon Sampling Theorem," but we usually forget about Shannon) says we need to sample at twice the maximum frequency we want to capture. There's some beautiful, beautiful math behind it that I won't go into, though I highly recommend the journey through that proof for anyone keen on understanding signal processing. But the gist that you actually need is this: want to capture a 500Hz signal? You need a sample rate of at least 1kHz. (But don't go over that too much -- it's just a waste of resources, like the kid in the back of the car that keeps asking "are we there yet?" when we obviously aren't.) Another way of putting this is that the "Nyquist frequency" of a 1kHz sample rate is 500Hz -- the Nyquist frequency being the "highest frequency signal present in the reconstructed output."
What happens if you sample at too low a rate? Aliasing. I'll let Wikipedia explain that. All you really need to know is that it makes a high signal frequency sound like a low alias frequency instead. To be precise, the alias frequency is the sampling frequency minus the signal frequency; to be descriptive, it sounds terrible. Take a piece of piano music and play all the high Fs as low C-sharps and you'll see what I mean.
Therefore, right before a signal hits the ADC (analog to digital converter, remember) we usually have an anti-aliasing filter, which is just a low-pass filter. Sometimes the microphone itself acts as the low-pass filter; if the mic can't physically capture sounds beyond the Nyquist frequency, those sounds just never get a chance to be digitized and aliased. But in case the mic does pass through signals above the Nyquist frequency, the anti-aliasing filter chucks them out (introducing a bit of distortion in the process, since there are no perfect filters).
We have now set our bandwidth limitations. We have decided that we will not capture or work with any signals below 2x our sampling frequency. Higher sounds are gone forever. A system only has as much bandwidth as its smallest component.
How fast do modern hearing aids sample? The typical one nowadays is about 16-32kHz. (Quick reminder for my audiology classmates: "Hz" is short for Hertz, which means "samples per second" -- 32kHz is "thirty-two kilohertz," or "thirty-two thousand samples per second.") Compared to CD-quality audio at 44.1kHz, that sampling rate seems downright sluggish. But hearing aid processors are working under all sorts of time and space and power limitations -- they can only process so much data at a time, and a 16-32kHz sample rate is already giving it 16-24 bits per second to deal with.
I should backtrack, because some of my classmates wondered: what are bits?
Bits are units of digital data. Computers work in binary (base 2), so the word "bit" is short for "BInary digiT." I won't go into detail as to why and how this works because it's not important (to audiology students, anyway -- engineering students should totally look this up) and Wikipedia does it well, but basically, if you have more bits, you can represent more numbers; 1 bit lets you represent 2 numbers, 2 bits lets you represent 4, 3 bits lets you represent 8, and so on down the line: 8 bits lets you represent 256 numbers. We use bits in units of 8 so frequently that we have a unit for that: 8 bits equals one byte. That's the same "byte" as in "kilobyte" (kB, 1000 bytes) or "megabyte" or "terabyte" and... basically, the words and numbers on the packages of thumbdrives that you buy. Bytes and bits describe data size, the amount of information you have. The more bytes (or bits) you have, the more data you have.
How big -- how many bits -- is an incoming digital signal? Well, that's going to depend on a few things, so let's go back to the Grand Canyon with our cameras again and take a time-lapse film. How much hard drive space will we need? This depends on how long the time-lapse is and how often we are taking pictures (sampling rate), but also how high-resolution each picture is (quantization). The more detail we are trying to capture with each sample, the more bits we're going to need.
If you remember from earlier: sampling is going from analog-to-digital in time, and quantization is going from analog-to-digial in amplitude. We need a reasonable number of "loudness levels" in order to be able to make sense out of sound; think of how terrible it would be if every sound you heard were coming out at exactly the same volume -- if the soft low hum of your air conditioner and the faint high cheeps of the cicada outside your window became exactly as loud as the phone call you were trying to listen to.
Through a bunch of (I assume) magical math that I didn't get to see and therefore don't understand, the general rule of thumb is that each bit -- each doubling of the number of different "loudness levels" you can have -- can give you another 6dB of dynamic range. A typical hearing aid will run at 16-20 bits and with a 96-120dB dynamic range, meaning that the difference in volume between the softest and loudest sounds the hearing aid will detect (and process and amplify) is 96-120dB.
You really, really want to leave "headroom" at the ceiling of your dynamic range -- about 6dB, as a rule of thumb. If the loudest sound your mic can pick up is 100dB, give yourself 6dB of headroom (also called "reserve gain") in your ADC so that in case you get a really loud sound, you'll have a bit of wiggle room before the processor goes awry and everything begins to sound like crap. (Also put an output-limiting compressor protection circuit in just before your ADC to make sure it never gets a voltage signal higher than what it can handle, the same way you put the anti-aliasing filter in to make sure it never gets a frequency higher than it can handle.)
Note that the softest sound does not need to be 0 dB SPL.
And once again I need to sidestep to explain a term: SPL stands for "sound pressure level," and 0 dB SPL i's a reference point for the sensitivity threshold of normal human hearing -- in other words, we call "the volume of the softest possible detectable 1kHz sound" 0dB SPL, then talk about sound volumes in decibels relative to that (the same way 0 degrees Celsius is calibrated against the freezing point of water, and we talk about temperatures in degrees Celsius relative to that).
Anyway. Note that the softest sound does not need to be 0dB SPL. In fact, it shouldn't be -- things that soft are probably background noise rather than signal you'll need to pay attention to. (If they wanted you to pay attention, they'd have made it louder, right?) So we can squeeze a bit more out of our dynamic range by raising the noise floor -- if we have a 100dB dynamic range and want to be able to process volume distinctions up to 120dB, we simply make the noise floor 20dB. This means that the softest sound the hearing aid will even detect is 20dB; anything below 20dB is discarded, turned into silence, all in the name of presevation of digital space.
We're just discarding data left and right, aren't we? Yes, we are. The heavier your backpack is, the longer it's going to take you to hike across the mountain -- so you only, only, only take essentials. (In a later post, we'll talk about why latency sucks, but for right now, trust me: latency sucks.) Processors can only do so much when they're tiny and current-strapped and have to be fast. If information isn't necessary for understanding or comfort or something of that sort, out it goes; high frequencies (above the Nyquist) get slapped off by the antialiasing filter so they don't alias, low frequencies get discarded underneath the noise floor.
Even "useful" signals are squeezed as small as possible. If we've decided on a 20dB noise floor ("things softer than 20dB SPL are probably not important so we're not even going to pick them up") then maybe we can also decide that soft sounds -- say, between 20-40dB SPL -- are things we want to be aware of, but we don't need to hear them particularly clearly -- it's ok if they're low-resolution. So we encode them in less space (say, 3-4 bits) whereas louder sounds (say, 60dB SPL) might be encoded with 16 bits.
Maybe this sounds abstract, so here's a musical analogy: say you want to hear a violin concerto, and you're going to get the violin melody (loud) and the orchestral background (soft) as 2 separate mp3 tracks. One of them will be a fantastic mp3; brilliant sound quality, all that -- the other will use the lossiest encoding possible. You get to pick which track gets encoded which way. You'd pick the complex, loud violin melody, right? The soft orchestral stuff is great to hear, but if fidelity is going to get lost somewhere, you'd rather it be there rather than in Joshua Bell's solo.
So yes. Squeeze, squeeze, squeeze. Don't need that frequency? Discard it. Don't need that resolution? Mangle it into fewer bits. Ultralight backpackers trim the borders off their maps and drill holes in their toothbrush handles to save extra precious ounces, and we want to do the same before the (decimated, somewhat battered) signals hit the DSP.
What happens when it does will be the topic of our next post.