Oral deaf audio MacGyver: identifying speakers

Being oral deaf is like being MacGyver with audio data, except that the constant MacGyvering is normal since you do it for every interaction of every day. Posting because this seems interesting/useful to other people, although I'm personally still in the "wait, why are people so amused/surprised by this... does not everyone do this, is this not perfectly logical?"

I was explaining how I use my residual hearing to sort-of identify speakers, using faculty meetings as an example. The very short version is that it's like constructing and doing logic grid puzzles constantly. Logic grid puzzles are ones where you get clues like...

There are five houses.
The Englishman lives in the red house.
The Spaniard owns the dog.
Coffee is drunk in the green house.
The Ukrainian drinks tea.
The green house is immediately to the right of the ivory house.

...and so forth, and have to figure out what's going on from making a grid and figuring out that the Ukranian can't possibly live in the green house because they drink tea and the green house person drinks coffee, and so forth.

Now the long explanation, in the context of being oral deaf. Some background: I'm profoundly deaf, with some low-frequency hearing; I use hearing aids and a hybrid CI (typically the CI plus one hearing aid). Generally speaking, I can't actually hear enough to identify people through voice alone -- but I can say some things about some attributes of their voice. For instance, I can tell (to some approximation) if a singer is in-tune, in-rhythm, and in control of their voice, and I can tell the difference between a low bass and a first soprano... but I wouldn't be able to listen to a strange song and go "oh, that's Michael Buble!" (My hearing friends assure me that his voice is quite distinctive.)

However! When I know people and have heard their voice (along with lipreading and context) for a while, I do know that their voices do and don't have certain attributes I can perceive. And even if I'm not using my residual hearing/audio-related gadgetry to get semantic information (i.e. the words someone is saying) because I have better alternatives in that context (interpretation, captioning) I will still want audio...

...and I will pause for a short sidebar right now, because it might seem, to hearing people, that this is the only logical course of action -- that hearing more is always good for understanding more. It isn't. Extra information is only information if it's worth the mental effort tradeoff to turn it into useful data; otherwise, it's noise. It's the same reason you would probably be happy if the background noise in a loud bar went away while you were talking to your friend. That background noise is "extra data," but it's not informative to you and just takes more effort to process it away.

In my case -- and the case of my deaf friends who prefer to not use residual hearing when there's another access option available -- we're patching across multiple languages/modalities on a time delay, and that triggers two competing thought streams. If you want to know what that feels like, try to fluently type a letter to one friend while speaking to another on a different topic. Physically, you can do it -- your eyeballs and hands are on the written letter, your ears and mouth are in the spoken conversation -- but your brain will struggle. Don't switch back and forth between them (which is what most people will immediately start to do) -- actually do both tasks in parallel. It's very, very hard. In our case, one stream is lossy auditory English as the speaker utters something, and the other is clear written English or clear ASL visuals some seconds behind it. (Assuming your provider is good. Sometimes this data stream is... less clear and accurate than one might like.) Merging/reconciling the two streams is one heck of a mental load... and since we *can* shut off the lossy auditory English as "noise" rather than "signal," sometimes we do.

Anyway, back to the main point. Sometimes I don't want the audio data for semantic purposes -- but I want it for some other purposes, so I'll leave my devices on. Oftentimes, this reason is "I'd like to identify who's speaking." Knowing who said what is often just as important as what's being said, and this is often not information available through that other, more accessible data stream -- for instance, a random local interpreter who shows up at your out-of-state conference will have no idea who your long-time cross-institutional colleagues are, so you'll get something like "MAN OVER THERE [is saying these things]" and then "WOMAN OVER THERE [is saying these things]" and then try to look in that direction yourself for a split-second to see which WOMAN OVER THERE is actually talking.

This is where the auditory data sometimes comes in. I can sometimes logic out some things about speaker identity using my fuzzy auditory sense along with other visually-based data, both in-the-moment and short-term-memorized.

By "fuzzy sense," I mean that auditorily -- sometimes, in good listening conditions -- I can tell things like "it's a man's voice, almost certainly... or rather, it is probably not a high soprano woman." By in-the-moment visual data, I mean things like "the person speaking is not in my line of sight right now" and "the interpreter / the few people who are in my line of sight right now are looking, generally, in this direction." By short-term-memorized visual data, I mean things like "I memorized roughly who was sitting where during the few seconds when I was walking into the room, but not in great detail because I was also waving to a colleague and grabbing coffee at the same time... nevertheless, I have a rough idea of some aspects of who might be where."

So then I think -- automatically -- something like this. "Oh, it's a man now, and not in my line of sight right now, and that has two possibilities because I've quasi-memorized where everyone is sitting when I walked into the room, so using the process of elimination..."

Again, the auditory part is mostly about gross differences like bass voices vs sopranos in no background noise. Sometimes it's not about what I can identify about voice attributes, but also about what I can't -- "I don't know if this is a man or a woman, but this person is not a high soprano... also, they are not speaking super fast based on the rhythm I can catch. Must not be persons X or Y."

For instance, at work, I have colleagues whose patterns are...

Slow sounds, many pauses, not a soprano
Super fast, not a bass, no pauses, machine gun syllable patterns
Incredibly variant prosody, probably not a woman but not obviously a bass
Slower cadence and more rolling prosody with pauses that feel like completions of thoughts rather than mid-thought processing (clear dips and stresses at the ends of sentences)
Almost identical to the above, but with sentences that have often not ended, but pauses are occurring and prosodic patterns are repeating and halting and repeating

These are all distinctive fingerprints, to me -- combined with knowing where they're sitting, and I have decently high confidence in most of my guesses. And then there are people who won't speak unless I'm actually looking at them or the interpreter or the captioning, and that's data too. ("Why is it quiet? Oh! Person A is going to talk, and is waiting for me to be ready for them to speak.")

There's more to this. Sometimes I'll look away and guess at what they're saying because I know their personalities, their interests, what they're likely to say and talk about, opinions they're likely to hold... I build Markov models for their sentence structures and vocabularies, and I'm pretty good at prediction... there's a lot more here, but this is a breakdown of one specific aspect of the constant logic puzzles I solve in my head as a deaf person.

In terms of my pure-tone audiogram, I shouldn't be able to do what I do -- and it's true, I can't from in-the-moment audio alone. But combined with a lot of other things, including a tolerance of extreme cognitive fatigue? Maybe. In the "zebra puzzle," where I drew the example logic puzzle clues from at the beginning, there are a series of clues that go on and on... and then the questions at the end are "who drinks water?" and "who owns the zebra?" Neither water nor zebra are mentioned in any of the clues above, so the first response might be "what the... you never said anything about... what zebra?" But you can figure it out with logic. Lots of logic. And you have the advantage of knowing that the puzzle is a logic puzzle and that it ought to be solvable, meaning that with logic, you can figure out who owns the zebra. In the real world... nobody tells you something could become a logic puzzle, and you never know if they are solvable. But I try them anyway.