the channel is lossy

on silbo gomero, and what's actually doing the work in speech

Silbo gomero is a whistled language used on La Gomera. It transposes spoken Spanish onto a single whistled tone — the second formant of each vowel becomes a fundamental frequency. Front vowels /i, e/ ride high, around 2–3 kHz. Back vowels /a, o, u/ ride low, around 1–1.5 kHz. Consonants are encoded as modulations of the vowel line: pitch level (high/low) crossed with signal envelope (interrupted/continuous). A two-by-two.

The phonological inventory of the whistle is much smaller than the phonological inventory of Spanish. Researchers report 2–4 vowels and 4–10 consonants in the whistle. Spanish has five vowels and around nineteen consonants. So a silbador whistling Spanish is running it through a compressor that drops most of the consonant contrasts and at least one vowel contrast. /p, t, k/ collapse together. /b, d, g/ collapse together. The two groups are encoded only by where the interruption sits, high or low.

what doesn't survive transmission

I went looking for clean intelligibility numbers — what percentage of sentences make it through. I couldn't find a published figure. The brain-imaging work (Carreiras 2004, 2005) is robust on a different question: silbadores process the whistle in language cortex while non-speakers process it as whistling. Beautiful result about how speakers route signals. Not about how much makes it through the channel.

What does seem agreed in the literature: the whistled signal is more ambiguous than the spoken signal. Silbadores compensate with repetition, stylized phrasing, and context. The gap between “you whistled it” and “I understood it” gets closed by the receiver, not by the channel.

the redundancy was doing other work

The interesting thing isn't that whistling can carry language. It's that you can throw away most of the phonological contrasts of Spanish and the message still arrives.

Most of what makes a phoneme distinct in spoken Spanish is doing redundant work. /p, t, k/ are contrasted three ways — lips, alveolar ridge, velum. Silbo collapses that distinction and the sentence still makes it because context already ruled out almost every word that wasn't going to fit. The consonant only had to disambiguate among the small set the listener was already entertaining.

The language model in the listener's head is doing a huge fraction of the work in normal speech, and we just don't notice because the acoustic channel is normally redundant enough to coast. Silbo strips the redundancy and you can see the model working.

A similar move shows up in compressed-speech research: intelligibility holds up under brutal acoustic degradation if the listener knows the language and the topic. Silbo is a culturally evolved version of that — a compression scheme tuned to a language whose listeners can run the decoder.

what i'm not claiming

That silbo “proves” anything about how speech recognition works in general. It's one channel-coding scheme on one language by one community. But it's a working existence proof that you can drop most of the contrasts and still land most of the message, which constrains how much of speech recognition can be acoustic-bottom-up.

— jj · sources: Silbo Gomero, Rialland 2003 · home