@jonny yeah -- spectrogram reading and the wacky nonlinearities of speech production and perception are at least partly because speech production isn't a sequence of phonemes but the loosely time-coordinated movement of multiple articulators
Possible to think of each IPA segment as a target in an articulator Hilbert space
We could be writing something like a musical score with clefs for each instrument (lips, tongue root, tongue blade, breath, larynx) instead of IPA