Auditory User Interface (Tom Veatch )

Subject: Auditory User Interface From: Tom Veatch <veatch@BHASHA.STANFORD.EDU> Date: Sun, 13 Feb 1994 23:44:47 PST Auditory User Interface Preface: In this unfortunately long message I explain what I believe to be some important ideas that will make a significant positive difference in the working lives of blind people. Basically it's about a sophisticated Auditory User Interface, very much like the GUI's currently being used by the sighted. I don't know if others are working on ideas like these, so I'm sending this message to all the people I can think of that may be able to help me do one of two things. Either to make sure that others have had these ideas before and that they are being implemented them in a way that will reach the blind computing community. Or to get the ideas worked on by competent people, in such a way as to make it become available. If you're not interested, please don't read on and delete this, and accept my apology for filling up your mailbox. However, if you're competent in signal processing and/or graphical user interface programming (e.g. Object-Oriented Widget programming), please read this message. If you believe you can do something with the ideas, please get in touch with me. If you know someone or some population (or email list) of people that might be interested in developing an AUI like this, please pass this on to them, and let me know you did. It is important that good AUI's are developed for the blind, so please help me by doing what you can to spread this message around to people that could have something to do with implementing it. By sending out this message, by the way, its contents become public domain, and anyone can do with it what they want. I just want blind people to have an AUI available to them. Background: Computers have given a new lease on productive life to many blind people, but the user interface that computers present to the blind is to my knowledge very limited as far as output and feedback are concerned: either text-to-speech or (rarely) printed braille are the only outputs I know of. While the Mac, Microsoft Windows, and to a far lesser extent, Unix-based windowing environments such as X-Windows, have brought the benefits of Graphical User Interfaces to the general population, these benefits have been lost on the large population of blind computer users. Nonetheless these people are often highly dependent on their computers for their work. The problem with text-to-speech is that it is basically a slow, one-dimensional, serial output form. One word at a time is spoken; only one item can be simultaneously presented by the computer to its user. All good user interfaces, and graphical ones in particular, overcome seriality by simultaneous presentation of lots of information. This is why a screen editor is better than a line editor, and why statically presented icons in a Desktop GUI are better than mere tacit availability of remembered names of programs or files in a command-line interface. I believe that much more productive computer work is done using GUI's than using a serial interface. Since by definition, the blind cannot use a GUI, what else can be used to substitute? I spoke for a few minutes with a blind man the other day. He said he was into "sonification" of computer environments. While I told him that the auditory system is much better than the visual system at certain things, he responded that he was more interested in the things that vision can do that hearing cannot. This was a challenge to me, and I thought for a little while about how to make it clear that the blind need not be as handicapped as it may seem. As someone on the hearing-seminar@apple.com list or some other similar list said once, the auditory system is better able to simultaneouly monitor multiple sources of information in parallel, while vision is more serial, focussing on one item at a time. This is the basic insight I would like to see developed into much more sophisticated Auditory User Interfaces. It seems to me that they could be built in a way very similar to Graphical User Interfaces. I need to present a little background before I get to the meat of the idea. There are people that do tape-recordings of nature and such that use microphones set inside the acoustically-correct ears of an acoustically-correct dummy head, to record exactly what it would sound like to be there listening yourself, and apparently these recordings are uncannily natural, more than just stereo. Hearing can distinguish not just left-to-right balance (stereo), but to a lesser extent distance, height, and front-versus-back. That is, there is an "auditory space" with some correspondence to physical space in that human perception can distinguish sounds originating in different places (even simultaneously, as in the cocktail party effect). One can also sense the size and perhaps shape of a room to some extent. I believe there must be algorithms to map a source sound and a location (relative to the head) to the appropriate pair of sounds to be played into the ears from stereo headphones so as to create the effect that that sound emanated from that location, as if it were recorded, for example, using the dummy head in a free-field acoustical environment. This ought to be a mechanical DSP task, though I'm probably missing a lot of details. Auditory User Interface With that background, here's the idea. For each icon, button, or simple or composite widget or graphical object displayed in a GUI, locate a repeating sound at some location in auditory space (preferably corresponding in some sense to the location of the widget in the GUI's visual space). Add up each of the left-ear- and right-ear-mapped outputs together to form composite left-ear and right-ear signals, and pipe them to stereo headphones. The effect should be a cocktail-party-like simultaneous presentation of lots of sources at once. Just as I can tell very quickly when my old car is making a new sound, an AUI user could monitor a lot of sound-sources at once. One could play around a lot with hooking up each widget or icon to different kinds of sounds. The general rules being 1) to minimize the obnoxiousness of having to listen to the stuff all day long, 2) to maximize the distinctness of the sounds, 3) to maximize ease-of-learning. Nature sounds (birds and gorillas come to mind as relatively easily distinguishable) or musical instruments (perhaps relatively pleasant-sounding) or human voices repeating a relevant word or phrase like "open file", or "save file", (relatively easily-learned), etc. While repeated phrases are easiest to learn, they also might be more obnoxious to live with. If non-speech is used, there should be a one-sound-at-a-time mode for learning what the meanings of the sounds are. To optimally implement such a system, experiments would have to be done to determine the granularity of auditory space, on the basis of which one could prevent different auditory icons from being located indistinguishably close together (though the types of the paired sounds undoubtedly interact with their distinguishability). Next, a mouse could be represented auditorily with a bee buzz or something (a bee can fly around in space, after all, which is why it seems appropriate to me), which can be moved around, at least in two dimensions, in the auditory space as it moves around on the mouse pad. The sources should be fairly low volume, so as to form an auditory background or murmur, from which the attentive listener can pick out what is of interest. Highlighting can be represented by increasing the volume or pitch or repetition rate, etc., of the sound associated with a widget. So when the mouse's auditory-space location gets close to the widget's auditory-space location (tracked by buzzing the mouse towards the widget's sound-source), the widget can be highlighted in this way, and a click can produce some appropriate effect on it. Association of simple widgets as part of a complex widget (e.g., buttons in a button box) could be done by putting their collective output through some kind of filter, adding some kind of room-reflectance or low-volume pink noise or something to show that they're associated. Even sequences of text, or computer programs, can be represented auditorily in parallel through such an AUI, by simultaneously doing text-to-speech for each, and by locating each line of the program at a different point in auditory space (within the constraints of what's distinguishable). Highlighting of selected ranges of text could be done by changing the voice-quality or pitch of the text-to-speech system's speaker. Enhancements: Later, more expensive versions of this kind of system might have a virtual-reality-like reorientation system, which tracks the user's head orientation and changes the sounds' apparent locations so they don't appear to move, but only the head moves in a fixed space, giving the effect that the user can "look at" different locations, thus bringing them to front/center/level. This kind of input function could substitute even more naturally for the function of a mouse. It could also be done cheaply using some appropriate magnet-on-a-hat and some kind of passive electromagnetic tracking system. Joystick applications could also be envisioned. In different program contexts (e.g., a word-processor vs a spreadsheet vs something else), the rhythm or room-reflectance or speed of the sounds could be changed, creating a different auditory context, in addition to the differences from simply having a different set of auditory widgets in the environment. One could do effects like virtual-reality flying, too, so that different programs could reside in different virtual locations, and the user could move around in the virtual space using a mouse or other input functions and make use of the functions available there. Summary: While the last two paragraphs have gotten rather extreme and visionary, I believe the basic ideas above are both sound and fairly easily implementable. Every widget in an object-oriented GUI could simply be augmented with an auditory representation (an associated digitized sound file) and an auditory-space location (consistent with what's distinguishable in auditory space), and the AUI infrastructure should take care of generating the composite auditory scene. This AUI infrastructure should include 1) the signal-processing to map a one-channel sound and a location relative to the head into two sounds that put the source at the correct apparent location in auditory space, plus 2) addition, to put all the sources in their locations at once, plus 3) event-mapping that deals with mouse location (moving it around in auditory space) and highlighting (doing the appropriate volume-control or other change to the source to signify highlighting) and other events like mouse-clicks, etc. Appeal: Someone that knows more about GUI implementation than I should be able to take this and run with it. I don't know enough about acoustics or about GUI design or about signal-processing or about Object-Oriented Programming to be able to do a decent version of this myself. Also I work more than full time and don't have time to pursue it. But as I said initially, an AUI of this kind would vastly enrich the computer interface available to the blind, and make their work much more productive and enjoyable, significantly reducing the limitations on working life produced by blindness. I want to see it happen. I think most of us want to see it happen, too. The fact that it has virtual-reality implications, as well as implications for enhancing sighted computer-users' interactions with computers should also not be overlooked. If you can help these ideas happen by passing them on to someone you know with the right expertise or interest, please do so, and let me know about it. Thanks very much. Tom Veatch

This message came from the mail archive
http://sound.media.mit.edu/dpwe-bin/mhindex.cgi/AUDITORY/postings/1994
maintained by:

Tom Veatch