Abstract
Audio-visual dialogue is an appealing tool for natural interface with computers. Lip-reading is one of important part for audio-visual dialogue. In this paper, it is proposed to use a self-organizing feature map (SOM) and a hierarchical SOM: Hypercolumn model (HCM), as a module of phoneme feature space construction for HMM base lip-reading system. Those SOMs allow alleviating many difficulties associated with feature space construction. It is, however, required for on-line systems to reduce the feature extraction time to the range of normal video camera rates. To achieve this, a randomization technique is introduced. The experimental results show performances of the SOMs for Japanese lip-reading.