TOP > Research > Department of Systems and Social Informatics > Department of Media Science > Media Expression Group > NAKATANI, Tomohiro

Comprehensive List of Researchers "Information Knowledge"

Department of Media Science

NAKATANI, Tomohiro
Media Expression Group
Visiting Associate Professor
Dr. of Informatics
Research Field
Sound Scene Analysis / Speech Enhancement / Statistical Signal Processing
NAKATANI, Tomohiro

Current Research

Information Extraction from Speech Communication Scenes
A human being with normal hearing can easily grasp everyday sound situations. For example, one can communicate with others when watching a TV show, and react to a telephone ringing while disregarding fan noise. In short, one can distinguish individual sounds from complex and varying sound scenes to extract appropriate information. The goal of this research is to develop a computer with such a sound scene analysis capability. Such a computer will support human-machine and human-human communications in a more intelligent way, and will enable us to retrieve information from sound media, such as home videos and Internet contents, in a more flexible way. Toward this goal, from a statistical signal processing viewpoint, we are investigating sound scene models, meeting diarization, speech enhancement and recognition, and intelligent speech interfaces.
(1) Statistical Signal Processing for Sound Scene Analysis
Suppose that a sound from a sound source is captured by microphones together with other sounds after being influenced by a certain room acoustics property. Predicting the resultant sound characteristics is called, a "direct problem". In contrast, analyzing how individual sounds are generated and captured given the captured signals is called an "inverse problem". Sound scene analysis is an inverse problem. In general, an inverse problem may have numerous solutions, and we need certain constraints or empirical knowledge to obtain a valuable solution. One systematic approach to the inverse problem is statistical signal processing. With statistical signal processing based sound scene analysis, mathematical models of the captured sounds are introduced based on the physical relationships of the sources and room acoustics as well as probabilistic models for their empirical knowledge. The sound scenes are then analyzed by estimating model parameters that best fit the captured sounds with the models. One important issue is how to make the model simple in order to enable us to extract valuable information within a tractable computing cost.
(2) Meeting Diarization
As a concrete example of sound scene analysis, we are studying meeting diarization, where the goal is to detect "Who spoke when" in a meeting situation (Fig. 1). First, we have developed a robust speech activity detection method, called MUSCLE-VAD, which can precisely estimate speech durations in captured sounds with very little computing cost. We introduced a probabilistic speech model into the method and integrated it with a time-varying noise model using a switching Kalman filter. We have also developed a prototype meeting diarization system that can work in real time with three microphones by integrating voice activity detection, source direction-of-arrival estimation, and speaker clustering.
(3) Speech enhancement and recognition
As an application of sound scene analysis, speech enhancement and recognition are both very important. Speech enhancement reduces the effect of noise and reverberation included in the captured sounds to recover the original quality of speech, and is useful for hands-free speech communication and for refining sound contents at studios. In addition, speech recognition can be improved even when extraneous sounds present at the same time if appropriate relationships between sounds are extracted. We have been developed robust speech enhancement and recognition techniques based on sound scene analysis.
By further pursuing the above theme, we are going to confirm the applicability of the sound scene analysis techniques to more general environments, and develop practical applications that are useful for actual human life and business scenes.


  • Tomohiro Nakatani received his M. E. and Dr. of Informatics degrees from Kyoto University. Since 1991, he has been working with NTT Corporation.
  • From 2005 to 2006, he was a visiting scholar at Georgia Institute of Technology, Atlanta. Since 2008, he is also a visiting Associate Prof. in the Graduate School of Information Science, Nagoya University.

Academic Societies

  • IEEE
  • ASJ


  1. Harmonic sound stream segregation using localization and its application to speech stream segregation, Speech Communication, Vol. 27 (3 -4), 209 -222 (1999).
  2. Robust and accurate fundamental frequency estimation based on dominant harmonic components, J. Acoustical Society of America, Vol. 116 (6), 3690 -3700 (2004).
  3. Harmonicity based blind dereverberation for single channel speech signals, IEEE Trans. Audio, Speech, and Language Processing, Vol. 15 (1), 80 -95 (2007).