Contents of Proceedings

Keynoye Lectures

IMPACT OF VARIABILITIES ON SPEECH RECOGNITION

Mohamed Benzeghiba (1), Renato De Mori (2), Olivier Deroo (3), Stephane Dupont (4), Denis Jouvet (5), Luciano Fissore (6), Pietro Laface (7), Alfred Mertins (8), Christophe Ris (4), Richard Rose (9), Vivek Tyagi (1), Christian Wellekens (1)

(1) Institut Eurecom, FRANCE
(2) LIA, Avignon, FRANCE
(3) A Capela, Mons, BELGIUM
(4) Multitel, Mons, BELGIUM
(5) France Telecom, Lannion, FRANCE
(6) Loquendo, Torino, ITALY
(7) Politenico, Torino, ITALY
(8) University Carl von Ossietzky, Oldenburg, GERMANY
(9) University Mc Gill, Montreal, CANADA

Page 3

Abstract:
Major progress is being recorded regularly on both the technology and exploitation of Automatic Speech Recognition (ASR) and spoken language systems. However, there are still technological barriers to flexible solutions and user satisfaction under some circumstances. This is related to several factors, such as the sensitivity to the environment (background noise or channel variability), or the weak representation of grammatical and semantic knowledge. Current research is also emphasizing deficiencies in dealing with variation naturally present in speech. For instance, the lack of robustness to foreign accents precludes the use by specific populations. There are actually many factors affecting the speech realization: regional, sociolinguistic, or related to the environment or the speaker itself. These create a wide range of variations that may not be modeled correctly (speaker, gender, speech rate, vocal effort, regional accents, speaking style, non stationarity...), especially when resources for system training are scarce. This paper outlines some current advances related to variabilities in ASR.

SPEECH SYNTHESIS ON THE WAY TO EMBEDDED SYSTEMS

Ruediger Hoffmann

Technische Universitat Dresden, GERMANY

Page 17

Abstract:
The historic development of speech synthesis terminals is close connected to the progress of the electronics technology, ranging from the old electronic tubes via transistors to VLSI circuits. The paradigm shift from parametric to time domain synthesis in the 1990-th initiated a splitting of the development in large (server based) TTS systems using unrestricted computational resources on the one hand, and small (embedded) TTS systems using low computational power and restricted memory space (footprint) on the other hand. The latter of these two lines requires a complicated tradeoff: High quality TTS systems should be shrinked to a very low footprint, but at the same time the quality of the synthetic speech should be preserved. In the past, the “magic border” of the footprint for embedded TTS was 1 Megabyte. In concatenative synthesis, this goal can be approached by sophisticated coding methods which are applied to the databases. There is a tendency now, however, to develop embedded TTS systems aiming to still lower footprints basing on improved parametric approaches.

WEB-BASED SPEECH DATA COLLECTION AND ANNOTATION

Christoph Draxler

Institut fur Phonetik und Sprachliche Kommunikation Ludwig-Maximilians-Universitat Munchen, GERMANY

Page 27

Abstract:
The WWW is a ubiquitous, mature communication infrastructure for business and scientific information interchange. Since 1997, the Bavarian Archive for SpeechSignals (BAS) has been developing and using web-based annotation tools for large-scale speech databases. Recently it has developed an application for recording speech via the WWW. Both the annotation and the recording tools are now integrated into a web application for the Ph@ttSessionz speech database collection project. The goals of Ph@ttSessionz are a) to demonstrate the feasibility of WWW-based speech recording and annotation, and b) to collect a database of 1000 adolescent German speakers for the development of speech-driven applications and devices. The recordings are performed in more than 35 public schools all over Germany. This paper will describe the recording and annotation software and discuss the technical problems that had to be overcome in the speech database collection.

Session: Automatic Speech Recognition
Chairs: Andrey Ronzhin, SPIIRAS, Russia
Christian Wellekens, Institut Eurecom, France
Nikos Fakotakis, University of Patras, Greece

SPEECH TRANSCRIPTION SERVICES

Dimitri Kanevsky, Sara Basson, Stanley Chen, Alexander Faisman, Alex Zlatsin (1), Sarah Conrod (2), Allen McCormick (3)

(1) IBM, T. Watson Research Center, Yorktown Heights, NY, UNITED STATES
(2) Cape Breton University, Sydney, NS, CANADA
(3) ADM solutions, Dominion, NS, CANADA

Page 37

Abstract:
This paper outlines the background development of "intelligent" technologies such as speech recognition. Despite significant progress in the development of these technologies, they still fall short in many areas, and rapid advances in dictation have actually stalled. This paper proposes semi-automatic solutions for smart integration of human and intelligent efforts. One such technique involves improvement to the speech recognition editing interface, thereby reducing the perception of errors to the viewer. Some other techniques described in the paper include batch enrollment techniques which allow users of speech recognition systems to have their voices trained by a third-party, thus reducing user involvement in preliminary voice-enrollment processes. Content spotting, a tool that that can be used for environments with repetitive speech input, will be described.

EVALUATION OF HMM-BASED FEATURE COMPENSATION APPLIED TO A FINITE-STATE GRAMMAR SPEECH RECOGNIZER

Akira Sasou (1), Hiroaki Kojima (1), Shuichi Itabashi (2), Kazuyo Tanaka (3)

(1) National Institute of Advanced Industrial Science and Technology (AIST), JAPAN
(2) National Institute of Informatics, Research Organization of Information and Systems, JAPAN
(3) University of Tsukuba, Institute of Library and Information Science, JAPAN

Page 44

Abstract:
In this paper, we describe a Hidden Markov Model (HMM)-based feature-compensation method. The proposed method compensates for noise-corrupted features using the output probability density functions (pdfs) of clean acoustic HMMs provided to the recognizer in advance. In this way, the proposed method achieves model-based feature compensation without any extra parameters. In compensating for the features, the output pdfs are adaptively weighted according to forward path probabilities. Because of this, the proposed method can minimize degradation of feature compensation accuracy due to temporary changes in the noise environment. We applied the proposed feature compensation to a finite-state grammar speech recognizer and evaluated it by conducting hundred-word recognition experiments in noisy environments. The experimental results indicate that, compared with the baseline performance, the proposed feature compensation method achieved a 12.06 % improvement in accuracy on an overall average.

IMPROVING ROBUSTNESS OF A LIKELIHOOD-BASED BEAMFORMER IN A REAL ENVIRONMENT FOR AUTOMATIC SPEECH RECOGNITION

Luca Brayda (1), Christian Wellekens (1), Maurizio Omologo (2)

(1) Institut Eurecom, FRANCE
(2) ITC-irst, ITALY

Page 50

Abstract:
Performance of distant-talking speech recognizers in real noisy environments can be increased using a microphone array. In this work we propose an N-best extension of the Limabeam algorithm, which is a likelihood-based adaptive filter-and-sum beamformer. We show that this algorithm can be used to optimize the noisy acoustic features using in parallel the N-best hypothesized transcriptions generated at a first recognition step. The parallel and independent optimizations increase the likelihood of minimal word error rate hypotheses and the resulting N-best hypotheses list is automatically re-ranked. Results show improvements over delay-and-sum beamforming and Unsupervised Limabeam on a real database with considerable amount of noise and limited reverberation.

SURVEY OF RUSSIAN SPEECH RECOGNITION SYSTEMS

Andrey Ronzhin, Rafael Yusupov, Izolda Li, Anastasia Leontieva

Saint-Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, RUSSIA

Page 54

Abstract:
An idea of the natural speech interaction was arrived since electronic calculation machines were created. The command interface of the first computers did not provide acceptable speed and naturalness of interaction. By many years of investigations large number of methods and software, solved the speech recognition problem, were developed. Significant results were achieved in speaker depended recognition of isolated speech and today the attention of researches is focused on problems of spontaneous speech, speaker independency, robustness in noisy conditions. In this paper the authors present the survey of Russian language specifics, methods and applied models for Russian speech recognition developed in some organizations in Russia and abroad. The survey is prepared by proceedings of the last conferences and internet webs of developers.

A TECHNIQUE FOR CHOOSING EFFICIENT ACOUSTIC MODELING UNITS FOR LITHUANIAN CONTINUOUS SPEECH RECOGNITION

Darius Silingas (1), Sigita Laurinciukaite (2), Laimutis Telksnys (2)

(1) Vytautas Magnus University, LITHUANIA
(2) Institute of Mathematics and Informatics, LITHUANIA

Page 61

Abstract:
This paper presents a technique and experiments on choosing a set of mixed-duration modeling units representing language-specific phonetic details based on the analysis of available training data. A case study for Lithuanian speech recognition is presented. Lithuanian language defines the following major phonetic features: linguistic stress, consonant softness, vowel duration, and compound phones. Also, syllables or words could be chosen as modeling units. Incorporating all linguistic features into base phone set or using syllable models explodes the number of triphones used for accurate acoustic modeling. Therefore only those phones with phonetic features that have enough training samples are defined as separate phonetic units. Algorithms for forming dynamic base phone set and choosing complexity for separate models are proposed, implemented and validated in experiments. The results of experiments indicate a significant improvement of recognition accuracy.

INFORMATION RETRIEVAL BASED ALGORITHM FOR EXTRA LARGE VOCABULARY SPEECH RECOGNITION

Valeriy Pylypenko

International Research/Training Center for Information Technologies and Systems, UKRAINE

Page 67

Abstract:
This paper presents a new two-pass algorithm for Extra Large (more than 1M words) Vocabulary Speech recognition based on the Information Retrieval (ELVIRS). The principle of this approach is to decompose a recognition process into two passes where the first pass builds the word subset for second pass recognition. With this approach a high performances for large vocabulary speech recognition can be obtained.

MULTI-WORDS IN THE CZECH TV/RADIO NEWS TRANSCRIPTION SYSTEM

Jan Kolorenc, Jan Nouza, Petr Cerva

Department of Electronics & Signal Processing, Technical University of Liberec, CZECH REPUBLIC

Page 70

Abstract:
This article explores influence of multi-words (compound words) in the continuous speech recognition system of our Czech TV/radio News Transcription system. The main aim is to support recognition of short words, which are often misrecognized. Short words are joined with frequent longer words into a multi-word. Two measures for multi-words selection are tested. The first measure is based on pointwise mutual information, the second is based on occurrence frequency. Occurrence frequency based measure outperformed pointwise mutual information based measure. Adding multi-words increased performance of continuous speech recognition system and reduced misrecognition of short words.

USING SPEECH SYNTHESIS IN KEYWORD SPOTTING

Vitali Kiselov, Andre Talanov, Ivan Tampel, Marina Tatarnikova

Speech Technology Center, RUSSIA

Page 75

Abstract:
The technology for unlimited vocabulary automatic keyword spotting in spontaneous Russian speech is presented in this paper. We propose a novel speech database search system. It is based on the ideas of word pattern recognition and speech synthesis. Keywords to be searched are input in the text form and corresponding speech signals are synthesized by a text-to-speech (TTS) system. These signals are used as training material for a recognizer based on the dynamic programming approach. Evaluation of the system was performed for telephone and microphone channels. In the latter case, for a limited number of keywords searched simultaneously we gain 83% of hits without speaker adaptation (false alarm at 9.3%), and 99% of hits with speaker adaptation (17.5% of false alarms). Hit rate in noisy telephone channel is 78.5%, while false alarm rate is 60%.

FAST KEYWORD SPOTTING FROM ACOUSTIC BASEFORMS

Lubos Smidl, Josef Psutka, Ondrej Obracanik, Petr Podany, Jiri Zahradil

University of West Bohemia, CZECH REPUBLIC

Page 79

Abstract:
This paper describes a filler model, used in our keyword spotting system, which is implemented as a phoneme recognizer. The filler model produces a sequence of phones corresponding to the input utterance and can be used as a phoneme recognizer. The dependency of accuracy and correctness on the filler model back loop penalty as well as the influence of the filler model language model are depicted. The output of the phoneme recognizer can be used for keyword spotting. Two modifications of basic DTW algorithm are presented. The advantage of this keyword spotting approach the is possibility of two pass detection. The first pass –slow– is done only once. The second pass –fast– is done on the request of searching the keyword and uses only the sequence of the phones generated by the first pass. All the tests are performed on the telephone speech corpus.

BUILDING ACOUSTIC MODELS FOR A LARGE VOCABULARY CONTINUOUS SPEECH RECOGNIZER FOR RUSSIAN

Marina Tatarnikova, Ivan Tampel, Ilya Oparin, Yuri Khokhlov

Speech Technology Center, RUSSIA

Page 83

Abstract:
Different types of acoustic models created at Speech Technology Center are evaluated in this paper. Our main goal was to test how well those models work and choose one model for implementation in a large vocabulary continuous speech recognition (LVCSR) system for Russian which is under development now. Context-independent discrete and continuous models, as well as context-dependent continuous models, were built and evaluated on an isolated word recognition task. The results gained with the context-dependent continuous model prove its consistency and show it can be used for acoustic modelling in a large vocabulary speech recognizer.

SURVEY OF THE SPEECH RECOGNITION TECHNIQUES FOR MOBILE DEVICES

Dmitry Zaykovskiy

Department of Information Technology, University of Ulm, GERMANY

Page 88

Abstract:
This paper presents an overview of different approaches for providing automatic speech recognition (ASR) technology to mobile users. Three principal system architectures in terms of employing wireless communication link are analyzed: Embedded Speech Recognition Systems, Network Speech Recognition (NSR) and Distributed Speech Recognition (DSR). Overview of the solutions which became standards by now as well as some critical analysis of the latest developments in the field of the speech recognition in mobile environments is given. Open issues, pros and cons of the different methodologies and techniques are highlighted. Special emphasis is made on the constraints and limitations the ASR applications encounter under the different architectures.

PRIOR OF THE LEXICAL MODEL IN THE HIDDEN VECTOR STATE PARSER

Filip Jurcicek (1), Jiri Zahradil (2), Lubos Smidl (1)

(1) Center of Applied Cybernetics, University of West Bohemia in Pilsen, CZECH REPUBLIC
(2) Department of Cybernetics, University of West Bohemia in Pilsen, CZECH REPUBLIC

Page 94

Abstract:
This paper describes an implementation of a statistical semantic parser for a closed domain with limited amount of training data. We implemented the hidden vector state model, which we present as a structure discrimination of a flat-concept model. The model was implemented in the graphical modeling toolkit. We introduced into the hidden vector state model a concept insertion penalty as a part of pattern recognition approach. In our model, the linear interpolation was used for both to deal with unseen words (unobserved input events) in training data and to smooth probabilities of the model. We evaluated the implementation of the concept insertion penalty in our model on a closed domain human-human train timetable dialogue corpus. We found that the concept insertion penalty was indispensable in our implementation of the hidden vector state model on the human-human train timetable dialogue corpus. Accuracy of the baseline system was increased from 33.7% to 55.4%.

ISOLATED SENTENCES RECOGNITION USING VECTOR QUANTIZATION AND NEURAL NETWORKS

Paola Tellez, Jesus Savage

University of Mexico, UNAM, MEXICO

Page 100

Abstract:
This paper show a way to combine speech recognition techniques based on Vector Quantization (VQ) with Neural Networks (NN). Vector Quantization has proved its usefulness for isolated words recognition, but it is also useful for isolated sentences recognition. One way to improve the performance of this technique is to add an NN block that will help the performance of the VQ recognizer.

IMPROVED TRANSCRIPTION OF CZECH PARLIAMENT SPEECHES BY ACOUSTIC AND LANGUAGE MODEL ADAPTATION

Petr Cerva, Jan Nouza, Jan Kolorenc, Petr David

Technical University of Liberec, CZECH REPUBLIC

Page 103

Abstract:
The aim of this work is to improve the accuracy of our spoken broadcast transcription system in the task of Czech parliament speeches recognition. To achieve this goal, we propose several approaches for adaptation of both acoustic and language models of our system: a new two step unsupervised speaker adaptation strategy is presented to improve the former model while the latter one is created from a text corpus mixed properly from both general (2.6 GB of Czech newspaper texts) and domain specific data (181 MB of parliament speeches). Our experimental results show that the combination of both adaptation approaches leads to near 30% relative reduction of WER in comparison with the baseline speaker independent (SI) system operating with a general language model.

Session: Speaker Identification
Chair: Sergey Koval, Speech Technology Center, Russia

THE WCL-1 SYSTEM IN THE 2003 NIST SPEAKER RECOGNITION EVALUATION AND 2003 NFI/TNO FORENSIC SPEAKER RECOGNITION EVALUATION

Todor Ganchev, Nikos Fakotakis, George Kokkinakis

Wire Communications Laboratory, University of Patras, GREECE

Page 109

Abstract:
In the present work we discuss the results, which our speaker verification system, WCL-1, obtained in the 2003 NFI/TNO Forensic Speaker Recognition Evaluation. These results, together with the ones obtained in the 2003 NIST and Speaker Recognition Evaluation, give opportunity for in depth analysis of the various aspects of real-world application of the speaker recognition technology. Based on the detailed analysis of the speaker verification performance obtained in the different subtasks, we identify the virtues and disadvantages of the WCL-1 system and its potential areas of use.

ALGBICMAP–VOICED: AN ALGORITHM FOR SPEAKER CHANGE DETECTION

Petra Zochova, Vlasta Radova

Department of Cybernetics, University of West Bohemia, CZECH REPUBLIC

Page 115

Abstract:
The paper deals with the problem of automatic speaker change detection. A metric-based algorithm, called the AlgBICMap algorithm, was proposed in [1]. The AlgBICMap–Voiced is a modification of that algorithm and enables us to decrease the number of false alarms. The algorithm allows to create a map of BIC (Bayessian Information Criterion). The map enables us to detect efficiently fields of speech of individual speakers. In comparison with a typical metric-based approach, the advantage of the proposed algorithm is its robustness because it uses more information, not only information provided by adjacent windows.

TOWARDS A MULTILINGUAL APPROACH ON SPEAKER CLASSIFICATION

Christian Muller, Michael Feld

German Research Center for Artificial Intelligence, GERMANY

Page 120

Abstract:
In our previous work, we described the AGENDER speaker classification technology, a two-layered approach which primarily recognizes the speakers' age and gender, but also incorporates novel domain independent aspects like emotions or cognitive load. Due to its classification accuracy, its flexible way of fusing the results of multiple classifiers as well as its multiple-platform architecture, the project is regarded as very successful, attending e.g. vital interest from telecommunications industry. Today, one of AGENDER's major drawback consists of the fact that it has not been exhaustively investigated whether the approach is language independent or not. This paper outlines our attempt to overcome this drawback. Particularly, we present a framework for a multilingual speaker classification system, which is based on an underlying language identification module.

FORMANTS MATCHING AS A ROBUST METHOD FOR FORENSIC SPEAKER IDENTIFICATION

Sergey Koval

Speech Technology Center, RUSSIA

Page 125

Abstract:
"Formants matching" method for a robust speaker identification is described. It is a spectral analysis based method which differs from traditional approaches in that it presupposes comparison of articulatory similar events in two compared recordings as opposed to comparison of the same phonemes. Searching for coincidences/differences in uncontrolled movements of speech production organs reflected in higher formants tracks and dynamics makes this method especially robust in situations of noisy audio, different languages, short duration of voice samples. The method shows high reliability when applied for forensic speaker identification. Its reductive automatic realization gives 1.2% EER for text-dependent voice samples of 3 seconds duration and 8% EER for text-independent low quality PSTN voice samples of 96 sec.

Session: Applied and Dialogue Systems
Chair: Boris Sokolov, SPIIRAS, Russia

A DIALOG SYSTEM FOR THE DIHANA PROJECT

David Griol, Francisco Torres, Lluis Hurtado, Sergio Grau, Fernando Garcia, Emilio Sanchis, Encarna Segarra

Universidad Politecnica de Valencia, SPAIN

Page 131

Abstract:
We present in this paper a dialog system developed into the DIHANA project. This system consists of seven modules: an automatic speech recognizer, a language understanding module, a dialog manager, a database queries manager, a natural language answer generator, a text-to-speech converter and, finally, a central communication manager. For the implementation of the system, we built an architecture based on the client-server paradigm, where the central communication manager works as the client, and the other modules work as servers.

DEVELOPMENT AND EVALUATION OF A STOCHASTIC UNDERSTANDING MODULE

Francisco Torres, Emilio Sanchis, Encarna Segarra

Universidad Politecnica de Valencia, SPAIN

Page 137

Abstract:
We present a natural language understanding module for a spoken dialog system that tackles a restricted domain task (the query of timetables, prices and services provided by a Spanish railway information system). This understanding module is based on stochastic models that are very close to the n-gram models. We have used models of sequences of variable length that contain words and categories. After a rewriting of the user sentences, that substitutes the attribute values with labels of categories, the transduction is made in only one pass, generating the frames (semantic representation of the user sentences) without using any intermediate semantic language. We report an evaluation of this new understanding module.

EXAMPLES OF LITHUANIAN VOICE DIALOGUE APPLICATIONS

Algimantas Rudzionis (1), Kastytis Ratkevicius (1), Vytautas Rudzionis (2), Rytis Maskeliunas (1)

(1) Kaunas University of Technology, LITHUANIA
(2) Vilnius university Kaunas faculty, LITHUANIA

Page 143

Abstract:
This paper presents several speech technology demonstrations developed with the aim to show the potential of speech technologies. All these applications must comply with several emerging voice technology oriented standards – SALT and VoiceXML and use such software platforms as Microsoft Speech Server or IBM WebSphere in order to achieve necessary level of compatibility with other applications. Since these platforms don’t have Lithuanian text-to-speech synthesis and speech recognition engines proprietary speech processing modules were developed and matched to the chosen standards and platforms. These demos could serve as tool for evaluating speech technology capabilities by the authorities of telecommunication companies and other potential business customers or representatives from governmental organizations. Also they could be used as an educational resource in the learning process.

BUILDING A SLOVENIAN-ENGLISH LANGUAGE PAIR SPEECH-TO-SPEECH TRANSLATION SYSTEM

Jerneja Zganec Gros (1), France Mihelic (2), Mario Zganec (1)

(1) Alpineon RTD, SLOVENIA
(2) Faculty of Electrical Engineering, University of Ljubljana, SLOVENIA

Page 147

Abstract:
In the paper the design phases of the VoiceTRAN Communicator are presented, which combines speech recognition, machine translation and text-to-speech synthesis using the Darpa Galaxy architecture. The aim of the contribution was to build a robust multimodal speech-to-speech translation communicator able to translate simple domain-specific sentences in the Slovenian-English language pair. The project represents a joint collaboration between several Slovenian research organizations that are active in human language technologies. We provide an overview of the task, describe the system architecture and individual servers. Further we describe the language resources that were used and developed within the project. We conclude the paper with plans for evaluation of the VoiceTRAN Communicator.

AUTOMATIC GENERATION OF GENERAL SPEECH USER INTERFACE

Dongyi Song

Department of Informatics, Media Informatics Group, Ludwig-Maximilians-Universitat Munchen, GERMANY

Page 152

Abstract:
This paper describes a novel approach to generating a general speech user interface to different applications by combining the existing speech user interfaces of the applications automatically. A general speech user interface enables the user to access different applications via speech simultaneously. Nowadays there exist different multi-application supported dialogue systems enabling such an interface. The key issue of constructing such an interface is how different applications should be integrated. In most existing multi-domain dialogue systems, it is suggested to integrate different applications at the level of dialogue manager, which manages the speech interaction between the user and the application. Explicit domain switch is required for these architectures. An improved approach has been proposed by Bui, et al. [1], which brings different application at the level of dialogue specification, which describes the application for a dialogue manager. Single-application dialogue specifications are brought together into an application hierarchy, so that transparent switching between all integrated applications is allowed. However, the interoperability problem between applications such as task sharing and information sharing is addressed as future work. The approach represented in this paper solves three problems in general speech user interfaces – transparent application switching, task sharing and information sharing by automatic merging the dialogue specifications of different applications into a unified dialogue specification, which provides necessary information for a multi-application supported dialogue system to enable the general speech user interface to different applications.

QUALITY AND QUANTITY ESTIMATION AND ANALYSIS OF MULTIMODAL SYSTEMS FOR HUMAN-COMPUTER INTERACTION

S. Potryasaev, Boris Sokolov, Rafael Yusupov

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, RUSSIA

Page 158

Abstract:
One of the most important factors of scientific and technological revolution is the introduction of automatic systems and informational systems (AS and IS) in all fields of human activity [1]. Both in the field of industrial manufacturing and in the informational sphere, the role and significance of such a notion as quality is constantly growing and is being developed under the influence of novel technologies and market needs. In the last decades, the problems connected with testing the quality of products have become the subject of intensive investigations conducted in such a new scientific branch as quality science. One of the main branches of this science is qualimetry, which is devoted to the development of methodological backgrounds for the quantitative estimation of product quality.

Session: Poster & Demo Session
Chair: Adil Timofeev, SPIIRAS, Russia

AUTOMATED DETECTION OF SEMANTIC CONNECTIONS IN THE TEXT SUBJECT ORGANIZATION

Irina Nikolaeva

Moscow State Linguistic University, RUSSIA

Page 171

Abstract:
The given paper gives the review of different semantic connections among words and terms, which are used as a tool for subject organization. Here we see the most important task in defining the semantic connections, which are used for connected text organization. We assume that the way of automated theme detection will increase the efficiency of processing.

DEVELOPMENT OF MAN-MACHINE INTERFACES AND VIRTUAL REALITY MEANS FOR INTEGRATED MEDICAL SYSTEMS

Adil Timofeev, Igor Gulenko, V. Andreev, Svetlana Chernakova (1), M. Litvinov (2)

(1) Saint-Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, RUSSIA
(2) Baltic State Technical University "Voenmekh", RUSSIA

Page 175

Abstract:
The results of research in the field of development and intellectualization of man-machine interfaces on the basis of a virtual reality means for the integrated medical systems of a virtual reality including the medical personnel, computers and medical mechanotronic systems (robots) are discussed.

A NATURAL LANGUAGE INTERFACE TO A THEATER INFORMATION DATABASE

Margus Treumuth

University of Tartu, ESTONIA

Page 179

Abstract:
The development of a natural language dialogue system as an interface to a theater information database is a joint research project of University of Tartu (Estonia) and Tallinn University of Technology (Estonia). The underlying database contains information about theater performances in a certain theater or city. The dialogue system can be used to ask information about performances using either spoken or typewritten natural language in Estonian. The dialogue management module was developed by the author at the University of Tartu. The speech recognition and speech generation modules were added by Tallinn University of Technology. This article discusses the development of the dialogue module but not the speech recognition/generation modules.

TEMPORAL DATA ON CONSONANTS IN DIFFERENT TYPES OF STANDARD RUSSIAN SPEECH

Svetlana Tananaiko, Ludmila Vasilieva

Department of Phonetics, Saint-Petersburg State University, RUSSIA

Page 182

Abstract:
The article deals with the results of the systematically per-formed study of the duration of consonants and its dependence upon various phonetic factors in the standard realizations of Russian spontaneous speech and reading. The data described were obtained for the RFBR project No 04-06-80111 “Spontaneous Speech as a Source of Pronunciation Standard Changes”. It may me concluded that consonant duration is not determined significantly by a phonetic context and a position in an intonation unit. As to the dependence upon age and gender, almost in all age groups the consonant duration in women's speech is more than in men's one. How-ever, consonant duration to the most degree depends on its distinctive features. The durations of consonants were differently distributed among speakers, what can be explained by speakers' individual characteristics.

A METHOD FOR STUDYING OF THE INFORMATIVE ATTRIBUTES OF SPEECH SIGNALS IN THE FREQUENCY DOMAIN

Alexander Kolokolov, Marianne Pavlova

Institute of Control Sciences, Russian Academy of Sciences, Moscow, RUSSIA

Page 188

Abstract:
A method, which is a modification of the analysis-synthesis procedure, is proposed for studying of the informative attributes of speech signal. It is based on the editing of the speech dynamic spectrogram and its subsequent restoration in the time domain. Some preliminary results of using the proposed method are presented.

PROCESSING OF CUSTOMER’S REQUESTS: ANALYSIS OF ESTONIAN DIALOGUE CORPUS

Mare Koit (1), Maret Valdisoo (1), Tiit Hennoste (2), Olga Gerassimenko (1), Riina Kasterpalu (1), Andriela Raabis (1), Krista Strandson (1)

(1) University of Tartu, ESTONIA
(2) University of Helsinki, FINLAND

Page 193

Abstract:
This paper describes processing of customer’s requests by an information operator. The study is based on the Estonian dialogue corpus. Our further aim is to develop a dialogue system which interacts with a user in Estonian and processes user’s requests automatically. The corpus analysis demonstrates that a number of linguistic cues can be found, which can be used in order to recognize requests automatically. Then a frame of request will be filled in by the dialogue system, and a semantic grammar will be used for giving information to the customer, or for initiating a subdialogue.

ASSISTIVE MULTIMODAL INTERFACE FOR MEDICAL APPLICATIONS

Svetlana Chernakova, Alexander Nechaev, Alexey Karpov, Andrey Ronzhin

Saint-Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, RUSSIA

Page 199

Abstract:
The results of research of an assistive multimodal interface (MMI) for assistance to doctors during the surgery operation and other medical applications are presented in the paper. Novel design of MMI is an “automatic assistant” for multimodal control of the medical equipment during the surgery operation, when a doctor operates with sterile hands and manipulates with surgical armaments. The automatic Speech Recognition Device (SRD) combined with Head-tracking Pointing Device (HPD) provides the reliability and natural way for control of the medical equipment or a computer. The ability of pointing the image zone of interest and viewing 3D images controlled by head movements and human-operator’s speech gives new features to MMI, especially for Stereo Visualization Device (SVD) of medical images (computer tomogram, endoscope, thermography, ultrasonic, and X-ray images) with usage of stereo-displays or stereo-glasses. Main goal of MMI design is improvement of the medical equipment control, including the medical computer visualization system, for diagnostic and surgery operations, training of doctors and students.

AN INFORMATION THEORETIC APPROACH TO SYSTEM IDENTIFICATION VIA INPUT/OUTPUT SIGNAL PROCESSING

Kirill Chernyshov

V.A. Trapeznikov Institute of Control Sciences, RUSSIA

Page 204

Abstract:
The aim of the paper is to present a conceptual approach to identification of nonlinear stochastic systems based on information measures of dependence. In the paper, an identification problem statement using the information criterion under rather general conditions is proposed. It is based on a parameterized description of the system model under study combined with minimum of the relative entropy method to derive the mutual information of the system's and model's output variables. Such a problem statement leads finally to a problem of the finite dimensional optimization. As a result, a constructive procedure of the model parameter identification is derived. It does not involve unreal a priori assumptions degenerating the entity of the initial identification problem statement like those ones presented in some referenced literature sources and revised in the present paper.

THE USE OF DYNAMIC VOCAL TRACT MODEL FOR CONSTRUCTING THE FORMANT STRUCTURE OF THE VOWELS

Vera Evdokimova

Department of Phonetics, Saint Petersburg State University, RUSSIA

Page 210

Abstract:
This paper discusses the new method of constructing the dynamic vocal tract model. It consists of two dynamic parts: the voice source and filter components. Each of these parts has their own dynamic features and resonant frequencies. Their interaction leads to the short-term phonetic effects. The method of obtaining frequency characteristic of the filter component by processing the real speech data is suggested. It allows constructing the formant structure of the vowels and their variations. On the example of the realization of the stressed vowel /a/ the formant structure using the new method is detected.

EMBEDDING BINARY DATA TO AUDIO STREAMS BASED ON DISCRETE WAVELET TRANSFORM

Dmitry Rublev, Vladimir Fedorov, Oleg Makarevich

Taganrog State University of Radioengineering, RUSSIA

Page 215

Abstract:
This work is dedicated to new steganography method for information hiding in digital audio signals based on wavelet transform, which is done to perceptually important areas thus providing high resistance to active and passive attacks, being flexible and adaptive to stegocontainers. The information hiding space extracting method based on Mallat algorithm is described and basic and more advanced modulation techniques are proposed. Achieved results, including bit error rate and signal distortion as well as future directions of improving robustness and security aspects are discussed.

DEVELOPMENT OF TRANSLATION TOOLS FROM THE NATURAL LANGUAGE ON SIGN LANGUAGE OF DEAF PEOPLE

Alexandr Voskresenskij (1), Igor Gulenko (2), G. Khakhalin (3)

(1) Boarding school No. 101 for deaf children, RUSSIA
(2) St. Petersburg Institute for Informatics and Automation, RUSSIA
(3) Scientific and Research Center NICEVT, RUSSIA

Page 221

Abstract:
It’s being proved the necessity of development of the program for translation of the natural text on sign language used by deaf people, the description of foreign analogues is resulted. The problem of development of the specialized interface for use by deaf people in means of communication is considered. The description of novelty of the used approach is given. It is being stated the assumption, that the used approach will not only allow to create the program of text translation in gestures, but also will be useful in solution of problems of processing the natural verbal languages.

A BILINGUAL TRANSLATION SYSTEM IN FOREIGN LANGUAGE TEACHING

Viatcheslav Yatsko, Maxim Kozlov

Katanov State University of Khakasia, RUSSIA

Page 226

Abstract:
The paper suggests a classification of computer systems used in foreign language education and focuses on the description of TITE – a distributed bilingual translation system designed for out-of-class activities. Detailed description of the architecture and functions of TITE is given. The paper emphasizes a need to develop curriculum–integrated educational computer systems and formulates some requirements for such systems.

SOME ASPECTS OF HARDWARE IMPLEMENTATION OF DIGITAL FILTERS

Oleg Paulin

Department of Computer Systems, Odessa National Polytechnic University, UKRAINE

Page 232

Abstract:
The necessity of availability of compressors, which press multi-row codes (MRC) into one, consisting in digital filters, is illustrated. Variations of MRC compressing are considered: standard operational Wallas' and Santoro’s elements. We have also developed three-operand adder with parallel carry with special Glasser’s coding (MGA). MGA advantage for compressing three rank codes is demonstrated.

LINGUISTIC MODELING BY FOURIER-HOLOGRAPHY TECHNIQUE: IMPLEMENTATION OF NON-MONOTONIC SEMANTICS

Alexander Alekseev, Alexander Pavlov

St. Petersburg State University for Information Technologies, Mechanics, and Optics, RUSSIA

Page 237

Abstract:
Linguistic Modeling is the approach that allows human-computer interaction to be established by using natural-like language. We use Fourier-holography technique to implement Linguistic Modeling in the framework of Neuro-Fuzzy approach. We pay attention to implementation of non-monotonic semantics by this technique. We develop theoretical model and verify it by experimental illustration.

OBJECTIVE METHOD OF SPEECH SIGNAL QUALITY ESTIMATION

Valentin Smirnov (1), Mikhail Gusev (2)

(1) Department of Phonetics and Foreign Languages Teaching Methodology, Saint-Petersburg State University, RUSSIA
(2) Department of Speech Technologies, LLC SPF “Bercut", RUSSIA

Page 242

Abstract:
This paper concerns a method of an objective estimation of speech signal quality. On the basis of the properties’ analysis of known quality estimation methods the necessity of developing new methods is judged. The opportunity of the most general account of hearing properties and speech processing is considered. Results of application of an offered method for an estimation of standard voice coders’ quality are given.

THE PROCEDURES OF THE NOISE CLIPPING IN THE SIGNAL, BASED ON FOURIER- AND WAVELET-TRANSFORM AND ON CLASSIFICATION OF SOUNDS OF SPEECH

Tatyana Yermolenko, Ujin Fedorov

Institute of the problems of the artificial intellect, Donetsk, UKRAINE

Page 245

Abstract:
Preliminary removal of noise from signal plays important role within the speech recognition. For the solution of this problem in the article the procedures of the noise clipping from the signal are proposed, which are based on Fourier- and wavelet- transform and classification of sounds of speech.

BIOLOGIC FEEDBACK FORMATION BY VOCAL REHABILITATION

Lydia Balatskaya (1), Vladimir Bondarenko (2), Eugen Choynzonov (1), Anton Konev (2), Roman Mescheriakov (2)

(1) Scientific-Research Institute of Oncology of Tomsk Science Center of Siberian Branch of Russian Academy of Medical Science, RUSSIA
(2) Tomsk State University of Control Systems and Radioelectronics, RUSSIA

Page 251

Abstract:
The basic approaches and methods of biologic feedback formation by vocal rehabilitation are considered in this work. As a base method the multilevel approach to consideration of the speech rehabilitation including process of speech restoration is used. The basic methods are formed on the basis of human acoustical system models, and also on the requirements, shown by the attending physician and the patient.

Session: Speech Synthesis
Chairs: Ruediger Hoffman, Technical University of Dresden, Germany
Boris Lobanov, United Institute of Informatics Problems of NASB, Belarus

MEL-LSP PARAMETERIZATION FOR HMM-BASED SPEECH SYNTHESIS

Naotoshi Nakatani, Kazumasa Yamamoto, Hiroshi Matsumoto

Faculty of Engineering, Shinshu University, JAPAN

Page 261

Abstract:
In HMM-based speech synthesis using mel-cepstral parameters, it has been observed that formant peaks tend to be flattened in the synthetic speech. To alleviate this problem this paper investigates Mel-LSP (line Spectral Pairs) based speech synthesis. First, using vowel spectra synthesized by four formants, it is shown that the formant flattening for the centroid of mel-LSP frequencies is mach less than that for mel-cepstra. After overviewing the closed form of Mel-LPC analysis, a structure of the Mel-LSP synthesis filter is presented. On the basis of this mel-LSP parameterization, the mora HMMs are trained using the mel-LSP parameters and short sentences are synthesized using them. The speech quality of these synthetic speech are compared with that of speech synthesized by the mel-cepstrum based HMMs. In A-B preference tests, Mel-LSP-based synthetic speech were chosen 61% of time over Mel-cepstrum based one.

APPLICATION OF VOWEL ALLOPHONES TRANSFORMS FOR SENTENCE INTONATION IN POLISH TTS SYSTEM

Krzysztof Popowski, Bozena Piorkowska, Edward Szpilewski

Institute of Computer Sciences, University of Bialystok, POLAND

Page 265

Abstract:
The article discusses the application of sentence intonation for Polish language basing on speech synthesis from allophones database. The paper presents some techniques of signal processing used for transformation of allophones speech signal. Introduced methods, joint together, permit to obtain intonation effects in speech synthesis problem. Our presented application and this approach allows build prosody database for further works witch including automatic sentence intonation. The results of the research have been applied to the text-to-speech synthesis systems for Polish language.

SPECTRAL DISTANCE COSTS FOR MULTILINGUAL UNIT SELECTION IN SPEECH SYNTHESIS

Hamurabi Gamboa Rosales, Oliver Jokisch, Rudiger Hoffman

Technische Universitat Dresden, GERMANY

Page 270

Abstract:
The unit-selection module in concatenative TTS systems plays an important role regarding corpus synthesis. It has the main goal to minimize a composition of target and concatenation costs for a given phrase. We measured the concatenation cost through the spectral discontinuity perceptions, which are based on the spectral properties measures like: Linear spectral frequencies (LSFs), Multiple centroid analysis (MCA) and Mel-frequency cepstral coefficients (MFCCs). To determinate and evaluate the relationship between our spectral distance measures and distortion human perception, we report a perceptual experiment’s guide to measure the correlation between human mismatch perception and spectral distance measures of concatenation costs in the multilingual, concatenative TTS system Papageno while testing the method for English, German and Spanish.

DEVELOPMENT OF MULTI-VOICE AND MULTI-LANGUAGE TTS SYNTHESIZER (LANGUAGES: BELARUSSIAN, POLISH, RUSSIAN)

Boris Lobanov, Liliya Tsirulnik

United Institute of Informatics Problems of National Academy of Science, Minsk, BELARUS

Page 274

Abstract:
The paper describes some results of the research which aiming at filling the gap in introducing and promoting computerized speech technology for Slavonic languages, in particular, a technology of TTS synthesis for Belarusian, Polish and Russian. A typological analysis of the peculiarities of phonemic and allophonic systems of Belarussian, Polish and Russian languages is given. Based on the results of this study, an approach to making a unified phonetic-acoustical database for multi-language Slavonic TTS synthesis is proposed. The results of the quantitative analysis of the pitch contours for some Slavonic languages and, besides, the peculiarities of speakers individual intonation are presented. The general structure of multi-language and multi-voice TTS system is described.

LOW RESOURCE TTS SYNTHESIS BASED ON CEPSTRAL FILTER WITH PHASE RANDOMIZED EXCITATION

Guntram Strecha, Matthias Eichner

Institut fur Akustik und Sprachkommunikation, Technische Universitat Dresden, GERMANY

Page 284

Abstract:
In this paper we present the acoustic synthesis of a low resource Text-To-Speech (TTS) system based on a 7th order cepstral filter. The excitation signal is designed in frequency domain by a two parameter model. This model is able to generate the excitation signal for both, voiced and unvoiced segments. The sets of filter coefficients represent the speech units and are stored in a compressed form in the inventory of the TTS system. An inventory which is normally used by a concatenative synthesis system is transformed to obtain the inventory for the proposed system. The compression method consists of a lifter and an interpolation technique to describe the temporal progression of the cepstral features. Additional spectral warping is applied to prefer lower frequency components in order to preserve the spectral structure in the compression step. This warping method offers the possibility to change the voice characteristics of the synthesized speech without additional computational or algorithmic efforts. We integrated the proposed synthesis method in our multilingual TTS system and achieved high quality speech syntheses using sampling rates up to 32 kHz with an average bit rate of 14 kBit/s and an inventory compression rate of 36:1.

AN APPROACH TO THE DECODER COMPLEXITY REDUCTION IN WAVEFORM INTERPOLATION SPEECH CODING

Kyung Jin Byun, Ik Soo Eo, Hee Bum Jeong (1), Minsoo Hahn (2)

(1) Electronics and Telecommunications Research Institute, KOREA
(2) Information and Communications University, KOREA

Page 288

Abstract:
Since current TTS synthesizers are mostly based on a technique known as synthesis by concatenation, the implementation of a high quality TTS requires huge storage space for the speech segments. In order to compress the database, the use of speech coders would be an efficient solution. Waveform Interpolation has been shown to be an efficient speech coding algorithm to provide high quality speech at low bit rates. However, its applications are constrained due to the high computational complexities. This paper describes the complexity reduction method in WI coder, which is used for compressing the TTS database. The proposed idea can reduce the complexity by removing the realignment process from the decoder procedure in WI. Since the realignment factor obtained in the encoder should be transmitted to the decoder in order to realign the characteristic waveforms, overall bit rate is slightly increased. We can reduce the 20 percent of the decoder complexity by utilizing the new approach.

ADAPTATION OF THE AHOTTS TEXT TO SPEECH SYSTEM TO PDA PLATFOMS

Jon Sanchez, Iker Luengo, Eva Navas, Inma Hernaez

University of the Basque Country, SPAIN

Page 292

Abstract:
This paper presents the work carried on to adapt a Basque language Text To Speech (TTS) system into a mobile device of limited resources. The aim is to make possible the use of the AhoTTS conversion system of the UPV/EHU Aholab group, in a Personal Digital Assistant (PDA), and to test the system performance in several aspects, such as sound sample generation times. The selected PDA is a Pocket PC under the Windows CE operating system. The converter has been compiled in a library which provides an Application Programming Interface (API) to the applications. Some applications that use the created API are also described.

THREE GENERATIONS OF SPEECH SYNTHESIS SYSTEMS IN SLOVAKIA

Sakhia Darjaa, Milan Rusko, Marian Trnka

Institute of Informatics of the Slovak Academy of Sciences, Bratislava, SLOVAKIA

Page 297

Abstract:
A brief survey of research in speech synthesis in Slovakia is presented. Three generations of synthesizers developed at the Institute of Informatics are described. Kempelen O.1 speech synthesizer developed in 1989 was a memory-footprint optimized system using a unique method of signal compression preserving transients and synthesizing stable parts of phonemes. Kempelen 1.x – engine was based on concatenation of pre-recorded diphones with signal post-processing for intonation and rhythmical contours implementation. Some interesting features were added for commercial applications. Kempelen 2.x – is based on unit-selection. The speech synthesis database design is described in the paper as well as the experience resulting from the design and testing of Kempelen 2.0. Kempelen 2.1 uses pre-selection of element-candidates based on a phonetic analysis of the orthoepic transcription of the text. Acoustical aspects are taken into account in the second run of the selection process.

MODELLING THE TEMPORAL STRUCTURE OF NEWSREADERS' SPEECH ON NEURAL NETWORKS FOR ESTONIAN TEXT-TO-SPEECH SYNTHESIS

Mark Fishel (1), Meelis Mihkla (2)

(1) University of Tartu, ESTONIA
(2) Institute of Estonian Language, Tallinn, ESTONIA

Page 303

Abstract:
Generation of natural-sounding synthetic speech from a text requires perfect control over the temporal structure of speech flow. The present paper describes an attempt to replace the rule-based durational model, hitherto used in Estonian text-to-speech synthesis, by neural networks (NN). For this aim, fluent speech of radio announcers and newsreaders was analysed and its temporal structure was modelled on neural networks. Analysis of pauses in extended material revealed that if a text is read out with a normal speech rate, it is quite possible to classify the pauses made, so that the results can be used in speech synthesis. For sound durations, certain characteristics of phone context as well as certain syllablelevel features were found to be the relevant input for an NN algorithm. For models of pause durations and positions, however, the prevalent features were variables characterizing text structure (punctuation marks and conjunctions).

REALIZATION OF PROSODIC CONTOURS IN SPEECH SYNTHESIS

Elena Karnevskaya

Department of English Phonetics, Minsk State Linguistic University, BELARUS

Page 307

Abstract:
The study under consideration aims at bringing to light some aspects of the prosodic organization of speech, namely those associated with the degree of cohesion between the adjacent elements of a speech stretch. The issue raised in the paper concerns variation in the degree of linking reflecting intra-clausal syntactical-semantic relations, i.e. relations between the accentual units as constituents of a prosodic contour. The problem is considered in the framework of prosodic modelling for multi-language speech synthesis.

THE QUALITY EVALUATION OF ALLOPHONE DATABASE FOR ENGLISH CONCATENATIVE SPEECH SYNTHESIS

Karina Evgrafova

Department of Phonetics, Saint Petersburg State University, RUSSIA

Page 311

Abstract:
The goal of this paper is to describe the procedure of perceptual tests which were aimed at evaluating the quality of allophonic database inventory for English concatenative speech synthesis. The main criteria of evaluation were the degree of naturalness and intelligibility of the resulting synthesized speech. The results of perceptual experiments with discussion are presented.

LINGUISTIC PROCESSOR TRAINING ON SPEAKER DATA FOR UNIT SELECTION TEXT-TO-SPEECH

Tetyana Lyudovyk

International Research/Training Center for Information Technologies and Systems, Kyiv, UKRAINE

Page 315

Abstract:
This paper describes an approach to synthesizing personalized speech while maintaining not only speaker voice but also speaker pronunciation peculiarities. Personalization is realized by means of pronunciation models trained on speaker data contained in his/her speech database. Untrained models allow to synthesize speech in neutral normative style. On the segmental level, the transcription model is used. On the prosodic level, models for phrasing, intonation, pause and phoneme duration are used. These prosodic models are derived from comparative acoustic-phonetic study of different speakers’ data contained in several speech corpora and databases. Personalizing of pronunciation models is carried out during the off-line training of linguistic processor using a speech database annotation. During the on-line speech synthesis mode, personalized pronunciation models are used by the linguistic processor to generate speaker specific target specification of input text.

Session: Signal Processing and Feature Extraction
Chairs: Leon Rothkrantz, Delft University of Technology, The Netherlands
Masato Akagi, Advanced Institute of Science and Technology, Japan
Pavel Skrelin, St. Petersburg State University, Russia

REFINEMENT OF AN MTF-BASED SPEECH DEREVERBERATION METHOD USING AN OPTIMAL INVERSE-MTF FILTER

Masashi Unoki, Masato Toi, Masato Akagi

School of Information Science, JAIST, JAPAN

Page 323

Abstract:
We previously proposed a speech dereverberation method based on the modulation transfer function (MTF). This method consists of power envelope restoration and carrier regeneration processes, and reduces both the loss due to degraded power envelopes and the loss of speech intelligibility. In the power envelope restoration, however, whether adaptive time-frequency division provides the best representation is still uncertain and the improvement in restoration accuracy tends to lessen as the reverberation time drastically increases. In this paper, with regard to these issues, we explain how the power envelope restoration can be improved and show that power envelope inverse filtering can be redesigned as an optimal inverse MTF.

EXTRACTING PHASE FROM VOICED SPEECH

Jamie Taylor, Donald Bitzer, Robert Rodman, David McAllister

Department of Computer Science, North Carolina State University, UNITED STATES

Page 327

Abstract:
A technique is presented to extract a useful phase signal from voiced speech. The speech is divided into glottal pulses and spectral phase is computed for each glottal pulse. The phase signal is then corrected for a variety of defects. Techniques are also presented to improve the reliability of the extracted signal. Finally, the signal is filtered to remove portions which are still unreliable.

MELODIC CONTOUR ESTIMATION WITH B-SPLINE MODELS USING A MDL CRITERION

Damien Lolive, Nelly Barbot, Olivier Boeffard

IRISA / University of Rennes 1 – ENSSAT, FRANCE

Page 333

Abstract:
This article describes a new approach to estimate F0 curves using a B-Spline model characterized by a knot sequence and associated control points. The free parameters of the model are the number of knots and their location. The free-knot placement, which is a NP-hard problem, is done using a global MLE within a simulated-annealing strategy. The optimal knot number estimation is provided by MDL methodology. Three criteria are proposed: control points are first considered as integer values, next as real coefficients with fixed precision and then with variable precision. Experiments are conducted in a speech processing context on a 7000 syllables French corpus. We show that a variable precision criterion gives better results in terms of RMS error (0.42Hz) as well as in terms of reduction of the number of B-spline degrees of freedom (63% of the full model).

SUBSPACE-BASED SPEECH ENHANCEMENT WITH PERCEPTUAL FILTERBANK AND SNR-AWARE TECHNIQUE

Jia-Ching Wang, Hsiao-Ping Lee, Jhing-Fa Wang, Chung- Hsien Yang

Department of Electrical Engineering, National Cheng Kung University, TAIWAN

Page 339

Abstract:
In this paper, a new subspace-based speech enhancement algorithm is presented. First, we construct a perceptual filterbank from psycho-acoustic model and incorporate it with the subspace-based enhancement approach. This filterbank is created through a five-level wavelet packet decomposition. Next, the prior SNR of each critical band are taken to decide the attenuation factor of the optimal linear estimator. Three different types of in-car noises in TAICAR database were used in our evaluation. The experimental results demonstrated that our approach outperformed conventional subspace and spectral subtraction methods.

AN OVERCOMPLETE WDFT-BASED PERCEPTUALLY CONSTRAINED VARIABLE BIT RATE WIDEBAND SPEECH CODER WITH EMBEDDED NOISE REDUCTION SYSTEM

Michael Livshitz (1), Alexander Petrovsky (2)

(1) Computer Engineering Department, Belarusian State University of Informatics and Radioelectronics, BELARUS
(2) Department of Real-Time Systems, Bialystok Technical University, POLAND

Page 343

Abstract:
The paper considers an application of speech enhancement system in perceptually constrained variable bit rate wideband speech coder based on multiband CELP-algorithm with perceptually monitored structure of codebook (PCVBR). Both parts of the proposed speech coding system use the same overcomplete warped discrete-Fourier transform (O-WDFT). An overcomplete basis of the WDFT is used to minimize reconstruction error in high frequency range in the synthesis block of noise reduction system and to provide more accurate representation of high band frequency components. A robustness of the coder with embedded noise reduction system in noise presence is discussed.

BASIS PURSUIT DECOMPOSITION: AN ANALYSIS OF SPANISH WORDS

Fabiola M. Martinez-Licona, John Goddard-Close, Alma E. Martinez-Licona (1), Hugo L. Rufiner-DiPersia (2)

(1) Universidad Autonoma Metropolitana, MEXICO
(2) Universidad Nacional Entre Rios, ARGENTINA

Page 349

Abstract:
Time-frequency representations (TF) of speech signals are commonly used to visualize the dynamic behavior. However the choice of the basis functions employed in the TF has an important effect on the number of non-zero coefficients occurring in the representation of the signal. Signals with different morphologies, such as vowels or fricatives, could benefit from the using different bases for each of their representations. Different methods, such as matching and basis pursuit (bp), have been introduced to find representations of signals using combined bases. One drawback of these techniques is that the computational demands and computing times are liable to be far greater than those of more traditional methods. The present paper analyzes the performance of bp applied to Spanish words. In particular, early stopping times, and their relationship to the number of coefficients found are studied. The quality of the reconstructed signals is also considered both quantitatively and qualitatively.

STUDY OF SPEECH SYLLABLES USING LORENZ MODEL FOR NON-LINEAL ANALYSIS

Victor H. Tellez-Arrieta, Fabiola M. Martinez-Licona, Alma E. Martinez-Licona

Universidad Autonoma Metropolitana, MEXICO

Page 355

Abstract:
Nonlinear methods for signal analysis has been a research area where techniques have tried to overcome the limitations that linear techniques show. Speech signals have characteristics considered as nonlinear as for example high frequency and short duration components as in the case of occlusive and fricative sounds. A study of nonlinear techniques on speech signals is presented; in particular Lorenz model is applied to Spanish syllables and their attractors are reviewed.

SONORITY MEASURE FOR AUTOMATIC SPEECH RECOGNITION

Daniil Kocharov

Department of Phonetics, Saint Petersburg State University, RUSSIA

Page 359

Abstract:
In this paper, the use of sonority measure as acoustic feature of the speech signal for continuous automatic speech recognition is described. The representation of sonority extent of sounds is made with a help of spectrum derivation. Therefore, a novel articulatory motivated acoustic feature expressing the sonority is named spectrum derivative feature. The new feature is tested in combination with the state-of-the-art Mel Frequency Cepstral Coefficients (MFCC) feature. The effects of various warping and filtering techniques on the spectrum derivative feature are investigated. Experiments have been performed on the large vocabulary task (VerbMobil II corpus). Improvement in word error rate has been obtained by combining the MFCC feature with the spectrum derivative feature: of up to 4.5 % on the large-vocabulary task (VerbMobil II corpus)relative to using MFCC alone with the same overall number of parameters in the system.

BROAD PHONEMIC CLASS SEGMENTATION OF SPEECH SIGNALS IN NOISE ENVIRONMENTS

Iosif Mporas, Panagiotis Zervas, Nikos Fakotakis

Wire Communication Laboratory, University of Patras, GREECE

Page 363

Abstract:
In this paper, we evaluate the performance of an implicit approach for the automatic detection of broad phonemic class boundaries from continuous speech signals in different additive noise environments. We exploit the prior knowledge of glottal pulse locations for the estimation of adjacent broad phonemic class boundaries. The approach’s validity was tested on the DARPA-TIMIT American-English language corpus and NOISEX-92 database. Our framework’s results were very promising since by this method we achieved 25 msec accuracy of 74,9% for un-noisy environment, while the performance reduced about 5% for wideband distortion noise.

SPEECH SIGNAL CODING USING NON-LINEAR PREDICTION BASED ON VOLTERRA SERIES EXPANSION

Ghasem Alipoor, Mohammad Hasan Savoji

Electrical and Computer Engineering Faculty, Shahid Beheshti University, Tehran, IRAN

Page 367

Abstract:
Non-linear prediction can be based on Volterra series expansion with some benefits especially when the expansion is limited to first and second terms for simplicity. But, these non-linear predictive filters suffer from instability triggered when quantization is used to translate the reduction in excitation signal energy into a smaller bit-rate. In this paper the instability is studied in forward prediction scheme that uses Least Squares (LS) Error criterion and solutions to remedy this problem are suggested and discussed. A scheme is reported that detects and flags those frames for which, after stabilization, including the quadratic predictor is beneficial. The results show that an overall improvement up to 2dB in the SNR can be achieved.

IMPROVEMENT APPROACHES OF ORDERED SPECTRA WARPING CRITERIA FOR NOISE ESTIMATION

Dragomir Nikolov

Technical University of Varna, BULGARIA

Page 371

Abstract:
Still Ephraim&Malah method has an outstanding performance, but it needs a priori speech/noise information and complicated calculations. On the contrary, noise estimation, based on ordered spectra, is computationally simple and needs no voice activity detection. The known methods have either big statistical error or high memory requirements. They also suffer from the lack of adaptation. The proposed solution is based on a warping of the ordered spectra when a speech/music is present. This warping provides dual information about noise level and speech presence at the same time. The derived method shows very robust voice activity detection and a resistance to speech artefacts like lip “pocks” and breathing. It has reasonable memory requirements. The method can improve the word error ratio up to 18.6%, compared to quantile based noise estimation. Similar performance is achieved with real world noises. A simple approach to suppress the musical noise is proposed too.

COMPARISON BETWEEN GMM AND DECISION GRAPHS BASED SILENCE/SPEECH DETECTION METHOD

Jan Trmal, Jan Zelinka, Josef Psutka, Ludek Muller

Department of Cybernetics, University of West Bohemia, CZECH REPUBLIC

Page 376

Abstract:
In the paper two contributions to silence detection problematics are described. The first contribution is a classifier based on decision graphs construction. Furthermore we performed the Independent Component Analysis on MFCC parameterization of used speech corpus and evaluated the influence of this method on voice activity detection accuracy. For experiments we used the decision graphs and HMM based classifier.

TRANSITIONAL SPEECH SEGMENTS MODELING BY MATCHING PURSUIT WITH A DICTIONARY BASED ON THE PSYCHOACOUSTIC ADAPTIVE WP

Alexey Petrovsky, Alexander Petrovsky

Computer Engineering Department, Belarusian State University of Informatics and Radioelectronics, Minsk, BELARUS

Page 380

Abstract:
In this paper transitional speech segments modeling by matching pursuit is proposed. The dictionary for matching pursuit is composed of wavelet functions that implement of psychoacoustic adaptive wavelet filter bank. Psychoacoustically motivated entropy based cost functions allow to greatly minimizing a number of time-frequency atoms in wavelet packet (WP) dictionary. The given transient modeling method is suitable to be integrated into a parametric speech coder or concatenate speech synthesis based on the three-part model of sinusoidal, transients and noise.

SPEAKER CHANGE DETECTION VIA BINARY SEGMENTATION TECHNIQUE AND INFORMATIONAL APPROACH

Jindrich Zdansky

Technical University of Liberec, CZECH REPUBLIC

Page 386

Abstract:
This paper deals with problems of speaker change detection in acoustic data. The aim is to identify the optimal number and position of the change-points that split the signal into shorter sections belonging to individual speakers. In particular we focus on so-called binary segmentation technique, which is well-known in mathematical statistics, but it has never been used in speaker change detection task. We prove its applicability on this task in simulated tests with artificially mixed utterances and also in tests done with 30 hours of real broadcast news (in 9 languages). Further we review commonly used approach to speaker change detection via Bayesian Information Criterion and we suggest theoretically more tenable solution.

PERCEPTUAL SPEECH ENHANCEMENT USING A HILBERT TRANSFORM BASED TIME-FREQUENCY REPRESENTATION OF SPEECH

Nima Derakhshan, Mohammad Savoji

Shahid Behesti University, IRAN

Page 390

Abstract:
A new Time-Frequency (TF) representation of speech signal is introduced and used for speech enhancement. TF representation and speech enhancement algorithm are both based on perceptual properties of human auditory system in which the concept of band analysis is exploited. TF representation is carried out by means of analytic decomposition of speech signal in the hearing Critical Bands (CB) where the envelope and phase components of the analytic signals are used. For the purpose of enhancement a time varying gain function is used which takes into account the threshold of hearing. Signal is reconstructed from the modified envelopes and the phases of noisy signal in CBs. Experiments show that using the threshold of hearing in which temporal masking is included can effectively eliminate the musical noise without a significant decrease in intelligibility. Results using noise estimation by Voice Activity Detector and Speech Presence Probability are reported on.

KALMAN FILTER APPROACH FOR PITCH DETERMINATION OF SPEECH SIGNALS

Ozgul Salor (1), Mubeccel Demirekler (2), Umut Orguner (2)

(1) Havelsan Inc., TURKEY
(2) Middle East Technical University, TURKEY

Page 396

Abstract:
In this paper, an efficient algorithm for pitch determination of speech signals is presented. Pitch curves of speech signals do not exhibit sudden changes in time in voiced and voiced-unvoiced transition regions of speech. Depending on this property, Kalman Filter has been used for pitch determination of speech similar to the way it is used in target tracking problems. The proposed method provides a reduction in the computational complexity of pitch determination since the autocorrelation search is made only inside the gating volume of the Kalman Filter and hence no pitch doubling check is required. It also does not need any fractional pitch computations. The pitch periods obtained using this method have been compared to those determined in MELP algorithm and it has been observed that comparable results have been obtained.

VOWELS DETECTION / RECOGNITION ON THE BASE OF SHORT CROSS-CORRELATION FUNCTION SIDE PEAK PARAMETERS

Wjatcheslaw Antciperov

Institute of Radioengineering and Electronics of RAS, RUSSIA

Page 400

Abstract:
The report presents the latest results in continuous speech phonetic analisys, concerning the problems of stable and effective speech recognition. These results are obtained on the base of earlier discussed approach. At the heart of approach underlies the analysis of dynamics of short cross-correlation function (CCF) parameters, namely the time location and the value of CCF side peak. This approach in natural manner gives rise to the so-called detection / recognition paradigm in phonetic speech processing. From the general point of view the paradigm implies the detection of statistically homogeneous signal fragments and their structure clarification. Both detection and recognition procedures are in frames of approach mutually dependent and represent two sides of the one uniform processing. The effectiveness of technique proposed is illustrated by a number of real speech processing examples.

Session: Emotional Speech Processing
Chair: Valery Petrushin, Accenture Technology Labs, United States

EMOTIONAL ASPECTS OF INTRINSIC SPEECH VARIABILITIES IN AUTOMATIC SPEECH RECOGNITION

Milos Cernak, Christian Wellekens

Institut Eurecom, FRANCE

Page 405

Abstract:
We analyze two German databases: the OLLO database designed for doing speech recognition experiments on speech variabilities, and the Berlin emotional database designed for the analysis and synthesis of emotional speech. The paper tries to find a relation between intrinsic speech variabilities and the emotions. Moreover, we study this relation from the point of view of speech recognition. Acoustical analysis is performed on both databases, using Normalized Amplitude Quotient and F0 parameterization of five analyzed vowels [a], [e], [i], [o], and [u], merging their long and short variants. Euclidean distance between the feature vectors of both databases is used for finding the relation, named as emotional aspect of speech variabilities. The speech recognition experiments on the OLLO database show that found emotional aspects have also a discrimination power.

PERCEPTUAL AND STATISTICAL ANALYSIS OF EMOTIONAL SPEECH IN MAN-COMPUTER COMMUNICATION

Slobodan Jovicic (1), Mirjana Rajkovic (1), Miodrag Djordjevic (2), Zorka Kasic (3)

(1) School of Electrical Engineering, University of Belgrade, SERBIA AND MONTENEGRO
(2) Institute for Experimental Phonetics and Speech Pathology, SERBIA AND MONTENEGRO
(3) Faculty for Special Education and Rehabilitation, University of Belgrade, SERBIA AND MONTENEGRO

Page 409

Abstract:
This paper presents the results of perceptual and statistical investigation of four emotions: anger, happiness, fear and sadness, in comparison to neutral speech. Perceptual analysis was performed through two tests: emotion evaluation inside Plutchik’s circle and emotion recognition test with subsequent statistical analysis with MDS (multidimensional scaling) procedure. Statistical analysis of emotions was based on static and dynamic acoustic features extracted from speech signals. ANOVA analysis of each class of features has given distribution of features according to its importance for emotion discrimination. Correlation analysis of each dimension of three-dimensional MDS representation with selected features indicate that ones of most importance in emotion identification. Finally, a three-level hierarchical model of emotion recognition was proposed.

A PITCH BASED ALGORITHM FOR INDEXING OF HUMOUR IN CONVERSATIONS

Narsimh Kamath

National Institute of Technology Karnataka, INDIA

Page 415

Abstract:
This paper presents a novel algorithm to automatically detect humorous segments in stored conversations. The voiced laughter of the speaker is recognized and the onsets of these laughter bouts used to annotate the stored conversations. An algorithm for laughter detection based on the acoustic properties of voiced laughter, viz. pitch and harmonicity is proposed and implemented for the same. This laughter detection algorithm is able to detect the onsets of voiced laughter bouts in clean speech. This work might find many applications such as in creating personalized videos and stored conversations, as well as in automatic speech transcription.

SPEECH EMOTION RECOGNITION FOR AFFECTIVE HUMAN-ROBOT INTERACTION

Kwang-Dong Jang, Oh-Wook Kwon

Department of Control and Instrumentation Engineering Chungbuk National University, KOREA

Page 419

Abstract:
We evaluate the performance of a speech emotion recognition method for affective human-robot interaction. In the proposed method, emotion is classified into 6 classes: Angry, bored, happy, neutral, sad, and surprised. After noise reduction and speech detection, a feature vector for an utterance is obtained from statistics of phonetic and prosodic information. Then a pattern classifier based on Gaussian support vector machines decides the emotion class of the utterance. To simulate a human-robot interaction situation, we record speech commands and dialogs uttered at 2m away from a microphone. Experimental results show that the proposed method yields the classification accuracy of 58.6% while listeners give 60.4% with the reference labels given by speakers' intention. On the other hand, the proposed method shows the classification accuracy of 51.2% with the reference labels given by listeners majority decision.

PARAMETERS OF FRICATIVES AND AFFRICATES IN RUSSIAN EMOTIONAL SPEECH

Valery Petrushin (1), Veronika Makarova (2)

(1) Accenture Technology Labs, UNITED STATES
(2) University of Saskatchewan, CANADA

Page 423

Abstract:
The paper investigates the effect of emotive states on the characteristics of Russian fricatives and affricates. The experimental data come from RUSLANA, a database containing neutral utterances along with the ones that portray surprise, happiness, anger, sadness and fear. The paper focuses on the role of duration, energy and dynamic ranges in the expression of emotions at the segmental level.

Session: Speech and Language Resources
Chair: Christoph Draxler, Ludwig-Maximilians University of Munich, Germany

THE CHAINS CORPUS: CHARACTERIZING INDIVIDUAL SPEAKERS

Fred Cummins, Marco Grimaldi, Thomas Leonard, Juraj Simko

School of Computer Science and Informatics, University College Dublin, IRELAND

Page 431

Abstract:
We present a novel speech corpus collected with the primary aim of facilitating research in speaker identification. The corpus features approximately 36 speakers recorded under a variety of speaking conditions, allowing comparison of the same speaker across different well-defined speech styles. Speakers read a variety of texts alone, in synchrony with a dialect-matched co-speaker, in imitation of a dialect-matched co-speaker, in a whisper, and at a fast rate. There is also an unscripted spontaneous retelling of a read fable. The bulk of the speakers were speakers of Eastern Hiberno-English. The corpus will be made freely available for research purposes.

THE PROBLEM OF CHOICE AND PREPARATION OF A TEXT MATERIAL FOR SPEECH CORPORA

Olga Krivnova

Philological Faculty, Lomonosov Moscow State University, RUSSIA

Page 436

Abstract:
Many tasks connected with speech analysis involve the problem of description, modelling and evaluation of the acoustical variability of sound units in various speech texts. Nowadays, a large-scale and statistically representative research of the acoustic variability of sound units has become possible thanks to quickly developing computer technologies and the use of representative speech corpora, containing special annotations. This paper is dedicated to the principles of the selection of a text material for scientific and applied modelling of acoustic variability. Computer tools necessary for the creation of a statistically representative and phonetically reliable speech corpus are also discussed; the author’s experience of a text material preparation for Russian speech corpora is briefly described.

OPTIMIZATION OF ALLOPHONE DATABASE COMPRESSION WITH WAVELETS FOR POLISH SPEECH SYNTHESIS TTS SYSTEMS

Krzysztof Popowski, Edward Szpilewski

Institute of Computer Sciences, University of Bialystok, POLAND

Page 439

Abstract:
This paper presents a new optimized wavelet compression method for allophone databases for speech synthesis using various wavelet functions, effective quantization and coding. This approach allows obtain better compression results still achieving good quality of reconstructed speech signal. The article also presents several types of allophone databases, wavelets used for compression and a short introduction to new TTS systems that can use our encoded databases.

TO THE PROBLEM OF MULTILANGUAGE PHONETIC DATABASE FORMATION: VIBRANTS IN ENGLISH, GERMAN, RUSSIAN AND CHECHEN

Rodmonga Potapova, Elena Loseva

Moscow State Linguistic University, RUSSIA

Page 445

Abstract:
This paper outlines more results of the research on vibrants in different languages and is a kind of continuation of what was presented at the conference SPECOM'2005. This time vibrants in four languages are investigated. The phoneme /r/ has always attracted the attention of linguists and become an object for research. It is outstanding for great intraspeaker and interspeaker variability. Because of great interspeaker variability /r/ is often described as having high speaker-discriminating power. For example, trilled [r]-sounds can differ across subjects in terms of the number and amplitude of taps. The problem can arise from intraspeaker variability in such cases. There are still a lot of uncertainties about some features of vibrants. The undertaken research aims at forming a phonetic database of vibrants in English, German, Russian and Chechen and performing a comparative analysis of vibrant systems in these languages in order to find some universal and distinctive features.

MOBILDAT-SK – A MOBILE TELEPHONE EXTENSION TO THE SPEECHDAT-E SK TELEPHONE SPEECH DATABASE IN SLOVAK

Milan Rusko, Darjaa Sakhia, Marian Trnka

Institute of Informatics, Slovak Academy of Sciences, SLOVAKIA

Page 449

Abstract:
The paper describes design and process of collection, annotation and evaluation of a new Slovak mobile-telephone speech database MobilDat-SK, which is a mobile-telephone extension to the SpeechDat-E SK. The MobilDat-SK database contains recordings of 1100 speakers and it is balanced according to the age, accent, and sex of the speakers. Every speaker pronounced 50 files (either prompted or spontaneous) containing numbers, names, dates, money amounts, embedded command words, geographical names, phonetically balanced words, phonetically balanced sentences, Yes/No answers and one longer non-mandatory spontaneous utterance. In the paper the structure of the database, the hardware and software solution of the automatic recording, the speaker recruitment strategy, the annotation process and evaluation process are described. The MobilDat-SK database has been developed for the “Intelligent Speech Communication Interface” project in the frame of the State Research and Development Task.

Session: Natural Language Processing
Chair: Rodmonga Potapova, Moscow State Linguistic University, Russia

USING A GENERAL RANK-BASED STATISTICS FRAMEWORK TO EVALUATE LANGUAGE MODELS

Pierre Alain, Olivier Boeffard

IRISA / University of Rennes 1 – ENSSAT, FRANCE

Page 457

Abstract:
Language model suffers from the lack of a non ambiguous evaluation framework. Even if perplexity is a widely used criterion in order to compare language models without any task assumptions, the main drawback is that perplexity supposes probability distributions and hence cannot compare all kinds of models. We suggest in this article to abandon perplexity and to extend the Shannon's entropy idea which is based on model prediction performance using rank based statistics. Our methodology is able to predict joint word sequences being independent of the task or model assumptions. Predicting a k-word sequence given a N-word vocabulary is a NP-hard computational task. So, we propose some acceptable and effective search heuristics for an A* algorithm. Experiments are carried out on the English language with different kinds of language models. We show that long-term prediction language models are not more effective than the standard n-gram models.

A NEW APPROACH FOR WORDS REORDERING BASED ON STATISTICAL LANGUAGE MODEL

Theologos Athanaselis, Stelios Bakamidis, Ioannis Dologlou

Institute for Language and Speech Processing, Maroussi, Athens, GREECE

Page 463

Abstract:
There are multiple reasons to expect that detecting the word order errors in a text will be a difficult problem, and detection rates reported in the literature are in fact low. Although grammatical rules constructed by computer linguists improve the performance of grammar checker in word order diagnosis, the repairing task is still very difficult. This paper presents an approach for repairing word order errors in English text by reordering words in a sentence and choosing the version that maximizes the number of trigram hits according to a language model. The novelty of this method concerns the use of an efficient confusion matrix technique for reordering the words. The comparative advantage of this method is that works with a large set of words, and avoids the laborious and costly process of collecting word order errors for creating error patterns.

BASIC PRINCIPLES OF KNOWLEDGES REPRESENTATION AND SPEECH INFORMATION PROCESSING WITHIN INTEGRATED INTELLIGENT SYSTEM

Andrey Baranovich, Oleg Sidorov

Moscow State Linguistic University, RUSSIA

Page 467

Abstract:
The class of the integrated intelligent systems with the extended tool sensorium which major components are input channels of the verbal and acoustic speech information is considered within the framework of applied human-machine systems synthesis. Main principles of information model-universum synthesis are formulated which semiotics identification and categorial structurization of units can be put in a basis of the mechanisms of the analysis and synthesis of arbitrary natural and artificial languages in their verbal interpretation. Results of simulation of knowledge subsystem of the integrated intelligent system may be used in particular for solution of a problem of the semantic analysis of normative documents. The problem of ambiguous interpretation of the text under computer analysis of normative documents is discussed. Decision is proposed for unambiguous text interpretation of normative documents associated with development of conformed biective dictionaries.

PARTS OF SPEECH RECOGNITION SYSTEM FOR THE TEXT-BASED POLISH SPEECH SYNTHESIZER

Bozena Piorkowska, Janusz Rafalko, L. Kalinowski, K. Pak

Institute of Computer Sciences, University of Bialystok, POLAND

Page 471

Abstract:
The article presents two ways of automatic recognition what part of speech a given word is. It is crucial to determine which words belong to subject and which to predicate. This information will greatly enhance proper sentence intonation because, as observations have proved, the most frequently stressed words in a sentence are the ones from the subject group. The two recognition algorithms, their good points and drawbacks, are discussed in the article. Apart from that, there is also an evaluation of how the program based on these algorithms works.

SYNONYM SEARCH IN WIKIPEDIA: SYNARCHER

Andrew Krizhanovsky

Saint-Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, RUSSIA

Page 474

Abstract:
The program Synarcher for synonym (and related terms) search in the text corpus of special structure (Wikipedia) was developed. The results of search are presented in the form of graph. It is possible to explore graph and search graph elements interactively. HITS (Kleinberg) adapted algorithm for synonym search, program architecture, and program work evaluation with test examples are presented in the paper. The proposed algorithm could be applied to a search request extending, and for synonym dictionary forming.

ALGORITHMIC INTERACTIVE PRESENTATION OF NOTIONS

Pavel Pankov (1), Sagyn Alaeva (2), Vasiliy Kutsenko (3)

(1) International University of Kyrgyzstan, KYRGYZSTAN
(2) Kyrgyz National University, KYRGYZSTAN
(3) Kyrgyz-Russian Slavic University, KYRGYZSTAN

Page 478

Abstract:
A kind of meta-language is proposed to present notions of natural languages in interactive form and to learn elements of languages on computer without any other language being a medium by means of the user's arbitrary actions by computer mouse with feed-back. Such actions by means of the "Drag-and-Drop" mechanism and the notion of "Active point" are implemented.

Session: Multimodal Analysis and Synthesis
Chair: Benoit Macq, Universite catholique de Louvain, Belgium
Niels Ole Bernsen, University of Southern Denmark, Denmark

COMPARISON BETWEEN DIFFERENT FEATURE EXTRACTION TECHNIQUES IN LIPREADING APPLICATIONS

Leon J. M. Rothkrantz, Jacek C. Wojdel, Pascal Wiggers

Delft University of Technology, THE NETHERLANDS

Page 483

Abstract:
In this paper we present a novel way of processing the video signal for lipreading application, and a post-processing data transformation that can be used alongside it to improve the audiovisual speech recognition results. The presented Lip Geometry Estimation (LGE) is compared with other geometry- and image intensity- based techniques typically deployed for this task. It can be applied at post-processing stage to any other feature extraction technique. We show to what extent different ways of processing the video signal are equivalent under appropriate transformations.

CROSSMODAL INTEGRATION AND MCGURK-EFFECT IN SYNTHETIC AUDIOVISUAL SPEECH

Katja Grauwinkel (1), Sascha Fagel (2)

(1) Department of Computer Sciences and Media, TFH Berlin University of Applied Sciences, GERMANY
(2) Institute for Speech Communication, Technical University Berlin, GERMANY

Page 489

Abstract:
This paper presents the results of a study investigating crossmodal processing of audiovisually synthesized speech stimuli. The perception of facial gestures has great influence on the interpretation of a speech signal. Not only paralingustic information of the speakers emotional state or motivation can be obtained. Especially if the acoustic signal is unclear, e.g. because of background noise or reverberation, watching the facial gestures can enhance speech intelligibility. On the other hand, visual information that is incongruent to auditory information can also reduce the effectiveness of acoustic speech features, even if the acoustic signal is of good quality. The way how two modalities interact with each other constitutes an interdisciplinary research area. Bimodal speech processing is a remarkable example on how one modality (vision) affects the experience of another modality (audition). The present study could show that this effect can also be achieved by synthetic speech.

AUDIO-VISUAL SPEECH RECOGNITION FOR SLAVONIC LANGUAGES (CZECH AND RUSSIAN)

Petr Cisar, Jan Zelinka, Milos Zelezny (1), Alexey Karpov, Andrey Ronzhin (2)

(1) Department of Cybernetics, University of West Bohemia in Pilsen (UWB), CZECH REPUBLIC
(2) Speech Informatics Group, Saint-Petersburg Institute of Informatics and Automation of the Russian Academy of Sciences, RUSSIA

Page 493

Abstract:
The paper presents the results of recent experiments with audio-visual speech recognition for two popular Slavonic languages: Russian and Czech. The description of test applied tasks, the process of multimodal databases collection and data pre-processing, methods for visual features extraction (geometric shape-based features; DCT and PCA pixel-based visual parameterization) as well as models of audio-visual recognition (concatenation of feature vectors and multi-stream models) are described. The prototypes of applied systems which will use the audio-visual speech recognition engine are mainly directed to the market of intellectual applications such as inquiry machines, video conference communications, moving objects control in noisy environments, etc.

CREATION AND SELECTION OF THE VISUAL FRONT END FEATURES AND THE AUDIO-VISUAL FEATURE FUSION FOR AUDIO-VISUAL SPEECH RECOGNITION

Josef Chaloupka

Department of Electronics and Signal Processing, Technical University of Liberec, CZECH REPUBLIC

Page 499

Abstract:
This contribution is about a creation and selection of the visual front end speech features. The use of the visual shape and the appearance-based visual features are described here. These visual features can be used for the visual or for the audio-visual speech recognition. Before they are used, the features have to be normalized and selected in such a way, so that the recognition rate was high enough. The second task has been the use of the fusion of different kinds of visual and acoustic speech features. The experiments for the audio-visual recognition of isolated words have been created in the conclusion of this work.

JOINT AUDIO-VISUAL UNIT SELECTION – THE JAVUS SPEECH SYNTHESIZER

Sascha Fagel

Berlin University of Technology, GERMANY

Page 503

Abstract:
The author presents a system for speech synthesis that selects and concatenates speech segments (units) of various size from an adequately prepared audio-visual speech database. The audio and the video track of selected segments are used together in concatenation to preserve audio-visual correlations. The input text is converted into a target phone chain and the database is searched for appropriate segments representing sub-chains of at least two phones that can be concatenated to the target utterance. The final segment sequence is selected from the possible segment sequences by a weighted sum of concatenation criteria for the audio and the video join. The weights of these audio and video join costs can be used to trade off between fluency in the audio and the video channel of the synthesized speech. The output shows the input text audio-visually spoken where the audio and the video track are reasonably fluent, synchronous, and intelligible.

STATISTICAL FACIAL EXPRESSION ANALYSIS FOR REALISTIC MPEG-4 FACIAL ANIMATION

Francois-Xavier Fanard, Olivier Martin, Benoit Macq

Laboratoire de Telecommunications,Universite catholique de Louvain, BELGIUM

Page 507

Abstract:
This paper presents a statistical study of facial expressions. The results can be used as a basis to develop a high-level realistic automated framework for real-time facial expression synthesis compliant with MPEG-4 Facial Animation specifications. To achieve this goal, the entire Cohn-Kanade facial expression database has been manually labeled so as to provide accurate statistics on the positions and shapes of a set of facial features (eyes, eyebrows and mouth) for six different emotions. After exposing this labeling process and the MPEG-4 Facial Animation specifications, the paper explains how to relate the extracted statistics to the low-level FAPs in order to obtain a simplified and realistic facial synthesis using only high-level actions. The main contribution of this paper is thus twofold: it extends the results presented in [4] (by quantifying FAP’s variations under the six emotions considered) and shows how these new results can be used to create realistic facial animation.

USING PHYSIOLOGICAL SIGNALS FOR SOUND CREATION

Jean-Julien Filatriau, Remy Lehembre, Quentin Noirhomme, Cedric Simon (1), Burak Arslan (2), Andrew Brouse (3), Julien Castet (4)

(1) Communications and Remote Sensing Lab, Universite Catholique de Louvain (UCL), BELGIUM
(2) TCTS Lab of the Facult?e Polytechnique de Mons, BELGIUM
(3) Computer Music Research, University of Plymouth, Drake Circus, UK
(4) Polytechnics National Institut of Grenoble, FRANCE

Page 513

Abstract:
Recent advances in new technologies offer a large range of innovative instruments to design and process sounds. Willing to explore new ways for music creation, specialists from the fields of brain-computer interfaces and sound synthesis worked together during the eNTERFACE05 workshop (Mons, Belgium).The aim of their work was to design an architecture of real-time sound synthesis driven by physiological signals. The following description links natural human signals to sound synthesis algorithms, thus offering rarely used pathways for music exploration. This architecture was tested during a "bio-concert" given at the end of the workshop where two musicians were using their EEG and EMG to perform a musical creation.

EEG INVERSE PROBLEM AND PRIORS IN A BRAIN-COMPUTER INTERFACE

Quentin Noirhomme, Benoit Macq

Universite Catholique de Louvain, BELGIUM

Page 519

Abstract:
Nowadays best Brain Computer Interface (BCI) methods are based on invasive recording of electrical brain activity. Scalp electrodes methods are not as accurate. This is partially due to the filtering of the signal by the skull and to the distance to the sources. Surprisingly methods for solving the EEG inverse problem have seldom been used to overcome these limitations. Inverse problem methods estimate the brain activity from the scalp potentials. In this paper we study the application of inverse problem methods to the BCI. A minimum norm method and four weighted minimum norm approaches based on four a priori informations were tested. Results were obtained by first processing the data with an inverse solution method. Then the data were classified by measuring the activation in preselected brain areas. The results were compared among priors, with scalp potential results and with best BCI methods. Scalp potential results were improved by more than 10%. Moreover results were equivalent to the one of the best BCI methods. The best prior obtained 86% of good classification. Finally these methods could be used in real time without increasing the computation time.

HANDS-FREE MOUSE CONTROL SYSTEM FOR HANDICAPPED OPERATORS

Alexey Karpov (1), Alexandre Cadiou (2)

(1) Speech Informatics Group, St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, RUSSIA
(2) Advanced Institute of Electronics of Paris (ISEP), FRANCE

Page 525

Abstract:
The paper describes an evolution of the multimodal system ICANDO (Intellectual Computer Assistant for Disabled Operators) intended for assistance to persons without hands or with disabilities of their hands or arms for human-computer interaction. The evolution of this device relates to the head movements tracking system. The system ICANDO combines the modules for automatic speech recognition and head tracking in one multimodal system. The architecture of the system, methods for head detection and tracking, experiments with hands-free mouse control are described in the paper. The obtained results described at the end of the paper demonstrate an acquisition of increased reliability and faster operation of the ICANDO system.

3D SYMBOL BASE TRANSLATION AND SYNTHESIS OF CZECH SIGN SPEECH

Zdenek Krnoul, Jakub Kanis, Milos Zelezny, Ludek Muller, Petr Cisar

University of West Bohemia, CZECH REPUBLIC

Page 530

Abstract:
This paper presents primary results of translation spoken Czech to Signed Czech and a synthesis of signs by the computer animation. The synthesis of animation employs a symbolic notation. An automatic process of synthesis generates the articulation of hands from this notation. The translation system is built on the statistical ground. For the notation of the new signs, the graphic editor is designed.

WHAT VOICE DO WE EXPECT FROM A SYNTHETIC CHARACTER?

Joao Cabral, Luis Oliveira, Guilherme Raimundo, Ana Paiva

INESC-ID/IST, PORTUGAL

Page 536

Abstract:
The emerging new applications of synthetic characters, as a way to achieve more natural interactions, puts new demands in the synthetic voices, in order to fulfill the expectations of the user. The work presented in this paper evaluates a synthetic voice, used by a synthetic character in a storytelling situation. To allow for a better comparison, a real actor was filmed telling a children's story. The pitch, duration and energy of the recorded speech were copied to the synthetic speech generated with a FESTIVAL based LPC-diphone synthesizer. At the same time, the synthetic character was also animated with the gestures, emotions and facial expressions used by the actor. Using different conditions combining the synthetic voice, synthetic character with the real voice and the real character, the voice was evaluated regarding the comprehension of the storyteller, the expression of emotions, its credibility and user satisfaction.

Session: Fundamentals of Speech Research
Chair: Rajmund Piotrowski, Herzen State Pedagogical University, Russia
Lev Stankevich, St. Petersburg State Technical University, Russia

SOME PRIORITY TRENDS OF MODERN FORENSIC SPEECHOLOGY

Rodmonga Potapova, Vsevolod Potapov

Moscow State Linguistic University, RUSSIA

Page 543

Abstract:
Forensic linguistics in particular forensic phonetics as a discipline is growing apace and attracting more and more attention from linguistics, legal academics, public prosecutors, psychologists and others in related fields. The purpose of this paper is a simple one: to give a concise but informative picture of the breadth of linguistic i.e. phonetic expertise which has been introduced into legal cases by professional experts in Russia. The paper presents the results of the studies in such priority fields of modern forensic speechology as identification of foreign speech, of interfered speech, identification of speech signal distorted with the help of special devices, the so-called voice changers, emotional speech (especially the state of 'fear' which is of paramount interest for experts), automation of phonetic expertise.

AUTOMATIC ANALYSIS AND SYNTHESIS OF SPEECH IN TEACHING PHILOLOGISTS THE PRINCIPLES OF MATHEMATICS AND INFORMATICS

Rajmund H. Piotrowski, Xenia R. Piotrowska, Yuri V. Romanov

Herzen State Pedagogical University of Russia, RUSSIA

Page 548

Abstract:
The article dwells upon the role and the place of the modern computer technologies in the system of higher humanitarian education.

AUTOMATIC ORIGIN OF A LANGUAGE IN AAC NEURON-LIKE SYSTEMS

Alexander Zhdanov, Alexander Kondukov, Tamara Naumkina, Olga Dmitrenko

Institute for System Programming RAS, RUSSIA

Page 550

Abstract:
The paper presents our first results of simulation of a simple language origin in neuron-like Autonomous Adaptive Control (AAC) system. The language origin is based on properties of neurons allow them associate different patterns. If one pattern is pattern of a real object and other one is the pattern of a verbal object the neuron can associate them. The set of pattern-identifiers and special linguistic action forms a language and allows to control the object from outside or to use it by object itself for thinking.

SPEECH RHYTHM DISORDERS DUE TO SYNCHRONIZATION INDUCED IN COUPLED OSCILLATORS

Oleg Skljarov

Research Institute of ETN and speech, RUSSIA

Page 555

Abstract:
We use Leaked-Integrate-and-Fire coupled oscillators (LIFs). Activity of LIFs creates speech rhythm controlling by inhibition due to square-law map. Regular rhythms of convulsive repetitions at early stuttering are being changed, however, by mixture of repetitions and neurotic pauses. This mixture is "stumbling-block" for clinicians. Due to delays only inhibitory LIFs are capable to create the synchronic activity in-phase or in-anti-phase at medial or at low coupling. This activity has form of slow oscillations damping to background level. Splashes of the activity above or below level lead to neurotic disorders or to convulsive repetitions. Really, decreased due to GABA the coupling leads to convulsions.

DEVELOPING A RUSSIAN VOWEL SPACE IN INFANCY: THE FIRST TWO YEARS

Jeannette M. Van der Stelt (1), Elena E. Lyakso (2), Louis C.W. Pols (1), Ton G. Wempe (1)

(1) Institute of Phonetic Sciences/ACLC, Universiteit van Amsterdam, THE NETHERLANDS
(2) Uchtomsky Institute of Physiology, St. Petersburg State University, RUSSIA

Page 561

Abstract:
In the first year of life infants do not yet produce vowels in a full sense of the word. Yet, for the tuned-in listener the infant’s sound can elicit a vowel-like perception that, with age, becomes more adult-like. Mastering the vowels from the ambient language also means dealing with developmental changes in the speech mechanism and in perception. An enormous variability in vocalic productions is found. The Principal Component Analysis method on band filter data employed in this study focuses on the spectral envelope, and thus accounts for frequency as well as intensity information, much like the information used by the ear. Results for Russian-learning infants indicate that between 6 months and two years of age the infants explore the vowel space in quite different manners. Data on 5 Dutch boys, acquiring 12 vowels, are compared to the Russian results, to interpret the influence of the number of vowels per language.

RECOGNITION OF WORDS AND PHRASES OF 4-5-YEARS-OLD CHILDREN BY ADULTS

Elena Lyakso, Anna Kurazova, Alexandra Gromova (1), Alexandr Ostrouxov (2)

(1) Saint Petersburg State University, Uchtomsky institute of Physiology, RUSSIA
(2) Russian acoustic company “AUDITECH”, RUSSIA

Page 567

Abstract:
The present study is a part of longitudinal investigation of children speech development in Russian language. The aim of this part of the study is the investigation of words and phrases recognition by auditors in 4-5-years-old children. It was shown that Russian-speaking adults recognized 60-100% of children’s words and phrases. If Russian speakers recognized 75% of children’s speech than words and phrases were recognized with the same probability. If the rate of recognition was lower, native speakers recognized words and phrases with a different probability. The probability of children’s speech recognition depends on development of features that are typical for adults? speech. The problems of speech recognition by adults are due to articulation ‘mistakes’ that are mainly a result of pronunciation of consonants and at a less rate that of vowels pronunciation. A combination of these articulation mistakes in one word makes it impossible to be recognized by native speakers.

HUMAN-ROBOT INTERACTION: GROUP BEHAVIOR LEVEL

Lev Stankevich, Denis Trotsky

Saint Petersburg State Technical University, RUSSIA

Page 571

Abstract:
This paper describes a new approach to problem of human-robot interaction for controlling behavior of robots group. A solving this problem is actual for providing teamwork of robots and other unmanned vehicles controlled in command and control manner. At group level of interaction, a human operator can define strategy and tactics of robot teamwork. It is proposed to use special multi-agent 3D game simulation environment (Basketball Server), which provides teamwork of basketball agents under control of the operator, in order to study principles of interaction for human operator and robot team. Using the environment, the operator can change strategy and tactics of team operatively using special agent and modules embedded into agent’s program, for interaction with a basketball agent. The environment has 3D visualization of game. This allows for defining effectiveness of individual and collective behaviors of the agents and their ability to solve complex tasks of attack and defense.

PHONETIC RESEARCH OF THE SOUND FORM OF MODERN BURYAT LANGUAGE

Ljubov Radnaeva

Department of Phonetics, Saint Petersburg State University, RUSSIA

Page 577

Abstract:
In this paper the phonetic research of the properties of a sound form of a modern Buryat language is reported. The research contains: 1. The study of the phoneme system. 2. Quantitative and qualitative acoustic characteristics of basic, positional and combinatorial allophones of vowel and consonant phonemes, their duration and formant structure. 3. The rules of phonemic and phonetic transcription on the basis of IPA 4. The statistical characteristics of distribution of allophones, phonemes and syllables. 5. The phonetical model of the Buryat text. The research is worked out in the Laboratory of Experimental Phonetics and at the Department of Phonetics and Methods of Teaching of Foreign Languages in Saint Petersburg State University.

GRANTS OF SCIENTIFIC FUNDS AS A PARAMETER OF AN ESTIMATION OF THE INTELLECTUAL CAPITAL OF SCIENCE OF SAINT PETERSBURG

Nelli Didenko, Andrey Petrovsky

St. Petersburg Scientific Centre of the Russian Academy of Sciences, RUSSIA

Page 583

Abstract:
This work represents the results of the analysis of participation St. Petersburg’s scientists in the international competitions have been carried out by Russian Foundation for Basic Research (RFBR) on a regular basis since 1996. The competition in these joint programs is shown to be stronger than that in the corresponding Russian programs of RFBR. The analysis of the data presented in the given paper allowed us to make a conclusion that the level of participation of St. Petersburg’s scientists in projects of joint programs of RFBR with foreign scientific funds, with a rare exception, is lower than 10% while this level in the Russian competitions of RFBR reaches about 13%.

Home