Ehab Essa's Technical Blog

Saturday, March 22, 2008

EVALUATION OF THE EFFECT OF CLASSIFIER ARCHITECTURE ON THE OVERALL PERFORMANCE OF ARABIC SPEECH RECOGNITION

Abstract

Combined classifiers offer solution to the pattern classification problems which arise from variation of the data acquisition conditions, the signal representing the pattern to be recognized and classifier architecture itself. This paper will study the effect of classifier architecture on the overall performance of the Arabic Speech Recognition System. Five different architectures will be studied and comparison of their performance is conducted. It is found that the architecture based on ensemble approaches outperform the modular approaches. The best ensemble-based architecture gives 94% while the best modular-based gives 79.3% for the testing data.

Tuesday, October 16, 2007

Arabic Speech Recognition based on Combined Classifier

I'll publish paper that concerned with recognition of isolated Arabic's words by using a combined classifier. A combined classifier is based on a number of Back-Propagation/LVQ neural networks with different parameters. Some feature extraction methods are examined. It gives promising result that can compete with the traditional HMM-based speech recognition approaches.

Tuesday, August 07, 2007

Neural Network Fixed size input vectors

When I'm trying to use Neural Networks such as LVQ or Backpropagation , I faced this problem and I spent almost month trying to find good and easy solution.

One method called time normalization using Dynamic Time Warping , it stretching or pressing the input speech duration but it may cause signal distortion.

You can see example written in matlab here : http://www.ee.columbia.edu/~dpwe/resources/matlab/dtw/

There is another good and easy method depend on Window techniques , for more information you can see this thesis Application of a Back-Propagation Neural Network to Isolated-Word Speech Recognition.

Monday, May 21, 2007

Window Function

Window Function is one of the steps to perform Speech Analysis

Speech wave must be multiple by an apporpriate time window when we extract N-sample interval from speech wave for calculating most processing on speech.

It has two effects :

it gradually attenuates the amplitude at both ends of the extraction interval to prevent an abrupt change at the endpoints.
it produces the convolution for the fourier transform of the window function and the speech spectrum.

Several compromise window functions have been proposed:

the Hamming window is usually used for speech analysis is defind as

Hanning window and Rectangular window

A/E Speech DataBase

Arabic/English Speech Database

This is the first step to build Speech Recognition system , We should create a database for training and testing any algorithmes you used.

English Speech Database

After a lot of searches on the internet , I found that most of speech databases are not free and you should buy it , this site: http://www.ldc.upenn.edu/ contain many speech databases for example TIMIT database.

But Finally I found one Free here Speech separation challenge

http://www.dcs.shef.ac.uk/~martin/SpeechSeparationChallenge.htm

The training and development sets are drawn from a closed set of 34 talkers of both genders. this document describes the corpus in more detail
training data from individual talkers: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

The last challenge ,We need to segment the speech wav files to words , I do it manually using very efficient program "Adobe Audition 2.0 ".

Arabic Speech Database

it is a big problem to create your own database , if you work alone and your friends doesn't want to help you and Disheartening.

I'm thinking to new method to build database without need anybody , ISA I will explain the new method after I make sure that it work properly.

**you can build your own Arabic Speech database from the Holy Quran recording.

Wednesday, September 06, 2006

Voice Conversion

the key purpose of a voice conversion system is to transform the voice of one speaker into that of another speaker. Therefore, given two speakers, the goal for a voice conversion system is to determine how to design a transformation that makes the speech of the first speaker sounds as though it were uttered by the second speaker.

A general voice conversion system works as follows. The system analyzes the speech samples of two speakers. This involves collecting the voice characteristic of the original and desired (target) speakers. After learning the characteristics of each speaker, the system automatically creates a conversion rule from the original speaker's voice characteristics into those of the desired speaker. This conversion rule is applied to the original speech to create a converted speech that exhibits the target speaker's voice characteristics.

We can use this technology in many fields. In the speech synthesis fields, the output speech can be enriched using voice conversion. In addition, it can also be used in telephone voice translation, low-bit speech encoding, speaker adaptation, and so on. Voice conversion depends much on the research about voice individuality.

Research indicate that the factors relevant to voice individuality can be distributed into two types. One is acoustic parameter, such as pitch frequency, formant frequencies and bandwidths, which are reflected by the voice source and the vocal tract of different peoples. The other is prosodic parameter, such as the timing, rhythm, and pause of voice, which usually depend on the social conditions of different peoples.

The UPC Voice Conversion Toolkit

Saturday, August 26, 2006

Speech Processing Links

Largest Source Code Chinese site http://www.programsalon.com/

English Translation: http://www.google.com/translate?u=http%3A%2F%2Fwww.pudn.com%2F&langpair=zh%7Cen&hl=en&ie=UTF8

webcast.berkeley

Podcasts and Webcasts of UC Berkeley current and archived courses.

http://webcast.berkeley.edu/

CS 224S/LINGUIST 136/236 Speech Recognition and Synthesis Winter 2005

http://www.stanford.edu/class/linguist236/

CS 224S / LINGUIST 281Speech Recognition and SynthesisWinter 2006

http://www.stanford.edu/class/cs224s/

Speech, Music and Hearing

http://www.speech.kth.se/

Speech Processing Group

http://www.busim.ee.boun.edu.tr/~speech/

Connexions

http://cnx.org/

is a rapidly growing collection of free scholarly materials and a powerful set of free software tools

Examples of Synthesized Speech

http://www.ims.uni-stuttgart.de/~moehler/synthspeech/examples.html

Friday, August 25, 2006

Arabic Speech Synthesis

Introduction

Speech is the primary means of communication between people. Speech synthesis, automatic generation of speech waveforms. Speech synthesis is the artificial production of human speech.

A diphone can be defined as a speech fragment which runs roughly from half-way one phoneme to half-way the next phoneme.
In this way the transition between two consecutive speech sounds is encapsulated in the diphone and needs not be calculated.

Overview of speech synthesis technology

The front-end has two major tasks.

First, it takes the raw text and converts things like numbers and abbreviations into their written-out word equivalents. This process is called text normalization or pre-processing.
Then it assigns phonetic transcriptions to each word, and divides and marks the text into various prosodic units, like phrases, clauses, and sentences. The process of assigning phonetic transcriptions to words is called letter to sound . The output is called symbolic linguistic representation.

The other part, the back-end, takes the symbolic linguistic representation and converts it into actual sound output. The back end is often referred to as the synthesizer.

The Arabic language

Modern Standard Arabic is generally adopted as the common medium of communication through the Arab world today.
Arabic language contains 28 letters. There are 6 vowels in Arabic, 3 short and 3 long.

Research Challenges

We can classify it into two categories:
First Text To Speech challenges:

Memory Requirements : in the concatenation method the memory requirements mainly appears when need to convert large text into speech.
Limited to only one speaker : because diphones must be previously recorded and the memory requirements.
Intelligibility : whether the system produces an intelligible speech
Naturalness : can the hearer be familiar with the voice

Second Arabic Language challenges:

The absence of the diacritics in modern Arabic text is one of the most critical problems facing computer processing of Arabic text.
Arabic Language has a special phonological system, phonetacties and syllabic structure.
Conversion of Arabic scripts into phonetic rules.

Research target

The main target is to build text to speech system using diphones concatenation. To achieve that goal, multiple goals appeared in the scene, such as:

Automatic diacritization for Arabic text, so as to help user saving time.
Phonological rules implementation (letter to sound conversion) is responsible for the automatic determination of the phonetic transcription of the incoming text.
Determining syllables type, We can classify the syllables in Arabic either according to the length of the syllable or according to the end of the syllable.
Converting syllables into diphones which is mapped into pre-recorded sounds.
Concatenating diphones wave files into one file that will be played.