Ehab Essa's Technical Blog: 2006

Wednesday, September 06, 2006

Voice Conversion

the key purpose of a voice conversion system is to transform the voice of one speaker into that of another speaker. Therefore, given two speakers, the goal for a voice conversion system is to determine how to design a transformation that makes the speech of the first speaker sounds as though it were uttered by the second speaker.

A general voice conversion system works as follows. The system analyzes the speech samples of two speakers. This involves collecting the voice characteristic of the original and desired (target) speakers. After learning the characteristics of each speaker, the system automatically creates a conversion rule from the original speaker's voice characteristics into those of the desired speaker. This conversion rule is applied to the original speech to create a converted speech that exhibits the target speaker's voice characteristics.

We can use this technology in many fields. In the speech synthesis fields, the output speech can be enriched using voice conversion. In addition, it can also be used in telephone voice translation, low-bit speech encoding, speaker adaptation, and so on. Voice conversion depends much on the research about voice individuality.

Research indicate that the factors relevant to voice individuality can be distributed into two types. One is acoustic parameter, such as pitch frequency, formant frequencies and bandwidths, which are reflected by the voice source and the vocal tract of different peoples. The other is prosodic parameter, such as the timing, rhythm, and pause of voice, which usually depend on the social conditions of different peoples.

The UPC Voice Conversion Toolkit

Saturday, August 26, 2006

Speech Processing Links

Largest Source Code Chinese site http://www.programsalon.com/

English Translation: http://www.google.com/translate?u=http%3A%2F%2Fwww.pudn.com%2F&langpair=zh%7Cen&hl=en&ie=UTF8

webcast.berkeley

Podcasts and Webcasts of UC Berkeley current and archived courses.

http://webcast.berkeley.edu/

CS 224S/LINGUIST 136/236 Speech Recognition and Synthesis Winter 2005

http://www.stanford.edu/class/linguist236/

CS 224S / LINGUIST 281Speech Recognition and SynthesisWinter 2006

http://www.stanford.edu/class/cs224s/

Speech, Music and Hearing

http://www.speech.kth.se/

Speech Processing Group

http://www.busim.ee.boun.edu.tr/~speech/

Connexions

http://cnx.org/

is a rapidly growing collection of free scholarly materials and a powerful set of free software tools

Examples of Synthesized Speech

http://www.ims.uni-stuttgart.de/~moehler/synthspeech/examples.html

Friday, August 25, 2006

Arabic Speech Synthesis

Introduction

Speech is the primary means of communication between people. Speech synthesis, automatic generation of speech waveforms. Speech synthesis is the artificial production of human speech.

A diphone can be defined as a speech fragment which runs roughly from half-way one phoneme to half-way the next phoneme.
In this way the transition between two consecutive speech sounds is encapsulated in the diphone and needs not be calculated.

Overview of speech synthesis technology

The front-end has two major tasks.

First, it takes the raw text and converts things like numbers and abbreviations into their written-out word equivalents. This process is called text normalization or pre-processing.
Then it assigns phonetic transcriptions to each word, and divides and marks the text into various prosodic units, like phrases, clauses, and sentences. The process of assigning phonetic transcriptions to words is called letter to sound . The output is called symbolic linguistic representation.

The other part, the back-end, takes the symbolic linguistic representation and converts it into actual sound output. The back end is often referred to as the synthesizer.

The Arabic language

Modern Standard Arabic is generally adopted as the common medium of communication through the Arab world today.
Arabic language contains 28 letters. There are 6 vowels in Arabic, 3 short and 3 long.

Research Challenges

We can classify it into two categories:
First Text To Speech challenges:

Memory Requirements : in the concatenation method the memory requirements mainly appears when need to convert large text into speech.
Limited to only one speaker : because diphones must be previously recorded and the memory requirements.
Intelligibility : whether the system produces an intelligible speech
Naturalness : can the hearer be familiar with the voice

Second Arabic Language challenges:

The absence of the diacritics in modern Arabic text is one of the most critical problems facing computer processing of Arabic text.
Arabic Language has a special phonological system, phonetacties and syllabic structure.
Conversion of Arabic scripts into phonetic rules.

Research target

The main target is to build text to speech system using diphones concatenation. To achieve that goal, multiple goals appeared in the scene, such as:

Automatic diacritization for Arabic text, so as to help user saving time.
Phonological rules implementation (letter to sound conversion) is responsible for the automatic determination of the phonetic transcription of the incoming text.
Determining syllables type, We can classify the syllables in Arabic either according to the length of the syllable or according to the end of the syllable.
Converting syllables into diphones which is mapped into pre-recorded sounds.
Concatenating diphones wave files into one file that will be played.

Friday, August 18, 2006

Concatenation Wave Files using C# 2005

Download source code - 63.1 Kb
The WAVE file format is a subset of Microsoft's RIFF specification for the storage of multimedia files. A RIFF file starts out with a file header followed by a sequence of data chunks. A WAVE file is often just a RIFF file with a single "WAVE" chunk which consists of two sub-chunks -- a "fmt " chunk specifying the data format and a "data" chunk containing the actual sample data. Call this form the "Canonical form".

The main idea is to create only one header for all wav files that want to concatenate and then write data of each file in single file.

more details in CodeProject
http://www.codeproject.com/useritems/Concatenation_Wave_Files.asp

References
http://ccrma.stanford.edu/courses/422/projects/WaveFormat/
http://www.sonicspot.com/guide/wavefiles.html
http://www.planetsourcecode.com/vb/scripts/ShowCode.asp?txtCodeId=31678&lngWId=1

Saturday, May 27, 2006

Curvature Scale Space

CSS Overview

To compute it, the curve is first parameterized by the arc length parameter u:

The horizontal axis in this image represent the normalized arc length u and the vertical axis represent the width of the Gaussian kernel.

The intersection of every horizontal line with the contours in this image indicates the location of curvature zero crossing on the corresponding evolved curve

Friday, April 28, 2006

Corner detection

Corner detection is an important task in various computer vision and image-understanding systems . Corner detection should satisfy a number of important criteria:

All the true corners should be detected
No false corners should be detected
Corner points should be well localized
Corner detector should be robust with respect to noise
Corner detector should be efficient

Localization refers to how accurately the position of a corner is found. This is critical in applications requiring the precise alignment of multiple images (for example, in registration of medical images).

Applications of Corner Detectors

The use of interest points (and thus corner detectors) to find corresponding points across multiple images is a key step in many image processing and computer vision applications. Some of the most notable examples are:

stereo matching
image registration (of particular importance in medical imaging)
stitching of panoramic photographs
object detection/recognition
motion tracking
robot navigation

Figure shows part of a hypothetical system to illustrate how a corner detector might be used in an automated assembly line. This assembly line fills triangle gift boxes with four different chocolates. However, the boxes must be positioned properly on the conveyor belt to ensure the chocolates are packed properly into the boxes. An overhead camera is used to capture a picture of each box as it passes under it and a computer compares it to a stored image of a properly aligned box. By finding the corners of each image, how much the box needs to be rotated can easily be computed.

Evaluating and Comparing Corner Detectors
Each of the corner detectors are evaluated on an artificial test image containing different corner types and two real world images

The image contains many corner types (L-Junction, Y-Junction, T-Junction, Arrow-Junction, and X-Junction )

Comparison of Select Corner Detectors

http://www.cim.mcgill.ca/~dparks/CornerDetector/mainComparison.htm

Thursday, April 27, 2006

Image Processing in C#

Bitmap Class

The Bitmap class is the generalization of the Image class. Note above that, the PictureBox class contains a refrence of the image class, but the Image class is an abstract class, so its objects can be instantiated and PictureBox objects contains the instantiations of the Image class's derived classes, one of which is the Bitmap class..

Constructor: Bitmap class contains 12 constructors that construct the Bitmap object from different parameters. It can construct the Bitmap from another bitmap, and the string address of the image. It also provides the constructors the Bitmap object from Bitmap( int width , int height ).

PixelFormat: PixelFormat is the public property of the image of this image object.

Clone: This is the overloaded function of the Bitmap class inherited from the Object class that returns the copy of the object on which the function was orignaly invoked.

GetPixel: The public function that get the color of the specified pixel.

SetPixel: The public function that sets the color of the specified pixel.

LockBits and UnLockBits: The public functions in the class that locks and unlocks the specified area of the image into the memory. The paramenters of the LockBits function are, a Rectangle object specifying the area to be Locked, then 2 integers specifying the access level for the object and the other showing the format of the image. This is done through two enumirations like ImageLockMode, and PixelFormat. They are narrated after this class. The LockBits returns the object of the BitmapData class and UnLockBits takes the object of the BitmapData class.

BitmapData LockBits( Rectangle , ImageLockMode , PixelFormat );

ImageLockMode Enumiration
Specifies flags that are passed to the flags parameter of the LockBits method. It has four members.
ReadOnly: Specifies that a portion of the image is locked for reading.
ReadWrite: Specifies that a portion of the image is locked for reading or writing.
UserInputBuffer: Specifies that the buffer used for reading or writing pixel data is allocated by the user.
WriteOnly: Specifies that a portion of the image is locked for writing.

BitmapData class
Scan0: The address of the first byte in the locked array. If whole of the image is locked, then it is the first byte of the image.
Stride: The width, in bytes, of a single row of pixel data in the locked array. This width is a multiple, or possibly sub-multiple, of the pixel dimensions of the image and may be padded out to include a few more bytes. I'll explain why shortly.
Width: The width of the locked image.
Height: The height of the locked image.
PixelFormat: The actual pixel format of the data.

PixelFormat Class
The pixel format defines the number of bits of memory associated with one pixel of data. In other words the format defines the order of the color components within a single pixel of data. Normaly, PixelFormat.Format24bppRgb is used. PixelFormat.Format24bppRgb specifies that the format is 24 bits per pixel; 8 bits each are used for the red, green, and blue components. In the 24 bit format image, the image consists of pixels consisting of 3 bytes, one for each red, green, and blue. The first byte in the image contains the blue color, the second byte contains the green color, and the third byte contains the red color of the pixel, and they form the color of the specified pixel.

Basic Layout of the Locked Array
Scan0 is the pointer to the 1st byte in the pixel array of the image which contains Height no. of rows and each row contains Stride no. of bytes in every row as shown in the diagram:

Each row contains the data of Width no. of pixels, where each pixel consists of PixelFormat no. of bytes. Thus total space taken by the Width no. of pixels is
Total space taken by Width no. Pixels = Width( no. of Pixels ) X PixelFormat( no. of bytes per pixel )
This space is aproximately equal to the Stride, but not exactly equal to it. In fact due to efficiency reasons, it is ensured that the Stride of an image contains a no. multiple of 4. For example, if an image is in 24 bits per pixel, and contains 25 pixel in each row, then it needs total space in each row equal to ( 75 = 3 X 25 ), but 75 is not the multiple of 4. Hence, the next multiple of 4 is used and in this case it is 76. Hence, 1 byte is remained unused as shown by the black area in every row in the above diagram. If space needed by the Width no. bytes is a multiple of 4, then the Stride is equal to it, and hence, there is no unused space mentioned as black area in the above diagram.
Iterating through the Image
First of all create the Bitmap object as:

Bitmap image = new Bitmap( "c:\\images\\image.gif" );

Then get the BitmapData object from it by calling the Lock method as:
BitmapData data = image.LockBits( new Rectangle( 0 , 0 , image.Width , image.Height ) , ImageLockMode.ReadWrite , PixelFormat.Format24bppRgb );
Then you can iterate in the image as:

unsafe {

byte* imgPtr = ( byte* )( data.Scan0 );

for( int i = 0 ; i <>
{
for( int j = 0 ; j <>
{
// write the logic implementation here
ptr += 3;
}
ptr += data.Stride - data.Width * 3;
}
}

Here unsafe, shows that you need to use the pointers in the unmanaged block, and the statement:

ptr += data.Stride - data.Width * 3;

shows that you need to skip the unused space.

Ehab Essa's Technical Blog