Sinhala Speech Emotion Recognition For Children-Part 1

4 min readDec 27, 2020

Voices are an important fact for identifying emotional expression because speech is the most important and essential communicational channel enriched with emotions. This project aims to design and implement an emotional intelligence system that is capable of identifying the emotion of the children between 6 months to 6 years for speech pathologists who treats speech impaired children.

Available Emotions

There are 3 emotions available. “Happy”, “Neutral”, “Sad”

Main Logic

Obtaining different features such as power, pitch, and vocal tract configuration from the speech signal, and analyze the changing the speech waveform to a form of parametric representation at a relatively lesser data rate.

Results

Requirements

Python 3.6+

First, using pip, we’ll need to install some dependencies:

Librosa — extracting audio features
Numpy
Soundfile
Scikit-learn — train the dataset
PyAudio

pip3 install librosa==0.6.3 numpy soundfile==0.9.0 sklearn pyaudio==0.2.11

pandas
wave
tqdm==4.28.1
matplotlib==2.2.3

If you already installed these libraries and, any libraries are missing check-requirements-txt is the tool to automatically check the missing packages in requirements.txt.

Install these package into the current python environment

pip3 install -r requirements.txt

Main Flow

Preparing the Dataset — Download and convert the dataset to be suited for extraction
Loading the Dataset: Loading the dataset in Python which involves extracting audio features (power, pitch, and vocal tract configuration from the speech signal)
Training the Model: Simply train it on a suited sklearn model.
Testing the Model: Measuring accuracy

Let’s get this started…

First we need a detasets to train on 🤓

Dataset

This project used 4 datasets. It's already downloaded and formatted in the data folder.

RAVDESS: The Ryson Audio-Visual Database of Emotional Speech and Song that contains 24 actors (12 male, 12 female),

2. TESS: Toronto Emotional Speech Set, A set of 200 target words were spoken in the carrier phrase “Say the word _____’ by two actresses (aged 26 and 64 years).

3. EMO-DB: a database of emotional utterances spoken by actors.

4. Custom: Some unbalanced noisy dataset that is located in data/train-custom for training and data/test-custom for testing in which you can add/remove recording samples easily by converting the raw audio to 16000 sample rate, mono channel

Adding the emotion to the end of audio file name separated with ‘_’

(e.g “harmed1_happy.wav” will be parsed automatically as happy)

The main thing is if you download the dataset and extracted it, We need to lower the sample rate on the audio files. That will be easy to extract the features in librosa.

But in this project no need to do that. because already I have prepared the dataset.

convert_wavs.py

Converting audio samples to be suitable for feature extraction

If you want to convert your own audio samples to 16000Hz sample rate and mono channel as suggested, you need this python script and ffmpeg installed on your machine.

Grid Search

Grid search results are already provided in grid folder, but if you want to tune various grid search parameters in parameters.py, you can run the script grid_search.py by: