Sinhala Speech Emotion Recognition For Children-Part 1
Voices are an important fact for identifying emotional expression because speech is the most important and essential communicational channel enriched with emotions. This project aims to design and implement an emotional intelligence system that is capable of identifying the emotion of the children between 6 months to 6 years for speech pathologists who treats speech impaired children.
Available Emotions
There are 3 emotions available. “Happy”, “Neutral”, “Sad”
Main Logic
Obtaining different features such as power, pitch, and vocal tract configuration from the speech signal, and analyze the changing the speech waveform to a form of parametric representation at a relatively lesser data rate.
Results
Requirements
- Python 3.6+
First, using pip, we’ll need to install some dependencies:
- Librosa — extracting audio features
- Numpy
- Soundfile
- Scikit-learn — train the dataset
- PyAudio
pip3 install librosa==0.6.3 numpy soundfile==0.9.0 sklearn pyaudio==0.2.11
- pandas
- wave
- tqdm==4.28.1
- matplotlib==2.2.3
If you already installed these libraries and, any libraries are missing check-requirements-txt is the tool to automatically check the missing packages in requirements.txt.
Install these package into the current python environment
pip3 install -r requirements.txt
Main Flow
- Preparing the Dataset — Download and convert the dataset to be suited for extraction
- Loading the Dataset: Loading the dataset in Python which involves extracting audio features (power, pitch, and vocal tract configuration from the speech signal)
- Training the Model: Simply train it on a suited sklearn model.
- Testing the Model: Measuring accuracy
Let’s get this started…
First we need a detasets to train on 🤓
Dataset
This project used 4 datasets. It's already downloaded and formatted in the data folder.
- RAVDESS: The Ryson Audio-Visual Database of Emotional Speech and Song that contains 24 actors (12 male, 12 female),
2. TESS: Toronto Emotional Speech Set, A set of 200 target words were spoken in the carrier phrase “Say the word _____’ by two actresses (aged 26 and 64 years).
3. EMO-DB: a database of emotional utterances spoken by actors.
4. Custom: Some unbalanced noisy dataset that is located in data/train-custom
for training and data/test-custom
for testing in which you can add/remove recording samples easily by converting the raw audio to 16000 sample rate, mono channel
Adding the emotion to the end of audio file name separated with ‘_’
(e.g “harmed1_happy.wav” will be parsed automatically as happy)
The main thing is if you download the dataset and extracted it, We need to lower the sample rate on the audio files. That will be easy to extract the features in librosa.
But in this project no need to do that. because already I have prepared the dataset.
convert_wavs.py
Converting audio samples to be suitable for feature extraction
If you want to convert your own audio samples to 16000Hz sample rate and mono channel as suggested, you need this python script and ffmpeg installed on your machine.
Grid Search
Grid search results are already provided in grid
folder, but if you want to tune various grid search parameters in parameters.py
, you can run the script grid_search.py
by:
python grid_search.py
Testing
You can test your own voice by executing the following command:
python test.py
If you enjoyed this piece, I’d love it if you hit the clap button 👏
See you soon with the topic of how to create the function that handles extracting features.