Speech detection using Mel-Frequency(MFCC) in R Studio!

A practical guide to implementing speech detection with the help of MFCC ( Mel-frequency Cepstral Coefficient) feature extraction.

Published in

Analytics Vidhya

7 min readAug 5, 2020

The objective of the study is to extract the features from the .wav file. The speech also reflects the mood of the person and their emotional condition while talking.

For example, when our favorite team is the winning game, and a friend calls us, generally we talk with a friend with excitement in our tone. Similarly, if you had a fight with your boyfriend in the morning, your tone or a conversation with a cab driver could be angry or sad (based on your emotional condition).

In this experiment, we have shown the practical implementation of the feature extraction from the speech signals.

Here, we have not explained on How the audio signal forms, or steps of the MFCC.. like what is pre-emphasis and Mel-filter, etc. Many studies ard articles are available on this, one can refer from the internet or you can learn from — here.

Kindly note, apart from MFCC there are many other techniques to do the feature extraction from an audio signal like Linear prediction coefficient (LPC), Discrete wavelet transform (DWT), etc.

The project can be implemented in business perspectives like,

Telecallers: which person in a team is more effective and why! Can we identify parameters (like tone, time of conversation, the prospect interaction, etc) from the best telecallers to train junior telecallers!
Sales: Better and right sales call analysis, this can be done with the use of NLP and speech analysis. Which salesman talks about what? with which emotions and which tone? What were the client's reactions and tone? These and many more parameters can help a company to fine-tune their salespeople.
Recruitment: With the help of past interview data, and the current best performing resources can lead firms to handpick the brightest mind in the future. Generally, interviewers phrases, that it took less than 2–4 minutes of initial interaction to decide whether the interviewee is fit for the company or not. That again the interviewer decides from their long interviewing experience and person’s way of talking and their confidence and emotions while talking.

MFCC ( Mel-frequency Cepstral Coefficient) implementation in R Studio:

For the study, we used inbuilt libraries in R studio,

library(wrassp)library(readr)library(tuneR)library(signal)library(oce)

An important part of any data analysis is data collection. For the study, we classified a few .wav files with different tones/moods like, angry, excited, normal, sad, etc.

However, before we work on the actual data for a study, I’ll give you a basic understanding of why MFCC is important for feature extraction with the use of a spectrogram. Here we take a simple .wav file and created a spectrogram without MFCC and with the MFCC data matrix.

Load the .wav File data, and

path = 'male.wav'audio = readWave('male.wav')

Check the component and structure of the .wav file data.

> str(audio)Formal class ‘Wave’ [package “tuneR”] with 6 slots
 ..@ left : int [1:408226] -79 -104 -153 -175 -201 -209 -231 -224 -233 -221 …
 ..@ right : num(0) 
 ..@ stereo : logi FALSE
 ..@ samp.rate: int 8000
 ..@ bit : int 16
 ..@ pcm : logi TRUE

We also faced few limitations in real practical speech detection analysis,
1. If audio files are 1000 or 5000, we need to load each and every file. Hence x we need to repeat the same code to load the file into the system. That means we need to use readWave() every time to load the file.
2. In real life, not every file is of the same size, for example in the telecallers data files, no call is with the same elements and signals size. Hence, to train the speech detection model we need to make sure every element is having same size like, @ left: int [1: 1000000], etc,
Take these two points as a note, I’ll explain further in the study.

Now, how does this .wav file looks like in signal form, for that we plot the graph with the use of plot().

plot(audio@left[10:408226], type ='l', col = 'seagreen', xlab = 'Elements / Time', ylab = 'Freq', main = 'Audio Frequency Wave')

Note: In this audio file a male person is speaking continuously, hence we see no break in the signal. And we selected the elements from 10 to 408226.

Feature Extraction with MFCC in R studio,

For more details on melfcc() function, one can also visit — RDocumentation

sr = audio@samp.rate         # samp rate in Hz
 
 
 mfcc.m = melfcc(audio, sr = sr,
     wintime = 0.015,        # Window length
     hoptime = 0.005,        # Successive windown inbetween
   # numcep = 3,             # By default it will be 12 features
     sumpower = TRUE,        # frequence scale transformation based on powerspectrum
     nbands = 40,            # Number of spectra bands, filter banks
     bwidth = 1,             # Width of spectral bands
     preemph = 0.95,         # pre Emphasis
   # frames_in_rows = TRUE)

The output of the melfcc() function is stored in mfcc.m, and it has 12 features.

Few Important notes:

mfcc.m dataset is a matrix of 10203 * 12, where the row has 10203 observations.
A number of rows are based on the length of an audio file, as you remember the ‘audio’ file has elements from 0:408226.
If we have 1000 or 5000 .wav files, then we need to fix the same elements from 0:408226 for all different lengths of .wav files.
This is very important because, generally, the dataset from Kaggle and UCI the length of the .wav data files are set ( let’s say 0:50000 elements, which are the same for all the .wav data files). But in real-world data files do not carry the same lengths.
Now, Why it’s so much important to set all the .wav files with the same elements. because, as we observed the out from melfcc(), the function will create n number of observations (In the above example it is 10203). And this will help us in further study, the study of speech detection.
To run any supervised learning model to predict the outcome, let’s say in this case we want to know whether the customers are satisfied with a particular telecaller or not! Based on the speech detection technique we can classify the good telecallers or rude/bad telecallers while interacting with customers. And suddenly we can take actions accordingly, rather than waiting for a customer complaint or losing a customer due to bad behaviour by a telecaller.
For example, based on the 5000 .wav data file we extracted the features in matrix form, now to use these data as input to the supervised learning model we need to convert these 5000 matrixes into a data frame.
For a data frame, we simply converted all the matrix into vectors first and then bind them together into a data frame.
Now, If the size of a vector is different, the data frame will not be created, and hence it's very important to set the same element size of all 5000 .wav data files.

Little technical! Take some rest and start again :) Listen to the song — Here

Photo by Maarten van den Heuvel on Unsplash

Okay, until now we talked about the feature extraction from the sound signal, and how we can use this from a business perspective.

But, in the last part, I’ll also show the spectrogram from the raw signal data and mfcc features data. We’ll compare both the spectrogram and check do really MFCC feature extraction makes difference in speech detection!

Before we plot spectrogram we did two things,

Defined the parameters for a spectrogram and
Set a global function() as spectrogram {…} (You can give any name for a function)

1. # Determine durationdur = length(mfcc.m)/audio@samp.rate
dur # in seconds# d=Determine sample ratefs = audio@samp.rate
fs # in Hz## Spectrogram parametersnfft = 512    #Fast Fourier Transformation size can be 512 (default), 1024 or 2048.window = 1500overlap = 500

2. # Creater a global function called 'spectrogram'spectrogram = function(a) {# Define Parametersspec = specgram(x = a,
                n = nfft,
                Fs = fs,
                window = window,
                overlap = overlap)# Structure of 'spec'str(spec)P = abs(spec$S)# Normalize
P = P/max(P)     # If we do without abs(*) it will creat NA# Convert to dB
P = 10*log10(P)# config time axis
t = spec$t# plot spectrogramimagep(x = t,
       y = spec$f,
       z = t(P),
       col = oce.colorsViridis,
       ylab = 'Frequency in Hz',
       xlab = 'Time in Sec',
       main = 'Spectrogram',
       drawPalette = T,
       decimate = F)}# Spectrogram without MFCCwithout.mfcc = spectrogram(as.matrix(audio@left))# Spectrogram with MFCCwith.mfcc = spectrogram(mfcc.m)

Without MFCC:

With MFCC:

After comparing both the spectrogram, we can clearly identify that frequencies are converted to the Mel frequency scale.

For detailed codes in R studio, you can also visit — Here.

On Github, I added two codes file,

‘Basics.R for Feature Extraction and Spectrogram,
‘MFCC Function + Spectrogram FUnction.R’ for more than one .wav file. In this file, I have captured four .wav files, but one can also load more .wav files according to their study requirements.

RutvijBhutaiya/Speech-detection-using-Mel-Frequency-MFCC

A practical guide to implementing speech detection with the help of MFCC ( Mel-frequency Cepstral Coefficient) feature…

github.com

Speech detection using Mel-Frequency(MFCC) in R Studio!

A practical guide to implementing speech detection with the help of MFCC ( Mel-frequency Cepstral Coefficient) feature extraction.

Feature Extraction with MFCC in R studio,

RutvijBhutaiya/Speech-detection-using-Mel-Frequency-MFCC

A practical guide to implementing speech detection with the help of MFCC ( Mel-frequency Cepstral Coefficient) feature…

Written by Rutvij Bhutaiya