New

[p] Title: I Created A Neural Network To Quickly Detect Spoken Vowels 20 Times Per Second

Quick disclaimer: I am aware that there is an internaltional standard for labeling the diferent recognized speech sounds (phonemes), but I wanted ASCII or extended ASCII for programming simplification, so I use a different nomeclature. Besides, it's easier for me to recognize and read. -Please forgive me

So I have often wondered about the real rules that govern speech that people use. For instance using something similar to a "glottal stop" to end words like "don't" and "that". The "t" is not pronounced. Or how "r" is almost always used as a vowel (in american english). My favorite examples are "fur", "fir", and "-fer". All three are pronounced identically and the typical "i,u,e" vowels are not pronounced at all. Its just pronounced "fr".

One day I was looking at a spectrograph of my voice, and I noticed some patterns. Vowels like "ah" in "stop" and "Bob" look very different from other vowels like "ee" in "green" and "bee". When we speak, there is the most prominant lowest frequency called the "fundamental", and there are many other frequencies that are multiples of that frequency called "harmonics". The sound "ah" has high volume on many of the harmonics, but the sound "ee" has a big gap where the harmonics are much much smaller. Every different vowel had its own combination of different harmonic values.

So I tried to create a set of rules by hand to classify different frequency patterns as different vowels. I could easily tell them apart by looking at them, but would the rules hold up to the test? So I made a computer program to guess different vowels, but it was not good. There are so many knobs to turn to create the different rules. And if there is variability, then I would also have to go through and determine all of the different ranges which would make the rules much more complex.

I started to do it by hand and tweak values, see how it worked, and then tweak the values again, etc, etc.

Thats when it hit me! I'm doing what a neural network trainer does. I could use one to do this for me!

So I researched the nitty gritty of getting one setup, recorded a lot of data (~45 minutes worth) and trained the model. It took a few days to figure out some problems, but I eventually got it working.

I used python and the tensoflow+keras library suite to create and train the neural network, Pyaudio for recording training data and realtime audio, numpy for data analysis. The neural network had 264 input nodes, 100 intermediate nodes, and 13 output nodes (one node for "no vowel", and 12 for the different vowels). The frequency calculation finishes within 1milisecond, and the neural network finishes within 2 milisecond as well on my hardware (intel i3-1115G4 at 4GHz). It spends more of its time on listening for audio than it does computing the answer. I found best results by running the loop 20 times per second (50ms) but I have also gotten it to run at 50 times per second (20ms), but it struggles on one or two vowels.

Here is a list of the different vowels that it recognizes

ӑ aa cat, 1

ŏ ah stop, 2

ē = ee green, 3

ō = oh gross, 4

oo = oo mood blue goose, 5

ĭ = ih sit,6

ā = ay stay, 7

ĕ = eh pet, 8

ŭ = uh bump, 9

o͝o = ou would could should took, 10

r̃ = (i chose this symbol) ur fur fir fer rural, 11