Speech
Recorder
Speech is a complex phenomenon.
People rarely understand how is it produced and perceived. The naive perception
is often that speech is built with words, and each word consists of phones. The
reality is unfortunately very different. Speech is a dynamic process without
clearly distinguished parts. It's always useful to get a sound editor and look
into the recording of the speech and listen to it. Here is for example the
speech recording in an audio editor.
All modern descriptions of speech
are to some degree probabilistic. That means that there are no certain
boundaries between units, or between words. Speech to text translation and
other applications of speech are never 100% correct. That idea is rather
unusual for software developers, who usually work with deterministic systems.
And it creates a lot of issues specific only to speech technology.
In current practice, speech
structure is understood as follows:
Speech is a continuous audio stream
where rather stable states mix with dynamically changed states. In this
sequence of states, one can define more or less similar classes of sounds, or phones.
Words are understood to be built of phones, but this is certainly not true. The
acoustic properties of a waveform corresponding to a phone can vary greatly depending
on many factors - phone context, speaker, style of speech and so on. The so
called coarticulation makes phones sound very different from their “canonical”
representation. Next, since transitions between words are more informative than
stable regions, developers often talk about diphones - parts of phones
between two consecutive phones. Sometimes developers talk about subphonetic
units - different substates of a phone. Often three or more regions of a
different nature can easily be found.
The number three is easily
explained. The first part of the phone depends on its preceding phone, the
middle part is stable, and the next part depends on the subsequent phone.
That's why there are often three states in a phone selected for speech
recognition.
Sometimes phones are considered in
context. There are triphones or even quinphones. But note that unlike phones
and diphones, they are matched with the same range in waveform as just phones.
They just differ by name. That's why we prefer to call this object senone.
A senone's dependence on context could be more complex than just left and right
context. It can be a rather complex function defined by a decision tree, or in
some other way.
Next, phones build subword units,
like syllables. Sometimes, syllables are defined as “reduction-stable
entities”. To illustrate, when speech becomes fast, phones often change, but
syllables remain the same. Also, syllables are related to intonational contour.
There are other ways to build subwords - morphologically-based in morphology-rich
languages or phonetically-based. Subwords are often used in open vocabulary
speech recognition.
Subwords form words. Words are
important in speech recognition because they restrict combinations of phones
significantly. If there are 40 phones and an average word has 7 phones, there
must be 40^7 words. Luckily, even a very educated person rarely uses more then
20k words in his practice, which makes recognition way more feasible.
Words and other non-linguistic
sounds, which we call fillers (breath, um, uh, cough), form utterances.
They are separate chunks of audio between pauses. They don't necessary match
sentences, which are more semantic concepts.
On the top of this, there are dialog
acts like turns, but they go beyond the purpose of the document.
The
common way to recognize speech is the following: we take waveform, split it on
utterances by silences then try to recognize what's being said in each
utterance. To do that we want to take all possible combinations of words and
try to match them with the audio. We choose the best matching combination.
There are few important things in this match.
First
of all it's a concept of features.
Since number of parameters is large, we are trying to optimize it. Numbers that
are calculated from speech usually by dividing speech on frames. Then for each
frame of length typically 10 milliseconds we extract 39 numbers that represent
the speech. That's called feature
vector. They way to generates numbers is a subject of active
investigation, but in simple case it's a derivative from spectrum.
Second
it's a concept of the model.
Model describes some mathematical object that gathers common attributes of the
spoken word. In practice, for audio model of senone is gaussian mixture of it's
three states - to put it simple, it's a most probable feature vector. From
concept of the model the following issues raised - how good does model fits
practice, can model be made better of it's internal model problems, how
adaptive model is to the changed conditions.
The
model of speech is called Hidden Markov Model or HMM, it's a generic model that describes
black-box communication channel. In this model process is described as a
sequence of states which change each other with certain probability. This model
is intended to describe any sequential process like speech. It has been proven
to be really practical for speech decoding.
Third,
it's a matching process itself. Since it would take a huge time more than
universe existed to compare all feature vectors with all models, the search is
often optimized by many tricks. At any points we maintain best matching
variants and extend them as time goes producing best matching variants for the
next frame.
A Lattice is a directed graph
that represents variants of the recognition. Often, getting the best match is
not practical; in that case, lattices are good intermediate formats to
represent the recognition result.
N-best lists of variants are like lattices, though their representations
are not as dense as the lattice ones.
Word confusion networks (sausages) are lattices where the strict order of nodes is
taken from lattice edges.
Speech database - a set of typical recordings from the task database. If we
develop dialog system it might be dialogs recorded from users. For dictation
system it might be reading recordings. Speech databases are used to train, tune
and test the decoding systems.
Text databases - sample texts collected for language model training and so
on. Usually, databases of texts are collected in sample text form. The issue
with collection is to put present documents (PDFs, web pages, scans) into
spoken text form. That is, you need to remove tags and headings, to expand
numbers to their spoken form, and to expand abbreviations.
When speech recognition is being
developed, the most complex issue is to make search precise (consider as many
variants to match as possible) and to make it fast enough to not run for ages.
There are also issues with making the model match the speech since models
aren't perfect.
Usually the system is tested on a
test database that is meant to represent the target task correctly.
The following characteristics are
used:
Word error rate. Let we have original text and recognition text of length of
N words. From them the I words were inserted D words were
deleted and S words were substituted Word error rate is
WER = (I + D + S) / N
WER is usually measured in percent.
Accuracy. It is almost the same thing as word error rate, but it
doesn't count insertions.
Accuracy = (N - D - S) / N
Accuracy is actually a worse measure
for most tasks, since insertions are also important in final results. But for
some tasks, accuracy is a reasonable measure of the decoder performance.
Speed. Suppose the audio file was 2 hours and the decoding took 6
hours. Then speed is counted as 3xRT.
ROC curves. When we talk about detection tasks, there are false alarms
and hits/misses; ROC curves are used. A curve is a graphic that describes the
number of false alarms vs number of hits, and tries to find optimal point where
the number of false alarms is small and number of hits matches 100%.
There are other properties that
aren't often taken into account, but still important for many practical
applications. Your first task should be to build such a measure and
systematically apply it during the system development. Your second task is to
collect the test database and test how does your application perform.
|
PARTS LIST
R1 = 1k
R2 = 470k
R3 = 10k
R4 = 5k1
R5 = 4k7
R6,7 = 100k
R8,9 = 1M
R10 = 10R
C1-10 = 100nF/63V
C11 = 47nF/63V
E1,4 = 220uF/16V
E2 = 4u7F/16V
E3 = 22uF/16V
IC1 = ISD2560 + socket
IC2 =LM78L05
IC3 = LM386 + socket
MIC = Condensator microphone
S1,2 = Pushbutton (S1 = Start and Pause. S2 = Stop and Reset)
S3 = Change-over switch
Hψjttaler = 8R speaker
|
The 2 pushbuttons = S1: Start/Pause. S2:
Stop/Reset.
If you want to play your message, put S3 at Play. Then push S1 to start playing
and again to pause.
If you want to delete your message press S2 twice.
If you want to record a message put S3 at Rec. Then push S1 to start and S2 to
stop.
http://www.hubcity.net