- 1 New Pronunciations for MSTD: Overview
- 2 Part 1 - Finite-State Transducer Dictionaries
- 3 Part 2 - Festival Pronunciation
- 4 References
New Pronunciations for MSTD: Overview
In this lab we'll be learning about pronunciation modeling with finite state transducers. During the workshop, our group will be finding new pronunciations (from the web and from transcription), then testing the quality of the new pronunciations in a speech recognition (ASR) model. However, in this lab we'll be using pronunciation models in a text-to-speech system.
Begin by ssh'ing into
login.clsp.jhu.edu, then into an
Then copy & unzip the tarball for the lab into your home directory:
cp /home/ws08mstd/kholling/pronun_lab.tgz .
tar -xzvf pronun_lab.tgz
Part 1 - Finite-State Transducer Dictionaries
For this first part of the lab, we are going to pretend that letters are the same as phonemes, and map (English) words to letters to build a fake-phonetic dictionary (a grapheme dictionary).
Consider the toy wordlist in toy.wordlist. Using this wordlist, and pretending that each letter represents a phone, we can create a fake pronunciation dictionary, toy.lex. Take a look at this dictionary, and note that it can be easily generated from our wordlist. (You may also note that our dictionary contains a few special symbols: <s> and </s>, which indicate the start and end of a sentence, and <unk>, which indicates an unknown token. These special symbols are given special "pronunciation" symbols as well.)
From this dictionary, we would like to compile a transducer that maps from words to letters and vice versa. First, we can produce a text format for the transducer toy.fsm.txt, with words on the input side, and letters on the output side. Take a look at toy.fsm.txt, and note that it can also be easily generated from our dictionary file.
Compiling & Printing Transducers
For this part of the lab we'll be using the AT&T
Hopefully you'll be able to follow our examples to complete the lab,
but you can also use the
FSM man pages and GRM man pages.
In particular, we will be using
Draw the automata represented by our toy.fsm.txt file. Note that each line in the
toy.fsm.txt file represents a transition from state to state. States are
identified by number (the first two columns in the file), the transition
input labels are in the third column, and the output labels are in the fourth
Compile the transducer using
fsmcompile, using our
toy.wordlist and toy.letterlist files, as follows:
fsmcompile -t -i toy.wordlist -o toy.letterlist toy.fsm.txt > toy.trans.fsm
The FSM toolkit also includes some tools for generating visual representations of the automata and transducers.
fsmdraw -i toy.wordlist -o toy.letterlist toy.trans.fsm | dot -Tps > toy.ps
Take a look at toy.pdf. Does it look right? Is there anything weird about it?
Transducers as Acceptors (Automata)
One thing that can be done with transducers is to project them onto
either input labels or output labels, with
turns the transducer into an acceptor that preserves only the labels
on the specified side. Note that order of the transducers matter, since it
is matching input and output labels.
Compose our transducer with an example string:
echo "John Jacob Smith" | farcompilestrings -u "<unk>" -i toy.wordlist | \ fsmcompose - toy.trans.fsm > string1.trans.fsm
Now string1.trans.fsm is a tiny transducer just for our example string,
and we can use
fsmproject to turn our transducer into a word-acceptor:
fsmproject -i string1.trans.fsm | fsmprint -i toy.wordlist
or a letter-acceptor:
fsmproject -o string1.trans.fsm | fsmprint -i toy.letterlist
fsmprojecting on either input or output labels.
farprintstrings on these acceptors to see results:
fsmproject -i string1.trans.fsm | farprintstrings -i toy.wordlist
fsmproject -o string1.trans.fsm | farprintstrings -i toy.letterlist
What happens if you
fsmcompose toy.trans.fsm with
"John Jacob Jingleheimer Schmidt", then call
echo "John Jacob Jingleheimer Schmidt" | farcompilestrings -u "<unk>" -i toy.wordlist | \ fsmcompose - toy.trans.fsm | fsmproject -o - | farprintstrings -i toy.letterlist
You should get: "J o h n J a c o b <unk> <unk>", because our transducer accepts "John" and "Jacob" as known input tokens but can only accept "Jingleheimer" and "Schmidt" as unknown ("unk") tokens.
Now, let's improve our transducer a bit. Add "Jingleheimer" and "Schmidt"
to our transducer by editing the toy.fsm.txt file.
(You could also add your name, or any other words you'd like!) Don't forget
to update the toy.wordlist and toy.letterlist files as well. It'll be faster
if you use some scripting to update the files, but you could edit them by hand
Once you've done that,
fsmcompile your new transducer and use
fsmdraw to take a look at it.
fsmcompile ... > toy2.trans.fsm
fsmdraw ... > toy2.ps
Now what happens when you
fsmcompose toy2.trans.fsm with
"John Jacob Jingleheimer Schmidt", then call
echo "John Jacob Jingleheimer Schmidt" | farcompilestrings ... | \ farprintstrings -i toy.letterlist
We could use a much larger wordlist, wsj.wordlist,
(extracted from sections 2-21 of the Penn Wall St. Treebank) to build a larger
word to letter transducer for a larger vocabulary:
fsmcompile -t -i wsj.wordlist -o wsj.letterlist wsj.fsm.txt > wsj.trans.fsm
How many states are in this FSM? (Figure this out from the wsj.fsm.txt file,
not by trying to
fsmdraw the transducer!)
How many more states is that than our toy FSMs?
Even though the WSJ FSM is much larger, it doesn't help us a lot with our unknown-word problem from before:
echo "John Jacob Jingleheimer Schmidt" | farcompilestrings ... | \ farprintstrings -i wsj.letterlist
Why do we still have an <unk> in our output?
Now we will produce a real pronunciation dictionary!
Take a look at the PronLex Dictionary, pronlex_arpabet.txt.gz. It is in essentially the same format as our toy dictionary toy.lex, except there are phones instead of letters. The phones are the space-delimited characters starting in the second column.
Find the PronLex dictionary entry for the word 'fulllength.' Note that
there are six pronunciations provided for this word. Can you find
other kinds of ambiguity in this dictionary that do
not exist in our toy dictionary from Part 1 of the lab?
Create a wordlist and a phonelist from your PronLex dictionary. (Don't
forget our special symbols, <s>, etc.). Yes, this will require some
scripting. If you need help, ask us!
Convert our dictionary into FSM text-format (thank you, John!):
cp /home/ws08mstd/kholling/pronun_lab/pronlex2fst.pl .
zcat pronlex_arpabet.txt.gz | pronlex2fst.pl > pronlex.fsm.txt
Then compile it into a transducer, which we'll call pronlex.trans.fsm. (Please don't try to optimize (i.e. invert, push, determinize) pronlex.trans.fsm.)
fsmcompile -t -i pronlex.wordlist -o pronlex.phonelist pronlex.fsm.txt > pronlex.trans.fsm
Now compile the following string using this new FSM:
"july 3rd is the last day of summer school for the 2008 clsp johns hopkins workshop attendees"
Note that the PronLex dictionary always expects only lowercase letters.
echo <your string> | farcompilestrings -u "<unk>"...
Are many words "<unk>"? Which ones? Can you explain for some
of them why they wouldn't be in a speech recognizer dictionary?
Now produce a phone string FSM for this word string by composing it
with pronlex.trans.fsm. (Note: remember, words are on the input
side of the transducer, phones on the output side.) Is there more than one
resulting phone string? Try using
Could you have predicted, just from your pronlex_arpabet.txt file, that
there would be more than one possible phone string?
What does that tell us about pronunciation modeling? How might that be a
problem for spoken term detection?
Part 2 - Festival Pronunciation
Set up Festival
Set up and test out Festival:
echo "Good afternoon, welcome to Festival" > helloworld.txt
text2wave helloworld.txt -o helloworld.wav
Then, from your personal machine:
scp <username>@<xmachine>:~/helloworld.wav .
and play the wav file!
Learn a pronunciation model
Now we'll learn a new pronunciation model to use in Festival.
We're going to use Wagon (installed in /apps/share/festival/speech_tools)
to train a pronunciation model for a set of family names (a particularly
difficult pronunciation problem). Take a look at the name-pronunciation
The dictionary consists of a name in the first column (and in lowercase)
and its (space-delimited) transcription in a single-character phonetic
alphabet. Note that the dictionary has been aligned automatically
(using the algorithm described in
so that letters are mapped one-for-one onto phones. In some
cases this results in a "deletion" (indicated by a "#" on the phone
side); in others this results in an "amalgamation", as in the
combination of "i" and "k" into "_i_k_" in:
m c p h e r s o n
m _i_k_ f # E R s & n
We're going to train a Festival letter-to-sound (LTS) model from this data.
For your convenience we have divided the data into individual training
files for each letter. The features we will be using to train the model
are defined in new_pronuns/dict.letter_desc. Take a look at that
file, and note that the last line of that file defines the features we will
use (p.p.p.name, n.n.n.name), which tells us that our context is three
letters on the left and three on the right.
Train an LTS model by running Wagon as follows:
makelts.sh > my_lts.scm
Take a look at the output scheme file, my_lts.scm. Can you make some
sense of it?
Note! You will have to log in to the
node for this next part of the lab!! If that machine gets too overloaded
with users during the lab, then team up for this part.
Evoke Festival and load your new model:
Festival Speech Synthesis System 1.96:beta July 2004 Copyright (C) University of Edinburgh, 1996-2004. All rights reserved. For details type `(festival_warranty)' festival>
festival> (load "my_lts.scm") nil
festival> (require 'lts) t
festival> (lts_predict "mcpherson" my_lts_rules) ("m" "ih" "k" "f" "er" "s" "ax" "n")
To quit festival, type
Festival vs Your LTS Model
Okay, now that you've loaded your new model in Festival, let's try it out! First, produce a wav file using the Festival pronounciation:
festival> (set! utt1 (SayText "mcpherson"))
festival> (utt.save.wave utt1 "pronun1.wav")
Now hear what it sounds like using your newly built letter-to-sound model:
festival> (set! phn2 (lts_predict "mcpherson" my_lts_rules))
festival> (set! utt2 (SayPhones phn2))
festival> (utt.save.wave utt2 "pronun1.my_lts.wav")
Experiment a bit!
Can you find any pronunciations that are improved by using our LTS rules? (Try taking a look at names.lex.)
Are there any pronunciations that are particularly bad in both Festival
and using our LTS rules? Why might that be? (Again, look at our names.lex
training data.) What are some ways we could try to improve the pronunciation?
AT&T FSM Toolkit:
- FSM: http://www.research.att.com/~fsmtools/fsm/
- FSM man pages: http://www.research.att.com/~fsmtools/fsm/man4/fsm.1.html
- GRM: http://www.research.att.com/~fsmtools/grm/
- GRM man pages: http://www.research.att.com/~fsmtools/grm/man4/grm.1.html