Pronunciations Lab

From CLSP Wiki
Jump to: navigation, search

New Pronunciations for MSTD: Overview

In this lab we'll be learning about pronunciation modeling with finite state transducers. During the workshop, our group will be finding new pronunciations (from the web and from transcription), then testing the quality of the new pronunciations in a speech recognition (ASR) model. However, in this lab we'll be using pronunciation models in a text-to-speech system.

Begin by ssh'ing into login.clsp.jhu.edu, then into an x node. Then copy & unzip the tarball for the lab into your home directory:

 cp /home/ws08mstd/kholling/pronun_lab.tgz .
tar -xzvf pronun_lab.tgz
cd pronun_lab/
export PATH=/home/clsp/ws08mstd/arnab/tools/fsm/bin/:/home/clsp/ws08mstd/arnab/tools/grm/bin/:$PATH

Part 1 - Finite-State Transducer Dictionaries

Grapheme Dictionary

For this first part of the lab, we are going to pretend that letters are the same as phonemes, and map (English) words to letters to build a fake-phonetic dictionary (a grapheme dictionary).

Consider the toy wordlist in toy.wordlist. Using this wordlist, and pretending that each letter represents a phone, we can create a fake pronunciation dictionary, toy.lex. Take a look at this dictionary, and note that it can be easily generated from our wordlist. (You may also note that our dictionary contains a few special symbols: <s> and </s>, which indicate the start and end of a sentence, and <unk>, which indicates an unknown token. These special symbols are given special "pronunciation" symbols as well.)

From this dictionary, we would like to compile a transducer that maps from words to letters and vice versa. First, we can produce a text format for the transducer toy.fsm.txt, with words on the input side, and letters on the output side. Take a look at toy.fsm.txt, and note that it can also be easily generated from our dictionary file.

Compiling & Printing Transducers

For this part of the lab we'll be using the AT&T FSM and GRM libraries. Hopefully you'll be able to follow our examples to complete the lab, but you can also use the FSM man pages and GRM man pages. In particular, we will be using fsmcompile and fsmcompose.

Draw the automata represented by our toy.fsm.txt file. Note that each line in the toy.fsm.txt file represents a transition from state to state. States are identified by number (the first two columns in the file), the transition input labels are in the third column, and the output labels are in the fourth column.


Compile the transducer using fsmcompile, using our toy.wordlist and toy.letterlist files, as follows:

 fsmcompile -t -i toy.wordlist -o toy.letterlist toy.fsm.txt > toy.trans.fsm

The FSM toolkit also includes some tools for generating visual representations of the automata and transducers.

 fsmdraw -i toy.wordlist -o toy.letterlist toy.trans.fsm | dot -Tps > toy.ps
ps2pdf toy.ps

Take a look at toy.pdf. Does it look right? Is there anything weird about it?


Transducers as Acceptors (Automata)

One thing that can be done with transducers is to project them onto either input labels or output labels, with fsmproject. This turns the transducer into an acceptor that preserves only the labels on the specified side. Note that order of the transducers matter, since it is matching input and output labels.

Compose our transducer with an example string:

 echo "John Jacob Smith" | farcompilestrings -u "<unk>" -i toy.wordlist | \
   fsmcompose - toy.trans.fsm > string1.trans.fsm

Now string1.trans.fsm is a tiny transducer just for our example string, and we can use fsmproject to turn our transducer into a word-acceptor:

 fsmproject -i string1.trans.fsm | fsmprint -i toy.wordlist

or a letter-acceptor:

 fsmproject -o string1.trans.fsm | fsmprint -i toy.letterlist

by fsmprojecting on either input or output labels. Call farprintstrings on these acceptors to see results:

 fsmproject -i string1.trans.fsm | farprintstrings -i toy.wordlist
fsmproject -o string1.trans.fsm | farprintstrings -i toy.letterlist

What happens if you fsmcompose toy.trans.fsm with "John Jacob Jingleheimer Schmidt", then call farprintstrings?

echo "John Jacob Jingleheimer Schmidt" | farcompilestrings -u "<unk>" -i toy.wordlist | \
 fsmcompose - toy.trans.fsm | fsmproject -o - | farprintstrings -i toy.letterlist

You should get: "J o h n J a c o b <unk> <unk>", because our transducer accepts "John" and "Jacob" as known input tokens but can only accept "Jingleheimer" and "Schmidt" as unknown ("unk") tokens.

Editing Transducers

Now, let's improve our transducer a bit. Add "Jingleheimer" and "Schmidt" to our transducer by editing the toy.fsm.txt file. (You could also add your name, or any other words you'd like!) Don't forget to update the toy.wordlist and toy.letterlist files as well. It'll be faster if you use some scripting to update the files, but you could edit them by hand as well...


Once you've done that, fsmcompile your new transducer and use fsmdraw to take a look at it.

 fsmcompile ... > toy2.trans.fsm
fsmdraw ... > toy2.ps

Now what happens when you fsmcompose toy2.trans.fsm with "John Jacob Jingleheimer Schmidt", then call farprintstrings?

 echo "John Jacob Jingleheimer Schmidt" | farcompilestrings ... | \
   farprintstrings -i toy.letterlist



Larger Transducers

We could use a much larger wordlist, wsj.wordlist, (extracted from sections 2-21 of the Penn Wall St. Treebank) to build a larger word to letter transducer for a larger vocabulary:

 fsmcompile -t -i wsj.wordlist -o wsj.letterlist wsj.fsm.txt > wsj.trans.fsm

How many states are in this FSM? (Figure this out from the wsj.fsm.txt file, not by trying to fsmdraw the transducer!) How many more states is that than our toy FSMs?


Even though the WSJ FSM is much larger, it doesn't help us a lot with our unknown-word problem from before:

 echo "John Jacob Jingleheimer Schmidt" | farcompilestrings ... | \
   farprintstrings -i wsj.letterlist

Why do we still have an <unk> in our output?


Phonetic Dictionary

Now we will produce a real pronunciation dictionary!

Take a look at the PronLex Dictionary, pronlex_arpabet.txt.gz. It is in essentially the same format as our toy dictionary toy.lex, except there are phones instead of letters. The phones are the space-delimited characters starting in the second column.

Find the PronLex dictionary entry for the word 'fulllength.' Note that there are six pronunciations provided for this word. Can you find other kinds of ambiguity in this dictionary that do not exist in our toy dictionary from Part 1 of the lab?


Create a wordlist and a phonelist from your PronLex dictionary. (Don't forget our special symbols, <s>, etc.). Yes, this will require some scripting. If you need help, ask us!


Convert our dictionary into FSM text-format (thank you, John!):

 cp /home/ws08mstd/kholling/pronun_lab/pronlex2fst.pl .
zcat pronlex_arpabet.txt.gz | pronlex2fst.pl > pronlex.fsm.txt

Then compile it into a transducer, which we'll call pronlex.trans.fsm. (Please don't try to optimize (i.e. invert, push, determinize) pronlex.trans.fsm.)

 fsmcompile -t -i pronlex.wordlist -o pronlex.phonelist pronlex.fsm.txt > pronlex.trans.fsm

Now compile the following string using this new FSM:
"july 3rd is the last day of summer school for the 2008 clsp johns hopkins workshop attendees"
Note that the PronLex dictionary always expects only lowercase letters.

 echo <your string> | farcompilestrings -u "<unk>"...

Are many words "<unk>"? Which ones? Can you explain for some of them why they wouldn't be in a speech recognizer dictionary?


Now produce a phone string FSM for this word string by composing it with pronlex.trans.fsm. (Note: remember, words are on the input side of the transducer, phones on the output side.) Is there more than one resulting phone string? Try using fsmrandgen:

 fsmrandgen -?...

Could you have predicted, just from your pronlex_arpabet.txt file, that there would be more than one possible phone string?

What does that tell us about pronunciation modeling? How might that be a problem for spoken term detection?


Part 2 - Festival Pronunciation

Set up Festival

Set up and test out Festival:

 export PATH=/apps/share/festival/festival/bin/:/apps/share/festival/speech_tools/bin/:$PATH
echo "Good afternoon, welcome to Festival" > helloworld.txt
text2wave helloworld.txt -o helloworld.wav

Then, from your personal machine:

 scp <username>@<xmachine>:~/helloworld.wav .

and play the wav file!


(See the Quick start section of the Festival Manual for general help.)

Learn a pronunciation model

Now we'll learn a new pronunciation model to use in Festival.

We're going to use Wagon (installed in /apps/share/festival/speech_tools) to train a pronunciation model for a set of family names (a particularly difficult pronunciation problem). Take a look at the name-pronunciation dictionary: new_pronuns/names.lex.


The dictionary consists of a name in the first column (and in lowercase) and its (space-delimited) transcription in a single-character phonetic alphabet. Note that the dictionary has been aligned automatically (using the algorithm described in Sproat, 2001), so that letters are mapped one-for-one onto phones. In some cases this results in a "deletion" (indicated by a "#" on the phone side); in others this results in an "amalgamation", as in the combination of "i" and "k" into "_i_k_" in:
m c p h e r s o n
m _i_k_ f # E R s & n

We're going to train a Festival letter-to-sound (LTS) model from this data. For your convenience we have divided the data into individual training files for each letter. The features we will be using to train the model are defined in new_pronuns/dict.letter_desc. Take a look at that file, and note that the last line of that file defines the features we will use (p.p.p.name, n.n.n.name), which tells us that our context is three letters on the left and three on the right.


Train an LTS model by running Wagon as follows:

 cd new_pronuns/
unzip names.data.zip
trainlettertrees.sh
makelts.sh > my_lts.scm

Take a look at the output scheme file, my_lts.scm. Can you make some sense of it?



Note! You will have to log in to the y01 node for this next part of the lab!! If that machine gets too overloaded with users during the lab, then team up for this part.
Evoke Festival and load your new model:

 export PATH=/apps/share/festival/festival/bin/:/apps/share/festival/speech_tools/bin/:$PATH
festival
Festival Speech Synthesis System 1.96:beta July 2004 Copyright (C) University of Edinburgh, 1996-2004. All rights reserved. For details type `(festival_warranty)' festival>
festival> (load "my_lts.scm") nil
festival> (require 'lts) t
festival> (lts_predict "mcpherson" my_lts_rules) ("m" "ih" "k" "f" "er" "s" "ax" "n")
festival>

To quit festival, type ^d or (quit).

Festival vs Your LTS Model

Okay, now that you've loaded your new model in Festival, let's try it out! First, produce a wav file using the Festival pronounciation:

 festival> (set! utt1 (SayText "mcpherson"))
festival> (utt.save.wave utt1 "pronun1.wav")

Now hear what it sounds like using your newly built letter-to-sound model:

 festival> (set! phn2 (lts_predict "mcpherson" my_lts_rules))
festival> (set! utt2 (SayPhones phn2))
festival> (utt.save.wave utt2 "pronun1.my_lts.wav")

Experiment a bit!

Can you find any pronunciations that are improved by using our LTS rules? (Try taking a look at names.lex.)

Are there any pronunciations that are particularly bad in both Festival and using our LTS rules? Why might that be? (Again, look at our names.lex training data.) What are some ways we could try to improve the pronunciation?



References

AT&T FSM Toolkit:

OpenFST:

Festival: