23rd February 2011 , 04:45 PM #1 Junior FaaDoO Engineer
Gender: : Male
City : Other
Case study of net-talk - Speech to Text Technology
The NETtalk network was part of a larger system for mapping English words as text into the corresponding speech sounds. NETtalk was configured and trained in a number of different ways. This study considers only the network that was trained on "continuous informal speech".
The task of reading aloud
For a human to read aloud requires -
- recognition of sequences of characters on a page
- a mapping between the visual recognition and an understanding of the pattern as analogous to a spoken or auditory sequence, which embodies grammatical, pronunciation and intonation rules of English,
- and the production of the sequence as speech by applying fine motor control with auditory feedback.
NETtalk converts human language in the form of a machine-readable text file to stream of symbols representing phonemes with the use of a neural network. A separate speech synthesis system was applied to the phonemes to produce speech sounds.
NETtalk and human reading
NETtalk was intended to be a real-time text-to-speech system although the NETtalk network can operate on its own as a data translator without real-time constraints.
The NETtalk network addresses the mapping task from a symbolic representation of English text to a symbolic representation of the corresponding spoken words. Whether the task addressed by the network bears any useful analogy to a human cognitive process is not important to the utility of the network, as the whole text-to-speech system can be judged in comparison to a human reader. The network is intended to be used as a part of a practical artifact, rather than as a psychological model to give useful insights into the workings of the human mind.
Network input encoding
Network inputs were required to represent a sequence of seven characters selected from a set of 29, comprising the letters of the Roman alphabet plus three punctuation marks. These were encoded as seven sets of 29 input units, where for each set of 29 units a character was represented as a pattern with one unit "on" and each of the others "off".
Network output encoding
Network outputs consist of 26 units to represent a single phoneme. Each output unit was used to represent an "articulatory feature" of which most phonemes had about three. The articulatory features correspond to actions or positions of the mouth during speech.
The network as a data translator
NETtalk takes a stream of individual characters as input and returns a stream of single phonemes as output.
The cognitive task of reading aloud requires that a timeless representation of language as text be mapped into a time-sequential representation of language as speech. The task of the NETtalk network is to perform a specific part of this temporal expansion. The MLP itself produces a sequence of phonemes, which is essentially another timeless representation, and it left to a separate speech synthesis system to perform the actual conversion to a real-time signal.
There is an implied sense of time in English text, such that bringing it to life from a timeless representation requires that it be scanned in sequence at a certain rate. For a human reader, the process is more complex than scanning one character, or even a number of characters at a time, as the visual system may process the forms of characters in combination and words as a whole. Characters are perceived in the context of the other characters around them, and the context of the text read and understood up to that point. The use of a buffered input system gives a temporal context to processing, but this context only extends 6 characters into the past because the the process of shifting characters along in a fixed memory causes the oldest character to be forgotten, and no effect of its presence carries into further processing. The window of seven characters that forms the input to the NETtalk network mimics the the visual context available to a human reader, but leaves no way of representing the wider context of sentence structure.
The stages of processing between time steps that cause a change in memory through time are
- the calculation of activations, leading to a new value for the phoneme output,
- the shifting of input characters to make room for the next input character,
- and setting the value of the new input character.
There is clearly a great deal of structural meaning that can be revealed in the cognitive task of reading aloud. Grammatical, pronunciation and intonation rules for the English language are extremely complex and have been the subject of centuries of study. The mapping from text to phonemes comprises combinatorial regularities with many exceptions and families of exceptions. A feature of the task is that most of the rules of pronunciation can be captured in a context window of seven characters. One of the specific reasons Sejnowski and Rosenberg chose the automatic learning system was that there is no need to know the details and intricacies of the mapping. The encoding scheme used for inputs and outputs was designed to make the task as convenient to learn as possible. The inputs used local codes for each letter, thus simplifying the input to hidden unit mapping. The output codes used a distributed representation which to a large extent captured the combinatorial structure of the task.
The network during training
The connection weights are the most significant stored element. The values of the connection weights reflect the whole history of training data, with each pattern leaving its mark. Learning is very slow. An individual pattern only changes the weight values by a very small amount, so the whole pattern can not automatically be reconstructed from the weights. The weights act as a means of compressing the information in the training set into a small fixed amount of data.
The network was trained by presenting patterns in sequence and changing the weights slightly after each presentation. Windows of seven characters were presented in sequence. In this sense the temporal information in the data was present and used during training. If the training set is seen as a collection of sequences of seven characters, each with a corresponding phoneme, the order in which such a training set is presented to the network should not greatly affect the mapping formed by the network, and in this sense the temporal information in the data does not have the same meaning during training time.
The connection weight values change by a small amount at each step in the learning process in response to the input and target pair presented. Each pattern presented leaves its mark on the weights as a subtle shift in each value. A feedforward network maps any particular input pattern from its input domain uniquely to a particular output pattern, regardless of any previous activations. It can be seen as a timeless mapping from domain to range. The order of presentation of patterns during the learning phase should not affect the ability of the network to form the required mapping. Nevertheless, the order of presentation of patterns during the learning phase does affect the search path through weight space because the weights record cumulative changes. The repetition of the training set over many training "epochs" serves to reduce the tendency for more recently seen patterns to cause earlier patterns to be gradually "forgotten".
An untrained network is initialized with random weight values which evolve through a slow learning process towards a structure that embodies the required functional mapping. The rules of the mapping are holistically encoded into the set of connection weight values. The encoding of the network outputs influences the structure that is formed in the network. The text-to-phoneme mapping was known to have been quasi-combinatorial in that there are regularities and regular exceptions to those regularities. Some understanding of the structure of the mapping has been expressed in the design of the output encoding which encodes each type of mouth position on a different output unit. The mapping task is reduced to a mapping from text to whether the phoneme is voiced, and another mapping from text to nasalization of a phoneme, and so on for each of the 26 articulatory features.
NETtalk performs an important part of a mapping from the spatial domain of written language to the temporal domain of spoken language, capturing the much of structure of a complex domain. The ability of the network to learn the required mapping is assisted by the design of the input and output encodings , though the network takes a long time to learn the mapping and replication of the simulation is difficult.
Sejnowski, T. J. and Rosenberg, C. R. (1986) NETtalk: a parallel network that learns to read aloud, Cognitive Science, 14, 179-211.
Crystal, D. (1987) The Cambridge Encyclopedia of Language, Cambridge University Press
Bunge, M. (1973) Method, Model and Matter, D. Reidel, Dordrecht