Next: Inter-punctuation and whitespaces
Up: Token-to-word rules
Previous: Expansion of numeral expressions
  Contents
Another important task of text preprocessing is the handling of
abbreviations. They are a big challenge for the text preprocessing
component of a TTS system, because of their frequency of occurance and
because mandatory guidelines for their formation are missing. Very
often the expansion of an abbreviation is ambiguous and could only be
solved using the semantic or pragmatic context of the sentence.
- Duden-style abbreviations
- Abbreviations following the
classification scheme given in the ``Duden''[13] are
recognized in the function
``german_token_to_words'' and expanded in the
function ``ger_lookup_comb_abbr''. The
pronunciations of the resulting words are then looked up separately.
If no listing is found, they are spelled.
- Units
- For a unit to be recognized, the preceding token must be
a number and the abbreviation has to be found in
``ger_masseinheit_teststring'' or
``ger_masseinheit_teststring2''. If so, the unit is
converted on the basis of the information in
``ger_abbr_masseinheiten_dim_tab'' and
``ger_abbr_masseinheiten_tab''.
- Abbreviations of length 1
- Tokens consisting of one letter are
always abbreviations. They are not expanded, because they usually
have many different meanings.
- Abbreviations consisting of consonants only
- Tokens consisting
of consonants only are always abbreviations, because they are not
pronounceable in German. The abbreviation is looked up in the
abbreviation tables by ``ger_translate_abbr''. If no
listing is found, the abbreviation is spelled.
- Abbreviations consisting of capital letters only
- If a token
with only capital letters is found in one of the abbreviation tables
it is recognized as an abbreviation. Otherwise it is spoken like a
normal word, because often words are written in capital letters to
highlight them.
- Abbreviations followed by a period
- If a token has a period as
punctuation feature, it is looked up in the appropriate table. If
found, it is expanded and the period is deleted from the punctuation
feature. Otherwise it is assumed that the period marks the end of a
sentence.
- Ambiguous tokens
- A special problem are abbreviations that also
appear as regular words. For example ``Art.'' may be the
abbreviation for ``Artikel'' or the word ``Art'' at the end of a
sentence. To solve this problem, we would have to regard the context
of such abbreviations, which is not yet implemented.
Next: Inter-punctuation and whitespaces
Up: Token-to-word rules
Previous: Expansion of numeral expressions
  Contents
Gregor Moehler
2001-07-17