Next: Abbreviations
Up: Token-to-word rules
Previous: Splitting of compounds separated
  Contents
One of the most important tasks of text preprocessing is the expansion
of numeral expressions. Sequences of digits occur in different
contexts and are pronounced differently. The following formats are
distinguished:
- Fractions
- The numerator is converted with the help of
``german_parse_cardinal'', the fraction bar is not
spoken and the denominator is converted by the function
``german_parse_fractal''. Unfortunately, years
written in the same way (e.g. WS 97/98) are currently pronounced
incorrectly.
- Ratios
- Conversion of the cardinals in a ratio is done by
``german_parse_cardinal'' and ``zu'' is inserted
between them ( ``3:5'' goes to ``drei zu fünf'') .
- Phone numbers
- Phone numbers have the same format as fractions,
with the exception of the zero at the beginning of the first number
(area code). Instead of the slash there may also be a hyphen. Phone
numbers are read digit by digit. The function
``german_parse_charlist'' is responsible for this
conversion.
- Numeral compositions
- In numeral compositions such as
``Jäger90'' and ``16jährig'', the number is converted with the
help of ``german_parse_cardinal'' and the enclosed
word is prepended or appended to the number.
- Years
- Years between 1100 and 1999 are spoken like
``fünfzehnhundertsiebenundsechzig'' (1567, engl: ``fifteen
hundred sixty seven''). As the differentiation between year and
cardinal is not reliable, all cardinals in the range specified above
are spoken like years.
- Dates
- Dates are written DAY.MONTH.YEAR. YEAR may be a two or a
four digit or completely left out. The conversion is done with the
help of ``german_parse_cardinal'' and
``german_parse_ordinal''. The
``german_ordinal_prediction_tree'' is used to
determine the inflexional suffixes of ordinals.
- Time
- Time is written HOURS.MINUTES or HOURS:MINUTES, followed
by ``Uhr'' or ``h'' and spoken as HOURS ``Uhr'' MINUTES. The
conversion of hours and minutes is done by
``german_parse_cardinal''. We have to consider that
the word ``Uhr'' is not spoken twice, after the hours and the
minutes. Therefore, each token is checked whether the preceding
token belongs to a time format as well.
- Currencies
- Currencies are written as CARDINAL,CARDINAL followed
by a unit (e.g. ``15,60 DM''; ``7,89 sfr''; ...). They are
pronounced by inserting the unit between the two cardinals. The
numbers are converted using
``german_parse_cardinal''.
``german_fetch_currency'' looks up the unit. Again,
we have to consider, that the unit is not spoken twice.
- Floating point numbers
- All sequences of digits that contain a
comma and have not been considered so far, are converted as floating
point (i.e., ``floating comma'') numbers. The digits to the left of
the comma are converted with
``german_parse_cardinal'', then the comma is
pronounced and finally the digits to the right of the comma are read
one by one with the help of
``german_parse_charlist''.
- Ordinals
- Ordinals are cardinals followed by a period. Thus, we
have to distinguish between ordinals and cardinals at the end of a
sentence. For this task we use a list of words that that can only
appear with capitals at the beginning of a sentence . The
inflexional suffixes are determined with the help of the
``german_ordinal_prediction_tree''. The ordinal
is expanded by ``german_parse_ordinal''.
- Cardinals
- If cardinals are grouped by periods or blanks into
blocks of three digits for legibility, they are converted into a
closed sequence of numerals and expanded with the help of
``german_parse_cardinal''.
- Roman numbers
- Roman numbers are converted into Arabic numbers
with ``ger_tok_roman_to_numstring''. They are
converted like ordinals. If there is a king's name, a queen's name
or the name of an emperor or empress in front of the roman number,
the delimiter ``der''/``die'' has to be inserted between the name
and the number.
Next: Abbreviations
Up: Token-to-word rules
Previous: Splitting of compounds separated
  Contents
Gregor Moehler
2001-07-17