Multext - Document MSG 1. MtSeg/Tools. Last modified

logo

MtSeg: Subtools


The Multext segmenter consists of the set of subtools described in the sections below. The user is free to chain together whichever subtools serve his or her application needs. MtSeg is a pre-defined script which invokes the entire segmentation subtool chain, in a logical order.


Entire segmentation : mtseg

Cutting at spaces : mtsegspace

Isolating punctuation : mtsegpunct

Splitting tokens : mtsegsplit

Identifying abbreviations : mtsegabbrev

Merging compounds: mtsegmerge

Identifying tokens by regular expressions : mtsegregex

Identifying sentence boundaries : mtsegsent


Entire segmentation : mtseg

Since the segmenter is conceived as a set of tools, a script is provided which proposes the following possible logical chain:

  1. mtsegspace (to split text at spaces)
  2. mtsegpunct (to isolate punctuations)
  3. mtsegmerge (with a resource file dedicated to punctuations, to merge compounds punctuations)
  4. mtsegsplit (to split tokens with internal punctuations)
  5. mtsegmerge (with a resource file dedicated to abbreviations, to merge compounds abbreviations)
  6. mtsegabbrev (to identify abbreviations)
  7. mtsegmerge (to recombine multiwords unit)
  8. mtsegregex (to identify date, with a resource file dedicated to)
  9. mtsegregex (to identify numbers, with a resource file dedicated to)
  10. mtsegregex (to identify enumerations, with a resource file dedicated to)
  11. mtsegsent (to detect sentence boundaries)


Spaces : mtsegspace

This tool is the very first in the segmentation phase. It breaks down the text by searching for spaces and tabulations, and generates one token per element. The REF field is updated to include the new information stating the location of the first character of the token in the SGML element (or in the line for plain text). All tokens are assigned to the class TOKEN by this subtool.

It should be noted that in this phase, punctuation marks remain attached to the preceding/following token.

This tool uses the following resource:

  1. the language-specific class definition resource (tbl.classes.xx).
space gif


Punctuation : mtsegpunct

The purpose of mtsegpunct is to identify and isolate punctuation marks. The rules defining which characters comprise punctuation and how they should be treated are defined in a resource file using regular expressions.

Note that this subtool may not treat internal punctuation of a token, in order to avoid separating compound words in this phase (this is set in the resource file). For example, "aujourd'hui" and "porte-manteau" are not cut at the apostrophe and hyphen by this subtool. A later subtool, mtsegsplit , determines whether or not to break these strings at the punctuation mark.

This tool uses the following resources :

  1. the language-specific class definition resource (tbl.classes.xx).
  2. the language-specific punctuation definitions resource (tbl.punct.xx)
punct gif


Splitting tokens: mtsegsplit

The role of mtsegsplit is to split tokens with internal punctuation, such as the French "viens-tu", where it is appropriate. Note that this program splits only affixes that are typographically marked (apostrophe, hyphen) and not agglutinated tokens (such as "damelo" in Italian or "du" in French), which should be dealt with at the lexical level if necessary.

In order to determine whether or not a token containing internal punctuation is to be split, mtsegsplit consults a user-defined list declaring those strings which should be split, as they appear to the left or the right of internal punctuation. Whenever a token contains an apostrophe and/or a hyphen, if the string to the left or right of the apostrophe or hyphen is not in the list, the token is not broken down (e.g., in French, "aujourd'hui" remains unsplit because "aujourd'" and "'hui" are not in the list).

Note that this tool does not split the ending period from tokens because at this step, abbreviations have not been detected yet.

This tool uses the following resources :

  1. the language-specific class definition resource (tbl.classes.xx).
  2. the language-specific punctuation definitions resource (tbl.punct.xx)
  3. the language-specific 'split' definitions resource (tbl.split.xx).
split gif


Abbreviations : mtsegabbrev

To determine whether or not a given token is an abbreviation, Mtsegabbrev consults a resource file containing a user-defined list of abbreviations.

Abbreviations composed of a series of capital letters separated by dots (such as E.E.C.) are recognized as abbreviations even if they are not in the abbreviation file.

Whenever a token ending with a period or dots ("...") is not recognised as an abbreviation, the punctuation mark is stripped off and output as a separate token with type PUNCTUATION according to the punctuation resource file, and the references are updated.

This tool uses the following resources :

  1. the language-specific class definition resource (tbl.classes.xx).
  2. the language-specific punctuation definitions resource (tbl.punct.xx)
  3. the language-specific abbreviations resource (tbl.abbrev.xx).
abbrev gif


Merging : mtsegmerge

The mtsegmerge program is intended to recombine fixed multi-token units such as "in spite of", and non-agglutinated compounds (common in German and Dutch, and which exist to some extent in most languages).

The user is free to define whatever compounds he or she wishes in the "compound" file.

Note that mtsegmerge is not dedicated to re-combining compound words only; it can be invoked or even re-invoked in the subtool chain with other resource files as input, which define other objects to be re-combined. For example, it can be called with the punctuation definitions resource file as input, in which unbreakable punctuation sequences (such as "...") are defined. This subtool can be used to re-combine these sequences that were broken down by earlier subtools. This tool uses the following resources :

  1. the language-specific class definition resource (tbl.classes.xx).
  2. the language-specific "compound-words" definition resource (tbl.merge.xx).
merge gif


mtsegregex

The purpose of mtsegregex is to identify any tokens (composed or single). It means that it can recombine, if necessary, as mtsegmerge, multi-token units defined by regular expressions. As an example, it can be used to identify date format or numbers ...
A left and right context can be associated with the expression. An enumeration, for example, will only be recognized if the full expression is preceded by a beginning of a paragraph.

In an expression, the character space (' ') must be replaced by the character underscore ('_').

This tool uses the following resources :

  1. the language-specific class definition resource (tbl.classes.xx).
  2. A user-language-specific expressions resource (tbl.date.xx, tbl.enum.xx ).

The specific resource given to this tool MUST have one entry beginning with 'MT_' which specify the full structure of the token to recognize.

Example:

MT_MYEXPRESSION [1-9]*! MYCLASS

Then, you have three optional directives:

MT_MAX_EXPR_LENGTH
to set the maximum size (in number of tokens) for the expression.
(default is set to 14)
MT_CLASS_BEFORE
to specify the class(es) which must precede the expression.
(default is set to ANY CLASS)
MT_CLASS_AFTER
to specify the class(es) which must follow the expression.
(default is set to ANY CLASS)



regex gif


Sentences : mtsegsent

The purpose of mtsegsent is to identify sentence boundaries in a text. The tool does not modify the input lines it receives; rather, this tool simply adds a new line for each end-of-sentence boundary it detects.

Once segmenter's subtools have assigned a class to each token, this tool will use those informations to locate sentence boundaries. Each segmenter's class is assigned a particular property defined in the resource file tbl.sent.xx.

In fact, each class can be considered either as:

This tool uses the following resources :

  1. the language-specific class definition resource (tbl.classes.xx).
  2. the language-specific sentences directives definitions resource (tbl.sent.xx).
sent gif


HTML 3.2 Checked! This document is better viewed with Netscape

| Top | Next | MtSeg home page | LPL/CNRS | MULTEXT |

Copyright © Centre National de la Recherche Scientifique, 1996.