Asian Linguistic Suite
Teragram accurately and efficiently handles the complex processing of Asian languages such as Japanese, Chinese and Korean, each of which can be encoded in several formats. Teragram also provides services to enable your software for Unicode. Software vendors for text search, e-commerce search and many other text applications, as well as Internet service providers, are currently using various components of Teragram language suites to achieve completeness, accuracy and speed.
Teragram offers a complete set of Asian language tools and libraries that handle the subtle complexities of Asian text processing. Tasks such as character encoding recognition and mapping, word tokenization, and morphological stemming are basic processing requirements for Asian languages such as Chinese, Japanese and Korean. Asian languages are additionally complex due to the fact that no single standard character encoding exists. For example, the number of characters used in Japanese is far greater than the 256 found in single-byte encoding. Therefore, any Japanese encoding system (EUC, shift-JIS, Unicode, etc.) uses at least some two-byte character encoding.
Teragram also provides tools to break texts into sequences of words rather than sequences of characters. This functionality is the first step toward any intelligent text processing in languages such as Chinese, Japanese and Korean. It is a critical functionality. In addition to providing information on word sequences, Teragram's segmenting tools can associate part-of-speech information with each word, further expediting the complex task of classifying Asian language input text.
Teragram offers solutions to the following challenges of processing Asian language texts:
First, there are numerous standards used to represent Asian characters in text. Usually these standards are not implicitly used without an obvious way to identify them. Teragram language and character encoding recognition automatically recognizes the language and encoding used by any text.
Second, when processing text, it is critical to be able to map and unify all documents into a unique encoding. Teragram character mapping software enables you to convert between any two encodings, and also allows you to convert text into the portable and flexible Unicode representation.
Third, Asian language texts are written with little or no use of word separators. Chinese and Japanese text is written without any space separations, and Korean with limited space separations. Therefore, the first task of any information processing system is to segment the initial text into a sequence of words (this process is called word segmentation). This task is performed by Teragram word segmentation software.
And finally, there is a need for morphological analysis. This is apparent in the context of information retrieval. In English, relating a word like "children" to its root "child" is an obvious necessity. The importance of morphology, however, is even greater in a language like Chinese, Japanese or Korean. In fact, Asian text is written with limited or no space separations. The task of segmenting the initial text into a sequence of words is strongly related to the morphological analysis process. For example, recognizing that a given sequence of characters is a word usually means that the word has been recognized with a specific part of speech (verb, noun, etc.).
Character Encoding Mapping
Teragram provides solutions for the growing complexity of recognizing, manipulating and converting numerous character encodings, including one-byte and multiple-byte encodings. In particular, Teragram software can map texts into Unicode and handle Unicode documents, using UTF8, UCS-2, UCS-4 or other encodings. Teragram Character Encoding Mapping Toolkit handles more than 200 language-character encodings (including Unicode, UTF8, UCS-2, UCS4, Shift-JIS, JIS, EUC, GB, extended GB, big5, KSC, EBCDIC, ISO, Microsoft Code Page, IBM, Cyrillic, Latin-1, MacOS and many others) and allows mapping between any encodings. Teragram's Character Encoding API is designed to meet three important requirements: speed, simplicity and precision. First, the need for speed is obvious. Teragram provides character mapping operations that are so extensive that they are quick. Second, Teragram pushes the need for simplicity to its limit so that it is possible to do any character encoding mapping with five different functions. Two are for loading and freeing data, one is for mapping from encoding to Unicode, the fourth maps Unicode to another encoding, and the last function maps between any pair of encodings. The third requirement is the precise determination of the encoding. The pivot encoding in this API is the UCS-2 representation of the Unicode. Teragram additionally provides a wide range of string manipulation tools in the Unicode standard.
Morphological Stemming and Segmentation
Teragram provides Asian language word-segmentation and stemming software for Chinese, Japanese and Korean at unmatched accuracy and speed. The importance of morphology is paramount in these languages. In fact, Chinese and Japanese text is written without space separations, and Korean with limited space separations. Therefore, the first task of any information processing system is to segment the initial text into a sequence of words (this process is called word segmentation). This task of breaking the input text into a sequence of words is strongly related to the morphological analysis process; for instance, recognizing that a given sequence of characters is a word usually means that the word has been recognized with a specific part of speech (verb, noun, etc.).
A fundamental problem in morphological analysis when retrieving information in Chinese is due to the fact that Chinese does not use spaces to mark word boundaries. Therefore, it is necessary and important that an information processing system is capable of first breaking the original Chinese text into a series of words or phrases, a process called word segmentation. The system can then recognize a given word or phrase as a particular part of speech, such as a noun or noun phrase, verb or verb phrase, adjective or adjective phrase, etc. Segmentation of Chinese text into words is a very difficult task. Many characters form one-character words by themselves, but these characters can also form multicharacter words when used with other characters. Chinese words have variable lengths, and the same character may occur in many different words.
The complexity of Chinese segmentation is shown in the following examples. First, consider the following example:
Which means: Single heart and single mind. It is a four-character word. It is also called quadrisyllabic Chengyu (idiom). The character "?" means one in English and it occurs twice in this idiom. It can also be a Chinese numeral as a one-character word. A second example follows:
Which means "People's Republic of China." This is a seven-character word. In other contexts, , and are also multicharacter words and can join others to form other compounds in other contexts.
Teragram Chinese segmentation software uses extremely large dictionaries of various types to accurately and efficiently resolve the segmentation of Chinese text. Besides commonly used words, these dictionaries contain compounds, idioms, and names of companies, people and products, among many other kinds of entries.
Japanese, like Chinese, does not use spaces to mark word boundaries. Like Chinese, characters can form a one-character word, or when combined with other characters, multicharacter words. This characteristic makes the process of word segmentation difficult but critical. For example, see the following input sentence:
Jon (John) ga (subject-marker) hon (book) wo (object) katta (bought)
(John bought a book.)
Six words are found. Among these words, the word "katta" has been recognized both as an individual segment and as the past tense of the verb "kau" (to buy). The word segmentation and stemming software program, together with the library and its API, gives the programmer the ability to break any input sentence into a sequence of words, each word being related to a part of speech and eventual morphological features.
Among with Chinese and Japanese, Korean ranks among the most complex languages to analyze linguistically. In fact, many problems that appear only at the syntax level for languages like English are already present within Korean morphology. In English, a word can take at most five forms (five tenses of a verb), nouns are even simpler (only singular and plural), and everything else is invariable. This means that English morphology consists of relating a small number of words for each paradigm (root form).
In Korean, however, a verb or a noun, for example, can be analyzed as follows:
e.g., must have seen --->
( (to see) + (past) + (have) + (must) + (sentence ending))
e.g., from under the table --->
( (table) + (under) + (from)) .
The complexity inherent in the Korean language makes morphological analysis (stemming) both difficult and very important. It is difficult because there is no way to manipulate any dictionary that would contain any segment (this would be tantamount to listing all the noun groups in English), and it is crucial because finding all the occurrences of a noun means that all the noun groups in which it appears have be correctly analyzed.