Korpuste tükeldamine

rakendusi silpide ning allkeeltega

Translated title of the contribution: Cutting the text corpora: applications with syllables and sub-languages

Kairit Sirts, Leo Võhandu

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

In this paper we study different aspects of language by using different cuts of language corpora. There are two particular cuts under observation, which are very different by their nature: mincing the text into syllables for developing a statistical language model and dividing the language into sub-languages for identifying the base vocabulary. Our syllable based statistical language model includes the 500 most frequently observed syllables. It is a three-level model consisting of frequency tables for syllables, syllable pairs and syllable triplets. A frequency table is a matrix with syllables, syllable pairs or syllable triplets in rows and syllables in columns. The numbers in matrix cells show how many times the syllable in the column happened to follow the element in the row. The Estonian pseudo language generator is an application of the syllable based statistical language model. Using the Estonian pseudo language generator it is possible to generate a text which is not fully Estonian, but definitely sounds like one. The purpose of categorizing syllables is to assort the syllables according to their possible locations in a word. We propose an algorithm for automatic syllable grouping using the data in the syllable frequency table. We show experimentally how syllables are grouped into word-initial, word-internal and word-final syllables. Language can be divided into general language using a base vocabulary and different sub-languages, which contain particular terminology. In this paper we discuss the definition of general language. We also propose an automatic algorithm for defining its base vocabulary.
Original languageEstonian
Pages (from-to)251-266
Number of pages16
JournalEesti Rakenduslingvistika Uhingu Aastaraamat
Issue number5
Publication statusPublished - 2009
Externally publishedYes

Keywords

  • computational linguistics
  • syllabification
  • syllable association
  • graph representation
  • language model
  • syllable grouping
  • general language
  • sub-languages
  • Estonian

Fingerprint Dive into the research topics of 'Cutting the text corpora: applications with syllables and sub-languages'. Together they form a unique fingerprint.

  • Cite this