358000, Республика Калмыкия, г. Элиста, ул. им. И.К. Илишкина, д. 8
Приемная: (84722) 3-55-06, факс: 2-37-84 e-mail: kigiran@mail.ru
CSS наложение
Воспоминания о депортации калмыцкого народа
CSS наложение
Научный полк
CSS наложение

Онлайн опросы КалмНЦ РАН

CSS наложение

СМИ о КалмНЦ РАН

CSS наложение
Противодействие коррупции
CSS наложение
Информационные ресурсы
Книжный киоск
Литературный формат
Конференции
Электронные книги Springer
Выборы директора
Следы наций

 

Статья

Автор 1

RUS

Куканова   В. В.

Калмыцкий институт гуманитарных исследований РАН

ENG

Kukanova   V.

Kalmyk Institute for Humanities of the Russian Academy of Sciences

Заглавие

RUS

ПРИНЦИПЫ СЕМАНТИЧЕСКОЙ РАЗМЕТКИ НАЦИОНАЛЬНОГО КОРПУСА КАЛМЫЦКОГО ЯЗЫКА

ENG

Principles of Semantic Annotation in the National Corpus of the Kalmyk Language

Аннотация

RUS

В статье дается описание принципов семантической разметки в Национальном корпусе калмыцкого языка, разработанных на основе Национального корпуса русского языка. В созданном корпусе калмыцкого языка имеется морфологическая разметка (www.kalmcorpora.ru), однако семантическое аннотирование для будущих исследований структуры и семантики языка, в частности вопросов сочетаемости лексических единиц, важный шаг в развитии корпуса как информационно-справочного ресурса. Ключевые слова: корпусная лингвистика, калмыцкий язык, Национальный корпус калмыцкого языка, семантическое аннотирование, фасетная классификация, древесная классификация.

ENG

This paper presents description of the semantic annotation principles in the National Corpus of the Kalmyk language (www.kalmcorpora.ru) which is agglutinative with rich morphology. The Kalmyk language belongs to the Mongolian language family and is used by the Oirats in Xinjiang (China) and the Kalmyks living in the Lower Volga region of Russia. The corpus of the Kalmyk language is open data of the Kalmyk texts of different styles from 1950-2012 but it mainly includes literary works and newspaper articles. The model of morphological analysis is based on the formal description of inflectional types and paradigms without which the corpus could not have automated language processing. The semantic annotation is a crucial step in the project development because the Kalmyk language belongs to the endangered ones, that is why it is necessary to create conditions for thorough and systematic research of the language facts on the wide range of textual materials with particular word collocations. Children can learn grammatical rules and vocabulary, however, it is difficult to acquire how a certain word “works” in the context, and without this knowledge we are not able to produce natural speech. Owing to the availability of semantically-based computerize queries and the information deriving from semantic annotation with or without combination of morphological description in the Kalmyk corpus, we can receive relatively distinct material for researching various linguistic phenomena. The work on semantic annotation is based on the list of lexical units from the Kalmyk-Russian dictionary [1977] edited by B. Muniev. In other words, we use dictionary-based approach to annotation. Combining different methods for processing the list of words, we analyze them from four aspects: 1) lexical and grammatical characteristics (revealing categories in the part of speech); 2) thematic characteristics (one list of themes for all parts of speech); 3) word connotation (negative, positive or both of them); 4) information on word derivatives (it is not the main purpose of annotation, however, we try to point out some of them in cases where it is easy to discover them). The semantic annotation is based on the faceted and tree classification, as a result we do not have a clear ontology of lexica but in the process of work we realize that it is difficult to give unambiguous characteristics because of word polysemy. In some cases, we add some specific operators to the universal taxosonomical classification to emphasize the existence of branched systems within some word groups in the Kalmyk language, for instance the system of animals’ marking depending on their age and gender. These marks are necessary to convey cultural peculiarities reflected in the language. We analyze lexical units of all parts of speech, except linking words, which make almost 27 thousand units. Two third of all the words have more than one mark in each group of annotation. The result of this annotation is accessed as a closed database (corpus) but we will have opened and published it by the end of 2014. At the moment, we are searching for and emending mistakes in the program code of the morphological analyzer.

Ключевые слова

корпусная лингвистика  ◆  калмыцкий язык  ◆  Национальный корпус калмыцкого языка  ◆  семантическое аннотирование  ◆  фасетная классификация  ◆  древесная классификация  ◆  Corpus Linguistics  ◆  Kalmyk language  ◆  the National Corpus of the Kalmyk Language  ◆  semantic annotation  ◆  faceted classification  ◆  tree classification