Статья

Автор 1	RUS Куканова В. В. Калмыцкий институт гуманитарных исследований РАН ENG Kukanova V. Kalmyk Institute for Humanities of the Russian Academy of Sciences
Заглавие	RUS ПРИНЦИПЫ СЕМАНТИЧЕСКОЙ РАЗМЕТКИ НАЦИОНАЛЬНОГО КОРПУСА КАЛМЫЦКОГО ЯЗЫКА ENG Principles of Semantic Annotation in the National Corpus of the Kalmyk Language
Аннотация	RUS В статье дается описание принципов семантической разметки в Национальном корпусе калмыцкого языка, разработанных на основе Национального корпуса русского языка. В созданном корпусе калмыцкого языка имеется морфологическая разметка (www.kalmcorpora.ru), однако семантическое аннотирование для будущих исследований структуры и семантики языка, в частности вопросов сочетаемости лексических единиц, важный шаг в развитии корпуса как информационно-справочного ресурса. Ключевые слова: корпусная лингвистика, калмыцкий язык, Национальный корпус калмыцкого языка, семантическое аннотирование, фасетная классификация, древесная классификация. ENG This paper presents description of the semantic annotation principles in the National Corpus of the Kalmyk language (www.kalmcorpora.ru) which is agglutinative with rich morphology. The Kalmyk language belongs to the Mongolian language family and is used by the Oirats in Xinjiang (China) and the Kalmyks living in the Lower Volga region of Russia. The corpus of the Kalmyk language is open data of the Kalmyk texts of different styles from 1950-2012 but it mainly includes literary works and newspaper articles. The model of morphological analysis is based on the formal description of inﬂectional types and paradigms without which the corpus could not have automated language processing. The semantic annotation is a crucial step in the project development because the Kalmyk language belongs to the endangered ones, that is why it is necessary to create conditions for thorough and systematic research of the language facts on the wide range of textual materials with particular word collocations. Children can learn grammatical rules and vocabulary, however, it is difﬁcult to acquire how a certain word “works” in the context, and without this knowledge we are not able to produce natural speech. Owing to the availability of semantically-based computerize queries and the information deriving from semantic annotation with or without combination of morphological description in the Kalmyk corpus, we can receive relatively distinct material for researching various linguistic phenomena. The work on semantic annotation is based on the list of lexical units from the Kalmyk-Russian dictionary [1977] edited by B. Muniev. In other words, we use dictionary-based approach to annotation. Combining different methods for processing the list of words, we analyze them from four aspects: 1) lexical and grammatical characteristics (revealing categories in the part of speech); 2) thematic characteristics (one list of themes for all parts of speech); 3) word connotation (negative, positive or both of them); 4) information on word derivatives (it is not the main purpose of annotation, however, we try to point out some of them in cases where it is easy to discover them). The semantic annotation is based on the faceted and tree classiﬁcation, as a result we do not have a clear ontology of lexica but in the process of work we realize that it is difﬁcult to give unambiguous characteristics because of word polysemy. In some cases, we add some speciﬁc operators to the universal taxosonomical classiﬁcation to emphasize the existence of branched systems within some word groups in the Kalmyk language, for instance the system of animals’ marking depending on their age and gender. These marks are necessary to convey cultural peculiarities reﬂected in the language. We analyze lexical units of all parts of speech, except linking words, which make almost 27 thousand units. Two third of all the words have more than one mark in each group of annotation. The result of this annotation is accessed as a closed database (corpus) but we will have opened and published it by the end of 2014. At the moment, we are searching for and emending mistakes in the program code of the morphological analyzer.
Ключевые слова	корпусная лингвистика ◆ калмыцкий язык ◆ Национальный корпус калмыцкого языка ◆ семантическое аннотирование ◆ фасетная классификация ◆ древесная классификация ◆ Corpus Linguistics ◆ Kalmyk language ◆ the National Corpus of the Kalmyk Language ◆ semantic annotation ◆ faceted classiﬁcation ◆ tree classiﬁcation

Общее число публикаций организации в РИНЦ	7534
Суммарное число цитирований публикаций организации	32257
h-индекс (индекс Хирша)	71