358000, Республика Калмыкия, г. Элиста, ул. им. И.К. Илишкина, д. 8
Приемная: (84722) 3-55-06, факс: 2-37-84 e-mail: kigiran@mail.ru
CSS наложение
Воспоминания о депортации калмыцкого народа
CSS наложение
Научный полк
CSS наложение

Онлайн опросы КалмНЦ РАН

CSS наложение

СМИ о КалмНЦ РАН

CSS наложение
Противодействие коррупции
CSS наложение
Информационные ресурсы
Книжный киоск
Литературный формат
Конференции
Электронные книги Springer
Выборы директора
Следы наций

 

Статья

Автор 1

RUS

Бембеев   Е. В.

Калмыцкий институт гуманитарных исследований Российской академии наук

ENG

Bembeev   E.

Kalmyk Institute for Humanities of the Russian Academy of Sciences

Автор 2

RUS

Куканова   В. В.

Калмыцкий институт гуманитарных исследований Российской академии наук

ENG

Kukanova   V.

Kalmyk Institute for Humanities of the Russian Academy of Sciences

Автор 3

RUS

Каджиев   А. Ю.

Калмыцкий институт гуманитарных исследований Российской академии наук

ENG

Kadzhiev   A.

Kalmyk Institute for Humanities of the Russian Academy of Sciences

Заглавие

RUS

ЧАСТОТНЫЙ СЛОВАРЬ СОВРЕМЕННОГО КАЛМЫЦКОГО ЯЗЫКА: ПРАВИЛА АНАЛИЗА ТЕКСТОВОГО МАТЕРИАЛА

ENG

Frequency Dictionary of Modern Kalmyk Language: Rules of Analysis of Text Material

Аннотация

RUS

Статья посвящена описанию правил анализа текстового материала для создания частотного словаря калмыцкого языка на материале Национального корпуса калмыцкого языка (www.kalmcorpora.ru), который состоит из художественных текстов второй половины XX - начала XXI в., а также газетных статей и расшифровок устной речи. Объем художественных (прозаических и поэтических) текстов превышает 10 млн словоупотреблений. Тексты в корпусе, а также отдельные элементы текста (словоформы, знаки препинания, абзацы и т. п.) особым образом аннотированы. Создаваемый частотный словарь калмыцкого языка будет носить пилотный характер, поскольку это первый опыт разработки словаря подобного типа. На наш взгляд, объем созданного корпуса калмыцкого языка позволяет описать язык с точки зрения частотности употребления языковых единиц и значений: словоформ, слов, конструкций (2-и 3-граммных), грамматических значений, букв и др.

ENG

The article is devoted to description of the rules for text material analysis for creating the Frequency Dictionary of the Kalmyk language on the basis of the National Corpus of the Kalmyk Language (www.kalmcorpora.ru) which includes the texts of the literary works published in the second half of the 20th and at the beginning of the 21st centuries as well as newspaper articles and transcripts of spoken language. The volume of the fi ction (prose and poetry) exceeds 10 mln. words. The texts in the Corpus as well as certain elements of the texts (word-forms, punctuations signs, paragraphs, etc.) have special annotations. The Frequency Dictionary created on the basis of the Corpus is a pilot model as it is the first attempt to develop a dictionary of this type. In our opinion, the size of the created Corpus of the Kalmyk Language allows to describe the language from the point of view of usage frequency of language units and meanings: word-forms, words, constructions (2 and 3-gramms), grammatical meanings, letters, etc. In 2013, the experimental version of the National Corpus of the Kalmyk Language was launched, but it did not have any morphological and semantic annotations though the closed data had already possessed these types of annotations. The material containing the annotations will be open after the analyzer’s program code will be adjusted, and its efficiency will reach 90%. At the present moment, the model of the algorithm of work of the morphological parser for the Kalmyk language successfully analyzes 70% of any text providing only unambiguous parsing at the same time. About 20% of the texts have multitude possible variants of automated analyses, though 10% of the texts have no parsing as there are no stems for them in the dictionary (they are mostly Russian loanwords which were not included into the Dictionary edited by B.D. Muniev [1977] and some proper names). The main idea of developing the Frequency Dictionary is that the most frequently used language units are the most significant ones in any language but at the same time non-frequent elements are of the same significance but from the other point of view. They can carry some traces of historical development and can belong to various terminological systems which evidences that a lexical unit is out of use in speech. The issue of the language units and meanings frequency is not developed in the Kalmyk linguistics that is why for researching the frequency characteristics of the Kalmyk speech one should first of all identify and justify the parameters for distinguishing frequency and describing frequency characteristics of the Kalmyk speech. Thus the aim of this article is to describe the rules for analyzing lexical units in order to develop the Frequency Dictionary of the Kalmyk language where the observation unit is a lemma - that is an initial form of the language without its lexical and grammatical annotations. However, it does not mean that the dictionary development will not take into account the Kalmyk grammar: processing of word-forms and working out lemma vocabulary are regulated by the rules of the formalized description of the Kalmyk language grammar, besides for each part of speech there is a separate description. The main and basic issue is to define the boundaries for the notions of a word and a lemma (an initial form of a word). The article provides the rules for textual material analysis in order to create the Frequency Dictionary of the Kalmyk language. These rules are built on the principles for developing “The Frequency Dictionary of the Russian Language” [Frequency Dictionary … 1977] and “The Grammar Dictionary of the Russian Language” [Zalizniak 1987] which were revised for the purposes of the Kalmyk language, while for the units which do not exist in the literary written language the rules have been developed anew. Each part of speech has its own set of rules which regulates the work of the morphological parser to process lineal letter sequence of the vocabulary element for the Frequency Dictionary.

Ключевые слова

корпусная лингвистика  ◆  квантитативные методы в лингвистике  ◆  частотный словарь  ◆  калмыцкий язык  ◆  правила лемматизации  ◆  Corpus Linguistics  ◆  quantitative methods in Linguistics  ◆  frequency dictionary  ◆  the Kalmyk language  ◆  the rules for lemmatization