O‘zbek tili milliy korpusi uchun matnlarni formatlash

O‘zbek tili milliy korpusi uchun matnlarni formatlash

№ 1 (1) 2022

Страницы:

58 –

63 Язык: узбекский

Открыть файл статьи

Аннотация

В данной статье рассматривается общий подход к описанию и кодированию методов, используемых при включении текстов в национальный корпус узбекского языка. Общий формат может быть оправдан разнообразием и несовместимостью существующих текстовых форматов. Используя формат JSON для хранения текстов в корпусе, можно увеличить скорость поиска в корпусе и преодолеть теоретические и технические проблемы масштабируемости. Описано включение в состав корпуса текстов эпоса «Алпомыш».

Ushbu maqolada o’zbek tili milliy korpusiga matnlarni kiritishda foydalanilgan usullarni tavsiflash va kodlashga umumiy yondashuv muhokama qilinadi.Umumiy format mavjud matn formatlarining xilma-xilligi va nomuvofiqligi bilan asoslanishi mumkin. Korpusda matnlarni saqlash uchun JSON formatdan foydalanish orqali korpus qidiruv tezligini oshirish va kengayuvchanlikdagi nazariy va texnik muammolarni bartaraf etish mumkin. Korpusga Alpomish dostoning matnlari kiritilishi tavsiflangan.

This article discusses the general approach to the description and coding of the methods used in the inclusion of texts in the national corpus of the Uzbek language.A common format can be justified by the diversity and incompatibility of existing text formats. By using the JSON format to store texts in the corpus, it is possible to increase corpus search speed and overcome theoretical and technical problems of scalability. The inclusion of the texts of the Alpomish epic into the corpus is described.

Список использованных источников

Qarshiyev A.B., Tursunov M.S.,Maxmidov Sh.B., “O‘zbek tili milliy korpusini loyihalash”,“Kompyuter lingvistikasi:muammolar, yechim, istiqbollar”mavzusidagi xalqaro ilmiy-amaliy konferensiya materiallari,Toshkent: ToshDO‘TAU,22.04.2022, Vol. 1 № 01 (2022), 82-88 betlar.

Xuchen Yao, Irina Borisova,Mehwish Alam, PDTB XML: the XMLization of the Penn Discourse TreeBank 2.0.

Tobias Weisskopf, In Digital Linguistics / Computational Linguistics, why is XML the preferred corpus format and not JSON?, ResearchGate, 2020.

https://www.w3schools.com/js/js_json_xml.asp.

Hieber, Daniel W. 2020. Data Format for Digital Linguistics. DOI:10.5281/zenodo.1438589.

Forbes, Angus G., Lee, Kristine,Hahn-Powell, Gus, Valenzuela- Escárcega, Marco A. & Surdeanu,Mihai. Text Annotation Graphs: Annotating complex natural language phenomena, 2018. https://www.aclweb.org/anthology/L18-1169.

https://github.com/CreativeCodin gLab/TextAnnotationGraphs.

https://github.com/explosion/spaC y/issues/2928.

https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/pipeline/JSONOutputter.html.