Manana Tandaschwili

Frankfurt University

Head of the Department for Caucasian Languages

Frankfurt a/M, Germany

ORCID: https://orcid.org/0009-0005-7812-9124

tandaschwili@em.uni-frankfurt.de

Georgian Language in the Context of the New Paradigm of Science

(New Classification of Languages and Modern Status of the Georgian Language)

Abstract

 The 21st century has brought significant changes to various fields of science, largely due to the rapid development of information technologies. The philosophy of science has evolved, introducing new challenges for languages. While the traditional classification paradigm included genetic, typological, and relational classifications, a new paradigm has emerged. This new paradigm categorizes languages according to different parameters: 1) the legal status of the language, 2) the viability of the language and its areas of use, and 3) the degree of operation of languages in the digital age. According to this new classification, languages are now categorized into high-resource languages (HRL) and low-resource languages (LRL). Consequently, the traditional Language Atlas, which reflects the genealogical classification or distribution area of languages, has been replaced by a new Language Atlas that shows the percentage of digital resources available for natural language processing (NLP) in relation to the number of languages spoken worldwide.

In both Georgian and international scientific literature and portals (Nancy et al., 2017), the Georgian language is considered a Low Resource Language (LRL). This presentation aims to examine the appropriateness of this classification for the Georgian language in the context of the new scientific paradigm.

According to theoretical scientific literature, LRLs can be understood as few studied, resource-scarce, less computerized, less privileged, and less commonly taught (Cieri et al., 2016:4543-44; Magueresse et al., 2020:1). Artificial intelligence experts define LRLs as languages with a limited amount of linguistic data and resources for natural language processing (NLP) tasks. In the context of NLP and machine learning, having “low resources” typically means a lack of annotated texts, linguistic databases, or other resources necessary to train and develop effective language models. Reasons a language might be considered low resource include having minimal digital presence, lacking annotated datasets, being underrepresented in academic research, or having insufficient computational resources.

Based on the definitions provided and the fact that there are both large databases and advanced language technologies available for Georgian (Tandashvili & Kamarauli, 2021:87 ff), I argue that the term “Low Resource Language” is not adequately applicable to Georgian, both in terms of content and context.

In my opinion, the third parameter of the new language classification (the degree of operation of languages in the digital age) should be revised. Specifically, a multi-level system should be developed instead of the current binary classification (HRL vs. LRL) to better reflect the functionality of languages today. I propose the following intermediate levels:

  1. HRL (High Resource Language): Languages with sufficient resources to train artificial intelligence;
  2. SRL (Sufficient Resource Language): Languages whose digital resources are not necessarily adequate for training artificial intelligence but are sufficient for functioning in the digital era;
  3. URL (Underresource Language): Languages with digital resources insufficient for machine processing of various tasks;
  4. LRL (Low Resource Language): Languages represented by a small number of unstructured or random digital resources;
  5. UDR (Undigitized Language): Languages not yet represented by digital resources but for which it is possible to create such resources (big data);
  6. NRL (Nonresourced Language): Languages that, for various reasons, will likely never be represented by digital resources (e.g., endangered languages or languages spoken by such a small number of people that creating significant digital resources is infeasible).

Keywords: Digital Kartvelology, Georgian Language, Low Resource Languages