A CORPUS-BASED SWADESH WORD LIST FOR LITERARY CHRISTIAN URMI ( NEW ALPHABET TEXTS )

5 The Neo-Aramaic dialects are modern vernacular forms of Aramaic, which has a documented history in the Middle East of over 3,000 years. Due to upheavals in the Middle East over the last one hundred years, thousands of speakers of Neo-Aramaic dialects have been forced to migrate from their homes or have perished in massacres. As a result, the dialects are now highly endangered. The dialects exhibit a remarkable diversity of structures. Moreover, the considerable depth of attestation of Aramaic from earlier periods provides evidence for the pathways of change. For these reasons the research of Neo-Aramaic is of importance for more general fields of linguistics, in particular language typology and historical linguistics. The papers in this volume represent the full range of research that is currently being carried out on Neo-Aramaic dialects. They advance the field in numerous ways. In order to allow linguists who are not specialists in Neo-Aramaic to benefit from the papers, the examples are fully glossed.


Introduction
The aim of this paper is to compile a basic word list for the literary Neo-Aramaic dialect of the Christians of Urmi and establish their etymologies. This study is intended as a starting point for a comparison of the lexicon in all dialects of the North-Eastern Neo-Aramaic (NENA) subgroup. Literary Christian Urmi is chosen for this study because it is attested in a very large corpus of texts.
Research of Neo-Aramaic in recent decades has produced descriptions of many dialects, especially within the NENA dialect subgroup. 2 We are now, therefore, in a good position to attempt to understand the genealogical relationships between the dialects. Hoberman (1988) has suggested a reconstruction of the proto-NENA pronominal system. One of the conclusions of Hoberman's study was that the dialects of Northern Iraqi Kurdistan share some morphological innovations, which may help to single them out as a cohesive subgroup. Fox (1994) attempts to explore relationships within NENA according to selected phonological, morphological and lexical features. The outcome of Fox's study was the identification of three major 1 HSE University, Moscow. The research has been supported by RFBR grant No� 17-04-00472� 2 For a bibliography of these dialect descriptions see: Napiorkowska (2015, 583-594 clusters of isoglosses, which, however, need to be checked with a broader range of data. 3 In this paper I shall present a Swadesh list of 110 basic words (following the version of Kassian et al. 2010) that are attested in a corpus of literary Christian Urmi� The corpus used for this purpose consists of a collection of books and newspapers issued in the latinised alphabet in Soviet Russia and Georgia from 1929 to 1938. This corpus was chosen on the assumption that these textual data provide sufficient documentation needed to create a basic word list� There are certain drawbacks in using literary texts for this purpose, because the language of literature and journalism may not reflect the true usage of a natural spoken language. The lexical features of the literary register, however, usually do not affect the usage within the scope of word lists consisting of 100 or even 200 words� It is important to note, however, that data collected from fieldwork are usually restricted in volume. The currently largest collection of spoken narrative texts of a Neo-Aramaic dialect (Khan 2016) amounts to approximately 70,000 words.

The Corpus 4
The books and newspapers in the Assyrian new alphabet (Novij Alfavit, henceforth NA) were published in Moscow and Tbilisi from 1929 to 1938� This project was an integral part of the latinisation campaign in the Soviet Union (Smith 1998, 121-42). After 1938 the publication of Assyrian books and the newspaper in NA ceased because most of the authors, editors and translators had been condemned to death by the Stalinist regime.
It is important to note that the books dated 1929-1931 were printed using the earlier variety of the Assyrian new alphabet, which is basically Cyrillic with the admixture of some Latin letters (t, d, j, l). A modified variety of the Assyrian NA was introduced in 1931 and was used later as a standard, with some further changes adopted in 1933. A table of correspondences between the transcription notations used by various scholars and the graphemes of the Assyrian NA is given in the appendix to this paper� The corpus includes 172 books and approximately 270 issues of the newspaper Kokhva d Madinkha with the texts in NA� 5 The genres of the books are the following: translations of Russian literary texts (the largest part of the corpus), original literary fiction in Assyrian Neo-Aramaic, school textbooks, popular scientific texts, Soviet propagandistic and atheistic literature. Currently the corpus of digitised texts amounts to approximately 630,000 words from the 46 books. 6 The word 'digitised' here means that the texts are available in the doc/txt formats and electronically searchable. Recently the morphologically tagged corpus of the texts in NA has been made available for queries at: http://neo-aramaic.web-corpora.net/index_en.html�

The Method of Presentation of the Results
Two kinds of queries were performed in order to determine the exponents of the meanings of the basic word list� First, the meanings of the word list were searched for in the Russian originals of the translated texts� 7 The corresponding exponent was checked in the Neo-Aramaic translation. Second, the word count of the exponents was performed on the basis of the textual database of approximately 630,000 words� In some cases I searched in the literature beyond the digitised corpus. I did this, for example, for anatomical terms such as foot. They were found in a school textbook on natural science. In the case of the words with high frequency, the word count was made on a sample textual file of 37,000 words� Each entry in the following list of basic words consists of: 1� the meaning 5 Most of the texts in this newspaper are printed in Syriac script� 6 The expected volume of the textual corpus after its full digitisation is more than 2 million words� 7 More than 80 percent of the searchable textual corpus are translations from Russian into Neo-Aramaic.

The 110 Swadesh List
The 110 Swadesh word list for the corpus of Neo-Aramaic texts in the New Alphabet is as follows.
8 The term is based on one of the classifications of Aramaic languages which divides the Aramaic languages of the Middle period into Western and Eastern branches (Rosenthal 1939 In most of its uses qəlpə refers to objects similar to the bark of the tree: eggshell, nutshell, watermelon rinds, or, metaphorically, the turtle shell. There is only one clear usage of qəlpə in a translated text: Kirvijşi d meşə в leləvəti ki axlьj qəlpə d ijləni 'The hares feed at night on tree bark' (THH 21/1). The other one renders original Russian кора 'bark', but the text speaks metaphorically about the turtle shell (THH 10/4). The Kurdish etymology for C� Urmi gura is suggested in (Khan 2016, vol. 3, 169) with a question mark.

вarnəşə� > 50×�
The ratio of the usage of nəşə to вarnəşə is 10:1. Therefore, nəşə is the main exponent of the meaning in question.  The character of the Classical Syriac sources that use derivatives of ṭlʿ with the meanings relating 'to sleep' (Bar Bahlul dictionary, The Book of Medicines) point to a probable Neo-Aramaic background of these terms in these dictionaries of CS.

Conclusions
The digitised corpus for literary Christian Urmi of approximately 630,000 words has been shown to be sufficient to establish the basic 110 word list with 117 exponents. More than 70 percent of the entries (87/117) have more than 50 attestations in the corpus.
There are seven meanings that have two exponents: bark (qəlpə, çuluxtə), to bite (qraţa, njasa), cold (qajra, qarьjra), green (qijnə, mijlənə), hair (kosə, mьsta), man (nəşə, вarnəşə); to sleep (dməxə, ţlaja). In the cases of cold and green the problem may be solved by statistical data: the exponents qajra for 16 One of the attestations of this word was found in the text MPX 90/28, which is not yet digitised� cold and qijnə for green have considerably more attestations in the corpus than the alternative exponents qarьjra and mijlənə� On the other hand, bare statistical data do not help in the case of bark (see the discussion of no. 3).