ETCSL:frequencies:content

Introduction

These pages present some basic frequencies based on the Electronic Text Corpus of Sumerian Literature. The frequencies are in the form of alphabetic and rank frequency lists, and the basic units of the lists are either word forms (types), lexemes, or sign names. There are also lists based on only a part of the corpus, which can be used to analyse subgroups of the compositions making up the ETCSL.

Since the ETCSL is constantly being expanded and amended, the frequencies will necessarily change over time. It is therefore important to register any change in the dating of the various lists.

Before making use of the many lists provided, be sure to read how they were produced and take note of the fact that certain parts of the corpus were excluded before the frequencies were put together. This means that a search in the corpus will yield a slightly different count from the frequency provided for the same word, lexeme, or sign name in any of the frequency lists.

How ...

When extracting the information to be included in the various lists, certain (arbitrary) decisions regarding format had to be made, partly due to requirements imposed by the programs used to produce the lists.

determinatives are delimited by { and } instead of & and ; (&d; -> {d}, &ki; -> {ki}, etc.)
determinatives are converted to sign names in the sign name lists
solitary Xs, &X;s, etc. are deleted, except in the sign name lists where Xs are kept and &X; and (&X;) are expanded to three Xs
the underscore (_) is used to combine lemma (base form), part of speech, and label to form a lexeme, e.g. ki_N_place
if the label contains spaces, the spaces are converted to #, e.g. e3_V_to#go#out#or#in; all other non-alphabetic characters are deleted, e.g. ( and )
character entities occurring in the labels have been converted to two-character sequences, e.g. &t; -> tx, &h; -> hz, etc.
sign names constituting a ligature, a complex sign, or a sign with graphic distinctions are usually stringed together using the underscore character. e.g. NUN_OVER_NUN, DAG_KISIM5_TIMES_LU_PLUS_MASH2, and NUN_TENU

The main programs used to produce the lists are: WordSmith Tools 4.0 (http://www.lexically.net/wordsmith/) and the Ngram Statistics Package (http://ngram.sourceforge.net/). In addition, some home-made Perl programs were developed.

Exclusions

In order to compile a base-profile of the corpus the following have been excluded: additions and non-primary variants within compositions; the catalogues; and those proverbs numbered 6.2 (to a large degree these duplicate proverbs which also occur in the files numbered 6.1). The Sumerian King List (2.1.1) has also been excluded because its statistical profile is atypical. If a composition has more than one version, Me-Turan and Susa versions are excluded and the remaining longest version is regarded as primary. Thus, the program extracting the material used as basis for the frequencies excludes:

c.0.1.1.xml
c.0.1.2.xml
c.0.2.01.xml
c.0.2.02.xml
c.0.2.03.xml
c.0.2.04.xml
c.0.2.05.xml
c.0.2.06.xml
c.0.2.07.xml
c.0.2.08.xml
c.0.2.11.xml
c.0.2.12.xml
c.0.2.13.xml
c.1.2.2.xml:<text id="c122.ver1" n="2">
c.1.8.1.2.xml:<text n="2">,<text n="3">
c.1.8.1.3.xml:<text id="c1813.ver1" n="2">,<text n="3">
c.1.8.1.4.xml:<text n="2">,<text n="3">,<text n="4">,<text n="5">
c.2.1.1.xml
c.2.1.5.xml:<text n="2">
c.2.2.6.xml:<text id="c226.ver1" n="2">
c.2.4.1.1.xml:<text n="2">
c.2.4.1.4.xml:<text n="2">,<text n="3">
c.2.5.4.13.xml:<text id="c25413.ver1" n="2">
c.2.5.4.15.xml:<text n="2">
c.3.1.06.1.xml:<text n="2">
c.3.1.08.xml:<text n="1">
c.3.1.20.xml:<text n="2">
c.4.07.8.xml:<text n="2">
c.4.08.18.xml:<text n="2">,<text n="3">
c.4.16.1.xml:<text n="2">
c.4.31.1.xml:<text n="2">
c.6.2.1.xml
c.6.2.2.xml
c.6.2.3.xml
c.6.2.4.xml
c.6.2.5.xml

Comparing a subset with the whole corpus

When comparing a subset of the compositions with the whole corpus, i.e. using the whole corpus as a reference corpus, all the compositions making up the subset are excluded from the reference corpus. This seemed to yield the best result.

The comparison is presented as a list of key words and is done with the help of the WordSmith Tool. Below is an extract from the help menu of that program. To read more about WordSmith Tools, go to http://www.lexically.net/wordsmith/.

Key words (from WordSmith Tools Help (c) Mike Scott)

The term "key word", though it is in common use, is not defined in Linguistics. This program identifies key words on a mechanical basis by comparing patterns of frequency. (A human being, on the other hand, may choose a phrase or a superordinate as a key word.)

A word is said to be "key" if
a) it occurs in the text at least as many times as the user has specified as a Minimum Frequency
b) its frequency in the text when compared with its frequency in a reference corpus is such that the statistical probability as computed by an appropriate procedure is smaller than or equal to a p value specified by the user.

positive and negative keyness
A word which is positively key occurs more often than would be expected by chance in comparison with the reference corpus. A word which is negatively key occurs less often than would be expected by chance in comparison with the reference corpus.

typical key words KeyWords will usually throw up 3 kinds of words as "key".

First, there will be proper nouns. Proper nouns are often key in texts, though a text about racing could wrongly identify as key, names of horses which are quite incidental to the story. This can be avoided by specifying a higher Minimum Frequency.

Second, there are key words that human beings would recognise. The program is quite good at finding these, and they give a good indication of the text's "aboutness". (All the same, the program does not group synonyms, and a word which only occurs once in a text may sometimes be "key" for a human being. And KeyWords will not identify key phrases unless you are comparing wordlists based on word clusters.)

Third, there are high-frequency words like because or shall or already. These would not usually be identified by the reader as key. They may be key indicators more of style than of "aboutness". But the fact that KeyWords identifies such words should prompt you to go back to the text, perhaps with Concord [another WordSmith Tool, JE], to investigate why such words have cropped up with unusual frequencies.

Batch comparison

A routine has been set up to simplify the comparison of compositions. In principle, all the compositions in the corpus can now be compared with each other. The program that performs the comparisons outputs a tab-delimited list, which can easily be pasted into e.g. Excel. In the example below, all the compositions starting with c.2.4, Ur III praise poetry, were compared. The list was then pasted into Excel and sorted according to the percentage of matching lexeme bigrams. (A nice thing about Excel is that it can sort on multiple columns.) Only a fraction of the 1,892 lines of the complete list is shown in Table 1.1. A brief discussion of the table and a way of highlighting the similarities are found below the table.

Table 1.1. Comparison of individual compositions sorted according to % of common lexeme bigrams (column 11)
C1/2 = composition 1/2, W = word/token, L = lexeme, WL = word line, LL = lexeme line
Compos. (C1 > C2) W in C1 Overlap with C2 % L in C1 Overlap with C2 % Biwords % Bilexemes % WL overlap % LL overlap %

c.2.4.4.9 > c.2.4.2.24 4 3 75 4 3 75 2 66.7 2 66.7 0 0 0 0

c.2.4.1.5 > c.2.4.1.6 181 113 62.4 181 155 85.6 55 33.1 83 51.6 5 12.5 7 17.5

c.2.4.1.6 > c.2.4.1.5 244 130 53.3 242 186 76.9 55 24.6 83 39 5 9.6 7 13.5

c.2.4.4.9 > c.2.4.4.1 4 1 25 4 3 75 0 0 1 33.3 0 0 0 0

c.2.4.4.9 > c.2.4.2.17 4 2 50 4 2 50 1 33.3 1 33.3 0 0 0 0

c.2.4.1.8 > c.2.4.1.1 27 13 48.1 27 17 63 1 4.5 3 13.6 0 0 0 0

c.2.4.5.5 > c.2.4.5.4 93 57 61.3 93 69 74.2 7 7.7 12 13.2 0 0 0 0

c.2.4.2.b > c.2.4.2.04 113 37 32.7 109 81 74.3 3 2.8 13 12.9 0 0 0 0

c.2.4.2.b > c.2.4.2.02 113 37 32.7 109 81 74.3 2 1.9 13 12.9 0 0 0 0

c.2.4.1.a > c.2.4.2.02 75 30 40 73 54 74 3 4.1 9 12.7 0 0 0 0

c.2.4.2.25 > c.2.4.2.03 128 49 38.3 123 92 74.8 6 4.7 15 12.5 1 3.4 1 3.4

c.2.4.2.12 > c.2.4.2.02 52 9 17.3 49 35 71.4 0 0 6 12.5 0 0 1 5.6

c.2.4.4.5 > c.2.4.5.3 45 12 26.7 42 21 50 2 4.5 5 12.5 0 0 2 10

c.2.4.2.17 > c.2.4.2.04 219 123 56.2 208 171 82.2 8 3.9 23 12.3 1 1.9 1 2

c.2.4.2.17 > c.2.4.2.24 219 95 43.4 208 157 75.5 7 3.4 22 11.8 0 0 0 0

c.2.4.1.a > c.2.4.2.04 75 37 49.3 73 60 82.2 3 4.1 8 11.3 0 0 0 0

c.2.4.1.a > c.2.4.2.24 75 28 37.3 73 54 74 3 4.1 8 11.3 0 0 0 0

c.2.4.1.5 > c.2.4.1.3 181 69 38.1 181 127 70.2 5 3 18 11.2 0 0 0 0

c.2.4.5.5 > c.2.4.2.04 93 56 60.2 93 72 77.4 2 2.2 10 11 0 0 0 0

c.2.4.2.23 > c.2.4.2.05 49 15 30.6 48 32 66.7 1 2.1 5 10.9 0 0 1 4.3

c.2.4.2.b > c.2.4.2.01 113 27 23.9 109 72 66.1 3 2.8 11 10.9 0 0 0 0

c.2.4.2.05 > c.2.4.2.02 931 387 41.6 901 696 77.2 22 2.4 92 10.8 1 0.5 3 1.4

c.2.4.2.17 > c.2.4.2.15 219 97 44.3 208 148 71.2 9 4.4 19 10.2 0 0 0 0

c.2.4.5.1 > c.2.4.5.4 150 63 42 142 90 63.4 9 6.9 12 10.1 0 0 0 0

c.2.4.2.25 > c.2.4.2.05 128 48 37.5 123 89 72.4 1 0.8 12 10 0 0 0 0

c.2.4.2.a > c.2.4.2.07 111 24 21.6 111 67 60.4 1 1.6 6 10 0 0 0 0

c.2.4.1.a > c.2.4.1.1 75 30 40 73 56 76.7 2 2.7 7 9.9 0 0 0 0

c.2.4.2.b > c.2.4.2.05 113 38 33.6 109 77 70.6 1 0.9 10 9.9 0 0 0 0

c.2.4.2.b > c.2.4.2.15 113 30 26.5 109 65 59.6 2 1.9 10 9.9 0 0 0 0

From left to right the various columns contains

the two compositions compared

the total number of words in composition 1

the number of words it shares with composition no. 2

the percentage of shared words

the total number of lexemes in compositions 1

the number of lexemes it shares with compositions no. 2

the percentage of shared lexemes

the number of shared word bigrams in the two compositions

the percentage of shared word bigrams

the number of shared lexeme bigrams shared

the percentage of shared lexeme bigrams

the number of equivalent lines in the two compositions

the percentage

the number of similar lines when all words have been converted to lexemes AND all proper nouns have been converted to the dummy lexeme PROPER + type of proper noun, e.g. PROPER_RN

the percentage.

What is potentially interesting about such a comparison is not only finding compositions that are similar in some respect, but also the ones that are (very) different within a particular grouping.

As always, there are a number of things one needs to be aware of when reading such a table. One is nicely illustrated by the first row of the table. This row yields extremely high numbers, but this is only because the composition being compared contains four words only, and hence is very "similar" to very many other compositions.

Row two in Table 1.1 contains the result of the comparison of c.2.4.1.5 with c.2.4.1.6 and may be more rewarding to follow up. These two compositions have a shared vocabulary (lexemes) of 85.6 % (seen from C1). They even have five identical lines (column 12) and seven similar lines (column 14). To illustrate the overlap between the two composition, a program called MOSS (Measure Of Software Similarity) can be used (http://www.cs.berkeley.edu/~aiken/moss.html). This program was designed to spot plagiarism in computer programming classes, but it seems to do a good job of finding similarities between Sumerian as well. Open c.2.4.1.5vsc.2.4.1.6.html and study the result. Note that we have compared the files consisting of the lexemes only. We may also compare the word files (c.2.4.1.5fvsc.2.4.1.6f.html), although the result it not equally colourful. However, MOSS does a good job of finding matches. MOSS is easy to download, install and run, provided you have Perl installed on your machine as well.

Another way of highlighting similarites between two compositions is to put in bold all the shared bigrams, either the lexeme or word bigrams. The outputs from this process can then be put side by side in a web document: sharedlexemebigrams / sharedwordbigrams.

Collocations

It is sometimes said that words can be defined by the company they keep, i.e., that the meaning of a particular word is strongly influenced or coloured by the surrounding words. With the development of large corpora and dedicated software it is very easy to study a word and its collocates. WordSmith Tools have options which let you highlight a word's left and right collocates. As an experiment, let us look at the nouns occurring immediately to the left of the following lexemes: babbar 'white', dadag 'to be bright', dalla 'to be bright', kug 'shining', and zalag 'to shine'. (Please ignore for the sake of this experiment the difference in word class marking.) As can be seen from the English labels given to these lexemes, their meaning, with perhaps the exception of babbar, seem similar, and they would very likely have been listed as synonyms in a thesaurus comparable to English words such as beaming, gleaming, glittering, glowing, shining, and luminous. We should also note that several of the Sumerian words are written with the same sign, UD (or UD.UD), i.e. babbar/dadag/zalag.

Table 1.2. Five comparable lexemes and their most frequent left collocates
Keyword (freq.) Nouns immediately to the left of the keyword

babbar (73) erin, tug2, siki, ud

dadag (79) cu-luh, a, ki, cu

dalla (60) ud, kalam, utu

kug (1,196) an (An), cag4, ki, inim, barag, ki-tuc, an (heaven), cir3

zalag (98) ud, saj-ki, cag4, igi

Before we venture some general remarks, note that the frequency with which these lexemes occur differ significantly, and that this, of course, will have a bearing on the strength of any claims made. It also vital to stress the importance of looking at the wider context of the above collocations and the forms of the five keywords' collocates. In the sequence kalam dalla, for instance, dalla is not modifying kalam, since the form is kalam-ma 'in the Land'. dalla stands out in another way as well, since the wider collocation very often is ud-/utu-gin7 dalla e3 ≅ "to appear brightly in the manner of daylight/the sun god"(?).

At first glance, the five lexemes investigated seem to modify slightly different things, which lend support to the argument that they, and especially babbar, dadag, and zalag, should be seen as different lexemes. More interesting to follow up is perhaps the instances where the lexemes observed modify the same head, e.g. cag4 kug vs. cag4 zalag. Do the two combinations convey the same meaning? And what about ki dadag vs. ki kug?

Another observation we can make is that inim (words) and cir3 (songs) are modified by kug only, perhaps indicating a sense nuance of kug not present in the other lexemes. Places to be (sit, dwell) such as barag and ki-tuc are also modified by kug, while concepts associated with liquid stuff (a, cu-luh) occur with dadag, and materials, e.g. erin and tug2, with babbar.

The reason for bringing up the English words beaming, gleaming, etc. above was not purely accidental. Looking at the nouns occurring near these English words, one gets a picture similar to the one for the five Sumerian lexemes, where some of the associated words overlap while others do not. Thus, we find that smiles and faces are beaming, eyes and faces are gleaming, eyes, prizes, and careers are glittering, eyes and faces are glowing, the sun, eyes, and light are shining, and the sun and eyes are luminous.

An exercise such as the one performed here can be done for any set of (synonymous) Sumerian words or lexemes to try and pinpoint their meanings or tease out similarities and differences between sets of semantically related words.

© Copyright 2003, 2004, 2005, 2006 The ETCSL project, Oriental Institute, University of Oxford

Compos. (C1 > C2)	W in C1	Overlap with C2	%	L in C1	Overlap with C2	%	Biwords	%	Bilexemes	%	WL overlap	%	LL overlap	%
c.2.4.4.9 > c.2.4.2.24	4	3	75	4	3	75	2	66.7	2	66.7	0	0	0	0
c.2.4.1.5 > c.2.4.1.6	181	113	62.4	181	155	85.6	55	33.1	83	51.6	5	12.5	7	17.5
c.2.4.1.6 > c.2.4.1.5	244	130	53.3	242	186	76.9	55	24.6	83	39	5	9.6	7	13.5
c.2.4.4.9 > c.2.4.4.1	4	1	25	4	3	75	0	0	1	33.3	0	0	0	0
c.2.4.4.9 > c.2.4.2.17	4	2	50	4	2	50	1	33.3	1	33.3	0	0	0	0
c.2.4.1.8 > c.2.4.1.1	27	13	48.1	27	17	63	1	4.5	3	13.6	0	0	0	0
c.2.4.5.5 > c.2.4.5.4	93	57	61.3	93	69	74.2	7	7.7	12	13.2	0	0	0	0
c.2.4.2.b > c.2.4.2.04	113	37	32.7	109	81	74.3	3	2.8	13	12.9	0	0	0	0
c.2.4.2.b > c.2.4.2.02	113	37	32.7	109	81	74.3	2	1.9	13	12.9	0	0	0	0
c.2.4.1.a > c.2.4.2.02	75	30	40	73	54	74	3	4.1	9	12.7	0	0	0	0
c.2.4.2.25 > c.2.4.2.03	128	49	38.3	123	92	74.8	6	4.7	15	12.5	1	3.4	1	3.4
c.2.4.2.12 > c.2.4.2.02	52	9	17.3	49	35	71.4	0	0	6	12.5	0	0	1	5.6
c.2.4.4.5 > c.2.4.5.3	45	12	26.7	42	21	50	2	4.5	5	12.5	0	0	2	10
c.2.4.2.17 > c.2.4.2.04	219	123	56.2	208	171	82.2	8	3.9	23	12.3	1	1.9	1	2
c.2.4.2.17 > c.2.4.2.24	219	95	43.4	208	157	75.5	7	3.4	22	11.8	0	0	0	0
c.2.4.1.a > c.2.4.2.04	75	37	49.3	73	60	82.2	3	4.1	8	11.3	0	0	0	0
c.2.4.1.a > c.2.4.2.24	75	28	37.3	73	54	74	3	4.1	8	11.3	0	0	0	0
c.2.4.1.5 > c.2.4.1.3	181	69	38.1	181	127	70.2	5	3	18	11.2	0	0	0	0
c.2.4.5.5 > c.2.4.2.04	93	56	60.2	93	72	77.4	2	2.2	10	11	0	0	0	0
c.2.4.2.23 > c.2.4.2.05	49	15	30.6	48	32	66.7	1	2.1	5	10.9	0	0	1	4.3
c.2.4.2.b > c.2.4.2.01	113	27	23.9	109	72	66.1	3	2.8	11	10.9	0	0	0	0
c.2.4.2.05 > c.2.4.2.02	931	387	41.6	901	696	77.2	22	2.4	92	10.8	1	0.5	3	1.4
c.2.4.2.17 > c.2.4.2.15	219	97	44.3	208	148	71.2	9	4.4	19	10.2	0	0	0	0
c.2.4.5.1 > c.2.4.5.4	150	63	42	142	90	63.4	9	6.9	12	10.1	0	0	0	0
c.2.4.2.25 > c.2.4.2.05	128	48	37.5	123	89	72.4	1	0.8	12	10	0	0	0	0
c.2.4.2.a > c.2.4.2.07	111	24	21.6	111	67	60.4	1	1.6	6	10	0	0	0	0
c.2.4.1.a > c.2.4.1.1	75	30	40	73	56	76.7	2	2.7	7	9.9	0	0	0	0
c.2.4.2.b > c.2.4.2.05	113	38	33.6	109	77	70.6	1	0.9	10	9.9	0	0	0	0
c.2.4.2.b > c.2.4.2.15	113	30	26.5	109	65	59.6	2	1.9	10	9.9	0	0	0	0

Keyword (freq.)	Nouns immediately to the left of the keyword
babbar (73)	erin, tug2, siki, ud
dadag (79)	cu-luh, a, ki, cu
dalla (60)	ud, kalam, utu
kug (1,196)	an (An), cag4, ki, inim, barag, ki-tuc, an (heaven), cir3
zalag (98)	ud, saj-ki, cag4, igi