TLA Text Corpus


The text corpus is one of the two core parts of the Thesaurus Linguae Aegyptiae (TLA), the other one being the lemma lists. The corpus contains a constantly increasing number of Ancient Egyptian texts written in the hieroglyphic/hieratic or Demotic scripts, currently ranging from ca. 3,000 BCE to ca. 300 CE. (Coptic texts will be added later in the project.)

The Ancient Egyptian text world was a remarkable one, culturally and historically. Messages of varying length and complexity were written on immensely diverse objects from very different spheres of life. There are texts on portable objects such as papyrus, ostraca (i.e., stone flakes or potsherds), and (complete) vessels, as well as texts on immovable objects such as the walls of temples and tombs, obelisks, statues, etc. Such different kinds of support, their material, formal and functional features contribute additional significance, connotating the meaning of the written texts.

Given this close interrelation and semantic interaction between written texts and their material support, an enhanced understanding of the Ancient Egyptian world view via texts needs to systematically take textual as well as material features of text objects into account. Consequently, all texts in the text corpus are annotated with a wide array of metadata relating to the texts themselves as well as to their material support (German Textträger). It has always been a goal of the Academies’ project to develop a more or less balanced, diverse corpus, i.e., to provide a representative range of textual and chronological variation in the corpus. As of now, the text corpus comprises about 1.44 mill. lemma tokens (Hieroglyphic/hieratic: 1,115 thous., Demotic: 326 thous.).

Furthermore, the texts in the TLA are not primarily conceptualized as abstract texts (e.g., Sinuhe) but as a (semantically coherent) textual string on a concrete support (e.g., papyrus Berlin ÄM P 3022).


Every written text and sentence as well as every text object (textträger) in the corpus has its own unique stable ID number, e.g., “MORHQGR3SNBI3KHAF6YOW5WLL4.” The basic level of a text is its Egyptological transliteration. A growing part of the subcorpus of hieroglyphic/hieratic texts is also annotated with a digital hieroglyphic transcription in JSesh-specific Manuel de Codage and, to the extent possible, Unicode. All texts also come with a translation into a modern language (mostly German, sometimes English or French, depending on the author’s language proficiencies). Texts may also contain commenting notes.

Text metadata and text object metadata

Texts and text objects systematically come with additional metadata which are not immanent in the text or text object itself. In order to enhance data retrieval, possible values of metadata are edited in controlled vocabularies (thesauri). Categories of data and metadata relating to texts and text objects are shown in the following table:

Text data and metadata Text object metadata
Egyptological transliteration
(Digital) hieroglyphic transcription
Translation (German, English, or French)
Script (Hieroglyphic, Hieratic, Demotic)
Language phase (Old Egyptian, Middle Egyptian, etc.)
Dating of the textDating of the text object
Text category/type (‘genre’)Type of text object
Agent of a social action
Archaeological Context
Cultural Context
Finding place
Current location
Bibliographical referencesBibliographical references

“Texts” and “sub-texts” in the TLA

A ‘text’ in the broader sense as conceptualized in the TLA is an entity marked as an independent textual unit by clearly marked text delimiters (beginning and end). An individual text may either consist of writing only, or it may be a multimodal composition of writing and illustrations. An example of multimodal texts are offering scenes on the walls of Egyptian temples displaying the king vis-à-vis a deity, both interacting with each other. Written labels—short phrases or sentences—identify the depicted entities or give information about their interactions. Such short textual units, although distinct entities, are part of the larger unit of the scene and are therefore conceptualized not as independent “texts,” but as dependent “sub-texts” in the TLA. One characteristic of sub-texts vs. texts in the TLA is that a fixed reading sequence of sub-texts cannot normally be established. Another characteristic is that, when interpreting sub-texts, it is necessary to take accompanying sub-texts into account, e.g., a scene as a whole.

General principles of text editing in the TLA

As mentioned above, Egyptian texts in the TLA are primarily conceptualized as strings in Egyptological transliteration. Line/column counting generally follows the original source (tag “lc”, for ’line[/column] count’). Conventional line counts of standard publications of (abstract) texts can be referred to in addition (“para” tag). Texts are divided into units of simple or complex sentences. Each sentence has a unique stable ID number by which it should be quoted, e.g., “TLA sentence IBUBd1NUc4LHaUPIlW0V9mCZyNQ.”

Each word token (or sometimes sequence of words) is lemmatized, i.e., it is linked to an entry (’lemma’) in one of the TLA’s lemma lists. Moreover, the lemma tokens in many texts are also annotated with grammatical codes. These encode morphological inflection, mainly inflection that is overtly marked in script (e.g., genus, number of nouns), however sometimes also inflection that is covert in the purely consonantal script but which can be contextually reconstructed from syntax (e.g., genus verbi of an unmarked sḏm(=f), number of a relative sḏm.t.n(=f)). To keep grammatical annotation to a certain degree independent of continually debated theoretical premises, the tagging of tense/aspect/mood (TAM) features of inflected verbs is strictly limited to overt inflection, i.e., morphological features visible in the written form. For example, a morphologically unmarked nḥm(=f) is simply annotated as an instance of an (active or passive) “suffix conjugation” form without TAM specification. The lemma tokens of an increasing part of the texts also come with their original hieroglyphic spelling (or, in the case of hieratic, a hieroglyphic transcription). Authors are also encouraged to specify a particular sense of a lemma in context, either by picking one of a set of translations from the lemma list or by entering another specific sense themselves. In addition to these standard annotations, editors may add more annotations, such as other semantic features (e.g., type of speech act), layout features (e.g., rubra, verse points, split columns, lists), semantic features (e.g., metaphorical domains), etc.

The content of the text corpus

Currently, the following sub-corpora of ancient Egyptian texts are accessible in the TLA corpus. Sub-corpora with digital hieroglyphic transcription text are marked with [H], those with grammatical annotation are marked with [G]. (State of the list: early 2022)

List of TLA authors

For a complete list of authors, see here.

History of the hieroglyphic/hieratic text corpus

The digital text corpus of the TLA was initiated as part of the Academy’s previous project “Altägyptisches Wörterbuch” (AAeW, 1992–2012) at the Berlin-Brandenburg Academy of Sciences and Humanities (funded by the Academies’ program of the Union of the German Academies of Sciences and Humanities). The idea was to create a digital successor to A. Erman’s & H. Grapow’s Wörterbuch der aegyptischen Sprache (1926–1931; 1950, 1963), notably including the Belegstellen (1935–1953) volumes, in the age of corpus-based computational lexicography: (i) a lemmatized balanced digital corpus of Egyptian texts in hieroglyphic, hieratic, and Demotic script, which builds up (ii) a corpus-based ‘dictionary’ of the ancient Egyptian language.

In order to further complete the lemma list, additional texts were selected to be added into the TLA based on a set of criteria. Texts that had not been used for the original Wörterbuch project and texts that had been published or re-edited after the project had ended were favored for inclusion. Late Egyptian texts that were to be encoded in the Projet Ramsès (Liège), on the other hand, were disfavored. With the growth of the project team and increasing support from cooperating projects and individual researchers, a broader, more balanced, more diverse corpus is evolving.

Prospects for the future

Coptic text corpus

Coptic, the last phase of the ancient Egyptian language, is not yet represented in the TLA text corpus. Once the Coptic lemma list is implemented in the TLA, a sample corpus of texts from all Coptic dialects will be imported. This will come from the lemmatized digital text data generated by Wolf-Peter Funk over many decades. This legacy data was converted into a modern encoding format, i.e., Unicode, by Katrin John (cooperating project “Database and Dictionary of Greek Loanwords in Coptic,” FU Berlin) and will soon be processed for incorporation into the TLA.

Coffin texts

In collaboration with Wolfgang Schenkel, the project is preparing the transformation of his Coffin Text data (CTUrtext) so that the Coffin Texts can be incorporated into the TLA.