Donnerstag, 2. November 2017

Resolving and Defining Language Codes for Language Resources

A functionality that is frequently requested when working with language resources available from the web is the use (and the interpretation) of language codes. Language codes are an essential device to unambiguously refer to language varieties, but only as a technical means to identify languages, thereby supplementing the language name. This is necessary as the same language variety may be referred to by different names, or the same name may be applied to different language varieties: The Manding language Bamanakan is also known as Bambara, for example, while dütsch (or düdesch) has been the medieval self-designation of both Low German and Dutch. In this blog post, I summarize the conventional standards and strategies how to find the appropriate language code for any given language.

There exist different standards for the purpose, most importantly ISO 639 (widely used but with a number of known limitations), and Glottolog (by linguists and for linguists, far more detailed and providing a structured [~ phylogenetic] view, but biased towards endangered modern languages, and thus rather sketchy in its historical dimension).

1. ISO 639 language tags

ISO 639 provides language identifiers as standardized by the International Organization for Standardization, whose standards are most widely used in technical applications. With the dawn of multilingual information technology, standardized language identifiers became necessary and language identifiers have been standardized as early as 1967. This original standard, still available as ISO 639/R has been withdrawn in 1988 and superseded by ISO 639:1988 which provided two-letter codes for a substantial set of languages. Unfortunately, combinations of two (ASCII) letters only allow to distinguish up to 26²=676 language varieties, which covers less than 10% of the languages currently spoken  (as of 2016-08-19, SIL's Ethnologue lists 7,097 languages, Glottolog lists 7,943 language varieties). Accordingly, ISO 639:1988 was withdrawn in 2002 and superseded by the current ISO 639 standard.
ISO 639 not only extends the earlier two-letter codes, but also integrates other pre-existing standardization efforts, reflected in different profiles.

1.1 Two-letter codes (ISO 639-1)

ISO 639-1 continues the earlier two-letter codes for languages. A list (and an alignment with ISO 639-2 codes) is provided by the Library of Congress, the ISO 639-2 registration authority. Because of its brevity and wide-spread use in technical applications, ISO 639-1 is recommended to be used for languages without ISO 639-3 code [BCP47].

1.2 Three-letter codes from the librarian tradition (ISO 639-2)

ISO 639-2 is a standard for three-letter language identifiers, based on the MARC Code List for Languages, a system developed for use in libraries. As the Library of Congress is the maintenance agency for both lists, they are kept compatible in terms of code additions and deletions. However, coming from a librarian tradition, ISO 639-2 is (deliberately) limited in scope and coverage. The original MARC code list aims to "[include] individual codes for most of the major languages of the modern and ancient world, e.g. Arabic, Chinese, English, Hindi, Latin, Tagalog, etc. These are the languages that are most frequently represented in the total body of the world's literature." (LoC 2007, p.5, emphasis mine).
For 22 cases, ISO 639-2 provides two alternative codes, one for terminological (T) use, one for bibliographical (B) use. The T codes are aligned with ISO 639-3 whereas the B codes correspond to deviating MARC codes. For web resources, ISO 639-2/T codes are recommended to be used, but only if no ISO 639-1 code exists [BCP47].
A list (and an alignment with ISO 639-1 codes) is provided by the Library of Congress.

1.3 Three-letter codes for [almost] all human languages (ISO 639-3)

By providing identifiers for about 400 languages only, ISO 639-2 is deliberately limited in its coverage. ISO TC37/SC2 thus invited SIL International to develop a more exhaustive set of language codes as ISO 639-3. SIL International, originally known as the Summer Institute of Linguistics, is a faith-based (i.e., missionary) organization with a strong profile in linguistics, well-known in academia for Ethnologue, a near-exhaustive database of languages and information about them.

Similar to MARC, the Ethnologue employed (independently developed) three-letter language identifiers. For ISO 639-3, these were harmonized with ISO 639-2/T and complemented with identifiers for extinct and constructed languages provided by the Linguist List.

ISO 639-3 is designed for compatibility with ISO 639-2: "At the core of ISO 639-3 are the individual languages already accounted for in ISO 639-2. The large number of ... languages ... beyond those ... was derived primarily from Ethnologue ... [and] from Linguist List." [SIL 2015a] "The alpha-3 codes for ISO 639-2 and ISO 639-3 overlap. In particular, every individual language code element in the terminology code of ISO 639-2 is also included in ISO 639-3 [, and] ... every alpha-3 language identifier has a single denotation across the union of code elements from all parts of ISO 639" [SIL 2015b].

1.4 ISO 639-4 to ISO 639-6

These standards are grouped together because they are of limited practical relevance to the language resource community.
ISO 639-4 defines general principles of coding of the representation of names of languages [ISO 639-4:2010]. However, only the introduction is freely available, most technical applications thus follow [BCP47], instead.
ISO 639-5 provides three-letter codes for language families and groups, maintained by the Library of Congress and thus primarily oriented towards supplementing ISO 639-2 [LoC 2013]. In particular, "this part of ISO 639 is intended to support the overall language coding (...) rather than provide a scientific classification of the languages of the world" [LoC 2013]. While ISO 639-5 provides a possible, but not uncontroversial view of hierarchical grouping, it does not live up to the standards of linguists -- both in terms of coverage (focusing on ISO 639-2) and design decisions (e.g., by postulating an Altaic group comprising Turkic, Tungus and Mongolian; here, both the inclusion of Japanese and/or Korean have been suggested as well as abandoning the Altaic language group altogether).
As the aforementioned ISO 639 profiles face a number of known issues in terms of coverage and historical depth, the complementary ISO 639-6 profile aimed to provide 4-letter codes for language variants. ISO 639-6 was edited by Debbie Garside and maintained by GeoLang Ltd., a Welsh company that originally provided services for geolinguistic research, but later shifted their focus to cyber security. The standard was withdrawn in 2014 [ISO 639-6:2009], possibly as a result of this re-orientation of the maintainer.

1.5 Applications and Issues

The ISO 639 standards are highly successful in technical applications, and -- where applicable --, recommended to be used for language resource metadata, e.g., as part of the Dublin Core Metadata Initiative: dcterms:language recommends "to use a controlled vocabulary such as RFC 4646". RFC 4646 (and its successor RFC 5646, i.e., BCP47) defines tags for identifying languages on grounds of ISO 693. With moderate degree of simplification, the following production rule applies

$ISO_639 ("-" $ISO_639')? ("-" $ISO_15924)? ("-" $ISO_3166_1)?

with the following components:
  • $ISO_639 the shortest ISO-639 code applicable (obligatory), e.g., en for English.
  • $ISO_639' an extended language tag (e.g., where an ISO 639-3 code provides a finer granularity than ISO 639-1) (optional).
  • $ISO_15924 ISO 15924 4-letter code for script (optional), e.g., Latn for Latin.
  • $ISO_3166_1 ISO  3166 2-letter (or UN M.49 3-number) region code (optional), e.g.,  DE (resp. 276) for Germany or US (resp. 840) for the USA
This blog post can thus be characterized with any of the 10 following language tags:
  • en (it is English, indeed)
  • en-Latn (English in Latin characters)
  • en-Latn-DE (resp. en-Latn-276; English in Latin characters written in Germany)
  • en-Latn-US (resp. en-Latn-840; English in Latin characters compliant with the US variety ~ American English)
  • en-DE, en-276, en-US, en-840
When creating state-of-the-art editions of legacy resources, it is often necessary to determine the language code of a language variety described informally, only. In order to find the language code for a language variety, different strategies can be pursued. ISO 639-2 and ISO 639-1 are basically subsets of ISO 639-3, which originates from SIL's Ethnologue. However, Ethnologue introduced a paywall in 2016, so that other alternatives are to be considered. As ISO 639-3 language tag maintenance is partially coordinated with the Linguist List, their MultiTree portal provides a semi-authoritative resource that allows to search for languages and their linguistic context. Glottolog is increasingly being considered as a freely available substitute for Ethnologue, and also provides ISO 639-3 codes (where appropriate). Wikipedia and Wikipedia-derived resources such as DBpedia do not represent viable sources for determining language tags or for identifying languages. Although many Wikipedia pages provide ISO 639 codes, as well as links to Glottolog, the redirection formalism of Wikipedia introduces additional noise. As an example, Middle Aramaic, "spoken from the 3rd century CE into various periods of modern times in different areas" [disambiguation page], redirects to Aramaic, and the language code provided at the time of writing is arc, i.e., "Imperial Aramaic (700-300 BCE)".

In the web, BCP47 codes are used, for example, to type strings for their language, e.g., in RDF 1.1. It is thus possible, for example, to give an RDF resource labels in different languages:

<dbpedia.org/resource/LLOD> rdfs:label "Linguistic Linked Open Data"@en

In query languages, we can now explicitly query for English (French, etc.) labels, e.g., in SPARQL

<dbpedia.org/resource/LLOD> rdfs:label ?label. FILTER(lang(?label)='en')

An obvious problem is that BCP47 codes provide different levels of granularity, and that fine-granular BCP47 tags need to be decomposed in order to be compared with less-granular ones. In order to look for text in (any standard variety of) English, it would thus not be sufficient to match the language code, but a substring of the language code (if it provides geoinformation) or their equivalents (e.g., for numerical or alphabetical country codes). Because BCP47 codes can be complex, the "naive" approach with lang() and equality tests is error-prone and should be avoided. Instead, SPARQL provides the langMatches() function which implements routines for matching BCP47 language codes.

<dbpedia.org/resource/LLOD> rdfs:label ?label. FILTER(langMatches(?label,'en'))

A frequent mistake is that - because of known decifits of ISO 639-1 and ISO 639-2 - practicioners in LLOD tend to prefer ISO 639-3 and use these codes even when BCP47 requires an ISO 639-1 code. Under these circumstances, langMatches() may return unexpected results.

The fact that BCP47 and ISO 639 represent complex, structured information in opaque strings may also lead to other problems: Without knowing the exact decomposition rules, and without a machine-readable specification of the relations between related ISO 639 codes, it is also not possible to query this information. Also, as SIL takes a natural bias to languages of the ancient Near East, ISO 639-3 provides a very fine-grained classification for, e.g., Aramaic, distinguishing Old Aramaic (oar), Imperial Aramaic (arc), Jewish Palestinian Aramaic (jpa), Jewish Babylonian Aramaic (tmr), Samaritian Aramaic (sam), Classical Syriac (syc, Syriac Aramaic), Syriac (syr, as macro-language), etc. But these fine-grained distinctions are motivated by theological interest, mostly, and other applications require a broader notion of "Aramaic". So, when we want to express references to (an unspecified variety) of "Aramaic" in, say, a glossary of the Old High German diathessaron (which originates from a Latin translation of an Aramaic gospel harmony), it is difficult to identify the proper language tag for the underlying variety of Aramaic for even a specialist in older Germanic languages, in particular given the fact that ISO 639-3 documentation is sparse. Unfortunately, ISO 639-2 only gives us a reduced choice between Imperial Aramaic (arc, "until 300 BCE"), Samaritian Aramaic (sam, does not apply) and Syriac (syr, including modern varieties), whereas ISO 639-1 is not aware of Aramaic, resp. Syriac, at all. We are thus lacking a level of generalization here.

A well-known problem is that ISO 639 codes are often insufficiently granular for the needs of linguists, who require flexible and more fine-grained subclassifications of languages, but also, a relatively flexible way to add new distinctions where necessary. As an example, orthographic traditions may change drastically over time, between genres, and even between different social groups, so that language processing of historical texts may requires specialized language identifiers for a specific variety written at a specific time in a specific region, for a specific purpose and by persons from a specific group. For example, in Sumerian women seem to have used a specific language variety, which a largely deviating phonology and vocabulary, but this is documented in certain genres only. However, a BCP47-code to represent this Emesal variety does not exist. Similar gender dissociations can be found in the Central American Garifuna language, for example, where the language of women and the language of men have historical origins in different language families, but cannot be differentiated with BCP47 (which recommends the ISO 639-3 code cab).

2. URI-based  language codes

Tag-based approaches on language classification rely on unstructured lists of strings (tags) as a primary data structure, where relations between different categories are nor formally represented, and with a fixed level of granularity. While BCP47 allows to refine the meaning of language tags by intersecting these with categories for other levels (language variety, writing system, geographic region) to arrive at a more specific definition, these additional criteria are only indirectly related to linguistic classification, and thus potentially error-prone.

A second issue is that expanding the BCP47 vocabulary is a laborsome and formal process. In order to introduce a novel language identifier for ISO 639-3, for example, a formal proposal needs to be submitted, verified by the ISO 639-3 registrar, and discussed within the community, e.g., on Linguist List  and other appropriate discussion lists. With a growing consensus pointing towards a change, this is documented, and published as a Change Requests, which are then open to further review and comment by any interested party for a period of three months, before finally being adopted, adopted in part, amended or withdrawn. However, this decision process is only partially driven by linguistic considerations, but can be affected by external factors such as language politics. Assigning a language variety an ISO language identifier can be seen as a political move. As an example, ISO 639-3 ids for Scanian (scy) and Jamtska (jmk), transitional dialects between Swedish and Danish, resp. Norwegian, have been retired in favor of their respective national language, Swedish, cf. the discussion on the re-activation of for Scanian.

These problems, the unstructured and static nature of language tags, their fixed level of granularity, as well as the administrative overhead of maintaining sparked early ideas on URI-based language identification: Regardless of the method of maintenance (be it by expert approval or any kind of formal process), additional information about language URIs can be provided in a machine-readable way, e.g., regarding their phylogenetic relations, and if necessary, URIs for novel language varieties can be created in a new namespace and put in relation with community-approved concepts.

2.1 ISO 639 in RDF

An RDF edition of ISO language codes has been discussed at the W3C in the mid-2000s, already, but without a consensus on the maintainer, this never evolved into a concrete resource. Only recently, the Library of Congress, the registration authority for ISO 639-1, 639-2, and 639-5, added RDF serializations to their editions [639-1, 639-2, 639-5] whereas SIL, the registration authority for ISO 639-3, only provides TSV data. Accordingly, most LLOD resources point to URIs and RDF data sets provided by third parties, instead. A current community practice (also adopted, for example, by the German National Library) is to refer to lexvo for ISO 639-3 URIs. Accordingly, it is possible to describe the language of this blog post as http://id.loc.gov/vocabulary/iso639-1/en, http://id.loc.gov/vocabulary/iso639-2/eng, or http://lexvo.org/id/iso639-3/eng.

While these RDF editions of ISO 639 merely provide a point of reference for language designations within and beyond the language resource community, they preserve the coverage and granularity issues of ISO 639 standards. However, it is now possible to develop more elaborate vocabularies tailored towards the needs of linguists which refer to ISO 639 language categories.

2.2 Glottolog

Glottolog is an academic repository that provides URIs and machine-readable information for identifying language varieties (or, languoids). It has been collaboratively developed and its languoid inventory is currently maintained by Martin Haspelmath and colleagues. Glottolog originates out of earlier efforts to create a unified bibliographical resource for language documentation (LangDoc), but it has found wide reception beyond this original use case. At the time of writing, for example, Glottolog IDs are used also by the wider community, e.g., in Wikipedia. A crucial aspect is that Glottolog avoids the notion of "language", as it comes with unintended political connotations (cf. Max Weinreich's "a language is a dialect with an army and a navy"), but instead defines a languoid as a language variety about (or in) which written literature does exist. Accordingly, language families, proto-languages, national languages, historical varieties, dialects and sociolects can receive a unified treatment. A Glottolog ID combines a 4-letter alphabetic core with a 4-letter numerical code, e.g.,  stan1293 for (Standard) English, but more importantly, this comes as a native URI: http://glottolog.org/resource/languoid/id/stan1293, which resolves via content negotiation to an HTML visualization or to RDF data, which then provides further links to ISO 639, lexvo, etc.

More importantly, also relations between languoids are provided in a machine-readable way, e.g., phylogenetic relations: English is a subconcept of (skos:broader) `Macro-English' (macr1271, which groups together Modern English with a number of English Pidgins), etc., and it has further subconcepts (skos:narrower) such as Indian English (indi1255), New Zealand English (newz1240), etc. Glottolog is designed to be descriptively adequate, but as being extensible rather than exhaustive: Suggestions about novel or incorrect langoids can be reported via the website and will be addressed by the maintainers. So, even where a distinction may be missing, it may be introduced upon request, and if properly justified by the accompanying scientific literature, it will be accepted.

As it is linked with ISO, etc., Glottolog can also be used to search for ISO language tags.

2.3 Language identification with URIs

Language identification in accordance with BCP47 has the great advantage of being compact and readable and well integrated in web technologies: XML provides an inheritage mechanism for xml:lang, and the Turtle format provides a short notation such as a "some string"@en. With URI references, this cannot be directly reproduced, but instead, explicit typed links between language resources (or their parts) and languoid URIs are required, e.g., with an RDF triple such as ... dcterms:language <http://glottolog.org/resource/languoid/id/stan1293>.  In practice, a combination of both strategies should be employed, i.e., a language resource should define its object language (the language of the primary data, e.g., the text language in a corpus, or the language of lexical entries in a dictionary) with an explicit triple and an appropriate vocabulary (e.g., http://purl.org/dc/terms/), but use language tags for its description language (the language of annotations, labels or definitions). A conventional interpretation would thus regard untyped literals as originating from the object language, unless overridden by an explicit language tag. In most cases, ISO 639-1 language tags will suffice to identify the description language, whereas the various object languages require a more easily extensible language inventory as provided by Glottolog.

References

[BCP47] A. Phillips, M. Davis (ed., 2009), Best Current Practice: Tags for Identifying Languages, https://tools.ietf.org/html/bcp47, retrieved 2016-08-20; also known as RFC 5646
[ISO 639-6:2009] ISO (2009), Codes for the representation of names of languages -- Part 6: Alpha-4 code for comprehensive coverage of language variants, http://www.iso.org/iso/catalogue_detail?csnumber=43380, retrieved 2016-08-20
[ISO 639-4:2010] ISO (2010), Codes for the representation of names of languages — Part 4: General principles of coding of the representation of names of languages and related entities, and application guidelines, https://www.iso.org/obp/ui/#iso:std:39535:en, retrieved 2016-08-20
[LoC 2007] Library of Congress (2007), MARC Code List for Languages, Introduction, https://www.loc.gov/marc/languages/introduction.pdf, retrieved 2016-08-19
 [LoC 2013] Library of Congress (2013), Codes for the Representation of Names of Languages, Part 5: Alpha-3 code for language families and groups, https://www.loc.gov/standards/iso639-5/langhome5.html, retrieved 2016-08-20
[SIL 2015a] SIL International (2015a), ISO 639-3, http://www-01.sil.org/iso639-3/default.asp, retrieved 2016-08-20
[SIL 2015b] SIL International (2015b), Relationship between ISO 639-3 and the other parts of ISO 639, http://www-01.sil.org/iso639-3/relationship.asp, retrieved 2016-08-20