White Paper: Foreign Languages on the Web
The "Lang" attribute and Classic Latin.
Recently I was faced with a question that I really had no hard data on. Fortunately, within the circle of international colleagues I work with who specialize in "web accessibility", I was able to gather some interesting information and opinion, and share these thoughts and observations here.
The Problem:
A web page (or specifically series of web pages) written in English also features extensive tracts of Classical Latin text - text originating from the 12th and 13th century. The W3C WCAG1 guidance states: Clearly identify changes in the natural language of a document's text and any text equivalents (e.g., captions).
(Priority 1, Checkpoint 4.1)
The question however was whether or not undertaking the non-trivial task of marking up these Latin texts to meet the WCAG Requirement was worth the return on investment? (<span lang="la"> Incipit: Holi writ haþ a liknesse to tre þat bereþ noote oþer appel</span>)
There is the possibility that doing so would still not practically satisfy a key constituency, screen readers, as it was not clear that a Latin Speech Synthesizer even existed today. As well, "Screen readers without Unicode support will read a character outside Latin-1 as a question mark, and even in the latest version of JAWS, the most popular screen reader, Unicode characters are very difficult to read." [1]
Opinions and Facts:
The facts and opinions that ensued, prompted by the question, centered on the following relevant points
- Screen Readers / Screen Reading Software
The number of languages supported by JAWS (the leading screen reading software package in the marketplace) is not limited to the list at the software vendor's website [2] as local distributors, for example Freedom Scientific Benelux, can deliver a JAWS version with a speech synthesizer for Dutch. It could not be determined however whether a JAWS speech synthesizer for Latin currently exists.[3] However, there is a sizable cottage industry of JAWS scripters who could add support for the characters even if it is not currently available. Given that this is an academic project it is conceivable that a blind researcher may wish to tap into this scripting resource, to add the capability if documents exist that would become more accessible with the investment.
Even if there were no speech synthesis available for a language, screen readers like JAWS can announce language changes and users can associate particular voice configurations with particular languages.
Looking beyond JAWS, Classical Latin [4] is among the current MBROLA voices [5] available. It is therefore (at least theoretically) usable with at least some screen readers and text-to-speech software, e.g. NVDA [6], FreeTTS (used by FireVox) [7], and Emacspeak [8].
- Typesetting / Alternate Usage
As this is an academic project then it might be more important to correctly mark-up the language for reasons other than accessibility. It is possible to machine-process words or even phrases in various useful ways, e.g. for machine translation. It is significantly more successful if you know for sure what language you are dealing with.
For example, if a user opens your HTML page in a word processor such as Microsoft Word, it will use the language markup, and this can be relevant when spelling checks are "on", i.e. words classified as misspelled are highlighted. Declaring Latin words as Latin prevents the program from applying English spelling rules to them. (A copy of Word tested for this seemed to be Latin-ignorant. That is, it recognized the words being in Latin but did not flag anything as misspelled and did not even hyphenate Latin words. This is probably better than treating them as English or some other language.)
Even when the language markup is correct however, search engines and related tools (such as Google) do not necessarily use this information. One respondent found web pages in Dutch with correct language markup that still showed up in search results, even when he explicitly asked Google to return only pages in English.
Regarding Character Support, this is a different issue and should not depend on language markup, and mostly doesn't. Generally, in special software like screen readers or specialized browsers, we should expect character support to be more restricted than in common modern browsers. Even Latin-1 isn't as safe as in "normal" browsing. For example, what would a screen reader do upon encountering a special character like "¶"? Would it recognize it as having a special meaning (paragraph separator) and make a pause? It probably spells it out. This might mean saying "pilcrow sign", perhaps independently of language being used (since characters names aren't widely localized - most characters don't even have a name in most languages), which might be complete gibberish even to people who understand normal English.
- Current and Future Technology
Style sheets, either page or user style sheets, could be used to style words in a particular language as different from others, using a selector like [lang="la"] or :lang(la). However, this does not work e.g. on IE 6, which does not recognize such selectors. On some browsers, like Firefox, the user can right-click on a word and get information about its language. Finally, some day some browsers or other software could make real use of the markup.
A special note of thanks goes out to the following contributors, who provided this information, and have been quoted (often verbatim) in this white paper:
- Charles McCathieNevile - Opera Software, Standards Group
- Jukka K. Korpela - http://www.cs.tut.fi/~jkorpela/personal.html
- Benjamin Hawkes-Lewis - http://benjaminhawkeslewis.com/
- Michael Moore - Texas Department of Assistive and Rehabilitative Services
- Christophe Strobbe - Katholieke Universiteit Leuven (Belgium) - Dept. of Electrical Engineering- SCD - Research Group on Document Architectures
- http://en.wikipedia.org/wiki/Wikipedia:Accessibility
- http://www.freedomscientific.com/fs_products/software_jawsinfo.asp
- http://lists.w3.org/Archives/Public/w3c-wai-gl/2005AprJun/0097.html
- http://tcts.fpms.ac.be/synthesis/mbrola/demo/la1.wav - NOTE: male voice, 188K Wav file
- http://tcts.fpms.ac.be/synthesis/mbrola.html
- http://www.nvda.fr/spip.php?article14
- http://freetts.sourceforge.net/
- http://web.mit.edu/ATIC/src/emacspeak-9.0/mbrola
