Wikivoyage talk:Article naming conventions/Accented characters

Moved from Project:Foreign-language names by (WT-en) Evan

The issue
The current style standards state that as a general rule "English" names should be used. Specifically,
 * Project:Foreign-language names indicates that "English-language names" should be used in article text.
 * Project:Article naming conventions indicates that "foreign language characters" should not be used in article names and that "foreign language spellings" should not be used in articles for technical reasons.

For the many instances where the place's native name is written in Latin script with non-English characters (which I'll haphazardly call "accented characters"), and a traditional English name (such as Munich) does not exist, some Wikivoyage contributors interpret the standard as advocating the simple replacement of accented characters by the closest English character. For example:


 * Malmo not Malm&ouml;
 * Sao Paulo not S&atilde;o Paulo
 * Popayan not Popay&aacute;n

However, it is unclear whether accented characters are considered foreign characters in the way that Greek characters are. They are (conspicuously) not specifically addressed or given as examples in either article. (Project:Foreign-language names says Montreal instead of Montr&eacute;al, but this is a bit special because English is used in Montreal, and the English pronunciation of the name differs significantly from the French beyond just the "&eacute;" sound.)

Why we should use accented characters
I think that accented characters should not be considered foreign, thus they ought to be used consistently in article text, and should be permissible in article names. Here is why:

1) Removing accents from a name does not make it an English-language name.

The standard says to use the common English-language name, but Malmo is not the traditional English in the sense that Munich is. Malmo is simply the closest approximation of the written name using only English-language characters, i.e., it's what people use when they can't write Malm&ouml; or don't know that it's Malm&ouml;.

2) Wikivoyage can use accented characters.

Wikivoyage is written for a medium that (for the vast majority of users) can display accented characters, for an audience that understands accented characters (i.e., knows that "&ouml;" is some sort of variation of "o" rather than some totally incomprehensible squiggle).

Because of this, we should not be swayed by outside standards or usage, to which these conditions generally do not apply.

Thus, appealing to simple Google searches to measure usage is not legitimate. To put it bluntly, all those web pages that say Malmo do so because the writers are uninformed about the medium or the audience, or making certain assumptions about them that don't apply to Wikivoyage, or just lazy. The writer may assume that the reader's browser is not equipped to properly display accented characters. The writer may not know how to enter "&ouml;". The writer may not even realize the difference. Or the writer may just be too lazy to write it correctly. (I myself have written a lot of Wikivoyage content that omits accents, and that's because I am lazy.) That doesn't excuse them, and it doesn't excuse me.

I don't even think style standards such as the New York Times or Associated Press should be relied on, as those standards were established for media that could not easily use accented characters, and a target audience that benefited little from their use (see next item).

There is actually one area of external usage which I think we can look to, and that is English-language material published in the country (including web pages originating in the country). Munich is used in English-language material such as local tourist information and media in Germany. Similarly, English-language material published in Sweden consistently writes "Malm&ouml;". Surely nobody knows more about Swedish names than the Swedes, and if they write "Malm&ouml;" in English text, wouldn't that make it the "English name"?

3) It benefits the traveler.

Whether it resides in the traveler's head or on a sheet of paper, the correct representation will be better understood by locals, and will enable better understanding by the traveler. Not using accented characters will hinder the traveler. If I don't know that it's Malm&ouml; rather than Malmo, it will be harder for me to understand why it is pronounced the way it is. It will be harder for me to recognize the spoken name. Using accented characters also educates the traveler on basic pronunciation rules in the native language that can be applied when learning other words and names.

4) It does not hinder the traveler

If you don't know how to pronounce the accented character, you can just pronounce it "in English", and come reasonably close.

5) It makes Wikivoyage look better. 

The correct written representation will be better appreciated by locals. It won't be considered amateurish or incorrect.

Objections
There are a number of objections to using accented characters on technical grounds, which I respond to:

1) Accents mess up sorting and searching.

As far as I know, Wikivoyage doesn't use any automated alphabetical sorting in generating content. As for searching on Wikivoyage, redirects make things work out so whether you search for the accented name or unaccented equivalent, you end up in the same place. Sorting and searching Wikivoyage content outside of Wikivoyage (e.g., in your own word processor, on Google) should not be of our concern.

2) Accented characters are difficult to enter with an English keyboard

The only time a Wikivoyage reader has to enter text is to search, and as stated above, redirects take care of searches. Surely someone searching for Malm&ouml; is not going to give up because he can't type "&ouml;", without trying "Malmo". As for the inconvenience to Wikivoyage writers, sorry, but the traveler comes first.

Thus, I don't believe that extended Latin characters in article names is a real problem. However, I think that restricting article names to English-language characters will not be a problem either, and will at the very least make URLs look nicer (Malmo instaead of Malm%C3%B6). If there is a way in Wikimedia to name the article file Malmo but the title Malm&ouml;, that would be ideal.

-- (WT-en) Paul Richter 15:44, 29 May 2004 (EDT)


 * I'd like to cast in a rather belated vote with Paul here: I think we should allow the entirety of ISO 8859-1 (Latin-1) in article names, with redirects as appropriate. See Talk:Fuerth for yet another example of why not doing so is causing problems. (WT-en) Jpatokal 00:32, 17 Jan 2005 (EST)


 * Ok, but what so special in ISO 8859-1? We are using utf8 anyway... If we accept ç, why not ć? I would rather vote for acceptance for any latin letters with accents additions. (Sure, that's because I'm egoist. Most Polish-specific chars are ISO 8859-2 ;-). Only general objection I have, is that sometimes printers have problems with  national chars. (I'm having some difficulties when printing ISO 8859-2 characters under Mozilla/Linux on Postscript printer, but I'm able to workaround this). According to URLs, if you need to use nice looking syntax (eg. in promotinal materials) you can always create ASCII-only redirect page, and promote this URL. Probably even better will be if article is under ASCII-only title, and redirects are made from real name. -- (WT-en) JanSlupski 08:49, 17 Jan 2005 (EST)


 * Just fyi, Wikivoyage is both edited and published in UNICODE. There's no technical reason for limiting ourselves to the ISO 8859-2 charset. Maybe that's beside the point and you all realize that already?  -- (WT-en) Mark 08:59, 17 Jan 2005 (EST)


 * ISO 8859-1 is the default character set of HTML pages, while UTF-8 isn't. I've seen a number of severely mangled attempts to use Unicode (in UTF-8 or various other encodings, often corrupted) on Wikipedia, hence my hesitation. I also don't think it adds much value to start using names like Tōkyō or Hà Nội, where the diacritics are not strictly necessary. (WT-en) Jpatokal 10:13, 17 Jan 2005 (EST)


 * I've had a look at english Wikipedia, and it appears that they do use ISO 8859-1, but Wikivoyage uses UNICODE (in practice, yes, UTF-8). Wikivoyage quite clearly does not mangle UTF-8, but I'm very willing to believe that Wikipedia does.  So again, I think there is no technical reason whatever to limit ourselves to any variation on ISO 8859. -- (WT-en) Mark 04:50, 18 Jan 2005 (EST)


 * The problem is not Wikimedia, the problem is users and browsers. A user can enter Latin-1 and be fairly sure it'll pass through unscathed, but outside the basic 8 bits things get iffy.


 * But I'm being conservative here. If we want to adopt UTF-8 wholesale in titles &mdash; and obviously this is more or less workable, because eg. the Japanese Wikipedia uses nothing but &mdash; then it's fine with me. (WT-en) Jpatokal 07:00, 18 Jan 2005 (EST)

Moved discussion
I moved this discussion from Project:Foreign-language names since Project:Foreign-language names is about doing this:


 * Mexico City (Spanish: Mexico)

...and not about accented chars in article names.

In addition, I've tried to rationalize the foreign-language characters part of the article naming conventions with the most-common-English-name part. I think the first section is clear: use the most common English name, or in the absence of any English name, take the most common local name and romanize it. --(WT-en) Evan 08:00, 18 Jan 2005 (EST)

Continuum of anglicization
I was thinking about anglicization and came up with this diagram:

Comments (or additions!) welcome. --(WT-en) Evan 08:16, 18 Jan 2005 (EST)


 * Helsingør was discussed on the Elsinore discussion page. And ø is not an accent, neither are æ or å. I better make a redirect for Helsingør. -- (WT-en) elgaard 14:31, 18 Jan 2005 (EST)
 * I understand and like the general idea, but this is hard to codify; I, for example, would say that the English name of São Paulo is São Paulo, as I can't see how this is different from Malmö being Malmö. A name written in Roman letters cannot be romanized, and "English name" should only be applied in the comparatively rare case when the place in question has a distinctly different and widely used name in English, eg. Copenhagen for København.


 * Howabout "any name written in Latin letters is used as is complete with diacritics, unless the place already has a widely used alternative English name"? (WT-en) Jpatokal 09:05, 18 Jan 2005 (EST)


 * I don't think that's right. I think we should first exhaust options for English names, then, as a last resort, use romanized versions of local names. Places that don't have some name in English are very, very rare -- a quick review of the Getty Thesaurus of Geographical Names usually comes up with at least something.


 * The difference between Malmö and Sao Paulo is that the unaccented version of the Brazilian city's name is widespread -- in fact, it seems more popular than the accented version for English. But the unaccented version of the Swedish one ("Malmoe") is less widely used than the accented version. But, yes, understood: there's nothing inherent in either name that could tell you algorithmically whether or not to use the accented version.


 * See Talk:Sao Paulo for a discussion on "Sao Paulo" versus "São Paulo". I, too, was originally inclined to think of "São Paulo" as the most common English name, but some (admittedly light) research seems to point towards the unaccented version. We don't use "Sao Paulo" just because it's unaccented; we use it because it's the most common (we think) English name. It's definitely not ubiquitous, though -- one of those difficult cases. --(WT-en) Evan 13:46, 18 Jan 2005 (EST)


 * I believe that diacritics should not appear in article names. For example: Whakatane is more correctly written as Whakatāne in Māori, and in some English texts, although Whakatäne is also considered acceptable. However, in common English the ā or ä is normally written as a, mostly because (US English) keyboards (and previously typewriters) do not possess ā or ä keys, (though German ones might). While it is possible to copy the ā or ä, it is far easier to simply type a. If you cannot read the ā character because your computer does not have the macron characters loaded or do not know the keyboard combination code to type the character then I think this challenge highlights my concern at having diacritics in Article names . If someone had to search for the placename would they necessarily put the diacritics in? Would you? Would other Wikivoyage's readers? If the answer to those questions is Yes, Definitely then use diacritics. However, I do not believe that most, or even many, English readers will use diacritics, instead they will simply type the letters a to z without the diacritics.  Finally, let me be clear - I have no objection to having the placename with diacritics being the first (or any) word of the article. I just believe that the Article name should not have them. I believe that if you want to have diacritics in Article names then make it a redirect to the Article name without diacritics. This also gives flexibility as links work with and without diacritics. -- (WT-en) Huttite 06:13, 20 Jan 2005 (EST)


 * Getty gives Whakatane as the preferred English spelling.


 * That is the common New Zealand spelling too. However allowing diacritics potentially permits articles to be created in their native language form -- (WT-en) Huttite 07:16, 25 Jan 2005 (EST)


 * Yes, but we ask that the article names be the most common English name, not the native language name. --(WT-en) Evan 11:40, 25 Jan 2005 (EST)


 * I don't think that English's relative dearth of accented characters derives from their absence on our Druid forefathers' oaken keyboards. It's more likely, I believe, that causation is in the other direction and that keyboards follow English usage. That all said, it's a mistake to think that English does not use accented characters; risqué, résumé, and café all come immediately to mind. I'm no linguist, but as I understand it, words often enter English from other languages, maintain their foreign characters for a while, and eventually the "edges" get smoothed down with use. And, often, the words that enter our language this way are place names.


 * So, I think we should use the most common English name for a place, regardless of whether it has accented characters or not. --(WT-en) Evan 08:04, 20 Jan 2005 (EST)

Here's a simple solution: When in doubt, use Getty's English-Preferred name. This also resolves, among other things, the long-running Bombay-or-Mumbai debate in favor of Mumbai. Incidentally:


 * Are there diacritics (accent marks) in the vocabulary data?
 * Names and other information in the vocabularies may include dozens of different diacritics. However, not all diacritics may be viewed on the Web. Diacritics that are outside the Latin 1 character set are suppressed in the Web versions of the Getty vocabularies. These diacritics are expressed by codes in the licensed data files. [

Neener neener. =) (WT-en) Jpatokal 08:58, 20 Jan 2005 (EST)


 * Can we please come to some sort of conclusion on this already? As Project:Article naming conventions now says accented chars are OK, I will hereby start using my own proposal above until convinced of the error of my ways. (WT-en) Jpatokal 03:37, 25 Jan 2005 (EST)


 * I have had a nasty little experience with an article named Großglockner Hochalpenstraße which displayed as GroÃŸglockner HochalpenstraÃŸe for both the title and the title's text in the article. (See Project:Bug reports 1.3.5) This error makes me wary of using diacritics for technical reasons. I would suggest that article names containing diacritics redirect to article names without them. By all means have diacritics in the article text but the article title generally should not, (unless it is a redirect article). Also having 2 alternative spellings means google will index the article under both alternatives. -- (WT-en) Huttite 07:16, 25 Jan 2005 (EST)


 * The conclusion is: most common English name, like it's always been. Getty's English or English-P name is probably usually a good starting point, but not definitive. If there is no English name, use a Romanized version of the most common local-language name. --(WT-en) Evan 07:44, 25 Jan 2005 (EST)


 * No, no! Up to now we have considered name Łódź, as already Romanized (as in table above). My understanding is that while using Łódź in article content can be allowed, article should be named Lodz, with eventual redirect Łódź &rarr; Lodz. -- (WT-en) Jan Słupski 07:53, 25 Jan 2005 (EST)


 * Webster's gives Lodz, Getty gives Lódz. I think one of the two of these is the most common English name. I also think that we should refer to places in articles in the same language that we use for linking. --(WT-en) Evan 11:30, 25 Jan 2005 (EST)


 * Hmmm. But, as far as I understand, Getty version is not English name in fact. It's result of stripping anything outside ISO8859-1, from the original name (and it's done because place has no English name!). I guess that this is done to support pre-utf8 browsers, etc. But I hardly imagine that anybody would ever enter Lódz to search box. --(WT-en) Jan Słupski 15:39, 25 Jan 2005 (EST)


 * You're mistaken. The so-called "Getty name", if I follow Jpatokal's suggestion correctly, is the name that the Getty Thesaurus of Geographic Names gives as the "English" or "English-Preferred" name for a place.


 * GToGN says: Note that, where there is an English name for any place, the Preferred English name is flagged as "English-P". (...) Note that for over 90% of the geographical places in the world, there is no English equivalent; English speakers use the vernacular name for these places. Also: Names and other information in the vocabularies may include dozens of different diacritics. However, not all diacritics may be viewed on the Web. Diacritics that are outside the Latin 1 character set are suppressed in the Web versions of the Getty vocabularies. These diacritics are expressed by codes in the licensed data files. . So back to example. Lódz is not the English name (it's not flagged "English-P"), and it is only so-called Web name. But in licensed data would be in full form of Łódź. I don't see any point on using strange form Lódz. Of course I do not object using "English-P" form if one exist. But I suspect it would never use accented characters, as they are simply not part of English language. -- (WT-en) JanSlupski 20:21, 25 Jan 2005 (EST)


 * One more "No, no!" from me. The problem with most common English name is that there is no way of determining what it is (unless you're Evan).  You yourself now list several choices in the table above, but the article can only have one name.  We must figure out some consistent, neutral, mutually agreeable way of determining this, and I've proposed two: 1) as in Wikipedia, or 2) as in Getty.  (Google isn't really an option because it doesn't handle diacritics or multiple languages very well.)  Can we choose one, or do you have better ideas? (WT-en) Jpatokal 11:51, 25 Jan 2005 (EST)


 * In the past, we've taken into account a lot of different criteria: reference material, recent media citations, past media citation, Google popularity, other guidebooks. See Talk:Myanmar and Talk:Bombay for examples. I realize it's a hard process, but it's only so for a microscopic minority of articles. For most (Germany, Greece, Zimbabwe, Saint Petersburg) we've had little or no problem -- the issue hasn't even come up.


 * I think the Getty thesaurus is a very good starting point. I'm wary of the fact that their criteria for choosing names doesn't match with ours (familiarity for contributors and readers), though. I also don't think it's a good idea to cede our decision-making powers to anyone else. There doesn't seem to be much point in offloading the discussion to Wikipedia, for example: there's just a bunch of contributors over there, arguing the points the same as we do. Why don't we start an article on Project:Finding the most common English name, and come up with some good sources and guidelines? --(WT-en) Evan 18:33, 25 Jan 2005 (EST)

Another proposition
Ok, once more proposition: --(WT-en) Jan Słupski 15:29, 25 Jan 2005 (EST), modified -- (WT-en) JanSlupski 20:27, 25 Jan 2005 (EST)
 * &sup1;city has English name if there is entry in Getty Thesaurus of Geographic Names flagged English-P
 * article name never uses any diacritics. Only pure lower ASCII.
 * of course, in any case article content can (and should) contain original name, probably in header (before TOC)
 * there can be any redirects from original, non-romanized or with-diacritics names to the article