Friday, October 9, 2009

Visitor Language Detection by the Book: Assigning Language Quality


How do we choose which language at our site to show to a visitor?
My server-side language of the moment is PHP. But this post will be about the general business rules of detecting a visitor's language preferences and assigning them a language at your web site. Part 2 gives the actual PHP coding for the detection.


I based my original research on the phpGedView open source language detection code and the HTTP/1.1 14.4 Accept-Language spec. I subsequently (a little late!) looked at Urbano Alvarez's solution to see if I might have done a better job and perhaps saved myself some work. Urbano's solution is very complete, and beautifully presented. I recommend it and hope not to detract. Where I think some improvement can be made is in assigning a visitor language according to the HTTP/1.1 spec.


So, what's the "right" way to assign a visitor language? Easy. Just use their HTTP Accept-Language header to assign a composite "quality" to each language you serve. Then serve the visitor the winning language.


And what's "quality"? On the visitor's side, it's the quality of their reading (their preference) of a given language (called a "range") from q=0.000 to q=1.000 (rounding is okay). On your side, it's the quality of your presentation of a given language (called a "tag") from q=0.000 to q=1.000 (rounding is okay). The composite quality of each of your languages (tags) is your visitor's quality multiplied by your language's quality.


The above explains two tools that are being ignored in the other language detection solutions I have seen. First, you are able to indicate the quality of each "translation" of your site. For example, "es,en;q=0.8" at Urbano's site and "en,es;q=0.6" at my site. Second, you are able to combine that quality with the client's preference strength (quality) to arrive at a winning assignment for the visitor.


And how do you assign their quality to your languages? That's the complicated part. I'll try to make it easy with examples.


The Accept-Language header looks like this:


or this

The first example means the reader (client) prefers to read en-gb perfectly well, en-us 90%, all other en 80%, etc. All other languages (ranges) get the default quality of 0%
The second example assigns the default quality of 10% to all other languages.
If the Accept-Languages header is missing, all languages (ranges) are 100% quality. In other words, serve whatever you like. The Google bot, for example, might send a header like "*,q=1", or simply no header to get a site's best language.


We compare a given browser language range with a given site language tag. If the browser language range exactly (in it's entirety!) equals the language tag or the tag's prefix, we assign the range quality to our language tag if it's the longest range that matches our tag.
Examples (assuming we serve en, en-us, and es-ar):

en-usenNo. Their entire doesn't match our prefix.
en-cockneyenNo. Their entire doesn't match our prefix.
en-cockneyen-usNo. Their entire doesn't match our prefix.
enen-usNot in this case. This is a match, but it isn't the longest match we found. Their en-us preference range is a longer match for this tag.
en-usen-usYes. Assign their quality.
enenYes. Assign their quality.
eses-arYes. Assign their quality.

About this, the specification notes:

Note: This use of a prefix matching rule does not imply that
language tags are assigned to languages in such a way that it is
always true that if a user understands a language with a certain
tag, then this user will also understand all languages with tags
for which this tag is a prefix. The prefix rule simply allows the
use of prefix tags if this is the case.

So, the specification explicitly denies assigning "related" languages to users who ask for a language with a certain tag. In other words, if a user asks for es-ar, the specification denies assigning her es-mx or even generic es.


We assign to each of our language tags the client quality of the longest matching range. If there's no matching range for a tag, we assign it the default quality, which is 1 if there's no Assign-Language header, or 0 if there's no "*" range, or the quality of the "*" range if there is one.


We combine quality by multiplying the assigned client quality of each of our language tags by our delivery quality. Our winning language is the one we serve by default with no questions asked. As Urbano's script shows, we then can allow user to pick a different language as an override.


That's all by the book. But it doesn't seem reasonable to deny es to a disinterested user that prefers to accept prefix language without saying so. (They say es-ar, but really, really also prefer es to en) (Or on the negative side, they say they hate en-cockney, but really hate all forms of english.) In real world usage, the negative non-acceptance case is likely a sign of genuine user intent and attention, and probably doesn't apply to all flavors of the language range. An intentional user in such a situation would naturally assume generic en would NOT be included in his non-acceptance of en-cockney. To solve that, we are on our own. I have toyed with three methods:

A. We can call es a pseudo-matching tag for the es-ar range, by calling it a match with a length of one, not two characters. That way if user sent a preference for the es range, our non-compliant effort to be helpful will be overridden by it (see below). That way we are following the spirit of the specification while at the same time allowing for human nature and giving es a chance to fill in for es-ar. The disadvantage of this method is that it just as readily assigns a low quality as a high quality. So if a user expressed a negative (low) preference (quality) for, say en-cockney, we end up assigning that low quality to our generic en if cockney is the only flavor of en she sent a quality for. Then en cannot win.

B. We can adjust the quality of the es-ar range downward (not upward, due to the way human nature works) to the average of the es-ar range quality and the browser default range quality. Its adjusted range quality will be the average (downward only) of the range quality and the default range quality. That way the language has half a leg up (not down) on any random language. The disadvantage of this method is that by following the spirit of the spec, we end up assigning probably a lower quality than a user really intended. Then es may not win as desired.

C. I suggest both calling the match a pseudo-match and also assigning the unadjusted quality of the range to your tag both only as long as it is above 0.5. That way high qualities are preserved, better matches can override, and targeted negative (low) preferences don't proliferate.

It may help educate users with a language label like this:
"English quality 80%" with tip "Language en, site quality=1.000, user preference (as reported by browser)=0.800"
"Espa├▒ol calidad 80%" with tip "Idioma es, calidad del sitio=0.800, preferencia del usuario (como lo inform├│ el navegador)=1.000"

No comments: