Sunday, June 17, 2012

How to choose which language to show a web site visitor

How are we supposed to automatically decide which language at our site to show a visitor?  The HTTP/1.1 14.4 Accept-Language spec tells us exactly how to make this decision.  I've discussed this in detail with php code examples at this blog, but I'd like to provide in this post a more readable executive summary.

THE STRATEGY IN A NUTSHELL

1.  We are supposed to indicate the quality (0 to 1) of each of our languages offered.
2.  The browser is supposed to tell us the user's language preferences with strengths (0 to 1) of each, and the browser can use a "*" to specify a preference strength for all other languages.  Any language not specified gets a preference strength of 0.
3.  For each of our offered languages, we search the visitor's preferences for the right strength to multiply by our language quality.  We assign to our offered language the strength of their longest matching preference.
4.  Our offered language with the highest composite quality times assigned strength wins, and we show it to the visitor.

HOW WE FIND THE MATCH FOR EACH OF OUR OFFERINGS

Each of the languages we offer has an ISO name like en, en-gb, es-ar, etc. For each of our languages, we consider the ISO name and the two-letter base portion (like en, en, es, etc). Then for each of their language preferences, we determine if it is a match to either our ISO name or our base, and if it's a match, we record it. The strength of the longest matching preference gets assigned to our offering as in the following examples:

Our offering: en-us;1.0
Their preferences: en-gb;1.0,es;0.5,fr;0.3
Strength assigned: 0 None of their preferences matches our offering or its base, so we must assign 0.
Note that this seems wrong.  But the HTTP spec (see Note 1 below) says it's the browser's responsibility to set or suggest the base language (en) when a specific language variant (en-gb) is selected by the user.

Our offering: en-us;1.0
Their preferences: en-gb;1.0,en;0.9,en-us;0.8,es;0.5,fr;0.3
Strength assigned: 0.8 Their en and en-us preferences match our offering, and their en-us is the longest match.


Our offering: en;1.0
Their preferences: en-gb;1.0,es;0.5,fr;0.3
Strength assigned: 0 None of their preferences matches our offering or its base, so we must assign 0.  Note that this seems wrong.  But the HTTP spec (see Note 1 below) says it's the browser's responsibility to set or suggest the base language (en) when a specific language variant (en-gb) is selected by the user.


Our offering: en-us;1.0
Their preferences: en;1.0,es;0.5,fr;0.3
Strength assigned: 1 Their en preference matches our base, so our en-us gets a strength of 1.0.

Our offering: en-us;1.0
Their preferences: en-gb;1.0,es;0.5,fr;0.3,*;0.5
Strength assigned: 0.5 Their * preference matches our en-us.
 
Our offering: fr;0.8
Their preferences: en-gb;1.0,es;0.5,fr;0.3,*;0.5
Strength assigned: 0.3 Their fr preference matches our fr.

We then multiply the strength of the match by our quality.  Whichever language we offer has the highest composite quality wins.  It is the language we serve.

SECOND-GUESSING THE SPEC

If we absolutely refuse to trust that browsers will assist users to give us the preferences they really want, we can do something like let the base of their preference match the base of our offering, perhaps with a strength reduction in recognition of the fact we are second-guessing the spec.

TERMS USED IN THE SPEC

The following table shows how the terminology in this post compares with the HTTP spec terminology:

THIS
POST
HTTP
SPEC
EXAMPLEMEANING
(language) preferencerangeen-us, esAn ISO language abbreviation (en-us, es, etc.) for a visitor's language preference.
(language) offeringtagen, es-arAn ISO language abbreviation (en, es-ar, etc.) for a language we offer.
baseprefixen, esThe part of a language abbreviation that comes before the "-". In ISO practice, the first two letters of a language abbreviation.
matchmatchno (en-us not equal to en), yes (es equal to base of es-ar)A preference matches an offering if it (in its entirety) equals the offering or the base of the offering.
strengthquality0.5, 1.0 (en-us 0.5 and es 1.0)An indication of the strength of a visitor's language preference.
qualityquality1.0, 0.8 (en 100% and es-ar 80%)An indication of the relative quality of the offered language at our web site.
assignedassigned0, 1.0 (Yes, their en-us is not supposed to match our en.  See note 1 below.)The strength assigned to each of our offerings.
enenmatchYes. Assign their quality.
eses-armatchYes. Assign their quality.

Note 1 (quoted from the HTTP Spec):
Note: When making the choice of linguistic preference available to the user, we remind implementors of the fact that users are not familiar with the details of language matching as described above, and should provide appropriate guidance. As an example, users might assume that on selecting "en-gb", they will be served any kind of English document if British English is not available. A user agent might suggest in such a case to add "en" to get the best matching behavior.

No comments: