Re: i18n: should cf-cli strings be exact duplicates of en-US translations?


John Feminella <jxf@...>
 

Thanks, that was super helpful. On a similar project, I overcame some of
the difficulties you described with a small shell script and use of `jq`.
Essentially one got around the i18n4go problem by adding a second file
which was simply an old-English-key-to-new-semantic-key mapping. The script
then parsed the mapping and made the appropriate substitutions in the
language JSON.

In this scheme one would:

* add a key to replace to the mapping
* make the source replacements
* run the script and make the locale.json replacements
* run i18n4go

From the i18n4go perspective, nothing will have changed; it's as if the old
key was never there and things were named after the new key all along.
Repeat this until each key has been replaced.

Eventually, when all the replacements have been made, you delete the
mapping and resume using i18n4go as normal. (The mapping doesn't need to be
committed to source control.)

It sounds like Dies and the CLI team have some more feedback so I'll come
revisit this when they've had a chance to comment.

On Thu, Sep 15, 2016, 20:58 Kris Hicks <khicks(a)pivotal.io> wrote:

The workflow is available here:
https://github.com/cloudfoundry/cli/blob/master/CONTRIBUTING.md#i18n

i18n4go, goi18n, bin/generate-language-resources are the three things that
need to be executed when updating any calls to the i18n.T() function (which
tends to be dot-imported, so it's just T()).

i18n4go digs through the AST to find calls to T(). It assumes you want the
strings that are in your call to T() as your translation key and (English)
value, and modifies the en-us.all.json accordingly. It also has the ability
to detect changes and removals.

goi18n is what fills out the *.translated.json and *.untranslated.json
files with those keys that are present in en-us.all.json but missing in,
for example, fr-fr.all.json. It puts empty values in the .translated files
and the English value in the .untranslated files.

bin/generate-language-resources rebuilds the binary representations of
those JSON files into i18n_resources.go

To use keys instead of the English values in calls to T(), i18n4go would
need to be changed to just do the add/removal of keys, but leave empty
values in the en-us.all.json file for new entries (rather than adding the
English value as well). The English value would be added manually to that
file at that point.

That in itself isn't too complicated, except for the fact that i18n4go
doesn't have much in the way of tests, so making modifications should be
done with care.

The part that's not so great is that once you start switching to keys, you
need to switch everything to a key at once otherwise i18n4go will not be
very useful at all. For example, if you switched one call to T() to be a
key instead of English, i18n4go would want to remove the old key and value,
and add the new key with a value that's the same as the key. You'd have to
restore the English value to the new entry. If you were to add a new call
to T(), you'd also get limited usefulness out of i18n4go as it would modify
the en-us.all.json to add the missing entry (which would have the correct
key, but wrong value), but you'd still need to update the value. The
workflow is already not very good; it would just be worse in this scenario.

Using proper keys also tends to imply an order/hierarchy/convention to the
keys, and that requires a bit of thought to properly model both the
existing strings and new ones.

I haven't been on the CLI team for a number of months now, so some of the
above may have changed.

Cheers,

KH

On Thu, Sep 15, 2016 at 5:18 PM John Feminella <jxf(a)pivotal.io> wrote:

The tooling around updating translations all assumes a particular
workflow, which would need to change: i18n4go pores through the source code
and compares the values it finds to what exists in the English file and
makes updates in the translation files as necessary, for example.

Thanks for the feedback, Kris. I'm not familiar with the broader
translation workflow on CF. If you have some time offline I'd love to get
your thoughts and understand what you see as the challenges.

On Thu, Sep 15, 2016, 19:16 Kris Hicks <khicks(a)pivotal.io> wrote:

Disclosure: I used to be a developer on the CLI team

I really like the idea of having identifiers rather than English, for
the same reasons that John mentioned.

For strings where a colon existed, it would make sense to include the
translation twice, one with and one without the colon, for the reason Dies
brought up. I think that's fine; usually it's not that but some number of
newlines in the CLI codebase, like in the "\n\nTIP\n" example.

I had considered taking on this work when I was on the CLI team, but it
seemed like too big a change at the time. The tooling around updating
translations all assumes a particular workflow, which would need to change:
i18n4go pores through the source code and compares the values it finds to
what exists in the English file and makes updates in the translation files
as necessary, for example.

Cheers,

KH

On Thu, Sep 15, 2016 at 3:43 PM John Feminella <jxf(a)pivotal.io> wrote:

hi Dies,

Can I ask the background of looking into this? Are you looking into
adding support for another locale and finding the number of messages to
translate too big?

No, I just thought it was an area that might be worth improving if
others agreed, and I've been involved in a number of i18n efforts on
various products. I am not personally or specifically blocked in any way by
this, though as I mentioned I think it is a beneficial suggestion.

That said, I think there are some areas that are more worth improving
than others (assuming there is agreement to change them at all). For
instance I think that embedding newlines in string keys, as in
"\n\nTIP:\n", could be modified to be more amenable for translators.


French grammar rules dictate the requirement for a space before the
colon. So “Hello:” becomes “Bonjour :”.

That could be the background for a number of such instances that look
like duplications.

I agree, these kinds of cases are sometimes tricky.

In such cases, my understanding is that in the same way that some
locales prefer "15 July" and others prefer "July 15", so too would you use
a locale-specific modifier for the colon. So in the code one might use
something like `T("greeting") + T("locale.separators.colon")` if you wanted
to be maximally correct, where "locale.separators.colon" maps to " :" for
fr-FR and ":" for en-US.

best,
~ jf


On Thu, Sep 15, 2016, 18:19 Koper, Dies <diesk(a)fast.au.fujitsu.com>
wrote:

Hi John,



Thank you for your interest in the CLI’s internals.



Can I ask the background of looking into this? Are you looking into
adding support for another locale and finding the number of messages to
translate too big?



I’ve asked my team to answer.



There is a caveat with Benefit (3) and Disadvantage (2):

French grammar rules dictate the requirement for a space before the
colon. So “Hello:” becomes “Bonjour :”.

That could be the background for a number of such instances that look
like duplications.



Regards,

Dies Koper
Cloud Foundry Product Manager - CLI





*From:* John Feminella [mailto:jxf(a)pivotal.io]
*Sent:* Thursday, September 15, 2016 7:14 PM
*To:* Discussions about Cloud Foundry projects and the system overall.
*Subject:* [cf-dev] i18n: should cf-cli strings be exact duplicates
of en-US translations?



hi,



I'm interested in gathering the community's thoughts about a proposal
to improve the structure of the i18n translations.



## Issue



Currently the translation files for the CF CLI duplicate both the
identifier and the translation. For example, we have translations similar
to:



# en-US.json

{

"id": "A required argument for this command is missing",

"translation": "a required argument for this command is missing"

}



My understanding is that the practice most i18n-enabled projects
follow is that the id is exactly an identifier which conveys some semantic
meaning, rather than literally duplicating one specific language's
translation.



For example, a localization resources file might instead contain
something like:



{

"id": "cli.errors.missing_argument",

"translation": "a required argument for this command is missing"

}



## Benefits



The benefits of using the semantic-identifier approach are:



(1) Reduce number of change sites



For cases where a translation shows up frequently, one needs to update
many more locations than is strictly necessary when a change occurs,
instead of just a few (the resources file).



(2) Improve message refactoring potential



Increases the opportunities for potential message refactorings. For
example, a large number of messages include embedded newlines. One example:



{

"id": "\n\nTIP:\n",

"translation": "\n\nTIP:\n"

},



This seems bad, because one might want to use the translation for
"TIP:" without needing to have the newlines embedded there.



(3) Reduce number of strings that need to be translated



Many translations that are virtually identical are duplicated
throughout the localization files. For example:



{

"id": "requested state",

"translation": "requested state"

},

{

"id": "requested state:",

"translation": "requested state:"

},



Under the current approach, if a string differs even by a single
character an entirely new translation is required, even if the semantic
meaning is the same.



With the proposed approach these can instead be merged.



(4) Clearer intent



The intent of the message is clearer because it's not explicitly
called out in the identifier, so if a message changes it's less clear what
set of purposes it has.



## Disadvantages



(1) Must look at two places to determine correct translation



The main disadvantage of this approach is that a translator must look
at both the source and destination files to determine the correct
translation. For instance, someone translating from en-US to fr-FR has to
have both the en-US and fr-FR files open.



However, in general I would think one usually doesn't want a *literal* translation
("translate the string 'missing a required argument'"), but rather a
*semantic* translation ("write an error message indicating that the
command is missing a required argument"), so this would encourage better
translations overall.



(2) Theoretically possible to have wrong translations



In a language where the translations for something like "hello" and
"hello:" were different, it would not be correct to merge these and
differentiate with `T("greetings.hello")` vs. `T("greetings.hello") + ":"`.
I've looked at the existing translations and I don't believe that any such
cases currently exist, although I could be mistaken.



## Summary



Overall I believe this adds a lot of tangible benefits and reduces
cognitive overhead. I'm interested in the community's thoughts. If we agree
this is not ideal I would be willing to make a PR to implement the proposal.



best,

~ jf

--

John Feminella
Advisory Platform Architect
✉ · jxf(a)pivotal.io
t · @jxxf


Join cf-dev@lists.cloudfoundry.org to automatically receive all group messages.