How do you convert character case in UNIX accurately? (assuming i18N) - internationalization

I'm trying to get a feel for how to manipulate characters and character sets in UNIX accurately given the existance of differing locales - and doing so without requiring special tools outside of UNIX standard items.
My research has shown me the problem of the German sharp-s character: one character changes into two - and other problems. Using tr is apparently a very bad idea. The only alternative I see is this:
echo StUfF | perl -n -e "print lc($_);"
but I'm not certain that will work, and it requires Perl - not a bad requirement necessarily, but a very big hammer...
What about awk and grep and sed and ...? That, more or less, is my question: how can I be sure that text will be lower-cased in every locale?

Perl lc/uc works fine for most languages but it won't work with Turkish correctly, see this bug report of mine for details. But if you don't need to worry about Turkish, Perl is good to go.

You can't be sure that text will be correct in every locale. That's not possible, there are always some errors in software libraries regarding implementation of i18n related staff.
If you're not afraid of using C++ or Java, you may take a look at ICU which implement broad set of collation, normalization, etc. rules.

Related

Automatic gettext translation generator for testing (pseudolocalization)

I'm currently in process of making site i18n-aware. Marking hardcoded strings as translatable.
I wonder if there's any automated tool that would let me browse the site and quickly see which strings are marked and which still aren't. I saw a few projects like django-i18n-helper that try to highlight translated strings using HTML facilities, but this doesn't work well with JavaScript.
So I thought FДЦЖ CУЯILLIC, 𝔅𝔩𝔞𝔠𝔨𝔩𝔢𝔱𝔱𝔢𝔯 or ʇxǝʇ uʍop-ǝpısdn (or something along those lines) should do the trick. Easy to distinguish visually, still readable, yet doesn't depend on any rich text formatting besides Unicode support.
The problem is, I can't find any readily-available tool that'd eat gettext .po/.pot file(s) and spew out such translation. Still, I think the idea is pretty obvious, so there must be something out there, already.
In my case I'm using Python/Django, but I suppose this question applies to anything that uses gettext-compatible library. The only thing the tool should be aware of, is that there could be HTML fragments in translation strings.
The msgfilter program will let you run your translations through any program you want. It works especially well with GNU sed.
For example, to turn all your translations into uppercase (HTML is mostly case-insensitive, so this should work):
msgfilter -i django.po sed -e 's/\(.*\)/\U\1/'
The only strings in your app that have lowercase letters in them would then be the hardcoded ones.
If you really want to do faux cyrillic, you just have to write a program or script that reads Latin and outputs that, and feed that program to msgfilter instead of sed.
If your distribution has a talkfilters package, it might provide a few programs that might be useful in this specific case. All of these should work as msgfilter filters. (My personal favorite is chef. Bork bork bork!)
Haven't tried this myself yet, but found podebug tool from Translate Toolkit. Based on documentation (flipped and unicode rewrite options), this looks exactly the tool I wished for.

Store and query a mapping in a file, without re-inventing the wheel

If I were using Python, I'd use a dict. If I were using Perl, I'd use a hash. But I'm using a Unix shell. How can I implement a persistent mapping table in a text file, using shell tools?
I need to look up mapping entries based on a string key, and query one of several fields for that key.
Unix already has colon-separated records for mappings like the system passwd table, but there doesn't appear to be a tool for reading arbitrary files formatted in this manner. So people resort to:
key=foo
fieldnum=3
value=$(cat /path/to/mapping | grep "^$key:" | cut -d':' -f$fieldnum)
but that's pretty long-winded. Surely I don't need to make a function to do that? Hasn't this wheel already been invented and implemented in a standard tool?
Given the conditions, I don't see anything hairy in the approach. But maybe consider awk to extract data. awk approach allows for picking only the first, or the last entry, or imposing any arbitrary additional conditions:
value=$(awk -F: "/^$key:/{print \$$fieldnum}" /path/to_mapping)
Once bundled in a function it's not that scary:)
I'm afraid there's no better way at least within POSIX. But you may also have a look at join command.
Bash supports arrays, which is not exactly the same. See for example this guide.
area[11]=23
area[13]=37
area[51]=UFOs
echo ${area[11]}
See this LinuxJournal article for Bash >= 4.0. For other versions of Bash you can fake it:
hput () {
eval hash"$1"='$2'
}
hget () {
eval echo '${hash'"$1"'#hash}'
}
# then
hput a blah
hget a # yields blah
Your example is one of several ways to do this using shell tools. Note that cat is unnecessary.
key=foo
fieldnum=3
filename=/path/to/mapping
value=$(grep "^$key:" "$filename" | cut -d':' -f$fieldnum)
Sometimes join comes in handy, too.
AWK, Python, Perl, sed and various XML, JSON and YAML tools as well as databases such as MySQL and SQLite can also be used, of course.
Without using them, everything else can sometimes be convoluted. Unfortunately, there isn't any "standard" utility. I would say that the answer posted by pooh comes closest. AWK is especially adept at dealing with plain-text fields and records.
The answer in this case appears to be: no, there's no widely-available implementation of the ‘passwd’ file format for the general case, and wheel re-invention is necessary in each case.

In Ruby, how to automatically convert non-supported characters in text-processing?

(Using Ruby 1.8)
I only have a brief understanding of encoding and such...but what I want to know is, in any given script handling any given text-file, is there some universal library or call I need to make to turn non-standard characters into their nearest printable equivalent. I realize there's no "all-in-one" fix, but this is for a English (U.S. gov't) text file, and so I'm wondering if there's something that mitigates what must be a relatively common issue in English text formatting.
For example, in a text file, I have an entry like this:
0-8­23
That hyphen is just literally a hyphen as I've typed it out. In the file though, it's something that looks like a hyphen (an n-dash?) but when copy and pasting it...for example, into this browser text box, it doesn't show up.
Printing it out via a Ruby script gets this:
08�23
How do I get my script to resolve it into a dash. Or something other than a gremlin?
It's very common to run into hyphen-like characters and dashes, especially in the output of word-processors. Converting them isn't too hard if you know what the byte is that represents the character, but gets to be a pain when you get a document with several different ones. It gets worse as you throw other accented characters into the mix.
Ruby 1.8 doesn't support multibyte and Unicode character sets as well as 1.9+, but you can work around that somewhat by using the Iconv library.
Iconv lets you convert between various character-sets, such as US-ASCII, ISO-8859-1 and WIN-1252. It's smarter than a regex, because it knows how to convert from accented characters, to similarly looking characters, or ignore them if nothing similar exists, allowing your transliteration to degrade gracefully.
I have some example code in an answer to a related question. Also read James Grey's article linked in the answer. It explains the problem and ways to fix it, ending up with recommending Iconv too.
You could whitelist with gsub:
string.gsub(/[^a-zA-Z0-9]/)
Without knowing more information, I can't build the perfect regex for you, but the general idea is to replace anything that's not what you're expecting (anything not a letter or number or expected symbols).

Delimiter for meta data in Windows file name

I'm working on maintenance of an application that transfers a file to another system and uses a structured filename to include meta data including a language code. The current app uses a two character language code and a dash/hyphen for a delimiter.
Ex. Canada-EN-ProdName-ProdCode.txt
I'm converting it to use IETF language code and so the dash delimiter won't do and need a replacement. I'm trying to determine a delimiter to avoid future errors and am considering the tilde ~.
Ex. Canada~en-GB~ProdName~ProdCode.txt
This will be use only on Windows Sever 2003 + systems. I certainly didn't come up with this system of parsing a filename to get meta data. Unfortunately, I can't include this in the file itself and the destination system is expecting the language code to be in IETF format with the dash.
Any thoughts on potential issues with using the tilde in the filename, or perhaps a better character to use? I'm just looking for a second opinion in case I'm overlooking a possible failure. I believe windows will use the tilde when shortening a long filename to 8.3 format, but I don't see that as an issue here as the OSs can handle lang filenames.
The tilde is probably fine, but what's wrong with the good old underscore _ ? It has no special meaning on either windows or unix, and makes names that are relatively easy to read. If there are no other special considerations, I would avoid the tilde solely out of paranoia, since windows does use it as a special character sometimes, as you mentioned.
For anyone readiong this question I would strongly recommend anything but the tilde in the file name or at least be careful in testing for any speed problems with any .NET path work where one exists.
I used this as a file name delimiter some time ago. I couldn't understand why simply getting a list of files from the folders was taking so long. It was a number of years later (having written a lot of speed up code that had marginal advantage) that I discovered there is a problem with the (DirectoryInfo(path).name in .NET at least) where simple existience of the tilde was forcing underlying code to through a lot of hoops.
The slow down was substantial (it was over a network so I had thought it was bandwidth/Network issues for a fair while)
I understand this is a legacy overhang for when alternative short versions of filenames could be used for Windows files.
I am now stuck with the tilde in these file names but, given that the problem lay in some of the .NET path functions (I don't actually know if it still does), I could work around it by spotting a tilde and creating my own answers when it existed rather than passing it through.
If in any doubt just run speed tests with and without the tilde in filenames for say just 500-1,000 files.

What is the best character to use as a delimiter in a custom batch syntax?

I've written a little program to download images to different folders from the web. I want to create a quick and dirty batch file syntax and was wondering what the best delimiter would be for the different variables.
The variables might include urls, folder paths, filenames and some custom messages.
So are there any characters that cannot be used for the first three? That would be the obvious choice to use as a delimiter. How about the good old comma?
Thanks!
You can use either:
A Control character: Control characters don't appear in files. Tab (\t) is probably the best choice here.
Some combination of characters which is unlikely to occur in your files. For e.g. #s# etc.
Tab is the generally preferred choice though.
Why not just use something that exists already? There are one or two choices, perl, python, ruby, bash, sh, csh, Groovy, ECMAscript, heavens for forbid windows scripting files.
I can't see what you'd gain by writing yet another batch file syntax.
Tabs. And then expand or compress any tabs found in the text.
Choose a delimiter that has the least chance of collision with the names of any variable that you may have (which precludes #, /, : etc). The comma (,) looks good to me (unless your custom message has a few) or < and > (subject to previous condition).
However, you may also need to 'escape' delimiter characters occurring as part of the variables you want to delimit.
This sounds like a really bad idea. There is no need to create yet another (data-representation) language, there are plenty ones which might fit your needs. In addition to Ruby, Perl, etc., you may want to consider YAML.
Designing good syntax for these sort of this is difficult and fraught with peril. Does reinventing the wheel ring a bell?
I would use '|'
It's one of the rarest characters.
How about String.fromCharCode(1) ?

Resources