Why does endianness is a concern regarding to binary files? - endianness

I have read that endianness is a problem when talking about binary files, but I have'nt found a reason for that.
Why is that? Also why isn't it apply to text files (that are in defintion binary files that just get a different interpretation).

Endianness is order within byte subsequences, necessarily of size over 1.
Take names, for example. In English, first names come... well, first. Then last names. For clarity, let's say that the given name precedes the family name. So "Thomas Edison, Nikola Tesla, Jack Nicholson" describes three people, whose given names are Thomas, Nikola and Jack. In Japan, family names come first, and given names follow. So e.g. "Kitano Takeshi" has Kitano as family name, and Takeshi as given name; and in English we usually write "Takeshi Kitano". Now if you aren't familiar with Japanese names or the pop culture, and you read "Shiina Ringo", you can't know if "Shiina" is a given name or a family name, unless someone informs you if "Shiina Ringo" is written in Japanese order or English order. (It's in Japanese order, "Shiina" is the family name.)
As long as you're inside the same culture, there's hardly any problem. Nobody in America would look at "Jack Frank" and doubt whether "Jack" or "Frank" was the given name. But if you go to a culture with the opposite order, suddenly a ton of problems. I live in Japan, and I have been randomly addressed as both "Mr $GivenName" and "Mr $LastName", simply because they had no idea whether I wrote my name using the Japanese or Western tradition.
On the other hand, people with mononyms don't have that problem. "Jennie" (the K-pop idol) is just "Jennie". "Cher" is just "Cher". There's no problem with "which one is the last name" when they don't use a last name.
Similarly, 16-bit integers take two bytes to represent. For example 0xDEAD can be represented as 0xDE 0xAD ("big endian"), or 0xAD 0xDE ("little endian"). Conversely, if you encounter 0xDE 0xAD, without knowing its endianness, you can't know if it's supposed to be 0xDEAD, or 0xADDE.
Just like with names, it usually doesn't bite you as long as you're working within the same system. Almost everything on a Windows 10 machine is little-endian. Working on a big-endian system (if you can find one, they're kind of extinct by now) will also not make problems. But if you make a little-endian file on Windows 10 and put it without change on a big-endian system without compensating for it, you'll get garbage.
Why doesn't it apply to text files? Just like why no-one messes up "Cher": when you have a stream of single bytes like e.g. 0x40, it's 0x40 whether you write it first-byte-first, or first-byte-last. (Also, note that this is only true for text files in single-byte encodings, like ASCII, Latin-1 etc. Text files with multibyte encodings like UTF-8, UCS-2, Shift-JIS, Big5 etc are all affected by endianness.)

Related

How to create filename with characters that are not part of UTF-8 on Windows?

[Edit/Disclaimer]: Comments pointed out that I have to clarify the encoding the user uses. Will update accordingly
I have a customer from China who recently reported an issue with their filenames on Windows. The software works with most Chinese characters, but it seems he has found one file that fails.
Unfortunately, they are not able to send me over the filename as neither zipping nor transmitting the file through other mediums seem to preserve the filename.
What is the easiest way (e.g. through Python) to generate a filename on Windows that is covered by the NTFS file system encoding but not UTF8?
Unicode strings are encoded as a series of bytes. The rules of what a series of bytes visually looks like to you in an operating system, is what operating systems use to turn bytes into characters.
Given that Windows uses a (variation of-) Unicode, and you say you have a character that's not in unicode, it also means that there is simply no way to represent that character.
Imagine if unicode only contained the numbers 0-9, and you ask someone how to encode the letter A. There's no answer to this, because only 0-9 are defined.
You could make up a new unicode codepoint for your character, but then operating systems won't know what to do with that unless you also make your own font files.
I somehow doubt that that's what you want to do though, but it's an option. Could your customer rename the file before sending it to you?

Translation and fixed number of letters words

In a part of my software I need, in different languages, lists of words having a fixed number of letters.
My software is already internationalized (using gettext, and it works fine). But, as it's not possible to translate a 4 letters word from english into a 4 letters word in another language (apart from some exceptions maybe), I can't imagine a reasonable way of letting gettext deal with these words lists.
So, for now, I decided to store my words like this (shortened example with english and french names):
FOUR_LETTERS_WORDS = { 'fr': ["ANGE", "ANIS", "ASIE", "AUBE", "AVEN", "AZUR"],
'en': ["LEFT", "LAMP", "ATOM", "GIRL", "PLUM", "NAVY", "GIFT", "DARK"] }
(This is python syntax, but the problem has not much to do with the programming language used)
The lists do not need to have the same length; they do not need to contain the same words.
My problem is following: if my software is to be translated in another language, like say, german, then all the strings that fall within the scope of gettext will be listed in the pot file and will be available for translation. But then, I also need a list of 4 letters words in german, that won't show up in the translation file.
I would like to know whether I need to think to ask the translator if he/she can also provide a list of such words, or if there's a better way to deal with this situation? (maybe finding a satisfying workaround with gettext?).
EDIT Realized the question has actually not much to do with the programming language, so removed the python* tags
You can do it with gettext. It's possible to use "keys" instead of complete sentences for translation.
If you use sentences in your .po files and don't want to have to translate the main language (let's say english), you don't need to translate them and only provide translation files for these words. If gettext finds a translation file, it uses it, else it will display the key (the msgid). It can be a complete sentence or a key, it does not matter.
To do that, you simply need to use a specific text domain for these words and use dgettext() function. The domain allows you to separate files depending on context, or whatever criteria of your choice (functionnality, sub-package, etc.).
To count these words, it's not as easy. You can count them with grep -c, for instance. You can provide a specific key that contains the number of 4 letters words (this would be a dirty hack on which you could probably not really rely).
Maybe there's another way in Python, I don't know this language...

Can sorting Japanese kanji words be done programmatically?

I've recently discovered, to my astonishment (having never really thought about it before), machine-sorting Japanese proper nouns is apparently not possible.
I work on an application that must allow the user to select a hospital from a 3-menu interface. The first menu is Prefecture, the second is City Name, and the third is Hospital. Each menu should be sorted, as you might expect, so the user can find what they want in the menu.
Let me outline what I have found, as preamble to my question:
The expected sort order for Japanese words is based on their pronunciation. Kanji do not have an inherent order (there are tens of thousands of Kanji in use), but the Japanese phonetic syllabaries do have an order: あ、い、う、え、お、か、き、く、け、こ... and on for the fifty traditional distinct sounds (a few of which are obsolete in modern Japanese). This sort order is called 五十音順 (gojuu on jun , or '50-sound order').
Therefore, Kanji words should be sorted in the same order as they would be if they were written in hiragana. (You can represent any kanji word in phonetic hiragana in Japanese.)
The kicker: there is no canonical way to determine the pronunciation of a given word written in kanji. You never know. Some kanji have ten or more different pronunciations, depending on the word. Many common words are in the dictionary, and I could probably hack together a way to look them up from one of the free dictionary databases, but proper nouns (e.g. hospital names) are not in the dictionary.
So, in my application, I have a list of every prefecture, city, and hospital in Japan. In order to sort these lists, which is a requirement, I need a matching list of each of these names in phonetic form (kana).
I can't come up with anything other than paying somebody fluent in Japanese (I'm only so-so) to manually transcribe them. Before I do so though:
Is it possible that I am totally high on fire, and there actually is some way to do this sorting without creating my own mappings of kanji words to phonetic readings, that I have somehow overlooked?
Is there a publicly available mapping of prefecture/city names, from the government or something? That would reduce the manual mapping I'd need to do to only hospital names.
Does anybody have any other advice on how to approach this problem? Any programming language is fine--I'm working with Ruby on Rails but I would be delighted if I could just write a program that would take the kanji input (say 40,000 proper nouns) and then output the phonetic representations as data that I could import into my Rails app.
宜しくお願いします。
For Data, dig Google's Japanese IME (Mozc) data files here.
https://github.com/google/mozc/tree/master/src/data
There is lots of interesting data there, including IPA dictionaries.
Edit:
And you may also try Mecab, it can use IPA dictionary and can convert kanjis to katakana for most of the words
https://taku910.github.io/mecab/
and there is ruby bindings for that too.
https://taku910.github.io/mecab/bindings.html
and here is somebody tested, ruby with mecab with tagger -Oyomi
http://hirai2.blog129.fc2.com/blog-entry-4.html
just a quick followup to explain the eventual actual solution we used. Thanks to all who recommended mecab--this appears to have done the trick.
We have a mostly-Rails backend, but in our circumstance we didn't need to solve this problem on the backend. For user-entered data, e.g. creating new entities with Japanese names, we modified the UI to require the user to enter the phonetic yomigana in addition to the kanji name. Users seem accustomed to this. The problem was the large corpus of data that is built into the app--hospital, company, and place names, mainly.
So, what we did is:
We converted all the source data (a list of 4000 hospitals with name, address, etc) into .csv format (encoded as UTF-8, of course).
Then, for developer use, we wrote a ruby script that:
Uses mecab to translate the contents of that file into Japanese phonetic readings
(the precise command used was mecab -Oyomi -o seed_hospitals.converted.csv seed_hospitals.csv, which outputs a new file with the kanji replaced by the phonetic equivalent, expressed in full-width katakana).
Standardizes all yomikata into hiragana (because users tend to enter hiragana when manually entering yomikata, and hiragana and katakana sort differently). Ruby makes this easy once you find it: NKF.nkf("-h1 -w", katakana_str) # -h1 means to hiragana, -w means output utf8
Using the awesomely conveninent new Ruby 1.9.2 version of CSV, combine the input file with the mecab-translated file, so that the resulting file now has extra columns inserted, a la NAME, NAME_YOMIGANA, ADDRESS, ADDRESS_YOMIGANA, and so on.
Use the data from the resulting .csv file to seed our rails app with its built-in values.
From time to time the client updates the source data, so we will need to do this whenever that happens.
As far as I can tell, this output is good. My Japanese isn't good enough to be 100% sure, but a few of my Japanese coworkers skimmed it and said it looks all right. I put a slightly obfuscated sample of the converted addresses in this gist so that anybody who cared to read this far can see for themselves.
UPDATE: The results are in... it's pretty good, but not perfect. Still, it looks like it correctly phoneticized 95%+ of the quasi-random addresses in my list.
Many thanks to all who helped me!
Nice to hear people are working with Japanese.
I think you're spot on with your assessment of the problem difficulty. I just asked one of the Japanese guys in my lab, and the way to do it seems to be as you describe:
Take a list of Kanji
Infer (guess) the yomigana
Sort yomigana by gojuon.
The hard part is obviously step two. I have two guys in my lab: 高橋 and 高谷. Naturally, when sorting reports etc. by name they appear nowhere near each other.
EDIT
If you're fluent in Japanese, have a look here: http://mecab.sourceforge.net/
It's a pretty popular tool, so you should be able to find English documentation too (the man page for mecab has English info).
I'm not familiar with MeCab, but I think using MeCab is good idea.
Then, I'll introduce another method.
If your app is written in Microsoft VBA, you can call "GetPhonetic" function. It's easy to use.
see : http://msdn.microsoft.com/en-us/library/aa195745(v=office.11).aspx
Sorting prefectures by its pronunciation is not common. Most Japanese are used to prefectures sorted by 「都道府県コード」.
e.g. 01:北海道, 02:青森県, …, 13:東京都, …, 27:大阪府, …, 47:沖縄県
These codes are defined in "JIS X 0401" or "ISO-3166-2 JP".
see (Wikipedia Japanese) :
http://ja.wikipedia.org/wiki/%E5%85%A8%E5%9B%BD%E5%9C%B0%E6%96%B9%E5%85%AC%E5%85%B1%E5%9B%A3%E4%BD%93%E3%82%B3%E3%83%BC%E3%83%89

What does I18N safe mean?

I came across a comment in some code referring to said code being "I18N safe".
What does this refer to?
I + (some 18 characters) + N = InternationalizatioN
I18N safe means that steps were taken during design and development that will facilitate Localization (L10N) at a later point.
This is most often referred to a code or construct ready for I18N - i.e easily supported by common I18N techniques. For instance, the following is ready:
printf(loadResourceString("Result is %s"), result);
while the following is not:
printf("Result is " + result);
because the word order may vary in different languages. Unicode support, international date-time formatting and the like also qualify.
EDIT: added loadResourceString to make an example close to real life.
i18n means internationalization => i (18 letters) n. Code that's marked as i18n safe would be code that correctly handles non-ASCII character data (e.g. Unicode).
Internationalization. The derivation of it is "the letter I, eighteen letters, the letter N".
I18N stands for Internationalization.
its a numeronym for Internationalization.
Different from Acronym, numeronym is a number based word (eg 411 = information, k9 = canine);
In code, this will normally be a folder title, that It generally refers to code that will work in international environments - with different locale, keyboard, character sets etc. ... "
Read more about it here: http://www.i18nguy.com/origini18n.html
i18n is a shorthand for "internationalization". This was coined at DEC and actually uses lowercase i and n.
As a sidenote: L10n stands for "localization" and uses capital L to distinguish it from the lowercase i.
Without any additional information, I would guess that it means the code handles text as UTF8 and is locale-aware. See this Wikipedia article for more information.
Can you be a bit more specific?
i18n-safe is a vague concept. It generally refers to code that will work in international environments - with different locale, keyboard, character sets etc. True i18n-safe code is hard to write.
It means that code cannot rely on:
sizeof (char) == 1
because that character could be a UTF-32 4-byte character, or a UTF-16 2-byte character, and occupy multiple bytes.
It means that code cannot rely on the length of a string equalling the number of bytes in a string. It means that code cannot rely on zero bytes in a string indicating a nul terminator. It means that code cannot simply assume ASCII encoding of text files, strings, and inputs.
I18N stands for Internationalization.
In a nutshell: I18N safe code means that it uses some kind of a lookup table for texts on the UI. For this you have to support non-ASCII encodings. This might seem to be easy, but there are some gotchas.
"I18N safe" coding means the code that doesn't introduce I18N bugs. I18N is a numeronym for Internationalization, where there are 18 characters between I and N.
There are multiple categories of issues related to i18n such as:
Culture Format: Date Time formats(DD/MM/YY in UK and MM/DD/YY in US ), number formats, Timezone, Measuring units change from culture to culture. The data must be accepted, processed and displayed in the right format for the right culture/locale.
International Characters Support: All the characters from all the different languages should be accepted, processed and displayed correctly.
Localizability: The translatable strings should not be hard code. They should be externalized in resource files.
"I18N Safe" coding means that none of the above issue is introduced by the way the code is written.
i18n deals with
- moving hard coded strings out of the code (not all should be by the way) so they can be localized/translated (localization == L10n), as others have pointed out, and also deals with
- locale sensitive method, such as
--methods dealing with text handling (how many words in a Japanese text is far obvious:), order/collation in different languages/writing systems,
--dealing with date/time (the simplest example is showing am/pm for the US, 24 h clocks for France for instance, going to more complex calendars for specific countries),
--dealing with arabic or hebrew (orientation of UI, of text, etc.),
--encoding as others have pointed out
--database issues
it's a fairly comprehensive angle. Just dealing with "String externalization" is far from enough.
Some (software) languages are better than others in helping developers write i18n code (meaning code which will run on different locales), but it remains a software engineering responsibility.

Validating Kana Input

I am working on an application that allows users to input Japanese language characters. I am trying to come up with a way to determine whether the user's input is a Japanese kana (hiragana, katakana, or kanji).
There are certain fields in the application where entering Latin text would be inappropriate and I need a way to limit certain fields to kanji-only, or katakana-only, etc.
The project uses UTF-8 encoding. I don't expect to accept JIS or Shift-JIS input.
Ideas?
It sounds like you basically need to just check whether each Unicode character is within a particular range. The Unicode code charts should be a good starting point.
If you're using .NET, my MiscUtil library has some Unicode range support - it's primitive, but it should do the job. I don't have the source to hand right now, but will update this post with an example later if it would be helpful.
Not sure of a perfect answer, but there is a Unicode range for katakana and hiragana listed on Wikipedia. (Which I would expect are also available from unicode.org as well.)
Hiragana: Unicode: 3040-309F
Katakana: Unicode: 30A0–30FF
Checking those ranges against the input should work as a validation for hiragana or katakana for Unicode in a language-agnostic manner.
For kanji, I would expect it to be a little more complicated, as I
expect that the Chinese characters used in Chinese and Japanese are both included in the same range, but then again, I may be wrong here. (I can't expect that Simplified Chinese and Traditional Chinese to be included in the same range...)
oh oh! I had this one once... I had a regex with the hiragana, then katakana and then the kanji. I forget the exact codes, I'll go have a look.
regex is great because you double the problems. And I did it in PHP, my choice for extra strong auto problem generation
--edit--
$pattern = '/[^\wぁ-ゔァ-ヺー\x{4E00}-\x{9FAF}_\-]+/u';
I found this here, but it's not great... I'll keep looking
--edit--
I looked through my portable hard drive.... I thought I had kept that particular snippet from the last company... sorry.

Resources