strange characters: interaction of R and Windows locale? - windows

WinXP-x32, R-2.13.0
Dear list,
I have a problem that (I think) relates to the interaction between Windows and R.
I am trying to scrape a table with data on the Hawai'ian Islands. This is my R code:
library(XML)
u <- "http://en.wikipedia.org/wiki/Hawaii"
tables <- readHTMLTable(u)
Islands <- tables[[5]]
The output is (first set of columns):
Island Nickname > > Islands
Island Nickname > > Location 1 Hawaiʻi[7] The Big
Island 19°34′N 155°30′W /
19.567°N 155.5°W / 19.567;
-155.5 2 Maui[8] The Valley Isle 20°48′N 156°20′W /
20.8°N 156.333°W / 20.8;
-156.333 3 Kahoʻolawe[9] The Target Isle 20°33′N
156°36′W / 20.55°N
156.6°W / 20.55; -156.6 4 LÄnaÊ»i[10] The Pineapple Isle
20°50′N 156°56′W /
20.833°N 156.933°W / 20.833;
-156.933 5 Molokaʻi[11] The Friendly Isle 21°08′N
157°02′W / 21.133°N
157.033°W / 21.133; -157.033 6 Oʻahu[12] The Gathering Place
21°28′N 157°59′W /
21.467°N 157.983°W / 21.467;
-157.983 7 Kauaʻi[13] The Garden Isle 22°05′N
159°30′W / 22.083°N
159.5°W / 22.083; -159.5 8 Niʻihau[14] The Forbidden Isle
21°54′N 160°10′W / 21.9°N
160.167°W / 21.9; -160.167
As you can see, there are "weird" characters in there. I have also tried readHTMLTable(u, encoding = "UTF-16") and readHTMLTable(u, encoding = "UTF-8")
but that didn't help.
It seems to me that there may be an issue with the interaction of the Windows settings of the character set and R.
sessionInfo() gives
> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252 LC_MONETARY=Dutch_Netherlands.1252
[4] LC_NUMERIC=C LC_TIME=Dutch_Netherlands.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] XML_3.2-0.2
I have also attempted to let R use another setting by entering: Sys.setlocale("LC_ALL", "en_US.UTF-8"), but this yields the response:
> Sys.setlocale("LC_ALL", "en_US.UTF-8")
[1] ""
Warning message:
In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
OS reports request to set locale to "en_US.UTF-8" cannot be honored
In addition, I have attempted to make the change directly from the windows command prompt, using: chcp 65001 and variations of that, but that didn't change anything.
I noticed from searching the web that others have the issue as well, but have not been able to find a solution. I looks like this is an issue of how Windows and R interact. Unfortunately, all three computers at my disposal have this problem. It occurs both under WinXP-x32 and under Win7-x86.
Is there a way to make R override the windows settings or can the issue be solved otherwise?
I have also tried other websites, and the issue occurs every time when there is an é, ü, ä, î, et cetera in the text-to-be-scraped.
Thank you,
Roger

A not quite an answer:
If you look at the wikipedia page and change the encoding in your browser (in IE, View -> Encoding; in Firefox, View -> Character Encoding) to Western (ISO-8869-1) or Western (Windows-1252) then you see the silly characters. That ought to mean that you can use iconv to change the encoding and fix your problems.
#Convert factors to character
Islands <- as.data.frame(lapply(Islands, as.character), stringsAsFactors = FALSE)
iconv(Islands$Island, "windows-1252", "UTF-8")
Unfortunately, it doesn't work. It may be possible to get the correct text by using a different conversion (iconvlist() shows all the possibilities).
It is possible it simply strip out the offending characters, though this isn't ideal.
iconv(Islands$Island, "windows-1252", "ASCII", "")

Unable to replicate the error, however looking at the help files is useful.
Sys.setlocale("LC_TIME", "de") # Solaris: details are OS-dependent
Sys.setlocale("LC_TIME", "de_DE.utf8") # Modern Linux etc.
Sys.setlocale("LC_TIME", "de_DE.UTF-8") # ditto
Sys.setlocale("LC_TIME", "de_DE") # OS X, in UTF-8
Sys.setlocale("LC_TIME", "German") # Windows
For a windows you should use formatting like "English" or "Dutch_Netherlands.1252" to change these settings.
I tried to replicate your state
> Sys.setlocale("LC_ALL","Dutch_Netherlands.1252")
[1] "LC_COLLATE=Dutch_Netherlands.1252;LC_CTYPE=Dutch_Netherlands.1252;LC_MONETARY=Dutch_Netherlands.1252;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.1252"
> Sys.getlocale()
[1] "LC_COLLATE=Dutch_Netherlands.1252;LC_CTYPE=Dutch_Netherlands.1252;LC_MONETARY=Dutch_Netherlands.1252;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.1252"
library(XML)
u <- "http://en.wikipedia.org/wiki/Hawaii"
tables <- readHTMLTable(u)
Islands <- tables[[5]]
However I do not get the funny characters in console, in my own locale the ʻ was marked as , but still all functionality remained.
> Islands[1,1]
[1] Hawaiʻi[27]
8 Levels: Hawaiʻi[27] Kahoʻolawe[34] Kauaʻi[30] Lānaʻi[32] Maui[28] ... Oʻahu[29]
And these funny characters can be read easily, and found from the table.
> Encoding(as.character("Hawaiʻi"))
[1] "UTF-8"
> Encoding(as.character(Islands[1,1]))
[1] "UTF-8"
> grep("Hawaiʻi", as.character(Islands[1,1]))
[1] 1
If you still have problems it would rely elsewhere, however to change the locale under windows you have to use different names than Linux or OS X (see your own locale info for example). In Windows "Dutch" is probably enough.

Related

How to read a file in utf8 encoding and output in Windows 10?

What is proper procedure to read and output utf8 encoded data in Windows 10?
My attempt to read utf8 encoded file in Windows 10 and output lines into terminal does not reproduce symbols of some languages.
OS: Windows 10
Native codepage: 437
Switched codepage: 65001
In cmd window issued command chcp 65001. Following ruby code reads utf8 encoded file and outputs lines with puts.
fname = 'hello_world.dat'
File.open(fname,'r:UTF-8') do |f|
puts f.read
end
hello_world.dat content
Afrikaans: Hello Wêreld!
Albanian: Përshendetje Botë!
Amharic: ሰላም ልዑል!
Arabic: مرحبا بالعالم!
Armenian: Բարեւ աշխարհ!
Basque: Kaixo Mundua!
Belarussian: Прывітанне Сусвет!
Bengali: ওহে বিশ্ব!
Bulgarian: Здравей свят!
Catalan: Hola món!
Chichewa: Moni Dziko Lapansi!
Chinese: 你好世界!
Croatian: Pozdrav svijete!
Czech: Ahoj světe!
Danish: Hej Verden!
Dutch: Hallo Wereld!
English: Hello World!
Estonian: Tere maailm!
Finnish: Hei maailma!
French: Bonjour monde!
Frisian: Hallo wrâld!
Georgian: გამარჯობა მსოფლიო!
German: Hallo Welt!
Greek: Γειά σου Κόσμε!
Hausa: Sannu Duniya!
Hebrew: שלום עולם!
Hindi: नमस्ते दुनिया!
Hungarian: Helló Világ!
Icelandic: Halló heimur!
Igbo: Ndewo Ụwa!
Indonesian: Halo Dunia!
Italian: Ciao mondo!
Japanese: こんにちは世界!
Kazakh: Сәлем Әлем!
Khmer: សួស្តី​ពិភពលោក!
Kyrgyz: Салам дүйнө!
Lao: ສະ​ບາຍ​ດີ​ຊາວ​ໂລກ!
Latvian: Sveika pasaule!
Lithuanian: Labas pasauli!
Luxemburgish: Moien Welt!
Macedonian: Здраво свету!
Malay: Hai dunia!
Malayalam: ഹലോ വേൾഡ്!
Mongolian: Сайн уу дэлхий!
Myanmar: မင်္ဂလာပါကမ္ဘာလောက!
Nepali: नमस्कार संसार!
Norwegian: Hei Verden!
Pashto: سلام نړی!
Persian: سلام دنیا!
Polish: Witaj świecie!
Portuguese: Olá Mundo!
Punjabi: ਸਤਿ ਸ੍ਰੀ ਅਕਾਲ ਦੁਨਿਆ!
Romanian: Salut Lume!
Russian: Привет мир!
Scots Gaelic: Hàlo a Shaoghail!
Serbian: Здраво Свете!
Sesotho: Lefatše Lumela!
Sinhala: හෙලෝ වර්ල්ඩ්!
Slovenian: Pozdravljen svet!
Spanish: ¡Hola Mundo!
Sundanese: Halo Dunya!
Swahili: Salamu Dunia!
Swedish: Hej världen!
Tajik: Салом Ҷаҳон!
Thai: สวัสดีชาวโลก!
Turkish: Selam Dünya!
Ukrainian: Привіт Світ!
Uzbek: Salom Dunyo!
Vietnamese: Chào thế giới!
Welsh: Helo Byd!
Xhosa: Molo Lizwe!
Yiddish: העלא וועלט!
Yoruba: Mo ki O Ile Aiye!
Zulu: Sawubona Mhlaba!
Steven Penny suggested to use PowerShell and do not change code page. Following picture demonstrates that the issue persists.
Windows Terminal installer (which is not a part of Windows distribution) solves utf8 output issue, please see included screen capture.
The problem is, you are using a some methods and tools that are really old. First:
Native codepage: 437
Switched codepage: 65001
You don't need to mess with the codepage any more, just leave it as the default. Also, from you picture I see you are also using Console Host, which is also really old. Windows Terminal [1] has been available since 2019, and has built in UTF-8 support. Using Windows Terminal, I can run your script, even without specifying UTF-8:
fname = 'hello_world.dat'
File.open(fname,'r') do |f|
puts f.read
end
and I get perfect result:
To use Windows Terminal, download the msixbundle file [2], then install it. Or, as it's essentially just a Zip file, you can rename it to file.zip and extract it with Windows, then run WindowsTerminal.exe. Or, since you are really having trouble with this process, you can use a portable version I just created
[3] (at your own risk).
https://github.com/microsoft/terminal
https://github.com/microsoft/terminal/releases/tag/v1.8.1444.0
https://github.com/microsoft/terminal/files/6563899/CascadiaPackage_1.8.1444.0_x64.zip

cannot edit ghostprint ppd in Windows 10

I had some difficulty posing my problem in a way that the Title filter found pleasing. The real problem is that modifying only the GhostPDF.PDD file in the GS9.26 installation in Windows 10 doesn't seem to affect the output after a re-installation using Windows 10 Device Installer.
I print to a networked Sun SPARCprinter 1 which is controlled by Ghostprint (script?) compiled to run on SunOS 4.1.4. This has worked successfully for some years printing output from Windows XP using Adobe's PS driver and a SPARCstation PPD cobbled together from samples found on the net.
I've installed Artifex's 9.26 on Windows 10 and output to an LPR printer (The Sun). The output works, is recognized as PS output by the Sun, but produces a number of FATAL errors.
I need to edit the Windows Ghostscript installation to output PS files which are more suitable for the Sun.
So to my simple question: Do I need to modify anything in the Ghostscript Windows 10 installation other than the Ghostpdf.PPD file?
additional info:
SPARCstation 10 information:
SunOS 4.1.4
arcad# gcc -dumpversion
2.95.2 Note: I had to bootstrap this version up from the early GCC which could be compiled with the SunOS 4.1.4 C compiler. I had the impression I couldn't bring it up any further but could be mistaken.
arcad# gs --help
Aladdin Ghostscript 6.01 (2000-03-17)
Copyright (C) 2000 Aladdin Enterprises ...
Usage: gs [switches] [file1.ps file2.ps ...]
Most frequently used switches: (you can use # in place of =)
-dNOPAUSE no pause after page | -q `quiet', fewer messages
-g<width>x<height> page size in pixels | -r<res> pixels/inch resolution
-sDEVICE=<devname> select device | -dBATCH exit after last file
-sOutputFile=<file> select output file: - for stdout, |command for pipe,
embed %d or %ld for page #
Input formats: PostScript PostScriptLevel1 PostScriptLevel2 PDF
.....
For more information, see /usr/local/share/ghostscript/6.01/doc/Use.htm.
Note: I think this is the most recent GS version I can compile with this gcc version
printcap section:
gp|GhostPrinter:\
:lp=/dev/lpvi0:sd=/var/spool/gsprintspool:lf=/var/spool/gsprintspool/log:\
:mx#0:sh:if=/usr/local/libexec/lpfilter-gps:
Typical spool file - "....." indicates stuff not included here"
arcad# more dfA004DESKTOP-M8C5I86
%!PS-Adobe-3.0
%%Title: Document
%%Creator: PScript5.dll Version 5.2.2
%%CreationDate: 12/14/2018 19:56:8
%%For: jferg
%%BoundingBox: (atend)
%%Pages: (atend)
%%Orientation: Portrait
%%PageOrder: Special
%%DocumentNeededResources: (atend)
%%DocumentSuppliedResources: (atend)
%%DocumentData: Clean7Bit
%%TargetDevice: (Ghostscript) (3010) 815
%%LanguageLevel: 3
%%EndComments
%%BeginDefaults
%%PageBoundingBox: 0 0 612 792
%%ViewingOrientation: 1 0 0 1
%%EndDefaults
.....
%%EndResource
userdict /Pscript_WinNT_Incr 230 dict dup begin put
%%BeginResource: file Pscript_FatalError 5.0 0
userdict begin/FatalErrorIf{{initgraphics findfont 1 index 0 eq{exch pop}{dup
length dict begin{1 index/FID ne{def}{pop pop}ifelse}forall/Encoding
{ISOLatin1Encoding}stopped{StandardEncoding}if def currentdict end
/ErrFont-Latin1 exch definefont}ifelse exch scalefont setfont counttomark 3 div
cvi{moveto show}repeat showpage quit}{cleartomark}ifelse}bind def end
%%EndResource
userdict begin/PrtVMMsg{vmstatus exch sub exch pop gt{[
quires more memory than is available in this printer.)100 500
more of the following, and then print again:)100 485
put format, choose Optimize For Portability.)115 470
ce Settings page, make sure the Available PostScript Memory is accur--More--(2%)
ce the number of fonts in the document.)115 440
ocument in parts.)115 425 12/Times-Roman showpage
Error: Low Printer VM ]%%)= true FatalErrorIf}if}bind def end
2016 ge{/VM?{pop}bind def}{/VM? userdict/PrtVMMsg get def}ifelse
.....
SPARCprinter PDD file which works with Adobe PS in Windows XP:
john#hp2:~/sun-stuff/cups-sparc$ more SPARCprinter2.ppd
*PPD-Adobe: "4.1"
*% PostScript(R) Printer Description File for SPARCprinter
*% Date: 94/01/14
*% Copyright 1994 Sun Microsystems, Inc. All Rights Reserved.
*% Permission is granted for redistribution of this file as
*% long as this copyright notice is intact and the contents
*% of the file is not altered in any way from its original form.
*% End of Copyright statement
*% Changed margins on SPARCprinter JAF 3-3-2017
*FormatVersion: "4.1"
*FileVersion: "1.10"
*LanguageEncoding: ISOLatin1
*LanguageVersion: English
*PCFileName: "SPRN.PPD"
*Product: "(SPARCprinter)"
*PSVersion: "(3.000) 0"
*ModelName: "SPARCprinter"
*ShortNickName: "SPARCprinter"
*NickName: "SPARCprinter"
*% ==== Device Capabilities ===============
*LanguageLevel: "3"
*Extensions: CMYK Composite
*FreeVM: "4194304"
*ColorDevice: False
*DefaultColorSpace: Gray
*VariablePaperSize: False
*TTRasterizer: None
*FileSystem: False
..... more of the usual stuff
I don't really understand why you have installed Ghostscript on Windows. Windows is perfectly capable of producing PostScript files all of its own. In addition, the PPD file doesn't actually do very much, it is simply a text file with descriptions of the capabilities of the printer.
So the real problem is, or seems to be, that your SUN setup doesn't like the PostScript being produced by the new version of Windows.
You don't say how you are printing the PostScript file. not how your printer is 'controlled by Ghostscript' (I'm not aware of any product called Ghostprint, there is a GSPrint as part of GSView, but that's really for Windows).
Assuming you are using Ghostscript on your Sparc workstation to drive the pritner, then the most likely problem I would say is that you are using an old version of Ghostscript on the workstation, and it doesn't like the PostScript being generated by the newer version of Windows.
If you had included the transcript from the workstation Ghostscript installation it might be possible to say more but without that I'm rather guessing.
Another possibility is that you are using the ps2write device in Ghostscript to produce PostScript files on Windows. I can't think why you would be doing that, but it sort of fits your description. In that case editing the PPD file will have no effect, because Ghostscript doesn't use it.
Now the ps2write device emits level 2 PostScript, the clue is in the name, and its possible again that your Sparc setup is so elderly that it doesn't understand level 2, or doesn't fully implement it. In which case you will probably get errors. Again, if you were to provide the text of the error messages this would help!
In the latter case, you are frankly out of luck. We dropped support for level 1 PostScript output some time ago, what with level 2 being 28 years old now and level 3 coming up on 20. If you need language level 1 output you will have to go back to a very old version of Ghostscript. Something like 9.07 (from 5 and a half years ago) was the last version that included the pswrite device.
With effort you could take the pswrite device and upgrade it so that it works with the current version of Ghostscript
[EDIT]
My word, that's a really old version of Ghostscript!
You could try building a new version to replace it, but I also don't know if current code will compile on gcc 2.95. It 'should' because we only expect C89, but the third party libraries (which are essential) may very well not compile.
The PostScript file you quoted has been produced by Windows, not by Ghostscript (%%Creator: PScript5.dll Version 5.2.2). So it seems likely to me that your problem is the PostScript being produced by the newer version of Windows doesn't work with your 18 year old version of Ghostscript. That's not actually entirely surprising.
If you look at the DSC comments it says:
%%LanguageLevel: 3
And your Ghostscript information says that it supports language levels 1 and 2. At the time the level 3 spec had only just been published (1999), and clearly the maintainers back then hadn't had time to fully implement it.
Note that the ghostpdf.ppd file is intended for use with Ghostscript as a 'print to PDF' printer along with the RedMon port monitor.
Now its not obvious to me which PPD file you are using, but..... Both the ghostpdf.ppd file and the sparcprinter ppd file have :
*LanguageLevel: "3"
That tells the PostScript driver that it can use language level 3, which your Sparc Ghostscript doesn't support. You could try changing that to:
*LanguageLevel: "2"
and see if that makes a difference (you will have to uninstall the printers from Windows and re-install them with the modified PPD file).
If it doesn't work, the only other thing I can think of is to use the Ghostscript you installed on the Windows system, and preprocess the PostScript file produced by Windows before you send it on. You can use the ps2write device in Ghostscript 9.26 to take in the level 3 file, and produce a level 2 file. It might be a bit bigger, but it ought to work.
To do that on Windows you would use something like:
gswin64c -sDEVICE=ps2write -sOutputFile=out.ps <input.ps>
The file 'out.ps' should then be a level 2 PostScript file. I can't guarantee that the output will then work the old version of Ghostscript on your Sparc, but you stand a chance!

Ghostscript 'offending input'

When searching for an occurrence of text in a PostScript file, I receive the following error:
gsapi_run_string_continue returns -21
The API documentation specifies that return codes > 0 are "Error" but doesn't describe it any more specifically. Full error console output below - error occurs twice identically, only one occurrence displayed here.
GPL Ghostscript 9.15 (2014-09-22)
Copyright (C) 2014 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Displaying DSC file C:/Users/c-toothm/Desktop/PRDFlow12_30_2014_050307/1230ouptut.ps
Displaying page 1
%%[ ProductName: GPL Ghostscript ]%%
%%[ LastPage ]%%
Extracting text using pstotext...
Ghostscript returns error code -21`
--- Begin offending input ---
evice /pop , d
initmatrix [1 0 0 1 0 0] concat colspSet`
0.00 43.32 +
0.94 0.95 +S
(XSFT2200041.img) run
EPSFILE2200041 restore
;
0 0 0 sco 5 Lw N 4950 4742 M 4800 4742 I K
0 0 0 sco 5 Lw N 4950 4752 M 4800 4752 I K
0 0 0 sco 5 Lw N 4950 4762 M 4800 476
--- End offending input ---
gsapi_run_string_continue returns -21`
[duplicate error redacted]
Our production output creates a giant .ps file every day and this error occurs in many, but not all, .ps files when searching for text. Randomly selected .ps files from the web do not throw the error, so this GS build seems OK - definitely a problem with my file.
What "offending input" is being referred to here and what can I do to address it?
I'd need to see the PostScript file to tell you exactly what is wrong, but 'evice' is not a PostScript operator and so that is likely the problem. Also, from ghostpdl/gs/psi/ierrors.h error code -21 is e_undefined which means the interpreter has encountered an undefined token, which is some confirmation that this is the problem.
This could be because the file contains a 'typo' like that (perhaps it should be setpagedevice or something), or it could be because a filter is improperly terminated, or has insufficient data, and consumes extra bytes from the input stream, chewing up your program.
You should start by using the Ghostscript executable and reproduce the error with that (you might also try the display device, to see whether the problem is related to pstotext), that will allow you to give a command line which other people can then duplicate. With that, and a copy of the offending file I can tell you exactly what's wrong, without it, not much hope.
Bear in mind that PostScript is an interpreted programming language, so its pretty much impossible to tell you what's wrong with your program without seeing the code.
FWIW you might like to try the Ghostscript txtwrite device instead of pstotext, the device doesn't rely on tinkering with the language like pstotext does. pstotext is also really old (the last release is coming up on its 11th birthday) and unsupported.....

How to show Japanese characters in Mac OS X Terminal?

So I'm running MeCab (http://mecab.sourceforge.net/#download) to word-segment and do morpho-analysis of Japanese sentences. However, when I run the program, I see abracadabra due to some encoding issues in Mac OS X Terminal. I googled the topic, added Dfile.encoding option, added the following 3 lines in .inputrc:
set convert-meta off
set meta-flag on
set output-meta on
Nothing works. Any ideas how to show Japanese characters in Mac OS X Terminal? Here's the output of the run of the program test.java:
env DYLD_LIBRARY_PATH=. /usr/bin/java -Dfile.encoding=utf-8 test
0.98pre3
å¤ ̾»ì,°ìÈÌ,*,*,*,*,*
ª郎ã µ­¹æ,°ìÈÌ,*,*,*,*,*
¯ä ̾»ì,¸Çͭ̾»ì,Áȿ¥,*,*,*,*
º郎にこのæ µ­¹æ,°ìÈÌ,*,*,*,*,*
¬ã ̾»ì,¥µÊÑÀܳ,*,*,*,*,*
µ­¹æ,°ìÈÌ,*,*,*,*,*
æ¸ ̾»ì,°ìÈÌ,*,*,*,*,*
¡ã µ­¹æ,³ç
BOS/EOS,*,*,*,*,*,*,*,*
å ̾»ì,°ìÈÌ,*,*,*,*
ª郎 µ­¹æ,°ìÈÌ,*,*,*
¯ ̾»ì,¸Çͭ̾»ì,Áȿ¥,*,*
º郎にこ µ­¹æ,°ìÈÌ,*,*,*
¬ ̾»ì,¥µÊÑÀܳ,*,*,*,
µ­¹æ,°ìÈÌ,*,*,*
æ ̾»ì,°ìÈÌ,*,*,*,*
¡ µ­¹æ,³ç¸̳«,*,*,*,*
µ­¹æ,°ìÈÌ,*,*,*
BOS/EOS,*,*,*,*,*,*,*,*
EOS
I would have thought that this was the default setting, but you could try selecting "Unicode (UTF-8)" as the Character encoding from Preferences..., Settings, Advanced, International. If this is already set, you may want to confirm that your program output is actually encoded in UTF-8. It could be Shift-JIS, EUC, or even UTF-16? In that case, try enabling those encodings from Preferences..., Encodings.
After this
% cd mecab-ipadic-2.7.0-xxxx
% ./configure --with-charset=utf8
% sudo make
% sudo make install
the output of 'mecab -D' is
% cd mecab-java-0.98pre3
% mecab -D
filename: /usr/local/lib/mecab/dic/ipadic/sys.dic
version: 102
charset: utf8
type: 0
size: 392126
left size: 1316
right size: 1316
Here's the output of running the test program.
bash-3.2$ env DYLD_LIBRARY_PATH=. /usr/bin/java test
0.98pre3
?? ??,????,??,?,*,*,??,???,???
? ??,???,*,*,*,*,?,?,?
?? ??,????,??,?,*,*,??,???,???
? ??,???,??,*,*,*,?,?,?
?? ???,*,*,*,*,*,??,??,??
? ??,??,*,*,*,*,?,??,??
? ??,???,??,*,*,*,?,?,?
?? ??,??,*,*,?????,???,??,???,???
? ???,*,*,*,????,???,?,?,?
? ??,??,*,*,*,*,?,?,?
EOS
BOS/EOS,*,*,*,*,*,*,*,*
?? ??,????,??,?,*,*,??,???,???
? ??,???,*,*,*,*,?,?,?
?? ??,????,??,?,*,*,??,???,???
? ??,???,??,*,*,*,?,?,?
?? ???,*,*,*,*,*,??,??,??
? ??,??,*,*,*,*,?,??,??
? ??,???,??,*,*,*,?,?,?
?? ??,??,*,*,?????,???,??,???,???
? ???,*,*,*,????,???,?,?,?
? ??,??,*,*,*,*,?,?,?
BOS/EOS,*,*,*,*,*,*,*,*
EOS
What am I missing to make encoding work?
P/S: all Japanese encodings are enabled at Preferences - Encodings in Terminal, encoding (Preferences - Settings - Advanced - International) in Mac OS X Terminal is UTF-8.

DIR command output in various localized versions

I have a strange (don't ask) need to see a few examples of a Win XP cmd shell DIR command for lots (some) of different localized versions of windows (eg. French, Spanish, etc).
The specific command I need is (note that this command is important... if you don't bother to use this command then don't bother to respond):
dir /4 /-c /t:a /n /a:-d-h-s
I know it's a crazy hope but I'm hoping to be able to chop/parse the output regardless of localization.
Probably not what you want to hear but we found all sorts of problems in relying on behavior in different localizations of Windows.
We had a cmd file which worked fine in US English but when we sent it for localization, they found all sorts of issues, and we have to support about 23 different versions.
In the end, it was easier to write (actual C) code to get the information via Win32 and output it in the format we wanted. This removed reliance on specific localization formats and configuration issues (some commands output differently not just based on locale but also on user configuration).
My advice: find a different way of doing this.
Polish windows Vista outputs:
C:\Users\Karol>dir /4 /-c /t:a /n /a:-d-h-s
Wolumin w stacji C to OS
Numer seryjny woluminu: 3EC1-6B83
Katalog: C:\Users\Karol
2009-12-10 21:19 2263 intlname.ols
2009-07-23 21:17 1480 laptop_to_epia.ppk
2009-07-23 21:17 466 laptop_to_epia.pub
2010-01-31 09:49 10392 _viminfo
4 plik(ów) 14601 bajtów
0 katalog(ów) 10880864256 bajtów wolnych
Here's the output for Korean XP:
C µå¶óÀ̺êÀÇ º¼·ý¿¡´Â À̸§ÀÌ ¾ø½À´Ï´Ù.
º¼·ý ÀÏ·Ã ¹øÈ£: 7C33-7DCE
C:\WINDOWS\system32 µð·ºÅ͸®
2009-02-02 ¿ÀÈÄ 11:39 1697 $winnt$.inf
2008-02-19 ¿ÀÈÄ 09:07 2151 12520437.cpx
2008-02-19 ¿ÀÈÄ 09:07 2233 12520850.cpx
2008-02-19 ¿ÀÈÄ 09:06 100352 6to4svc.dll
2008-02-19 ¿ÀÈÄ 08:47 1460 a15.tbl
(seem to have lost the unicode during transfer... but for my purposes that's ok).
Sure... it's the wrong way... but needs must/devil drives. The underlying problem is that the machine the command runs on cannot be modified/relied upon. The parsing/chopping is pretty minor (pull out a filename, file size and the creation date). The good news is that the filename is guaranteed to not include any spaces. Which means the last 2 fields of a split() are the filename and size and the first N fields are the date (note I don't need the date as a date, just a string is fine). Trickiness may be involve ensuring unicode moves around correctly (unlike in the Korean example).

Resources