So I'm running MeCab (http://mecab.sourceforge.net/#download) to word-segment and do morpho-analysis of Japanese sentences. However, when I run the program, I see abracadabra due to some encoding issues in Mac OS X Terminal. I googled the topic, added Dfile.encoding option, added the following 3 lines in .inputrc:
set convert-meta off
set meta-flag on
set output-meta on
Nothing works. Any ideas how to show Japanese characters in Mac OS X Terminal? Here's the output of the run of the program test.java:
env DYLD_LIBRARY_PATH=. /usr/bin/java -Dfile.encoding=utf-8 test
0.98pre3
å¤ ̾»ì,°ìÈÌ,*,*,*,*,*
ª郎ã µ¹æ,°ìÈÌ,*,*,*,*,*
¯ä ̾»ì,¸Çͭ̾»ì,Áȿ¥,*,*,*,*
º郎にこのæ µ¹æ,°ìÈÌ,*,*,*,*,*
¬ã ̾»ì,¥µÊÑÀܳ,*,*,*,*,*
µ¹æ,°ìÈÌ,*,*,*,*,*
æ¸ ̾»ì,°ìÈÌ,*,*,*,*,*
¡ã µ¹æ,³ç
BOS/EOS,*,*,*,*,*,*,*,*
å ̾»ì,°ìÈÌ,*,*,*,*
ª郎 µ¹æ,°ìÈÌ,*,*,*
¯ ̾»ì,¸Çͭ̾»ì,Áȿ¥,*,*
º郎にこ µ¹æ,°ìÈÌ,*,*,*
¬ ̾»ì,¥µÊÑÀܳ,*,*,*,
µ¹æ,°ìÈÌ,*,*,*
æ ̾»ì,°ìÈÌ,*,*,*,*
¡ µ¹æ,³ç¸̳«,*,*,*,*
µ¹æ,°ìÈÌ,*,*,*
BOS/EOS,*,*,*,*,*,*,*,*
EOS
I would have thought that this was the default setting, but you could try selecting "Unicode (UTF-8)" as the Character encoding from Preferences..., Settings, Advanced, International. If this is already set, you may want to confirm that your program output is actually encoded in UTF-8. It could be Shift-JIS, EUC, or even UTF-16? In that case, try enabling those encodings from Preferences..., Encodings.
After this
% cd mecab-ipadic-2.7.0-xxxx
% ./configure --with-charset=utf8
% sudo make
% sudo make install
the output of 'mecab -D' is
% cd mecab-java-0.98pre3
% mecab -D
filename: /usr/local/lib/mecab/dic/ipadic/sys.dic
version: 102
charset: utf8
type: 0
size: 392126
left size: 1316
right size: 1316
Here's the output of running the test program.
bash-3.2$ env DYLD_LIBRARY_PATH=. /usr/bin/java test
0.98pre3
?? ??,????,??,?,*,*,??,???,???
? ??,???,*,*,*,*,?,?,?
?? ??,????,??,?,*,*,??,???,???
? ??,???,??,*,*,*,?,?,?
?? ???,*,*,*,*,*,??,??,??
? ??,??,*,*,*,*,?,??,??
? ??,???,??,*,*,*,?,?,?
?? ??,??,*,*,?????,???,??,???,???
? ???,*,*,*,????,???,?,?,?
? ??,??,*,*,*,*,?,?,?
EOS
BOS/EOS,*,*,*,*,*,*,*,*
?? ??,????,??,?,*,*,??,???,???
? ??,???,*,*,*,*,?,?,?
?? ??,????,??,?,*,*,??,???,???
? ??,???,??,*,*,*,?,?,?
?? ???,*,*,*,*,*,??,??,??
? ??,??,*,*,*,*,?,??,??
? ??,???,??,*,*,*,?,?,?
?? ??,??,*,*,?????,???,??,???,???
? ???,*,*,*,????,???,?,?,?
? ??,??,*,*,*,*,?,?,?
BOS/EOS,*,*,*,*,*,*,*,*
EOS
What am I missing to make encoding work?
P/S: all Japanese encodings are enabled at Preferences - Encodings in Terminal, encoding (Preferences - Settings - Advanced - International) in Mac OS X Terminal is UTF-8.
Related
on database1:
show LC_CTYPE; shows C
show LC_COLLATE; shows C
show SERVER_ENCODING; shows UTF8
but set "PGPASSWORD=password1" & set "PGCLIENTENCODING=UTF8" & psql.exe -h 127.0.0.1 -p 5432 -U postgres -d database1 -c "INSERT INTO table1 (column1) VALUES ('mise à jour 1');"
shows: ERROR: invalid byte sequence for encoding "UTF8": 0xc8 0x20
the error disappears if PGCLIENTENCODING is set to ISO_8859_5 for example
how to fix this issue?
There is nothing much to fix. Your Windows shell uses a different encoding than UTF-8, so you have to set the client encoding to that encoding to make it work. To find out which client encoding to use, you must figure out which encoding your shell uses. That in turn depends on which shell you are using and how the Windows system was configured.
I'm trying to import csv file to postgres with COPY command.
As I've received well known 'ERROR: character with byte sequence 0xd0 0x9f in encoding "UTF8" has no equivalent in encoding "WIN1252"' I changed my client_encoding to utf8.
Now I'm getting completely unreadable message
ПОМИЛКÐ: Ð²Ñ–Ð´Ð½Ð¾ÑˆÐµÐ½Ð½Ñ "mytab" не Ñ–Ñнує
I tried to change console codepage by chcp 65001 but with no luck.
Can anybody help me with that extraordinary rare and complex task - to import csv to database?
Solution:
I would suggest, that problem is due to UA or RU localization of installed DB.
Switching DB lang should help (at least hepled me):
SET lc_messages TO 'en_US.UTF-8';
Please try on your PC and let me know if that helps.
My investigation:
In the powerShell I all the time getting an error but with:
ERROR: character with byte sequence 0xd0 0x9f in encoding "UTF8" has no equivalent in encoding "WIN1252"
When I swith encoding to UTF-8 with comand:
SET client_encoding TO 'UTF8';
I'm starting to get the same not readable symbols, but if I going to pgAdmin4 and run the same command it gives me well explained error in UA lang:
ERROR: ПОМИЛКА: insert або update в таблиці "exam_results" порушує обмеження зовнішнього ключа "exam_results_subject_id_fkey"
DETAIL: Ключ (subject_id)=(0) не присутній в таблиці "subjects".
CONTEXT: SQL-оператор "insert into exam_results (student_id, subject_id, mark)
values ((random()*100000)::int,
(random()*1000)::int,
(random()*5)::int)"
Функція PL/pgSQL inline_code_block рядок 4 в SQL-оператор
What is proper procedure to read and output utf8 encoded data in Windows 10?
My attempt to read utf8 encoded file in Windows 10 and output lines into terminal does not reproduce symbols of some languages.
OS: Windows 10
Native codepage: 437
Switched codepage: 65001
In cmd window issued command chcp 65001. Following ruby code reads utf8 encoded file and outputs lines with puts.
fname = 'hello_world.dat'
File.open(fname,'r:UTF-8') do |f|
puts f.read
end
hello_world.dat content
Afrikaans: Hello Wêreld!
Albanian: Përshendetje Botë!
Amharic: ሰላም ልዑል!
Arabic: مرحبا بالعالم!
Armenian: Բարեւ աշխարհ!
Basque: Kaixo Mundua!
Belarussian: Прывітанне Сусвет!
Bengali: ওহে বিশ্ব!
Bulgarian: Здравей свят!
Catalan: Hola món!
Chichewa: Moni Dziko Lapansi!
Chinese: 你好世界!
Croatian: Pozdrav svijete!
Czech: Ahoj světe!
Danish: Hej Verden!
Dutch: Hallo Wereld!
English: Hello World!
Estonian: Tere maailm!
Finnish: Hei maailma!
French: Bonjour monde!
Frisian: Hallo wrâld!
Georgian: გამარჯობა მსოფლიო!
German: Hallo Welt!
Greek: Γειά σου Κόσμε!
Hausa: Sannu Duniya!
Hebrew: שלום עולם!
Hindi: नमस्ते दुनिया!
Hungarian: Helló Világ!
Icelandic: Halló heimur!
Igbo: Ndewo Ụwa!
Indonesian: Halo Dunia!
Italian: Ciao mondo!
Japanese: こんにちは世界!
Kazakh: Сәлем Әлем!
Khmer: សួស្តីពិភពលោក!
Kyrgyz: Салам дүйнө!
Lao: ສະບາຍດີຊາວໂລກ!
Latvian: Sveika pasaule!
Lithuanian: Labas pasauli!
Luxemburgish: Moien Welt!
Macedonian: Здраво свету!
Malay: Hai dunia!
Malayalam: ഹലോ വേൾഡ്!
Mongolian: Сайн уу дэлхий!
Myanmar: မင်္ဂလာပါကမ္ဘာလောက!
Nepali: नमस्कार संसार!
Norwegian: Hei Verden!
Pashto: سلام نړی!
Persian: سلام دنیا!
Polish: Witaj świecie!
Portuguese: Olá Mundo!
Punjabi: ਸਤਿ ਸ੍ਰੀ ਅਕਾਲ ਦੁਨਿਆ!
Romanian: Salut Lume!
Russian: Привет мир!
Scots Gaelic: Hàlo a Shaoghail!
Serbian: Здраво Свете!
Sesotho: Lefatše Lumela!
Sinhala: හෙලෝ වර්ල්ඩ්!
Slovenian: Pozdravljen svet!
Spanish: ¡Hola Mundo!
Sundanese: Halo Dunya!
Swahili: Salamu Dunia!
Swedish: Hej världen!
Tajik: Салом Ҷаҳон!
Thai: สวัสดีชาวโลก!
Turkish: Selam Dünya!
Ukrainian: Привіт Світ!
Uzbek: Salom Dunyo!
Vietnamese: Chào thế giới!
Welsh: Helo Byd!
Xhosa: Molo Lizwe!
Yiddish: העלא וועלט!
Yoruba: Mo ki O Ile Aiye!
Zulu: Sawubona Mhlaba!
Steven Penny suggested to use PowerShell and do not change code page. Following picture demonstrates that the issue persists.
Windows Terminal installer (which is not a part of Windows distribution) solves utf8 output issue, please see included screen capture.
The problem is, you are using a some methods and tools that are really old. First:
Native codepage: 437
Switched codepage: 65001
You don't need to mess with the codepage any more, just leave it as the default. Also, from you picture I see you are also using Console Host, which is also really old. Windows Terminal [1] has been available since 2019, and has built in UTF-8 support. Using Windows Terminal, I can run your script, even without specifying UTF-8:
fname = 'hello_world.dat'
File.open(fname,'r') do |f|
puts f.read
end
and I get perfect result:
To use Windows Terminal, download the msixbundle file [2], then install it. Or, as it's essentially just a Zip file, you can rename it to file.zip and extract it with Windows, then run WindowsTerminal.exe. Or, since you are really having trouble with this process, you can use a portable version I just created
[3] (at your own risk).
https://github.com/microsoft/terminal
https://github.com/microsoft/terminal/releases/tag/v1.8.1444.0
https://github.com/microsoft/terminal/files/6563899/CascadiaPackage_1.8.1444.0_x64.zip
When I just recently update my OS X system, the pdf compiled by xelatex can no longer display Chinese (The characters are missing). I look at the log and it says
This is XeTeX, Version 3.14159265-2.6-0.99991 (TeX Live 2014) (preloaded format=xelatex 2015.3.9) 17 OCT 2015 21:18
entering extended mode
(/usr/local/texlive/2014/texmf-dist/tex/latex/amsfonts/umsb.fd
File: umsb.fd 2013/01/14 v3.01 AMS symbols B)
Missing character: There is no 林 in font ptmr7t!
Missing character: There is no 星 in font ptmr7t!
Missing character: There is no 宇 in font ptmr7t!
By the way, I am using the sublime latex tool plugin.
I met this problem too.
Though my language is Japanese, I think it has a same solution.
Try this:
Open Sublimetext > Preferences > Browse Packages
then open file LaTexTools > builders > traditionalBuilder.py
Comment out line 18-20, DEFAULT_COMMAND_LATEXMK
And paste this:
DEFAULT_COMMAND_LATEXMK = ["latexmk", "-cd",
"-e", "$latex = 'platex -synctex=1 -src-specials -interaction=nonstopmode'",
"-e", "$biber = 'biber %O --bblencoding=utf8 -u -U --output_safechars %B'",
"-e", "$bibtex = 'pbibtex %O %B'",
"-e", "$makeindex = 'makeindex %O -o %D %S'",
"-e", "$dvipdf = 'dvipdfmx %O -o %D %S'",
"-e", "$pdf_mode = '3'",
"-e", "$pdf_update_method = '0'",
"-e", "$pdf_previewer = 'open -a preview.app'",
"-f", "-norc", "-gg", "-pdfdvi"]
Press Cmd+Shift+P and input "recon", then Run LatexTools:Reconfigure and migrate Settings
Quit Sublime text(not close window, but quit)
Reopen and edit tex file(or remove old log files), then build
WinXP-x32, R-2.13.0
Dear list,
I have a problem that (I think) relates to the interaction between Windows and R.
I am trying to scrape a table with data on the Hawai'ian Islands. This is my R code:
library(XML)
u <- "http://en.wikipedia.org/wiki/Hawaii"
tables <- readHTMLTable(u)
Islands <- tables[[5]]
The output is (first set of columns):
Island Nickname > > Islands
Island Nickname > > Location 1 Hawaiʻi[7] The Big
Island 19°34′N 155°30′W /
19.567°N 155.5°W / 19.567;
-155.5 2 Maui[8] The Valley Isle 20°48′N 156°20′W /
20.8°N 156.333°W / 20.8;
-156.333 3 Kahoʻolawe[9] The Target Isle 20°33′N
156°36′W / 20.55°N
156.6°W / 20.55; -156.6 4 LÄnaÊ»i[10] The Pineapple Isle
20°50′N 156°56′W /
20.833°N 156.933°W / 20.833;
-156.933 5 Molokaʻi[11] The Friendly Isle 21°08′N
157°02′W / 21.133°N
157.033°W / 21.133; -157.033 6 Oʻahu[12] The Gathering Place
21°28′N 157°59′W /
21.467°N 157.983°W / 21.467;
-157.983 7 Kauaʻi[13] The Garden Isle 22°05′N
159°30′W / 22.083°N
159.5°W / 22.083; -159.5 8 Niʻihau[14] The Forbidden Isle
21°54′N 160°10′W / 21.9°N
160.167°W / 21.9; -160.167
As you can see, there are "weird" characters in there. I have also tried readHTMLTable(u, encoding = "UTF-16") and readHTMLTable(u, encoding = "UTF-8")
but that didn't help.
It seems to me that there may be an issue with the interaction of the Windows settings of the character set and R.
sessionInfo() gives
> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252 LC_MONETARY=Dutch_Netherlands.1252
[4] LC_NUMERIC=C LC_TIME=Dutch_Netherlands.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] XML_3.2-0.2
I have also attempted to let R use another setting by entering: Sys.setlocale("LC_ALL", "en_US.UTF-8"), but this yields the response:
> Sys.setlocale("LC_ALL", "en_US.UTF-8")
[1] ""
Warning message:
In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
OS reports request to set locale to "en_US.UTF-8" cannot be honored
In addition, I have attempted to make the change directly from the windows command prompt, using: chcp 65001 and variations of that, but that didn't change anything.
I noticed from searching the web that others have the issue as well, but have not been able to find a solution. I looks like this is an issue of how Windows and R interact. Unfortunately, all three computers at my disposal have this problem. It occurs both under WinXP-x32 and under Win7-x86.
Is there a way to make R override the windows settings or can the issue be solved otherwise?
I have also tried other websites, and the issue occurs every time when there is an é, ü, ä, î, et cetera in the text-to-be-scraped.
Thank you,
Roger
A not quite an answer:
If you look at the wikipedia page and change the encoding in your browser (in IE, View -> Encoding; in Firefox, View -> Character Encoding) to Western (ISO-8869-1) or Western (Windows-1252) then you see the silly characters. That ought to mean that you can use iconv to change the encoding and fix your problems.
#Convert factors to character
Islands <- as.data.frame(lapply(Islands, as.character), stringsAsFactors = FALSE)
iconv(Islands$Island, "windows-1252", "UTF-8")
Unfortunately, it doesn't work. It may be possible to get the correct text by using a different conversion (iconvlist() shows all the possibilities).
It is possible it simply strip out the offending characters, though this isn't ideal.
iconv(Islands$Island, "windows-1252", "ASCII", "")
Unable to replicate the error, however looking at the help files is useful.
Sys.setlocale("LC_TIME", "de") # Solaris: details are OS-dependent
Sys.setlocale("LC_TIME", "de_DE.utf8") # Modern Linux etc.
Sys.setlocale("LC_TIME", "de_DE.UTF-8") # ditto
Sys.setlocale("LC_TIME", "de_DE") # OS X, in UTF-8
Sys.setlocale("LC_TIME", "German") # Windows
For a windows you should use formatting like "English" or "Dutch_Netherlands.1252" to change these settings.
I tried to replicate your state
> Sys.setlocale("LC_ALL","Dutch_Netherlands.1252")
[1] "LC_COLLATE=Dutch_Netherlands.1252;LC_CTYPE=Dutch_Netherlands.1252;LC_MONETARY=Dutch_Netherlands.1252;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.1252"
> Sys.getlocale()
[1] "LC_COLLATE=Dutch_Netherlands.1252;LC_CTYPE=Dutch_Netherlands.1252;LC_MONETARY=Dutch_Netherlands.1252;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.1252"
library(XML)
u <- "http://en.wikipedia.org/wiki/Hawaii"
tables <- readHTMLTable(u)
Islands <- tables[[5]]
However I do not get the funny characters in console, in my own locale the ʻ was marked as , but still all functionality remained.
> Islands[1,1]
[1] Hawaiʻi[27]
8 Levels: Hawaiʻi[27] Kahoʻolawe[34] Kauaʻi[30] Lānaʻi[32] Maui[28] ... Oʻahu[29]
And these funny characters can be read easily, and found from the table.
> Encoding(as.character("Hawaiʻi"))
[1] "UTF-8"
> Encoding(as.character(Islands[1,1]))
[1] "UTF-8"
> grep("Hawaiʻi", as.character(Islands[1,1]))
[1] 1
If you still have problems it would rely elsewhere, however to change the locale under windows you have to use different names than Linux or OS X (see your own locale info for example). In Windows "Dutch" is probably enough.