Lynx UTF-8 support - utf-8

I am using Lynx on OS X 10.11. However, it does not print UTF-8 for non-ASCII characters, but rather either an ASCII representation of them, or the ef bf bd "replacement" character (?).
I have been studying this guide for help.
The output from the locale command:
locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
When I run Lynx with
lynx http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt
here is what the display appears like:
According to the posts in the article, Lynx should print UTF-8 properly.
lynx -dump ... prints the same.
(running export LC_ALL="en_US.UTF-8" doesn't help either.)
What is strange, is that if I run with the -mime_header argument, eg:
lynx -mime_header http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt
It prints the characters properly. (Albeit, as a dump rather than opening in a browser environment):
EDIT:
Forgot to mention,
-assume_charset=utf8 and -assume_unrec_charset=utf8
don't help either.
EDIT:
Well I am able to get the output I want by hard-setting CHARACTER_SET in lynx.cfg. Though this seems like a bit of a workaround, as in the documentation it states:
# ... The 'o'ptions menu setting will be stored in the user's RC
# file whenever those settings are saved, and thereafter will be used as the
# default. ...
However, the setting only persists for the session it is set in. That won't work for me as I am primarily using lynx -dump in a script. But as I pretty much am only UTF-8, I guess I can live with the hard setting for now.

I do think you should use
lynx -dump --display_charset=utf-8
rather than hard-setting the config file
so
lynx --display_charset=utf-8 http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt
alternatively
check
https://www.brow.sh/

Related

Displaying Telugu on Terminal or iterm2 applications of Mac OS

Terminal and iterm2 applications on my Mac don't display Telugu characters properly. The characters get all jumbled up. I see the same issue with other languages like Kannada and Sanskrit. Some characters seem fine but some others are getting jumbled (as if one character is being super-imposed on another).
I set my text-encoding of Terminal to utf-8, did export LC_CTYPE=en_US.UTF-8 as suggested by other answers but nothing seems to work. Here is my locale:
$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
Enabling "double width" setting did not solve the problem either. I also checked "set locale environment variable on startup". That did not work either.
Note that the characters are being displayed properly in other applications like browsers, word processors, etc. So the problem is local to terminal apps like Terminal and iterm2.
This is how the word "Telugu" is being displayed

Why is wget printing all messages in Russian when run from the terminal in PyCharm?

For some reason, wget has started printing all of its messages in Russian, but only when run from within PyCharm's terminal. Why is this happening and how can I change it back to English?
I am on OSX 10.13, and am using wget 1.19.4_1 installed using Homebrew. I have used wget on this computer before, and the text was in English. I cannot understand Russian, so nothing on this computer has ever been set to use Russian.
When I run ...$ locale, this is the result:
LANG=
LC_COLLATE="C"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
Here's what I've found so far
This only happens for terminals in JetBrains IDEs -- I have tried both PyCharm and JGrasp, and they are both affected. When run in the OSX Terminal app, wget outputs English text. It would still be nice to know why this is happening and how to fix it.
This problem seems to affect only wget
Reinstalling wget using brew does not seem to have any effect.
There is no en_US locale in /usr/local/Cellar/wget/1.19.4_1/share/locale/, but I do not know if this could be the cause of my problem.
Copying .../en_GB/ (where ... is wget's locale folder) into a new folder called .../en_US/ does not get rid of the Russian text. Nor does replacing .../ru/ with .../en_GB/. I'm not sure exactly how a locale is defined, so this may or may not mean anything.
This just happened to me, except in french.
My fix: force in the lang values in the terminal
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
This has wget happy; I've added them to ~/.profile now to see if it keeps them away forever.
jGRASP is using a pty to connect to external programs, and some other IDEs are probably doing the same. So it's possible that new ptys created the way these IDEs are doing it are defaulting to the wrong locale.
From jGRASP or another IDE you can compile and run a Java or C program to print the default locale to verify this. For Java it is java.util.Locale.getDefault() .
The source for the jGRASP pty connection is in .../jGRASP.app/Contents/Resources/jgrasp/src/wedge.c . You can see how the pty is created there. You can also modify and recompile to .../jgrasp/jbin/osx_run if it will help to trace down the problem. Compiling on OS X requires no special parameters, just: cc -o ../jbin/osx_run wedge.c .

How do I get accented letters to actually work on bash?

My bash installation on cygwin doesn't handle accented letters properly. I tried adding
set input-meta on # to accept 8-bit characters
set output-meta on # to show 8-bit characters
set convert-meta on # to show it as character, not the octal representation
to my input rc, but this doesn't quite work yet. Indeed, if I type
$ echo ù
then before i press enter it is automatically changed to
$ echo \303
although the output is right, for I get
$ echo \303
ù
I get the same result for anyother accented letter. Usually though I use a non-italian keyboard, and I use autohotkey to substitute letters with an apostrophe after them with an accented letter. When this is the case, accented letters get substituted with a \302, and they print garbage depending on the letter: prints a 3y for a ù, a ¢ for an ò, and nothing for everething else.
How do I get all this to make sense?
EDIT: my locale settings, cygwin version and terminal are the following
$ uname -a
CYGWIN_NT-6.1-WOW64 ferdi-Asus 1.7.17(0.262/5/3) 2012-10-19 14:39 i686 Cygwin
$ locale
LANG=it_IT.UTF-8
LC_CTYPE="it_IT.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="it_IT.UTF-8"
LC_COLLATE="it_IT.UTF-8"
LC_MONETARY="it_IT.UTF-8"
LC_MESSAGES="it_IT.UTF-8"
LC_ALL=
$ tty
/dev/pty1
I'm invoking it simply clicking the Cygwin terminal link. It redirects to
C:\cygwin\bin\mintty.exe -i /Cygwin-Terminal.ico -
The relevant part of the autohotkey script is the following
#NoEnv ; Recommended for performance and compatibility with future AutoHotkey releases.
SendMode Input ; Recommended for new scripts due to its superior speed and reliability.
SetWorkingDir %A_ScriptDir% ; Ensures a consistent starting directory.
...
::avra'::avrà
::avro'::avrò
...
To get accented letters on bash via Cygwin using Mintty 1.1.2 just do the following:
Go to the menu (if you don't see any menu, right click on your Terminal).
Click Options....
Click Text.
Change the Locale to C.
Change the Character set to ISO-8859-1 (Western European).
Then test it:

Unicode (utf-8) with git-bash

I'm having some trouble getting unicode to work for git-bash (on windows 7). I have tried many things without success. Although, I'm not quite sure what is responsible to for this so i might be working in the wrong direction.
It really seems this should be possible as the encoding for cmd.exe can be changed to unicode with 'chcp 65001'.
Here are some things I've tried (besides the obvious of looking through the configuration options in the GUI).
Setting environment variables in '.bashrc'. I guess it makes sense this doesn't work since i think it's a linux thing. The 'locale' command does not exist.
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8
Starting out in cmd.exe, changing the encoding to unicode with 'chcp 65001' and then starting up git-bash. This causes me to get a permission denied when trying to cat my unicode test file. However, catting a file without unicode works just fine. As demonstrated, dropping back out to cmd.exe i can still "cat" the file. Using my default encoding (437) i can cat the file in bash (no permission denied but the output is fudged).
S:\>chcp 65001
Active code page: 65001
S:\>"C:\Program Files (x86)\Git\bin\sh.exe" --login -i
zarac#TOWELIE /z
cat /s/unicode.txt
cat: write error: Permission denied
zarac#TOWELIE /z
cat /s/nounicode.txt
abc
zarac#TOWELIE /z
L /s/unicode.txt
-rw-r--r-- 1 zarac Administ 7 May 18 10:30 /s/unicode.txt
zarac#TOWELIE /z
whoami
towelie\zarac
zarac#TOWELIE /z
exit
Z:\>type S:\unicode.txt
abc£
Using the /U flag when starting the shell (makes sense that it doesn't work because it's not quite what it's for if-i-understand-correctly, but it has to do with unicode so i tried it).
C:\Windows\SysWOW64\cmd.exe /U /C "C:\Program Files (x86)\Git\bin\sh.exe" --login -i
As I prefer to use Console2, I've tried adding a dword value named CodePage with the value 65001 (decimal) to the windows registry under [HKEY_CURRENT_USER\Console] as well as [HKEY_CURRENT_USER\Console\Git Bash]. This seems to have the same effect as setting 'chcp 65001' accept that it's "automatic". (http://stackoverflow.com/questions/379240/is-there-a-windows-command-shell-that-will-display-unicode-characters)
JPSoft's TCC/LE
PowerCMD
stackoverflow
duckduckgo
ixquick / google
So, method 2 seems viable if that permission issue can be fixed. However, I'm open to pretty much any solution although i prefer if i can use Console2 (due mostly to it's nifty tab feature). Perhaps one solution would be to setup an SSH server and then use Putty/Kitty to connect to it, but that's just wrong! ; )
PS. Is there any official documentation for git-bash?
I faced the same issue in MSYS Git 2.8.0 and as it turned out it just needed changing the configuration.
$ git --version
git version 2.8.0.windows.1
The default configuration of Git Bash console in my system did not show Greek filenames.
$cd ~
$ls
AppData/
'Application Data'#
Contacts/
Cookies#
Desktop/
Documents/
Downloads/
Favorites/
Links/
'Local Settings'#
NTUSER.DAT
.
.
.
''$'\316\244\316\261'' '$'\316\255\316\263\316\263\317\201\316\261\317\206\316\254'' '$'\316\274\316\277\317\205'#
The last line should display "Τα έγγραφά μου", the greek translation of "My Documents". In order to fix it I followed the below steps:
Check your existing locale configuration
$locale
LANG=en
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
As shown above, in my case it was not UTF-8
Change the locale to a UTF-8 encoding. Click the icon on the left side of MINGW title bar, select "Options" and in the "Text" category choose "UTF-8" Character set. You should also choose a unicode font, such as the default "Lucida Console". My configuration looks as following:
Change the language for the current window (no need to do this on future windows, as they will be created with the settings of step 2)
$ LANG='C.UTF-8'
The ls command should now display properly
AppData/
'Application Data'#
Contacts/
Cookies#
Desktop/
Documents/
Downloads/
Favorites/
Links/
'Local Settings'#
NTUSER.DAT
.
.
.
'Τα έγγραφά μου'#
Found this answer elsewhere:
chcp.com 65001
Git bash chcp windows7 encoding issue
That's what actually solved it for me.
As CharlesB said in a comment, msysgit 1.7.10 handles unicode correctly. There are still a few issues but I can confirm that updating did solve the issue I was having.
See: https://github.com/msysgit/msysgit/wiki/Git-for-Windows-Unicode-Support
Check if the issue persists with Git 2.1 (August 2014).
See commit 617ce96 or commit 1c950a5 by Karsten Blees (kblees)
Win32: support Unicode console output
WriteConsoleW seems to be the only way to reliably print unicode to the
console (without weird code page conversions).
Also redirects vfprintf to the winansi.c version.
Win32: add Unicode conversion functions
Add Unicode conversion functions to convert between Windows native UTF-16LE encoding to UTF-8 and back.
To support repositories with legacy-encoded file names, the UTF-8 to UTF-16 conversion function tries to create valid, unique file names even for invalid UTF-8 byte sequences, so that these repositories can be checked out without error.
It is likely to be a port of something already integrated in msysgit, but at least that means the Windows version of Git won't have to diverge/patch from the main Git repo source code in order to include those improvements.
I can see that there are some problems with character encoding with git bash for windows. Less for the work with git itself and the tools it ships with (curl, cat, grep etc.). I didn't run into problems with these over the years character encoding related.
Normally with each new version problems get better resolved. E.g. with the version from a year ago, I couldn't enter characters like "ä" into the shell, so it was not possible to write
echo "ä"
To quickly test if UTF-8 is supported and at which level. A workaround is to write the byte-sequences octal:
$ echo -e "\0303\0244"
ä
Still issues I do have when I execute my windows php.exe binary to output text:
$ php -r 'echo "\xC3\xA4";'
ä
This does not give the the "ä" in the terminal, but it outputs "├ñ" instead. The workaround I have for that is, that I wrap the php command in a bash-script that processes the output through cat:
#!/bin/bash
{ php.exe "$#" 2>&1 1>&3 | cat 1>&2; } 3>&1 | cat
ref. reg. stdout + stderr cat
This magically then makes php working again:
$ php -r 'echo "\xC3\xA4";'
ä
Applies to
$ git --version
git version 1.9.4.msysgit.1
I must admit I miss deeper understanding why this is all the way it is. But I'm finally happy that I found a workaround to use php in git bash with UTF-8 support.
For me the solution was just to enable unicode support.
Docs: https://github.com/msysgit/msysgit/wiki/Git-for-Windows-Unicode-Support
git config --global core.quotepath off
I found the following steps helpful:
Run Git Bash
Right-click and select Options...
Select Text group at the left
Change Font to Consolas
Select C as Locale and UTF-8 as Character set
Apply and Save.
In the terminal execute:
git config --global core.quotepath false
In rare cases, execute in the terminal as well:
export LANG='C.UTF-8'
The problem with chcp 65001 is that there are bugs in the C runtime (MSVCRT) that make stdio calls return inconsistent results when run under code page 65001.
That should be better with Git 2.23 (Q3 2019)
See commit 090d1e8 (03 Jul 2019) by Karsten Blees (kblees).
(Merged by Junio C Hamano -- gitster -- in commit 0328db0, 11 Jul 2019)
gettext: always use UTF-8 on native Windows
On native Windows, Git exclusively uses UTF-8 for console output (both with MinTTY and native Win32 Console).
Gettext uses setlocale() to determine the output encoding for translated text, however, MSVCRT's setlocale() does not support UTF-8.
As a result, translated text is encoded in system encoding (as per GetAPC()), and non-ASCII chars are mangled in console output.
Side note: There is actually a code page for UTF-8: 65001.
In practice, it does not work as expected at least on Windows 7, though, so we cannot use it in Git. Besides, if we overrode the code page, any process spawned from Git would inherit that code page (as opposed to the code page configured for the current user), which would quite possibly break e.g. diff or merge helpers.
So we really cannot override the code page.
In init_gettext_charset(), Git calls gettext's bind_textdomain_codeset() with the character set obtained via locale_charset(); Let's override that latter function to force the encoding to UTF-8 on native Windows.
In Git for Windows' SDK, there is a libcharset.h and therefore we define HAVE_LIBCHARSET_H in the MINGW-specific section in config.mak.uname, therefore we need to add the override before that conditionally-compiled code block.
Rather than simply defining locale_charset() to return the string "UTF-8", though, we are careful not to break LC_ALL=C: the ab/no-kwset patch series, for example, needs to have a way to prevent Git from expecting UTF-8-encoded input.
And:
See commit 697bdd2 (04 Jul 2019), and commit 9423885, commit 39a98e9 (27 Jun 2019) by Johannes Schindelin (dscho).
(Merged by Junio C Hamano -- gitster -- in commit 0a2ff7c, 11 Jul 2019)
mingw: use Unicode functions explicitly
Many Win32 API functions actually exist in two variants: one with the A suffix that takes ANSI parameters (char * or const char *) and one with the W suffix that takes Unicode parameters (wchar_t * or const wchar_t *).
The ANSI variant assumes that the strings are encoded according to whatever is the current locale.
This is not what Git wants to use on Windows: we assume that char * variables point to strings encoded in UTF-8.
There is a pseudo UTF-8 locale on Windows, but it does not work as one might expect. In addition, if we overrode the user's locale, that would modify the behavior of programs spawned by Git (such as editors, difftools, etc), therefore we cannot use that pseudo locale.
Further, it is actually highly encouraged to use the Unicode versions
instead of the ANSI versions, so let's do precisely that.
Note: when calling the Win32 API functions without any suffix, it depends whether the UNICODE constant is defined before the relevant headers are #include'd.
Without that constant, the ANSI variants are used.
Let's be explicit and avoid that ambiguity.

OS X Terminal UTF-8 issues

Okay, so I finally got myself a MacBook Air after 15 years of linux. And before I got it my big concern was UTF-8 support because no matter if I get files sent to me from windows or mac-clients theres always issues with encoding, while on ubuntu I can be sure that all output no matter what program will produce perfect utf-8 encoded data.
And now on my second day (today) with OS X Im tearing my hair of by frustration. Why?
When I open Nano and type some swedish characters like ÅÄÖ in it, it puts out blank characters at the end of the line (which i guess is the other byte in each character)
When I open python and try using swedish characters, it does not output anything at all
When I connect to a Ubuntu server trough SSH I cant type åäö in bash, tough it works in VIM (still trough SSH). And in nano backspace does not work, but if check the box "Delete sends ctrl+H" in the Terminal preferences, backspace starts working in nano but stops working in VIM.
I've tried unchecking all other encodings then UTF-8 in terminal preferences but that does not seem to work either.
I'm sure that every non US-person must have the same issues, so hove do I fix them? I just want full UTF-8 support... :'(
For me, this helped:
I checked locale on my local shell in terminal
$ locale
LANG="cs_CZ.UTF-8"
LC_COLLATE="cs_CZ.UTF-8"
Then connected to any remote host I am using via ssh and edited file /etc/profile as root - at the end I added line:
export LANG=cs_CZ.UTF-8
After next connection it works fine in bash, ls and nano.
Go to Terminal -> Preferences -> Advanced (Tab) go down to International and select Unicode (UTF-8) as Character Encoding.
And tick Set locale environment variables on startup.
Unfortunately, the Preferences dialog is not always very helpful, but by tweaking around you should be able to get everything working.
To be able to type Swedish characters in Terminal, add the following lines to your ~/.inputrc (most likely you must create this file):
set input-meta on
set output-meta on
set convert-meta off
This should do the work both with utf8 and other codings in bash, nano and many other programs. Some programs, like tmux, also depends on the locale. Then, adding for instance export LC_ALL=en_US.UTF-8 to your ~/.profile file should help, but keep in mind that a few (mainly obscure) programs require a standard locale, so if you have trouble running or compiling a program, try going back to LC_ALL=C.
Some references that may be helpful:
http://homepage.mac.com/thgewecke/mlingos9.html#unicode
http://hints.macworld.com/article.php?story=20060825071728278
The following is a summary of what you need to do under OS X Mavericks (10.9). This is all summarized in
http://hints.macworld.com/article.php?story=20060825071728278
Go to Terminal->Preferences->Settings->Advanced.
Under International, make sure the character encoding is set to Unicode (UTF-8).
Also, and this is key: under Emulation, make sure that Escape non-ASCII input with Control-V is unchecked (i.e. is not set).
These two settings fix things for Terminal.
Make sure your locale is set to something that ends in .UTF-8. Type locale and look at the LC_CTYPE line. If it doesn't say something like en_US.UTF-8 (the stuff before the dot might change if you are using a non-US-English locale), then in your Bash .profile or .bashrc in your home directory, add a line like this:
export LC_CTYPE=en_US.UTF-8
This will fix things for command-line programs in general.
Add the following lines to .inputrc in your home directory (create it if necessary):
set meta-flag on
set input-meta on
set output-meta on
set convert-meta off
This makes Bash be eight-bit clean, so it will pass UTF-8 characters in and out without messing with them.
Keep in mind you will have to restart Bash (e.g. close and reopen the Terminal window) to get it to pay attention to all the settings you make in 2 and 3 above.
Short versatile answer (fits to other national languages, even Lithuanian or Russian)
open Terminal
edit .profile in home directory - nano .profile or in Catalina or newer nano .zshenv
add line export LC_ALL=en_US.UTF-8
press Ctrl+x and Y (exit and save)
This solved for me even small country rare national characters. You may need to close and open Terminal to make changes effective.
Also if you like Linux behavior (use lot of Alt shortcuts like Alt+. or Alt+, in mc) then you should disable Mac style Option key function:
Terminal->Preferences->Profiles->Keyboard and check box:
Use Option as Meta key
To make nano work as you want it to, try:
export LANG="UTF-8"
Or get a newer version of nano via MacPorts:
# cf. http://www.macports.org/install.php
port info nano
port variants nano
sudo port install nano +utf8 +color +no_wrap
With respect to ssh & UTF-8 issues comment out SendEnv LANG LC_* in /etc/ssh_config.
See: Terminal in OS X Lion: can't write åäö on remote machine
My terminal was just acting silly, not printing out åäö. I found (and set) this setting:
Under Terminal -> Preferences... -> Profiles -> Advanced.
Seems to have fixed my problem.
Check whether nano was actually built with UTF-8 support, using nano --version. Here it is on Cygwin:
nano --version
GNU nano version 2.2.5 (compiled 21:04:20, Nov 3 2010)
(C) 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007,
2008, 2009 Free Software Foundation, Inc.
Email: nano#nano-editor.org Web: http://www.nano-editor.org/
Compiled options: --enable-color --enable-extra --enable-multibuffer
--enable-nanorc --enable-utf8
Note the last bit.
Since nano is a terminal application. I guess it's more a terminal problem than a nano problem.
I met similar problems at OS X (I cannot input and view the Chinese characters at terminal).
I tried tweaking the system setting through OS X UI whose real effect is change the environment variable LANG.
So finally I just add some stuff into the ~/.bashrc to fix the problem.
# I'm Chinese and I prefer English manual
export LC_COLLATE="zh_CN.UTF-8"
export LC_CTYPE="zh_CN.UTF-8"
export LC_MESSAGES="en_US.UTF-8"
export LC_MONETARY="zh_CN.UTF-8"
export LC_NUMERIC="zh_CN.UTF-8"
export LC_TIME="zh_CN.UTF-8"
BTW, don't set LC_ALL which will override all the other LC_* settings.
Try
Having a Powerline compatible font installed https://github.com/powerline/fonts
Setting these ENV vars in .zshrc or .bashrc:
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
Just add a file on remote server
$ sudo nano /etc/environment
LANG=en_US.utf-8
LC_ALL=en_US.utf-8
PS: Top answer has a suggestion to change /etc/profile file on remote server, it works, but this file is often overwritten by system, and doesn't help for long.
/etc/profile file contains disclaimer:
It's NOT a good idea to change this file unless you know what you are doing. It's much better to create a custom.sh shell script in /etc/profile.d/ to make custom changes to your environment, as this will prevent the need for merging in future updates.
In my case, simply using the uxterm command instead of xterm solved the problem. It's available in /opt/X11/bin/uxterm by installing the XQuartz package provided by Apple.

Resources