I'm unable to correctly pass UTF-8 string values as arguments to command line apps.
Approaches I've tried:
pass the value between double quotes: "café"
pass with single quotes: 'café'
use the char code: 'caf\233'
use a $ sign before the string: $'café'
I'm using Mac OS 10.10, iTerm and my current locale outputs:
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
It is doubtful this has anything to do with the shell. I would make sure that your tools (both the writer tools and whatever you're reading with) correctly deal with UTF-8 at all. I would most suspect that whatever you're reading your tags with is interpreting and printing this as Latin-1. You should look inside the file with a hex editor and look for the tag. I'm betting it will be correct (C3 82, which is é in UTF-8, and é in Latin-1). Your output tool is probably the problem, not the writer (and definitely not the shell).
If your reading tool demands Latin-1, then you need to encode é as E9. The iconv tool can be useful in making those conversions for scripts.
Related
When I use unicode 6.0 character(for example, 'beer mug') in Bash(4.3.11), it doesn't display correctly.
Just copy and paste character is okay, but if you use utf-16 hex code like
$ echo -e '\ud83c\udf7a',
output is '??????'.
What's the problem?
You can't use UTF-16 with bash and a unix(-like) terminal. Bash strings are strings of bytes, and the terminal will (if you have it configured correctly) be expecting UTF-8 sequences. In UTF-8, surrogate pairs are illegal. So if you want to show your beer mug, you need to provide the UTF-8 sequence.
Note that echo -e interprets unicode escapes in the forms \uXXXX and \UXXXXXXXX, producing the corresponding UTF-8 sequence. So you can get your beer mug (assuming your terminal font includes it) with:
echo -e '\U0001f37a'
I am dealing with some multilingual data(English and Arabic) in a json file with a weird character i am not able to parse. I am not sure what the character is. I tried getting the ASCII value via vim and this is what i got
"38 0x26"
This is the status line in vim i used to get the value (http://vim.wikia.com/wiki/Showing_the_ASCII_value_of_the_current_character).
:set statusline=%<%f%h%m%r%=%b\ 0x%B\ \ %l,%c%V\ %P
This is how the character looks in vim -
I tried 'sed' and '.gsub' to replace this character unsuccessfully.
Is there a way where i can replace this character(preferably with .gsub ruby) with '&' or something else?
Thanks
try with something like
sed 's/[[:alpnum:][:space:]\[\]{}()\.\*\\\/_(AllAsciiVariationYouWant)/&/g;t
s/./?/g' YourFile
where (AllAsciiVariationYouWant) is all character that you want to keep as is (without the surrounding "()" )
JSON is encoded in UTF-8 (Unicode). If you're seeing funky-looking characters in your file, it's probably because your editor is not treating Unicode characters properly. That could be caused by the use of a terminal emulator that doesn't support Unicode; an incorrect $LANG setting; vim not being able to correctly determine the encoding of the file; and likely other reasons.
What terminal program are you using? What's your $LANG environment variable set to (echo $LANG)? If you're certain your terminal supports Unicode, try:
LANG=en_US.utf-8 vim your_file_here.json
(The above example assumes that U.S. English is appropriate for the file, which it may not be.)
As for replacing characters in the file, vim's substitution command can be used:
:%s/old text/new text/g
The above command will run the substitute command on all lines in the file (%), replacing every instance of "old text" with "new text". (The g at the end tells vim to replace every instance on a line, not just the first it finds.)
In the ruby file:
p __ENCODING__
#<Encoding:US-ASCII>
In vim:
set encoding?
encoding=utf-8
This is causing me grief (http://stackoverflow.com/questions/14495486/ruby-syntax-error-with-multiple-language-in-hash), which is patched but I still don't understand why the file shows as ASCII by ruby and utf-8 by vim.
As #melpomene commented, :set encoding tells you what encoding is used internally by Vim.
:set fileencoding will tell you what encoding Vim decided to use for your document. The possible values are given by the fileencodings option. ASCII is not part of the default list as it's usually handled transparently by the other encodings listed.
But that part of your question is puzzling me:
but I still don't understand why the file is ASCII
because it looks like you actively want that file to be treated as ASCII by the interpreter.
Anyway, that encoding directive is only used by Ruby: it doesn't mean that the file is actually encoded as ASCII or that Vim is supposed to care about it and treat it in a special way.
In short, whether your file is actually encoded in ASCII or not, Vim doesn't care.
So… what do you want exactly? That vim sets its fileencoding option to ASCII when you open a supposedly ASCII file? That your supposedly ASCII file be converted to another encoding?
edit
With that directive, you explicitely tell Ruby that the file's content must be treated as ASCII and Ruby says "OK, that's ASCII, if you say so.".
This directive doesn't change anything to the actual encoding of the file. It could be utf-8, latin1 or whatever.
Vim doesn't understand that directive.
Vim chooses the encoding it uses for that file according to a number of rules you should read about in :h encoding, :h fileencoding and :h fileencodings.
Vim doesn't treat ASCII in a special "ASCII" way, it just handles it has the subset of utf-8 that it is.
So, before we go further, please verify:
the encoding of the file with something like $ file /path/to/file
the fileencoding Vim uses for that file with :set fileencoding
I've got a Perl program that I wrote on Windows. It starts with:
$unused_header = <STDIN>;
my #header_fields = split('\|\^\|', $unused_header, -1);
Which should split input that consists of a very large file of:
The|^|Quick|^|Brown|^|Fox|!|
Into:
{The, Quick, Brown, Fox|!|}
Note: This line just does the headre alone, theres another one like it to do the repetitive data lines.
It worked great on windows, but on linux it fails. However, if I define a string with the same contents within Perl, and run the split on that, it works fine.
I think it's a UTF-16 encoding handling issue, but I'm not sure how to handle it. Does anyone know how I can get perl to understand the UTF-16 being piped into STDIN?
I found: http://www.haboogo.com/matching_patterns/2009/01/utf-16-processing-issue-in-perl.html but I'm not sure what to do with it.
If STDIN is UTF-16, use one of the following
binmode(STDIN, ':encoding(UTF-16le)'); # Byte order used by Windows.
binmode(STDIN, ':encoding(UTF-16be)'); # The other byte order.
binmode(STDIN, ':encoding(UTF-16)'); # Use BOM to determine byte order.
Tom has written a lengthy answer with regards to perl and unicode. It contains some bolierplate code to properly and fully support UTF-8, but you can replace with UTF-16 as needed.
I doubt it's a UTF-xx encoding issue, as neither Windows Perl nor Unix Perl will try to read data with those encodings unless you tell it to.
If the Unix script is reading the exact same file as the Windows script but behaves differently, maybe it's a line-ending issue. The dos2unix command on most Unix-y systems can change the line endings on a file, or you can strip off the line-endings yourself in the Perl script
$unused_header = <STDIN>;
$unused_header =~ s/\r?\n$//; # chop \r\n (Windows) or \n (Unix)
I just upgraded from Ruby 1.8 to 1.9, and most of my text processing scripts now fail with the error invalid byte sequence in UTF-8. I need to either strip out the invalid characters or specify that Ruby should use ASCII encoding instead (or whatever encoding the C stdio functions write, which is how the files were produced) -- how would I go about doing either of those things?
Preferably the latter, because (as near as I can tell) there's nothing wrong with the files on disk -- if there are weird, invalid characters they don't appear in my editor...
What's your locale set to in the shell? In Linux-based systems you can check this by running the locale command and change it by e.g.
$ export LANG=en_US
My guess is that you are using locale settings which have UTF-8 encoding and this is causing Ruby to assume that the text files were created according to utf-8 encoding rules. You can see this by trying
$ LANG=en_GB ruby -e 'warn "foo".encoding.name'
US-ASCII
$ LANG=en_GB.UTF-8 ruby -e 'warn "foo".encoding.name'
UTF-8
For a more general treatment of how string encoding has changed in Ruby 1.9 I thoroughly recommend
http://blog.grayproductions.net/articles/ruby_19s_string
(code examples assume bash or similar shell - C-shell derivatives are different)