LUA : How to print a Latin1 string with io.write()? - windows

In Lua 5.4, I tried to print sone strings in Latin1 encoding with io.write(), but some characters (à,é...) are not well printed,
How could I perform this ?
Here is a screenshot of failed print with win-125x.lua

I guess you are running Lua on Windows.
Because you are converting Latin1 characters to UTF8, you should set the Windows console codepage to UTF8 before running your Lua script, with the following command :
chcp 65001
An other option is to save your script with UTF8 encoding without the need to convert strings from cp1252 to UTF8 and use the chcp command before running your script.
Remember that standard Lua has no concept of string encoding and that Windows support for UTF8 characters in the console is incomplete. Hence this kind of problems.
Check that related question too : Problem with accents while copying files in LUA

If you have the table utf8 you can do...
> do io.write(utf8.char(8364):rep(3)..'\n'):flush() end
€€€
To get the code you can do...
> do io.write(utf8.codepoint('€')..'\n'):flush() end
8364
But i am not sure if that works on windows.
...i am on linux.

Related

MBCS to UTF-8: How to encode in Python

I am trying to create a duplicate file finder for Windows. My program works well in Linux. But it writes NUL characters to the log file in Windows. This is due to the MBCS default file system encoding of Windows, while the file system encoding in Linux is UTF-8. How can I convert MBCS to UTF-8 to avoid this error?
Tell Python to use UTF-8 on the log file. In Python 3 you do this by:
open(..., encoding='utf-8')
If you want to convert an MBCS string to UTF-8 you can switch string encodings:
filename.encode('mbcs').decode('utf-8')
Use filename.encode(sys.getdefaultencoding())... to make the code work on Linux, as well.
Just change the encode to 'latin-1' (encoding='latin-1')
Using pure Python:
open(..., encoding = 'latin-1')
Using Pandas:
pd.read_csv(..., encoding='latin-1')

Python 3 not aware of Windows filename encodings?

The following code works well in Win7 until it crashes in the last print(f). It does it when it finds some "exotic" characters in the filenames, as the french "oe" as in œuvre and the C in Karel Čapek. The program crashes with an Encoding error, saying the character x in the filename is'nt a valid utf-8 char.
Should'nt Python3 be aware of the utf-16 encoding of the Windows7 paths?
How should I modify my code?
import os
rootDir = '.'
extensions = ['mobi','lit','prc','azw','rtf','odt','lrf','fb2','azw3' ]
files=[]
for dirName, subdirList, fileList in os.walk(rootDir):
files.extend((os.path.join(dirName,fn) for fn in fileList if any([fn.endswith(ext) for ext in extensions])))
for f in files:
print(f)
eryksun answered my question in a comment. I copy his answer here so the thread does'nt stand as unanswered, The win-unicode-console module solved the problem:
Python 3's raw FileIO class forces binary mode, which precludes using
a UTF-16 text mode for the Windows console. Thus the default setup is
limited to using an OEM/ANSI codepage. To avoid raising an exception,
you'd have to use a less-strict 'replace' or 'backslashreplace' mode
for sys.stdout. Switching to codepage 65001 (UTF-8) seems like it
should be the answer, but the console host (conhost.exe) has problems
with multibyte encodings. That leaves the UTF-16 wide-character API,
such as via the win-unicode-console module.

convert UTF-8 to CP1252 in ubuntu with PHP or bash shell

I have a question about converting UTF-8 to CP1252 in Ubuntu with PHP or SHELL.
Background : Converting a csv file from UTF-8 to CP1252 in Ubuntu with PHP or SHELL, copy file from Ubuntu to Windows, open file with nodepad++.
Environment :
Ubuntu 10.04
PHP 5.3
a file csv with letters (œ, à, ç)
Methods used :
With PHP
iconv("UTF-8", "CP1252", "content of file")
or
mb_convert_encoding("content of file", "UTF-8", "CP1252")
If I check the generated file with
file -i name_of_the_file
It displayed :
name_of_the_file: text/plain; charset=iso-8859-1
I copy this converted file to windows and opened with notepad++, in the bottom of the right, we can see the encoding is ANSI
And when I changed the encoding from ANSI to Windows-1252, the specials characters were well displayed.
With Shell
iconv -f UTF-8 -t CP1252" "content of file"
The rest will be the same .
Question :
1. Why the command file did not display directly CP1252 or ANSI but ISO-8895-1 ?
2. Why the specials characters could be well displayed when I changed the encoding from ANSI to Windows-1252.
Thank you in advance !
1.
CP1252 and ISO-8859-1 are very similar, quite often a file encoded in one of them would look identically as the file encoded in the second one. See Wikipedia to see which characters are in Windows-1252 and not in ISO-8859-1.
Letters à and ç are encoded identically in both encodings. While ISO-8859-1 doesn't have an œ and CP1252 does, file might have missed that. AFAIK it doesn't analyse the entire file.
2.
"ANSI" is a misnomer used for the default non-Unicode encoding in Windows. In case of Western European languages, ANSI means Windows-1252. In case of Central European, it's Windows-1250, in case of Russian it's Windows-1251, and so on. Nothing apart from Windows uses the term "ANSI" to refer to an encoding.

File encoding using ruby in windows

I have two files in a windows folder. Using the technique described here I found out that one file encoding is ANSI and another one is UTF-8.
However, If I open cmd or Powershell and try to get the encoding in IRB with the following code I get always "CP850":
File.open(file_name).read.encoding.name # => CP850
or
File.open(file_name).external_encoding.name # => CP850
Notepad++ also gives me that one file is ANSI and another is UTF-8.
How can I get the proper encoding using Ruby in Windows?
It is impossible to tell what encoding a file is, but it's possible to make an educated guess.
When you open a file, ruby simply assumes it's encoded with the default 8-bit encoding (in your case CP850).
See Detect encoding
and What is ANSI format? about ANSI

Ruby UTF-8 Encoding doesn't work in Windows even with Magic Comment

I'm trying to run a file (ruby anyfile.rb in cmd prompt) with the following contents:
# encoding: utf-8
puts 'áá'
happens the following error:
invalid multibyte char (UTF-8)
It seems that Ruby does not understand the magic comment...
EDIT: If I remove the "# encoding: utf-8" and run the command prompt like this:
ruby-E:UTF-8 encoding.rb
then it works - any ideas?
EDIT2: when i run:
ruby -e 'p [Encoding.default_external, Encoding.default_internal]'
i got [#Encoding:CP850, nil], maybe my Encoding.default_external is wrong?!
Environment:
Windows XP (yes, I also hate windows + ruby)
ruby 1.9.2p180 (2011-02-18) [i386-mingw32]
I believe this is a classic case of "if you hear hooves, think horses, not zebras".
The error message is telling you that you have a byte sequence in your file that is not a valid UTF-8 multibyte sequence.
It is definitely possible that
It seems that Ruby does not understand the magic comment...
as you say, and that up until now nobody noticed that magic comments don't actually work because you are the first person in the history of humankind to actually try to use magic comments. (Actually, this is not possible. If Ruby didn't understand magic comments, it would complain about an invalid ASCII character, since ASCII is the default encoding if no magic comment is present.)
Or, there actually is an invalid multibyte UTF-8 sequence in your file.
Which do you think is more likely? If I were you, I would check my file.
I've encountered similar issues from time to time with files that were not saved as UTF-8, even when the magic comment states so.
I've found that Ruby 1.9.2 had issues to properly convert UTF-8 to codepages 850 and 437, the defaults for command prompt on Windows.
I do recommend you upgrade to Ruby 1.9.3 (latest is patchlevel 125) which solves a lot of encoding issues, specially on Windows.
Also, to verify that your saved file do not contain a Unicode BOM (so it is plain UTF) and is properly saved.
To verify that, you can switch the codepage in the console to unicode (chcp 65001) and try type myscript.rb
You should see the accented letters correctly.
Last but no least, ensure your command prompt uses a TrueType font so extended characters are properly displayed.
Hope that helps.
Try
# encoding: iso-8859-1
Not everything that's text is utf8.
Are you sure you selected 'UTF-8' from the Encoding dropdown when you saved the file in Notepad? I've just tried this on an XP machine and your code example worked for me.

Resources