windows "zip folders" file encodings

windows "zip folders" file encodings - windows

which encodings do windows use when reading filenames from zip archive thru zip folders?
as far as I know
cyrillic is represented as cp866 and
central european - cp437
what about other?
Portuguese
Español (Spanish)
Français (French)
Polski (Polish)
Türkçe (Turkish)
Deutsch (German)
Italiano (Italian)
العربية (Arabic)
Farsi
ไทย (Thai)
中文 (Chinese)
日本語 (Japanese)
한국어 (Korean)
Tiếng Việt (Vietnamese)
I think first seven of this are in cp437.
Chinese may be in Big5
But I know nothing about others.

According to Apache Commons Compress Zip package documentation it's the platforms default encoding.
Windows' "compressed folder" feature doesn't recognize any flag or
extra field and creates archives using the platforms default encoding
- and expects archives to be in that encoding when reading them.

Related

Visual Studio 2019 does not properly convert UTF-8 strings to UTF-16 strings in source files without BOM

I have the following source file (encoded in UTF-8 without BOM, displayed fine in the Source Code Editor):
#include <Windows.h>
int main()
{
MessageBoxW(0, L"Umlaute ÄÖÜ, 🙂", nullptr, 0);
return 0;
}
When running the program, the special characters (Umlaute and Emoji) are messed up in the Message Box.
However, if I save the source file manually as "UTF-8 with BOM", Visual Studio will properly convert the string to UTF-16 and when running the program, the special characters are displayed in the Message Box. But it would be annoying to convert every single file to UTF-8 with BOM. (Also, I think GCC for example does not like BOM?)
Why is Visual Studio messing up my string, if there is no BOM in the source file? The Auto-detect UTF-8 encoding without signature option is already enabled.
I tested the same source with MinGW-w64 and don't have the issue, regardless if there is a BOM or not.

Use the /utf-8 compiler switch. The MS compiler assumes a legacy ANSI encoding (Windows-1252 on US and Western European versions of Windows) if no BOM is found in the source file.

How do I properly unzip a zip with Chinese character that from Windows in OSX?

One day I just zipped a file with Chinese character called 周國賢 - 密封罩.flac, to a zip, using bandizip & designated encoding to utf-8.
And then I try to unzip it in my MacbookPro, which is (probably) using Macintosh as encoding. The file unzipped is called ©P∞ÍΩÂ - ±K´ ∏n.flac, which does not match the above Chinese name.
So, I try to test about the encoding, and found that Macintosh->big5 would return the Macintosh mysterious symbol into Cantonese, but have some unmatching characters: 周衰�璀� - 密封罩.flac.
I have tried another file: §˝µ· - ¨ı®ß.ape: and it actually output the correct name of the file: 王菲 - 紅豆.ape
So, here is my question: how do I unzip a file that with big5 chinese character properly and without any information loss? Or how do I zip a file correctly to prevent information loss/ incorrect characters? (edit #2: you can use bandizip to zip the file into utf-8 encoding)
BTW, The encoding converter I am using is https://r12a.github.io/apps/encodings/, which could be quite helpful for you to check for encoding. Don't forget to click change encodings shown. And I am not the owner of the encoding converter.
edit #1: I have found that the setting in bandizip is wrong...well sorry for the inconvenience caused. Nonetheless, I figure out that The Unarchiver in Mac Apple Store can unzip big5 correctly. This can be a workaround, but still I don't know how to unzip big5 characters properly WITHOUT any loss.

Russian character dispaly as '?' in installer

I have created my desktop application for English, French and Russian using NSIS. French and English are works fine but while select Russian language it display '?' instead of some characters.
For example: Original string is Äîáðî ïîæàëîâàòü â ìàñòåð íàñòðîéêè
It display at runtime as : .
I already have added !define MUI_LANGDLL_ALLLANGUAGES in my LanguageText.nsh file.
I am using windows 10 64Bit virtual machine in Russian language for test installer.
UPDATE
I have added Unicode true in my main file and also convert all files(.ini/.nsh) in UTF-8 using notepad as your suggestions.
Now Header on installer works perfectly and other character still as '?????'
Also when I open my language.nsh file in notepad, all characters display correctly. But when I open it in HM NIS it changes.
EX: Notepad file content :
LangString WMWelCome ${LANG_RUSSIAN} "Мастер установки поможет Вам установить $(^NameDA).\nIf на Ваш компьютер,если Вы хотите обновить программу,пожалуйста создайте резервное копирование программы, данных и баз данных.\nIt Перед началом установки закройте все другие приложения.\n\nЧто бы открыть Installation Guide нажмите на ссылку внизу.\n\nIf для получение помощи, пожалуйста свяжитесь с нами at\nhelp#windowmaker.com или посетите нашу страницу в интернете WebSite.\n\n$_CLICK"
While I open same file in HM NIS:
content change to :
LangString WMWelCome ${LANG_RUSSIAN} "ÐœÐ°ÑÑ‚ÐµÑ€ ÑƒÑÑ‚Ð°Ð½Ð¾Ð²ÐºÐ¸ Ð¿Ð¾Ð¼Ð¾Ð¶ÐµÑ‚ Ð’Ð°Ð¼ ÑƒÑÑ‚Ð°Ð½Ð¾Ð²Ð¸Ñ‚ÑŒ $(^NameDA).\nIf Ð½Ð° Ð’Ð°Ñˆ ÐºÐ¾Ð¼Ð¿ÑŒÑŽÑ‚ÐµÑ€,ÐµÑÐ»Ð¸ Ð’Ñ‹ Ñ…Ð¾Ñ‚Ð¸Ñ‚Ðµ Ð¾Ð±Ð½Ð¾Ð²Ð¸Ñ‚ÑŒ Ð¿Ñ€Ð¾Ð³Ñ€Ð°Ð¼Ð¼Ñƒ,Ð¿Ð¾Ð¶Ð°Ð»ÑƒÐ¹ÑÑ‚Ð° ÑÐ¾Ð·Ð´Ð°Ð¹Ñ‚Ðµ Ñ€ÐµÐ·ÐµÑ€Ð²Ð½Ð¾Ðµ ÐºÐ¾Ð¿Ð¸Ñ€Ð¾Ð²Ð°Ð½Ð¸Ðµ Ð¿Ñ€Ð¾Ð³Ñ€Ð°Ð¼Ð¼Ñ‹, Ð´Ð°Ð½Ð½Ñ‹Ñ… Ð¸ Ð±Ð°Ð· Ð´Ð°Ð½Ð½Ñ‹Ñ….\nIt ÐŸÐµÑ€ÐµÐ´ Ð½Ð°Ñ‡Ð°Ð»Ð¾Ð¼ ÑƒÑÑ‚Ð°Ð½Ð¾Ð²ÐºÐ¸ Ð·Ð°ÐºÑ€Ð¾Ð¹Ñ‚Ðµ Ð²ÑÐµ Ð´Ñ€ÑƒÐ³Ð¸Ðµ Ð¿Ñ€Ð¸Ð»Ð¾Ð¶ÐµÐ½Ð¸Ñ.\n\nÐ§Ñ‚Ð¾ Ð±Ñ‹ Ð¾Ñ‚ÐºÑ€Ñ‹Ñ‚ÑŒ Installation Guide Ð½Ð°Ð¶Ð¼Ð¸Ñ‚Ðµ Ð½Ð° ÑÑÑ‹Ð»ÐºÑƒ Ð²Ð½Ð¸Ð·Ñƒ.\n\nIf Ð´Ð»Ñ Ð¿Ð¾Ð»ÑƒÑ‡ÐµÐ½Ð¸Ðµ Ð¿Ð¾Ð¼Ð¾Ñ‰Ð¸, Ð¿Ð¾Ð¶Ð°Ð»ÑƒÐ¹ÑÑ‚Ð° ÑÐ²ÑÐ¶Ð¸Ñ‚ÐµÑÑŒ Ñ Ð½Ð°Ð¼Ð¸ at\nhelp#windowmaker.com Ð¸Ð»Ð¸ Ð¿Ð¾ÑÐµÑ‚Ð¸Ñ‚Ðµ Ð½Ð°ÑˆÑƒ ÑÑ‚Ñ€Ð°Ð½Ð¸Ñ†Ñƒ Ð² Ð¸Ð½Ñ‚ÐµÑ€Ð½ÐµÑ‚Ðµ WebSite.\n\n$_CLICK"

NSIS v2
NSIS v2 does not translate strings in any way, all strings are copied as raw bytes from the source files.
To properly build a multi-language installer you should put your Russian strings in a file named MyRussian.nsh and edit it with a editor that can save in the Windows 1251 codepage.
NSIS v3
NSIS v3 translates strings to Unicode internally in the compiler. I would recommend that you save your .nsi and .nsh files as UTF-8 with a BOM/SIG when building multi-language installers in v3. I would also recommend that you produce a Unicode installer and you can do that by adding Unicode True to your script.
You can also force a specific encoding by using the /charset option when using !include but using UTF-8 everywhere is less painful.
When you use MUI_LANGDLL_ALLLANGUAGES you are telling NSIS that you don't want to hide languages that might not display correctly on a specific machine. Only "Russian machines" can display ANSI installers correctly. Unicode installers work on every machine (except machines running Windows 95/98/ME obviously).

rubyzip and unicode characters in filenames

I am creating zip archive with rubyzip gem and Zip::ZipOutputStream class and have got a problem with unicode letters (cyrillic) - in archive they are presented as question marks like ????? ???? ??.doc. Does rubyzip support unicode?

I looked at rubyzip methods and it doesn't seem that rubyzip can change the language. It probably uses your computer's default code page. You could use chilkat zip instead as in this example unless you have specific requirements that cannot be addressed by chilkat.

You can use the following snippet to convert UTF-8 to CP437 which cover some unicode chars (just a few). Windows 7 and older assume that filenames are encoded in CP437.
# first normalize the string
normalized_filename = input.mb_chars.normalize.to_s
# then encode in cp437
filename_for_zip = normalized_filename.encode("cp437")
# add file to zip
zipfile.add(filename_for_zip, pdf_file)

You may just run zip directly.
`cd yourfolder; zip archivename file1 file2`
Notice specific quotes. Worked for me on Ubuntu for Cyrillic file names, while rubyzip was generating archive with non-readable file names.

Get encoding of a file in Windows

This isn't really a programming question, is there a command line or Windows tool (Windows 7) to get the current encoding of a text file? Sure I can write a little C# app but I wanted to know if there is something already built in?

Open up your file using regular old vanilla Notepad that comes with Windows.
It will show you the encoding of the file when you click "Save As...".
It'll look like this:
Whatever the default-selected encoding is, that is what your current encoding is for the file.
If it is UTF-8, you can change it to ANSI and click save to change the encoding (or visa-versa).
I realize there are many different types of encoding, but this was all I needed when I was informed our export files were in UTF-8 and they required ANSI. It was a onetime export, so Notepad fit the bill for me.
FYI: From my understanding I think "Unicode" (as listed in Notepad) is a misnomer for UTF-16.
More here on Notepad's "Unicode" option: Windows 7 - UTF-8 and Unicdoe

If you have "git" or "Cygwin" on your Windows Machine, then go to the folder where your file is present and execute the command:
file *
This will give you the encoding details of all the files in that folder.

The (Linux) command-line tool 'file' is available on Windows via GnuWin32:
http://gnuwin32.sourceforge.net/packages/file.htm
If you have git installed, it's located in C:\Program Files\git\usr\bin.
Example:
C:\Users\SH\Downloads\SquareRoot>file *
_UpgradeReport_Files; directory
Debug; directory
duration.h; ASCII C++ program text, with CRLF line terminators
ipch; directory
main.cpp; ASCII C program text, with CRLF line terminators
Precision.txt; ASCII text, with CRLF line terminators
Release; directory
Speed.txt; ASCII text, with CRLF line terminators
SquareRoot.sdf; data
SquareRoot.sln; UTF-8 Unicode (with BOM) text, with CRLF line terminators
SquareRoot.sln.docstates.suo; PCX ver. 2.5 image data
SquareRoot.suo; CDF V2 Document, corrupt: Cannot read summary info
SquareRoot.vcproj; XML document text
SquareRoot.vcxproj; XML document text
SquareRoot.vcxproj.filters; XML document text
SquareRoot.vcxproj.user; XML document text
squarerootmethods.h; ASCII C program text, with CRLF line terminators
UpgradeLog.XML; XML document text
C:\Users\SH\Downloads\SquareRoot>file --mime-encoding *
_UpgradeReport_Files; binary
Debug; binary
duration.h; us-ascii
ipch; binary
main.cpp; us-ascii
Precision.txt; us-ascii
Release; binary
Speed.txt; us-ascii
SquareRoot.sdf; binary
SquareRoot.sln; utf-8
SquareRoot.sln.docstates.suo; binary
SquareRoot.suo; CDF V2 Document, corrupt: Cannot read summary infobinary
SquareRoot.vcproj; us-ascii
SquareRoot.vcxproj; utf-8
SquareRoot.vcxproj.filters; utf-8
SquareRoot.vcxproj.user; utf-8
squarerootmethods.h; us-ascii
UpgradeLog.XML; us-ascii

Another tool that I found useful: https://archive.codeplex.com/?p=encodingchecker
EXE can be found here

Install git ( on Windows you have to use git bash console). Type:
file --mime-encoding *
for all files in the current directory , or
file --mime-encoding */*
for the files in all subdirectories

Here's my take how to detect the Unicode family of text encodings via BOM. The accuracy of this method is low, as this method only works on text files (specifically Unicode files), and defaults to ascii when no BOM is present (like most text editors, the default would be UTF8 if you want to match the HTTP/web ecosystem).
Update 2018: I no longer recommend this method. I recommend using file.exe from GIT or *nix tools as recommended by #Sybren, and I show how to do that via PowerShell in a later answer.
# from https://gist.github.com/zommarin/1480974
function Get-FileEncoding($Path) {
$bytes = [byte[]](Get-Content $Path -Encoding byte -ReadCount 4 -TotalCount 4)
if(!$bytes) { return 'utf8' }
switch -regex ('{0:x2}{1:x2}{2:x2}{3:x2}' -f $bytes[0],$bytes[1],$bytes[2],$bytes[3]) {
'^efbbbf' { return 'utf8' }
'^2b2f76' { return 'utf7' }
'^fffe' { return 'unicode' }
'^feff' { return 'bigendianunicode' }
'^0000feff' { return 'utf32' }
default { return 'ascii' }
}
}
dir ~\Documents\WindowsPowershell -File |
select Name,#{Name='Encoding';Expression={Get-FileEncoding $_.FullName}} |
ft -AutoSize
Recommendation: This can work reasonably well if the dir, ls, or Get-ChildItem only checks known text files, and when you're only looking for "bad encodings" from a known list of tools. (i.e. SQL Management Studio defaults to UTF16, which broke GIT auto-cr-lf for Windows, which was the default for many years.)

A simple solution might be opening the file in Firefox.
Drag and drop the file into firefox
Press Ctrl+I to open the page info
and the text encoding will appear on the "Page Info" window.
Note: If the file is not in txt format, just rename it to txt and try again.
P.S. For more info see this article.

I wrote the #4 answer (at time of writing). But lately I have git installed on all my computers, so now I use #Sybren's solution. Here is a new answer that makes that solution handy from powershell (without putting all of git/usr/bin in the PATH, which is too much clutter for me).
Add this to your profile.ps1:
$global:gitbin = 'C:\Program Files\Git\usr\bin'
Set-Alias file.exe $gitbin\file.exe
And used like: file.exe --mime-encoding *. You must include .exe in the command for PS alias to work.
But if you don't customize your PowerShell profile.ps1 I suggest you start with mine: https://gist.github.com/yzorg/8215221/8e38fd722a3dfc526bbe4668d1f3b08eb7c08be0
and save it to ~\Documents\WindowsPowerShell. It's safe to use on a computer without git, but will write warnings when git is not found.
The .exe in the command is also how I use C:\WINDOWS\system32\where.exe from powershell; and many other OS CLI commands that are "hidden by default" by powershell, *shrug*.

you can simply check that by opening your git bash on the file location then running the command file -i file_name
example
user filesData
$ file -i data.csv
data.csv: text/csv; charset=utf-8

Some C code here for reliable ascii, bom's, and utf8 detection: https://unicodebook.readthedocs.io/guess_encoding.html
Only ASCII, UTF-8 and encodings using a BOM (UTF-7 with BOM, UTF-8 with BOM,
UTF-16, and UTF-32) have reliable algorithms to get the encoding of a document.
For all other encodings, you have to trust heuristics based on statistics.
EDIT:
A powershell version of a C# answer from: Effective way to find any file's Encoding. Only works with signatures (boms).
# get-encoding.ps1
param([Parameter(ValueFromPipeline=$True)] $filename)
begin {
# set .net current directoy
[Environment]::CurrentDirectory = (pwd).path
}
process {
$reader = [System.IO.StreamReader]::new($filename,
[System.Text.Encoding]::default,$true)
$peek = $reader.Peek()
$encoding = $reader.currentencoding
$reader.close()
[pscustomobject]#{Name=split-path $filename -leaf
BodyName=$encoding.BodyName
EncodingName=$encoding.EncodingName}
}
.\get-encoding chinese8.txt
Name BodyName EncodingName
---- -------- ------------
chinese8.txt utf-8 Unicode (UTF-8)
get-childitem -file | .\get-encoding

Looking for a Node.js/npm solution? Try encoding-checker:
npm install -g encoding-checker
Usage
Usage: encoding-checker [-p pattern] [-i encoding] [-v]
Options:
--help Show help [boolean]
--version Show version number [boolean]
--pattern, -p, -d [default: "*"]
--ignore-encoding, -i [default: ""]
--verbose, -v [default: false]
Examples
Get encoding of all files in current directory:
encoding-checker
Return encoding of all md files in current directory:
encoding-checker -p "*.md"
Get encoding of all files in current directory and its subfolders (will take quite some time for huge folders; seemingly unresponsive):
encoding-checker -p "**"
For more examples refer to the npm docu or the official repository.

Similar to the solution listed above with Notepad, you can also open the file in Visual Studio, if you're using that. In Visual Studio, you can select "File > Advanced Save Options..."
The "Encoding:" combo box will tell you specifically which encoding is currently being used for the file. It has a lot more text encodings listed in there than Notepad does, so it's useful when dealing with various files from around the world and whatever else.
Just like Notepad, you can also change the encoding from the list of options there, and then saving the file after hitting "OK". You can also select the encoding you want through the "Save with Encoding..." option in the Save As dialog (by clicking the arrow next to the Save button).

The only way that I have found to do this is VIM or Notepad++.

EncodingChecker
File Encoding Checker is a GUI tool that allows you to validate the text encoding of one or more files. The tool can display the encoding for all selected files, or only the files that do not have the encodings you specify.
File Encoding Checker requires .NET 4 or above to run.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

windows "zip folders" file encodings - windows

Related

Visual Studio 2019 does not properly convert UTF-8 strings to UTF-16 strings in source files without BOM

How do I properly unzip a zip with Chinese character that from Windows in OSX?

Russian character dispaly as '?' in installer

rubyzip and unicode characters in filenames

Get encoding of a file in Windows

Categories

Resources