Convert mangled characters back to UTF-8

Convert mangled characters back to UTF-8 - utf-8

Here is what I did:
I dumped a SQLite database with UTF-8 data (sqlite3 example.db .dump > dump.sql), but since this was in powershell, I assume the piping converted it to windows-1252
I loaded that dumped data into a new database, again using powershell (Get-Content dump.sql | sqlite3 example2.db)
I dumped that new database and am left with a new .sql file (this time it was not through powershell - so I assume it was unmodified)
This new sql file's UTF-8 characters are seriously mangled, and I was wondering if there was a way to convert it back into correct UTF-8.
As a few examples, here are what some sequences are in the new file, and what they should be (all are viewed as UTF-8):
ÒüéÒü¬ÒüƒÒü½ should be あなたに
´╝ü should be a full width exclamation mark
Òé¡Òé╗Òé¡ should be キセキ
Does anyone have any idea as to how I might undo this mangling? Any method would be very helpful!
This is in powershell 7.0.1
Edit:
On further inspection, you can duplicate my predicament by redirecting any such data to a file in powershell (note that the data cannot itself be entered in powershell). Hence, setting up a script like this gives the same outcome:
test.sh
#!/bin/bash
echo "キ"
And then running wsl ./test.sh > test.txt will give an output of Òé¡, not キ
Edit 2:
It seems as if the codepage the UTF-8 text was converted to is almost 437: some characters are restored using this assumption (e.g. 木), but others are not. If it's close to 437, but isn't, what could it be?

It turns out, since I am in the UK, the codepage I wanted was 850. Saving the file as 850 and then reloading it as UTF-8 fixed my issue!

Related

Concatenating files in Windows Command Prompt and the string "ï»¿"

I am concatenating files using Windows. I have used the TYPE and the COPY command and I get the same artifact. At the place where my original files are joined in the new file, the character string "ï»¿" (i.e. Decimal: 139 175 168 Hex: 8BAFA8) is inserted.
How can I troubleshoot this? Is there an easy explanation you can provide for how to avoid this. And why does this happen?

The very good explanation why does this happen is in #Mark_Tolonen answer, so I will not repeat it.
Instead of obsolete TYPE and COPY one have to use powershell now:
powershell -Command "& { Get-Content a*.txt | Out-File output.txt -Encoding utf8 }"
This command get content of all files patterned by a*.txt in a current folder and concatenates them in the output.txt file using UTF-8.
Powershell is a part of Windows 7 and later.

The extra bytes are a UTF-8 encoding signature. The Unicode byte order mark U+FEFF is encoded in UTF-8 and written to the beginning of the file to indicate the file is encoded in UTF-8. It's not required but Windows assumes a text file is encoded in the local ANSI encoding (commonly Windows-1252) unless a BOM appears.
Many file tools don't know about this (DOS copy being one of them), so concatenating files can be troublesome.
Today being ignorant of encodings often causes trouble. You can't simply concatenate two text files of unknown encoding...they may be different.
If you know the encoding, use a tool that understands the encoding. Here's a very basic concatenate script written in Python that will convert encodings as well.
# cat.py
import sys
if len(sys.argv) < 5:
print('usage: cat <in_encoding> <out_encoding> <outfile> <infile> [infile...]')
else:
with open(sys.argv[3],'w',encoding=sys.argv[2]) as fout:
for file in sys.argv[4:]:
with open(file,'r',encoding=sys.argv[1]) as fin:
fout.write(fin.read())
Given two files with UTF-8 w/ BOM encoding, this command will output UTF-8 (no BOM):
cat.py utf-8-sig utf-8 out.txt test1.txt test2.txt
Side note about Python: utf-8-sig encoding reads files and removes the BOM from the data if present, so it can be used to read any UTF-8 file with or without a BOM. utf-8-sig encoding writes a BOM at the start of a file, but utf-8 does not.

InstallScript GetLine() can not read text file contains result from command prompt

My Installation needs to check the result of a command from cmd.exe. Thus, I redirect the result of the command to a text file and then try to read the file to get the result as follows:
// send command to cmd to execute and redirect the result to a text file
// try to read the file
szDir = "D:\\";
szFileName = "MyFile.txt";
if Is(FILEEXISTS, szDir ^ szFileName) then
listID = ListCreate(STRINGLIST);
if listID != LIST_NULL then
if OpenFIleMode(FILE_MODE_NORMAL) = 0 then
if OpenFile(nFileHandle, szDir, szFileName) = 0 then
// I run into problem here
while (GetLine(nFileHandle, szCurLine) = 0 )
ListAddString(listID, szCurLine, AFTER);
endwhile;
CloseFile(nFileHandle);
endif;
endif;
endif;
endif;
The problem is that right after the command prompt is executed and the result is redirected to MyFile.txt, I can set open file mode, open the file but I can not read any text into my list. ListReadFromFile() does not helps. If I open the file, edit and save it manually, my script works.
After debugging, I figured that GetLine() returns an error code (-1) which means the file pointer must be at the end of file or other errors. However, FILE_MODE_NORMAL sets the file as read only and SET THE FILE POINTER AT THE BEGINNING OF THE FILE.
What did I possibly do wrong? Is this something to do with read/write access of the file? I tried this command without result:
icacls D:\MyFile.txt /grant Administrator:(R,W)
I am using IstallShield 2018 and Windows 10 64-bit btw. Your help is much appreciated.
EDIT 1: I suspected the encoding and tried a few things:
After running "wslconfig /l", the content of MyFile.txt opened in Notepad++ is without an encoding, but still appeared normal and readable. I tried to converted the content to UTF-8 but it did not work.
If I add something to the file (echo This line is appended >> MyFile.txt), the encoding changed to UTF-8, but the content in step 1 is changeed also. NULL (\0) is added to between every character and even repelace new line character. Maybe this is why GetLine() failed to read the file.
Work around: after step 1, I run "find "my_desired_content" MyFile.txt" > TempFile.txt and read TempFile.txt (which is encoded in UTF-8).
My ultimate goal is to check if "my_desired_content" apeears in the result of "wslconfig /l" so this is fine. However, what I don't understand is that both MyFile.txt and TempFile.txt are created from cmd command but they are encoded differently?

The problem is due to the contents of the file. Assuming this is the file generated by your linked question, you can examine its contents in a hex editor to find out the following facts:
Its contents are encoded in UTF-16 (LE) without a BOM
Its newlines are encoded as CR or CR CR instead of CR LF
I thought the newlines would be more important than the text encoding, but it turns out I had it backwards. If I change each of these things independently, GetLine seems to function correctly for either CR, CR CR, or CR LF, but only handles UTF-16 when the BOM is present. (That is, in a hex editor, the file starts with FF FE 57 00 instead of 57 00 for a file starting with the character W.)
I'm at a bit of a loss for the best way to address this. If you're up for a challenge, you could read the file with FILE_MODE_BINARYREADONLY, and can use your extra knowledge about what should be in the file to ensure you interpret its encoding correctly. Note that for most of UTF-16, you can create a single code unit by combining two bytes in the following manner:
szResult[i] = (nHigh << 8) + nLow;
where nHigh and nLow are probably values like szBuffer[2*i + 1] and szBuffer[2*i], assuming you filled a STRING szBuffer by calling ReadBytes.
Other unproven ideas include editing it in binary to ensure the BOM (FF FE) is present, figuring out ways to ensure the file is originally created with the BOM, figuring out ways to create it in an alternate encoding, finding another command you can invoke to "fix" the file, or lodging a request with the vendor (my employer) and hoping the development team changes something to better handle this case.
Here's an easier workaround. If you can safely assume that the command will append UTF-16 characters without a signature, you can append this output to a file that has just a signature. How do you get such a file?
You could create a file with just the BOM in your development environment, and add it to your Support Files. If you need to use it multiple times, copy it around first.
You could create it with code. Just call the following (error checking omitted for clarity)
OpenFileMode(FILE_MODE_APPEND_UNICODE);
CreateFile(nFileHandle, szDir, szFileName);
CloseFile(nFileHandle);
and if szDir ^ szFileName didn't exist, it will now be a file with just the UTF-16 signature.
Assuming this file is called sig.txt, you can then invoke the command
wslconfig /l >> sig.txt to write to that file. Note the doubled >> for append. The resulting file will include the Unicode signature you created ahead of time, plus the Unicode data output from wslconfig, and GetLine should interpret things correctly.
The biggest problem here is that this hardcodes around the behavior of wslconfig, and that behavior may change at any point. This is why Christopher alludes to recommending an API, and I agree completely. In the mean time, You could try to make this more robust by invoking it in a cmd /U (but my understanding of what that does or guarantees is fuzzy at best), or by trying the original way and then with the BOM.

This whole WSL thing is pretty new. I don't see any APIs it but rather then screen scrapping command outputs you might want to look at this registry key:
HKEY_CURRENT_USER\SOFTWARE\Microsoft\Windows\CurrentVersion\Lxss
It seems to have the list of installed distros that come from the store. Coming from the store probably explains why this is HKCU and not HKLM.
A brave new world.... sigh.

How do I manipulating CSVs containing unicode (Thai) characters using bash?

I've got an Adwords dump containing Thai keywords which I'll use for a join with data from another DB.
In theory, I grab the file, snip off the useless lines at the top and bottom, clean it up a little and upload it to PostgreSQL as a new table.
In practice, the characters get garbled on the way (actually, from the start) even though the file opens fine in Excel and OpenOffice. The below is true on both my local machine (running OSX) and the server (running Ubuntu).
First, I already set my locale to UTF-8:
$ echo "กระเป๋า สะพาย คอนเวิร์ส"
กระเป๋า สะพาย คอนเวิร์ส
However, looking at the CSV (let's assume it only contains the above string) on the CLI gives me this:
$ head file.csv
#0#2 *02" -#'4#L*
Any idea where the problem is?

The original file was in the wrong encoding.
$ file file.csv
file.csv: Little-endian UTF-16 Unicode English text
Quick fix:
$ iconv -f UTF-16 -t UTF-8 file.csv
$ head file.csv
กระเป๋า สะพาย คอนเวิร์ส

Corruption when using certain batch variable names in custom build command

I have a VS2013 project with a custom build command. In the command script I set an environment variable, and read it out again in the same script. I can confirm by calling set that setting the variable works. However, depending on the variable name, I can't read it out again.
The following works as expected when run as a batch script:
set AVAR=xxx
set ABLAH=xxx
set BBLAH=xxx
set DEV=xxx
set #ABLAH=xxx
echo %AVAR%
echo %ABLAH%
echo %BBLAH%
echo %DEV%
echo %#ABLAH%
But produces the following output in the project:
1> xxx
1> «LAH
1> »LAH
1> ÞV
1> xxx
In this case, the name AVAR works, but many others don't. Also, variables starting with # seem to work. Any idea what is going on?

I've found the solution. Visual Studio (msbuild) converts %XX escape sequences like in URLs. I only expected it to so in URLs, like browsers do. However, it seems to replace them everywhere.
So when it encounters %ABCDE%, it recognizes %AB and inserts the character « = 0xAB, giving «CDE% to the batch interpreter. But if the code is not a valid hexadecimal number, it silently ignores it, and the interpreter sees the right characters. That's why variable names with # at the beginning always worked.
So the solution is to escape at least all % in front valid hex codes 00-FF, better even all of them, with %25.
An easy solution would be to just edit the corresponding commands in the GUI (via project properties), and not directly in the .vcxproj or .props file. This way, VS inserts the correct escape codes. In my case this was not possible since the commands were defined as user macros (Property Pages: Common Properties/User Macros). My commands span multiple lines, but the user macro editor only supports single lines.
Another thing to watch out for is that it not only replaces percent signs. Other symbols have special meaning and have to be replaced, too. (This goes beyond XML entities, like & -> &.) Here is a list of special characters from MSDN. The characters are: % $ # ' ; ? *. It doesn't seem to be necessary to replace all of them all the time, but if you notice funky behavior then this is a thing to look at. You can try to enter these characters through the GUI and see how and if VS escapes them in the project file.
On other character to note especially is the semicolon. If you define a property with unescaped semicolons, like <MyPaths>DirA;DirB</MyPaths>, msbuild/VS will internally convert them to newlines (well, or it splits the property into a list or something). But it will still show the paths as separated with semicolons in the property pages! Except when you click the dropdown button next to a property and select <Edit...>, then it will show the paths as a list or separated by newlines! This is completely invisible most of the time, except when you set a property not in XML or the GUI, but you are reading the output of a command into a property. In this case the command must output newlines, if you want the effect of a semicolon. Otherwise you don't get multiple paths, but one long path with semicolons in it.

Batch files are usually in North American and Western European countries "ASCII" files using an OEM code page like code page 850 (OEM multilingual Latin I) or code page 437 (OEM US) and not code page Windows-1252 as used usually for single byte encoded text files. The code page to use for a batch file depends on local settings for non Unicode files in console. The code page does not matter if just characters with a code value smaller 128 are used in batch file, i.e. the batch file is a real ASCII file.
Therefore make sure that you edit and save the batch file as ASCII file using the right code page and not as Unicode file using UTF-8, UTF-16 Little Endian or UTF-16 Big Endian. Editor of Visual Studio uses by default UTF-8 encoding for the files. This is the wrong encoding for batch files.
Character « has in table of code page 850 the code value 174 decimal (0xAB). In table of code page 1252 code value 174 is for character ® which is an indication that you want to output in batch file characters encoded in UTF-8 (also code value 174 for character ®) or Windows-1252.
A simple batch code for demonstration stored as ANSI file with code page Windows-1252.
#echo off
cls
echo This batch file was saved as ANSI file using code page Windows-1252.
echo.
echo Registered trademark symbol ® has code value 174 in Windows-1252.
echo.
echo But active code page is not Windows 1252 in console window.
echo.
chcp
echo.
echo Therefore the left guillemet character is output instead of registered
echo trademark symbol as this character has in code page 850 code value 174.
echo.
echo Press any key to continue ...
pause>nul
And batch files are for DOS/Windows and should therefore use carriage return + line-feed as line terminator instead of just line-feed (UNIX) or just carriage return (old MAC).
Some text editors display line terminator type and encoding respectively code page somewhere in status bar at bottom of main application window for active file.

Issue with encoding of a character (not able to sed or .gsub)

I am dealing with some multilingual data(English and Arabic) in a json file with a weird character i am not able to parse. I am not sure what the character is. I tried getting the ASCII value via vim and this is what i got
"38 0x26"
This is the status line in vim i used to get the value (http://vim.wikia.com/wiki/Showing_the_ASCII_value_of_the_current_character).
:set statusline=%<%f%h%m%r%=%b\ 0x%B\ \ %l,%c%V\ %P
This is how the character looks in vim -
I tried 'sed' and '.gsub' to replace this character unsuccessfully.
Is there a way where i can replace this character(preferably with .gsub ruby) with '&' or something else?
Thanks

try with something like
sed 's/[[:alpnum:][:space:]\[\]{}()\.\*\\\/_(AllAsciiVariationYouWant)/&/g;t
s/./?/g' YourFile
where (AllAsciiVariationYouWant) is all character that you want to keep as is (without the surrounding "()" )

JSON is encoded in UTF-8 (Unicode). If you're seeing funky-looking characters in your file, it's probably because your editor is not treating Unicode characters properly. That could be caused by the use of a terminal emulator that doesn't support Unicode; an incorrect $LANG setting; vim not being able to correctly determine the encoding of the file; and likely other reasons.
What terminal program are you using? What's your $LANG environment variable set to (echo $LANG)? If you're certain your terminal supports Unicode, try:
LANG=en_US.utf-8 vim your_file_here.json
(The above example assumes that U.S. English is appropriate for the file, which it may not be.)
As for replacing characters in the file, vim's substitution command can be used:
:%s/old text/new text/g
The above command will run the substitute command on all lines in the file (%), replacing every instance of "old text" with "new text". (The g at the end tells vim to replace every instance on a line, not just the first it finds.)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio