Mac issue with file encoding - macos

I have a script which is reading some data from one server and storing it in a file. But the file seems somehow corrupt. I can print it to the display, but checking the file with file produces
bash$ file -I filename
filename: text/plain; charset=unknown-8bit
Why is it telling me that the encoding is unknown? The first line of the file displays for me as
“The Galaxy A5 and A3 offer a beautifully crafted full metal unibody
A hex dump reveals that the first three bytes are 0xE2, 0x80, 0x9C followed by the regular ASCII text The Galaxy A5...
What's wrong? Why does file tell me the encoding is unknown, and what is it actually?

Based on the information in the question, the file is a perfectly good UTF-8 file. The first three bytes encode LEFT DOUBLE QUOTATION MARK (U+201C) aka a curly quote.
Maybe your version of file is really old.

You can use iconv to convert the file into the desired charset. E.G.
iconv --from-code=UTF8 --to-code=YOURTARGET
To get a list of supported targets, use the --list flag.

Related

Gettext failed to extract non-ASCII characters

In my source files I have string containing non-ASCII characters like
sCursorFormat = TRANSLATE("Frequency (Hz): %s\nDegree (°): %s");
But when I extract them they vanish like
msgid ""
"Frequency (Hz): %s\n"
"Degree (): %s"
msgstr ""
I have specified the encoding when extracting as
xgettext --from-code=UTF-8
I'm running under MS Windows and the source files are C++ (not that it should matter).
The encoding of your source file is probably not UTF-8, but ANSI, which stands for whatever the encoding for non-Unicode applications is (probably code page 1252). If you would open the file in some hex editor you would see byte 0x80 standing for degree symbol. This byte is not a valid UTF-8 character. In UTF-8 encoding degree symbol is represented with two bytes 0xC2 0xB0. This is why the byte vanishes when using --from-code=UTF-8.
The solution for your problem is to use --from-code=windows-1252. OR, better yet, to save all source files as UTF-8, and then use --from-code=UTF-8.

Concatenating files in Windows Command Prompt and the string ""

I am concatenating files using Windows. I have used the TYPE and the COPY command and I get the same artifact. At the place where my original files are joined in the new file, the character string "" (i.e. Decimal: 139 175 168 Hex: 8BAFA8) is inserted.
How can I troubleshoot this? Is there an easy explanation you can provide for how to avoid this. And why does this happen?
The very good explanation why does this happen is in #Mark_Tolonen answer, so I will not repeat it.
Instead of obsolete TYPE and COPY one have to use powershell now:
powershell -Command "& { Get-Content a*.txt | Out-File output.txt -Encoding utf8 }"
This command get content of all files patterned by a*.txt in a current folder and concatenates them in the output.txt file using UTF-8.
Powershell is a part of Windows 7 and later.
The extra bytes are a UTF-8 encoding signature. The Unicode byte order mark U+FEFF is encoded in UTF-8 and written to the beginning of the file to indicate the file is encoded in UTF-8. It's not required but Windows assumes a text file is encoded in the local ANSI encoding (commonly Windows-1252) unless a BOM appears.
Many file tools don't know about this (DOS copy being one of them), so concatenating files can be troublesome.
Today being ignorant of encodings often causes trouble. You can't simply concatenate two text files of unknown encoding...they may be different.
If you know the encoding, use a tool that understands the encoding. Here's a very basic concatenate script written in Python that will convert encodings as well.
# cat.py
import sys
if len(sys.argv) < 5:
print('usage: cat <in_encoding> <out_encoding> <outfile> <infile> [infile...]')
else:
with open(sys.argv[3],'w',encoding=sys.argv[2]) as fout:
for file in sys.argv[4:]:
with open(file,'r',encoding=sys.argv[1]) as fin:
fout.write(fin.read())
Given two files with UTF-8 w/ BOM encoding, this command will output UTF-8 (no BOM):
cat.py utf-8-sig utf-8 out.txt test1.txt test2.txt
Side note about Python: utf-8-sig encoding reads files and removes the BOM from the data if present, so it can be used to read any UTF-8 file with or without a BOM. utf-8-sig encoding writes a BOM at the start of a file, but utf-8 does not.

Converting from ANSI to UTF-8 using script

I have created a script (.sh file) to convert a CSV file from ANSI encoding to UTF-8.
The command I used is:
iconv -f "windows-1252" -t "UTF-8" $csvname -o $newcsvname
I got this from another Stack Overflow post.
but the iconv command doesn't seem to be working.
Snapshot of input file contents in Notepad++
Snapshot of firstcsv file below
Snapshot of second csv file below,
EDIT: I tried reducing the problematic input CSV file contents to a few lines (similar to the first file), and now it gets converted fine. Is there something wrong with the file contents itself then? How do I check that?
You can use python chardet Character Encoding Detector to ensure existing character encoding format.
iconv -f {character encoding} -t utf-8 {FileName} > {Output FileName}
This should work. Also check if any junk characters are exist in file or not, that may create error in conversion.

How do I manipulating CSVs containing unicode (Thai) characters using bash?

I've got an Adwords dump containing Thai keywords which I'll use for a join with data from another DB.
In theory, I grab the file, snip off the useless lines at the top and bottom, clean it up a little and upload it to PostgreSQL as a new table.
In practice, the characters get garbled on the way (actually, from the start) even though the file opens fine in Excel and OpenOffice. The below is true on both my local machine (running OSX) and the server (running Ubuntu).
First, I already set my locale to UTF-8:
$ echo "กระเป๋า สะพาย คอนเวิร์ส"
กระเป๋า สะพาย คอนเวิร์ส
However, looking at the CSV (let's assume it only contains the above string) on the CLI gives me this:
$ head file.csv
#0#2 *02" -#'4#L*
Any idea where the problem is?
The original file was in the wrong encoding.
$ file file.csv
file.csv: Little-endian UTF-16 Unicode English text
Quick fix:
$ iconv -f UTF-16 -t UTF-8 file.csv
$ head file.csv
กระเป๋า สะพาย คอนเวิร์ส

How to grep for exact hexadecimal value of characters

I am trying to grep for the hexadecimal value of a range of UTF-8 encoded characters and I only want just that specific range of characters to be returned.
I currently have this:
grep -P -n "[\xB9-\xBF]" $str_st_location >> output_st.txt
But this returns every character that has any of those hex values in it hex representation i.e it returns 00B9 - FFB9 as long as the B9 is present.
Is there a way I can specify using grep that I only want the exact/specific hex value range I search for?
Sample Input:
STRING_OPEN
Open
æ–­å¼€
Ouvert
Abierto
Открыто
Abrir
Now using my grep statement, it should return the 3rd line and 6th line, but it also includes some text in my file that are Russian and Chinese because the range for languages include the hex values I'm searching for like these:
断开
Открыто
I can't give out more sample input unfortunately as it's work related.
EDIT: Actually the below code snippet worked!
grep -P -n "[\x{00B9}-\x{00BF}]" $str_st_location > output_st.txt
It found all the corrupted characters and there were no false positives. The only issue now is that the lines with the corrupted characters automatically gets "uncorrupted" i.e when I open the file, grep's output is the corrected version of the corrupted characters. For example, it finds æ–­å¼€ and in the text file, it's show as 断开.
Since you're using -P, you're probably using GNU grep, because that is a GNU grep extension. Your command works using GNU grep 2.21 with pcre 8.37 and a UTF-8 locale, however there have been bugs in the past with multi-byte characters and character ranges. You're probably using an older version, or it is possible that your locale is set to one that uses single-byte characters.
If you don't want to upgrade, it is possible to match this character range by matching individual bytes, which should work in older versions. You would need to convert the characters to bytes and search for the byte values. Assuming UTF-8, U+00B9 is C2 B9 and U+00BF is C2 BF. Setting LC_CTYPE to something that uses single-byte characters (like C) will ensure that it will match individual bytes even in versions that correctly support multi-byte characters.
LC_CTYPE=C grep -P -n "\xC2[\xB9-\xBF]" $str_st_location >> output_st.txt

Resources