Converting from ANSI to UTF-8 using script - bash

I have created a script (.sh file) to convert a CSV file from ANSI encoding to UTF-8.
The command I used is:
iconv -f "windows-1252" -t "UTF-8" $csvname -o $newcsvname
I got this from another Stack Overflow post.
but the iconv command doesn't seem to be working.
Snapshot of input file contents in Notepad++
Snapshot of firstcsv file below
Snapshot of second csv file below,
EDIT: I tried reducing the problematic input CSV file contents to a few lines (similar to the first file), and now it gets converted fine. Is there something wrong with the file contents itself then? How do I check that?

You can use python chardet Character Encoding Detector to ensure existing character encoding format.
iconv -f {character encoding} -t utf-8 {FileName} > {Output FileName}
This should work. Also check if any junk characters are exist in file or not, that may create error in conversion.

Related

Concatenating files in Windows Command Prompt and the string ""

I am concatenating files using Windows. I have used the TYPE and the COPY command and I get the same artifact. At the place where my original files are joined in the new file, the character string "" (i.e. Decimal: 139 175 168 Hex: 8BAFA8) is inserted.
How can I troubleshoot this? Is there an easy explanation you can provide for how to avoid this. And why does this happen?
The very good explanation why does this happen is in #Mark_Tolonen answer, so I will not repeat it.
Instead of obsolete TYPE and COPY one have to use powershell now:
powershell -Command "& { Get-Content a*.txt | Out-File output.txt -Encoding utf8 }"
This command get content of all files patterned by a*.txt in a current folder and concatenates them in the output.txt file using UTF-8.
Powershell is a part of Windows 7 and later.
The extra bytes are a UTF-8 encoding signature. The Unicode byte order mark U+FEFF is encoded in UTF-8 and written to the beginning of the file to indicate the file is encoded in UTF-8. It's not required but Windows assumes a text file is encoded in the local ANSI encoding (commonly Windows-1252) unless a BOM appears.
Many file tools don't know about this (DOS copy being one of them), so concatenating files can be troublesome.
Today being ignorant of encodings often causes trouble. You can't simply concatenate two text files of unknown encoding...they may be different.
If you know the encoding, use a tool that understands the encoding. Here's a very basic concatenate script written in Python that will convert encodings as well.
# cat.py
import sys
if len(sys.argv) < 5:
print('usage: cat <in_encoding> <out_encoding> <outfile> <infile> [infile...]')
else:
with open(sys.argv[3],'w',encoding=sys.argv[2]) as fout:
for file in sys.argv[4:]:
with open(file,'r',encoding=sys.argv[1]) as fin:
fout.write(fin.read())
Given two files with UTF-8 w/ BOM encoding, this command will output UTF-8 (no BOM):
cat.py utf-8-sig utf-8 out.txt test1.txt test2.txt
Side note about Python: utf-8-sig encoding reads files and removes the BOM from the data if present, so it can be used to read any UTF-8 file with or without a BOM. utf-8-sig encoding writes a BOM at the start of a file, but utf-8 does not.

How do I manipulating CSVs containing unicode (Thai) characters using bash?

I've got an Adwords dump containing Thai keywords which I'll use for a join with data from another DB.
In theory, I grab the file, snip off the useless lines at the top and bottom, clean it up a little and upload it to PostgreSQL as a new table.
In practice, the characters get garbled on the way (actually, from the start) even though the file opens fine in Excel and OpenOffice. The below is true on both my local machine (running OSX) and the server (running Ubuntu).
First, I already set my locale to UTF-8:
$ echo "กระเป๋า สะพาย คอนเวิร์ส"
กระเป๋า สะพาย คอนเวิร์ส
However, looking at the CSV (let's assume it only contains the above string) on the CLI gives me this:
$ head file.csv
#0#2 *02" -#'4#L*
Any idea where the problem is?
The original file was in the wrong encoding.
$ file file.csv
file.csv: Little-endian UTF-16 Unicode English text
Quick fix:
$ iconv -f UTF-16 -t UTF-8 file.csv
$ head file.csv
กระเป๋า สะพาย คอนเวิร์ส

Mac issue with file encoding

I have a script which is reading some data from one server and storing it in a file. But the file seems somehow corrupt. I can print it to the display, but checking the file with file produces
bash$ file -I filename
filename: text/plain; charset=unknown-8bit
Why is it telling me that the encoding is unknown? The first line of the file displays for me as
“The Galaxy A5 and A3 offer a beautifully crafted full metal unibody
A hex dump reveals that the first three bytes are 0xE2, 0x80, 0x9C followed by the regular ASCII text The Galaxy A5...
What's wrong? Why does file tell me the encoding is unknown, and what is it actually?
Based on the information in the question, the file is a perfectly good UTF-8 file. The first three bytes encode LEFT DOUBLE QUOTATION MARK (U+201C) aka a curly quote.
Maybe your version of file is really old.
You can use iconv to convert the file into the desired charset. E.G.
iconv --from-code=UTF8 --to-code=YOURTARGET
To get a list of supported targets, use the --list flag.

Where is file needed for PDFTOTEXT output in UTF-8 format?

I want to use the XPDF-based PDFTOTEXT command-line tool to look at PDF files, hoping to get UTF-8 output. I have seen others on StackOverflow getting it -- questions 4039930, 3809761 and 13618330 show that others have been able to use it.
When I use the option -enc utf-8 these messages are displayed:
Syntax Error: Couldn't find unicodeMap file for the 'utf-8' encoding
Config Error: Couldn't get text encoding
I've seen documentation that (among others) UTF-8 encoding is "predefined" but I cannot find the file that I need to point to. (I've looked at multiple different downloads of XPDF-based software and have not yet found it.)
Any pointers would be appreciated.
EDIT: I am on Windows.
You should use UTF-8 instead utf-8. See pdftotext help message:
$ pdftotext -listenc
Available encodings are:
UCS-2
ASCII7
Latin1
UTF-8
ZapfDingbats
Symbol
Proof code:
$ pdftotext -eol unix -nopgbrk -layout -enc utf-8 file.pdf
Syntax Error: Couldn't find unicodeMap file for the 'utf-8' encoding
Command Line Error: Couldn't get text encoding
$ pdftotext -eol unix -nopgbrk -layout -enc UTF-8 file.pdf
$ echo $?
0

ruby mechanize: how read downloaded binary csv file

I'm not very familiar using ruby with binary data. I'm using mechanize to download a large number of csv files to my local disk. I then need to search these files for specific strings.
I use the save_as method in mechanize to save the file (which saves the file as binary). The content type of the file (according to mechanize) is:
application/vnd.ms-excel;charset=x-UTF-16LE-BOM
From here, I'm not sure how to read the file. I've tried reading it in as a normal file in ruby, but I just get the binary data. I've also tried just using standard unix tools (strings/grep) to try and search without any luck.
When I run the 'file' command on one of the files, I get:
foo.csv: Little-endian UTF-16 Unicode Pascal program text, with very long lines, with CRLF, CR, LF line terminators
I can see the data just fine with cat or vi. With vi I also see some control characters.
I've also tried both the csv and fastercsv ruby libraries, but I get 'IllegalFormatError' exception for these. I've also tried this solution without any luck.
Any help would be greatly appreciated. Thanks.
You can use the command 'iconv' to conver to UTF-8,
# iconv -f 'UTF-16LE' -t 'UTF-8' bad_file.csv > good_file.csv
There is also a wrapper for iconv in the standard library, you could use that to convert the file after reading it into your program.

Resources