How do I manipulating CSVs containing unicode (Thai) characters using bash? - bash

I've got an Adwords dump containing Thai keywords which I'll use for a join with data from another DB.
In theory, I grab the file, snip off the useless lines at the top and bottom, clean it up a little and upload it to PostgreSQL as a new table.
In practice, the characters get garbled on the way (actually, from the start) even though the file opens fine in Excel and OpenOffice. The below is true on both my local machine (running OSX) and the server (running Ubuntu).
First, I already set my locale to UTF-8:
$ echo "กระเป๋า สะพาย คอนเวิร์ส"
กระเป๋า สะพาย คอนเวิร์ส
However, looking at the CSV (let's assume it only contains the above string) on the CLI gives me this:
$ head file.csv
#0#2 *02" -#'4#L*
Any idea where the problem is?

The original file was in the wrong encoding.
$ file file.csv
file.csv: Little-endian UTF-16 Unicode English text
Quick fix:
$ iconv -f UTF-16 -t UTF-8 file.csv
$ head file.csv
กระเป๋า สะพาย คอนเวิร์ส

Related

Convert mangled characters back to UTF-8

Here is what I did:
I dumped a SQLite database with UTF-8 data (sqlite3 example.db .dump > dump.sql), but since this was in powershell, I assume the piping converted it to windows-1252
I loaded that dumped data into a new database, again using powershell (Get-Content dump.sql | sqlite3 example2.db)
I dumped that new database and am left with a new .sql file (this time it was not through powershell - so I assume it was unmodified)
This new sql file's UTF-8 characters are seriously mangled, and I was wondering if there was a way to convert it back into correct UTF-8.
As a few examples, here are what some sequences are in the new file, and what they should be (all are viewed as UTF-8):
ÒüéÒü¬ÒüƒÒü½ should be あなたに
´╝ü should be a full width exclamation mark
Òé¡Òé╗Òé¡ should be キセキ
Does anyone have any idea as to how I might undo this mangling? Any method would be very helpful!
This is in powershell 7.0.1
Edit:
On further inspection, you can duplicate my predicament by redirecting any such data to a file in powershell (note that the data cannot itself be entered in powershell). Hence, setting up a script like this gives the same outcome:
test.sh
#!/bin/bash
echo "キ"
And then running wsl ./test.sh > test.txt will give an output of Òé¡, not キ
Edit 2:
It seems as if the codepage the UTF-8 text was converted to is almost 437: some characters are restored using this assumption (e.g. 木), but others are not. If it's close to 437, but isn't, what could it be?
It turns out, since I am in the UK, the codepage I wanted was 850. Saving the file as 850 and then reloading it as UTF-8 fixed my issue!

Sending script and file content via STDIN

I generate (dynamically) a script concatenating the following files:
testscript1
echo Writing File
cat > /tmp/test_file <<EOF
testcontent
line1
second line
testscript2
EOF
echo File is written
And I execute by calling
$ cat testscript1 testcontent testscript2 | ssh remote_host bash -s --
The effect is that the file /tmp/test_file is filled with the desired content.
Is there also a variant thinkable where binary files can be supplied in a similar fashion? Instead of cat of course dd could be used or other Tools, but the problem I see is 'telling' them that the STDIN now ended (can I send ^D through that stream?)
I am not able to get my head around that problem, but there is likely no comparable solution. However, I might be wrong, so I'd be happy to hear from you.
Regards,
Mazze
can I send ^D through that stream
Yes but you don't want to.
Control+D, commonly notated ^D, is just a character -- or to be pedantic (as I often am), a codepoint in the usual character code (ASCII or a superset like UTF-8) that we treat as a character. You can send that character/byte by a number of methods, most simply printf '\004', but the receiving system won't treat it as end-of-file; it will instead be stored in the destination file, just like any other data byte, followed by the subsequent data that you meant to be a new command and file etc.
^D only causes end-of-file when input from a terminal (more exactly, a 'tty' device) -- and then only in 'cooked' mode (which is why programs like vi and less can do things very different from ending a file when you type ^D). The form of ssh you used doesn't make the input a 'tty' device. ssh can make the input (and output) a 'tty' (more exactly a subclass of 'tty' called a pseudo-tty or 'pty', but that doesn't matter here) if you add the -t option (in some situations you may need to repeat it as -t -t or -tt). But then if your binary file contains any byte with the value \004 -- or several other special values -- which is quite possible, then your data will be corrupted and garbage commands executed (sometimes), which definitely won't do what you want and may damage your system.
The traditional approach to what you are trying to do, back in the 1980s and 1990s, was 'shar' (shell archive) and the usual solution to handling binary data was 'uuencode', which converts binary data into only printable characters that could go safely go through a link like this, matched by 'uudecode' which converts it back. See this surviving example from GNU. uuencode and uudecode themselves were part of a communication protocol 'uucp' used mostly for email and Usenet, which are (all) mostly obsolete and forgotten.
However, nearly all systems today contain a 'base64' program which provides equivalent (though not identical) functionality. Within a single system you can do:
base64 <infile | base64 -d >outfile
to get the same effect as cp infile outfile. In your case you can do something like:
{ echo "base64 -d <<END# >outfile"; base64 <infile; echo "END#"; otherstuff; } | ssh remote bash
You can also try:
cat testscript1 testcontent testscript2 | base64 | ssh <options> "base64 --decode | bash"
Don't worry about ^D, because when your input is exhausted, the next processes of the pipeline will notice that they have reached the end of the input file.

Converting from ANSI to UTF-8 using script

I have created a script (.sh file) to convert a CSV file from ANSI encoding to UTF-8.
The command I used is:
iconv -f "windows-1252" -t "UTF-8" $csvname -o $newcsvname
I got this from another Stack Overflow post.
but the iconv command doesn't seem to be working.
Snapshot of input file contents in Notepad++
Snapshot of firstcsv file below
Snapshot of second csv file below,
EDIT: I tried reducing the problematic input CSV file contents to a few lines (similar to the first file), and now it gets converted fine. Is there something wrong with the file contents itself then? How do I check that?
You can use python chardet Character Encoding Detector to ensure existing character encoding format.
iconv -f {character encoding} -t utf-8 {FileName} > {Output FileName}
This should work. Also check if any junk characters are exist in file or not, that may create error in conversion.

BASH: Replacing special character groups

I have a rather tricky request...
We use a special application which is connected to a oracle database. For control reasons the application uses special characters which are defined by the application and saved in a long field of the database.
My task is to query the long field periodically and check for changes. To do that, I write the content by using a bash script in a file and compare the old and the new file with md5sum.
When there's a difference, I want to send the old file via mail. The problem is, that the old file contains these special characters and I don't know how to replace them with for example a string which describes them.
I tried to replace them on the basis of their ASCII code, but this didn't work. I've also tried to replace them by their appearance in the file. (They look like this: ^P ) This didn't work neither.
When viewing the file by text editor like nano the characters are visible like described above. But when using cat on the file, the content is only displayed until the first appearance of such a control character.
As far as I know there is know possibility to replace them while querying from the database because of the fact that the content is in a LONG field.
I hope you can help me.
Thank you in advance.
Marco
^P is the Control-P character, which is decimal 16 or hexadecimal 0x10, also known as the Data Link Escape (DLE) character in ASCII.
To replace all occurrences of 0x10 in a file with another string we can use our friend gsed:
gsed "s/\x10/Data Link Escape/g" yourfile.txt
This should replace all occurrences of characters containing the hex value 0x10 with the text string "Data Link Escape". You'll probably want to use a different string - this is just an example.
Depending on the system you're using you may be able to use the standard sed command if your version of sed recognizes the \xNN single-character escape codes. If there are multiple hex characters you need to replace you may want to create a file containing your sed commands, one for each hexadecmial character you need to replace, and tell sed or gsed to use the commands in the file - consult the sed or gsed man pages for how to do this.
Share and enjoy.
You can use xxd to change the string to its hex representation, then use xxd -r to convert back.
Or, you can use uuencode and uudecode.
One option is to run the file through cat -v. This replaces nonprinting characters with visible representations (using the ^ notation for control characters):
$ echo $'\x10\x12\x13\x14\x16' | cat -v
^P^R^S^T^V

using sed to change <CR><LF> to a symbol

Am working on Windows Vista with GnuWin32 (sed 4.2.1 and core utilities 5.3.0). Also have ActivePerl 5.14.2 package.
I have a large multi record file. The end of each record in the file is denoted with four dollar signs ($$$$). Within each logical record are many "CRLF."
I would like to replace all instances of CRLF with a symbol such as |+|. Then I will replace $$$$ with CRLF. The result: one record per row for import into Excel for further manipulation.
I've tried several methods for transforming CRLF to |+| but without success.
For example, one method was: sed -e "s/[\r\n]/|+|/g" source_file_in target_file_out
Another method used tr -d to delete \r and then a second statement: sed -e "s/\n/|+|/g" source_file_in target_file_out
The tr statement worked; the sed statement did not.
I've read the following articles but don't see how to adapt them to replace \r\n with a symbol like |+|.
sed: how to replace CR and/or LF with "\r" "\n", so any file will be in one line
Replace string that contains CRLF?
How can I replace a newline (\n) using sed?
If this problem cannot be solved easily using sed (and tr), then I'll use Perl if someone shows me how.
Thank you Ed for your recommendation.
The awk script is not yet working completely, so I'll add some missing detail with the hope that you can fine tune your recommendation.
First, I'm running gawk v3.1.6.2962. I believe there may be differences in awk implementations, so this may be a useful bit of information.
Next, some more information about the type of data and origin of the data.
The data is about chemicals (text data that is input to a stereo-chemical drawing program).
The chemical files are in an .sdf format.
When I open "133711.sdf" in NotePad++ (using View/Show symbol/Show all characters), I see data that is shown in the screen shot:
https://dl.dropbox.com/u/3094317/_master_1_screen_shot_.png
As you see, LF only - no CR.
I believe this means that the origin of the .sdf files is a UNIX system.
Next, I run the Windows command COPY *.sdf _master_2_.txt. That creates the very large file-of-files that I want to parse into records.
_master_2_.txt has the same structure as 133711.sdf - LF only; no CR.
Then, I run your awk recommendation in a .BAT file. I need to replace your single quotes with double quotes because Microsoft made me.
awk -v FS="\r\n" -v OFS="|+|" -v RS="\$\$\$\$" -v ORS="\r\n" "{$1=$1}1" C:_master_2_.txt >C:\output.txt
I've attached a screen shout of output.txt:
https://dl.dropbox.com/u/3094317/output.txt.png
As you can see, the awk command did not successfully replace "\r\n" with "|+|".
Further, Windows created the output.txt with CRLF.
It did successfully replace the four $ with CRLF.
Is this information adequate to update your awk recommendation to handle the Windows-related issues?
Try this with GNU awk:
awk -v FS='\r\n' -v OFS='|+|' -v RS='\\$\\$\\$\\$' -v ORS='\r\n' '{$1=$1}1' file
I see from your updated question that you're on Windows. To avoid ridiculous quoting rules and issues, put this in a file named "whatever.awk":
BEGIN{FS="\r\n"; OFS="|+|"; RS="\\$\\$\\$\\$"; ORS="\r\n"} {$1=$1}1
and run it as
awk -f whatever.awk file
and see if that does what you want.

Resources