Finding out file encoding - utf-8

I am having trouble finding out the file encoding of a file. It is the autocorrect list file (*.acl) of MS Office products. It consists of name/value pairs of mappings used to automatically substitute the name to value. I looked into it using a hex editor and found out following matchings.
Char - HEX - Unicode Code Point - Unicode HEX repr.
a - 00 61 - U+0061 - 00 61
A - 00 41 - U+0041 - 00 41
ä - 00 E4 - U+00E4 - C3 A4
© - 00 A9 - U+00A9 - C2 A9
The pattern I recognize is that the unicode code point is simply used to save the char in hex, instead of the correct hex value of the utf8 char. (I validated the schema of this list for ÄöÖüÜß and some other [a-ZA-Z] chars and it hold true for any of them. But I guess you don't need to see this to understand the pattern.)
Also the control characters 00 00 00 09 seems to be used as separation of the keys and values and 00 00 00 03 to separate the key/value pairs.
The motivation why I looked into this in the first place: I want to manipulate this file on my own using scripting. VBA is not an option. And I couldn't manage to input this file into my script correctly, yet. I hope you can help me understanding the character encoding used.
Thanks in advance for your time.

Related

JavaScript alternative to Ruby's force_encoding(Encoding::ASCII_8BIT)

I'm making my way through the Building Git book, but attempting to build my implementation in JavaScript. I'm stumped at the part of reading data in this file format that apparently only Ruby uses. Here is the excerpt from the book about this:
Note that we set the string’s encoding13 to ASCII_8BIT, which is Ruby’s way of saying that the string represents arbitrary binary data rather than text per se. Although the blobs we’ll be storing will all be ASCII-compatible source code, Git does allow blobs to be any kind of file, and certainly other kinds of objects — especially trees — will contain non-textual data. Setting the encoding this way means we don’t get surprising errors when the string is concatenated with others; Ruby sees that it’s binary data and just concatenates the bytes and won’t try to perform any character conversions.
Is there a way to emulate this encoding in JS?
or
Is there an alternative encoding I can use that JS and Ruby share that won't break anything?
Additionally, I've tried using Buffer.from(< text input >, 'binary') but it doesn't result in the same amount of bytes that the ruby ASCII-8BIT returns because in Node.js binary maps to ISO-8859-1.
Node certainly supports binary data, that's kind of what Buffer is for. However, it is crucial to know what you are converting into what. For example, the emoji "☺️" is encoded as six bytes in UTF-8:
// UTF-16 (JS) string to UTF-8 representation
Buffer.from('☺️', 'utf-8')
// => <Buffer e2 98 ba ef b8 8f>
If you happen to have a string that is not a native JS string (i.e. its encoding is different), you can use the encoding parameter to make Buffer interpret each character in a different manner (though only several different conversions are supported). For example, if we have a string of six characters that correspond to the six numbers above, it is not a smiley face for JavaScript, but Buffer.from can help us repackage it:
Buffer.from('\u00e2\u0098\u00ba\u00ef\u00b8\u008f', 'binary')
// => <Buffer e2 98 ba ef b8 8f>
JavaScript itself has only one encoding for its strings; thus, the parameter 'binary' is not really the binary encoding, but a mode of operation for Buffer.from, telling it that the string would have been a binary string if each character were one byte (however, since JavaScript internally uses UCS-2, each character is always represented by two bytes). Thus, if you use it on something that is not a string of characters in range from U+0000 to U+00FF, it will not do the correct thing, because there no such thing (GIGO principle). What it will actually do is get the lower byte of each character, which is probably not what you want:
Buffer.from('STUFF', 'binary') // 8BIT range: U+0000 to U+00FF
// => <Buffer 42 59 54 45 53> ("STUFF")
Buffer.from('STUFF', 'binary') // U+FF33 U+FF34 U+FF35 U+FF26 U+FF26
// => <Buffer 33 34 35 26 26> (garbage)
So, Node's Buffer structure exactly corresponds to Ruby's ASCII-8BIT "encoding" (binary is an encoding like "bald" is a hair style — it simply means no interpretation is attached to bytes; e.g. in ASCII, 65 means "A"; but in binary "encoding", 65 is just 65). Buffer.from with 'binary' lets you convert weird strings where one character corresponds to one byte into a Buffer. It is not the normal way of handling binary data; its function is to un-mess-up binary data when it has been read incorrectly into a string.
I assume you are reading a file as string, then trying to convert it to a Buffer — but your string is not actually in what Node considers to be the "binary" form (a sequence of characters in range from U+0000 to U+00FF; thus "in Node.js binary maps to ISO-8859-1" is not really true, because ISO-8859-1 is a sequence of characters in range from 0x00 to 0xFF — a single-byte encoding!).
Ideally, to have a binary representation of file contents, you would want to read the file as a Buffer in the first place (by using fs.readFile without an encoding), without ever touching a string.
(If my guess here is incorrect, please specify what the contents of your < text input > is, and how you obtain it, and in which case "it doesn't result in the same amount of bytes".)
EDIT: I seem to like typing Array.from too much. It's Buffer.from, of course.

Gnu sort UTF-8 incorrect collation order

I am using GNU sort on Linux for a UTF-8 file and some strings are not being sorted correctly. I have the LC_COLLATE variable set to en_US.UTF-8 in BASH. Here is a hex dump showing the problem.
5f ef ac 82 0a
5f ef ac 81 0a
5f ef ac 82 0a
5f ef ac 82 0a
These are four consecutive sorted lines. The 0a is the end of line. The order on the forth byte is incorrect. The byte value 81 should not be between the 82 value bytes. When this is displayed in the terminal window the second line is a different character from the other three.
I doubt that this is a problem with the sort command because it is a GNU core utility, and it should be rock solid. Any ideas why this could be occurring? And why do I have to use hexdump to track down this problem; it's the 21st century already!
Use LC_COLLATE=C appears to be the only solution.
You can set this up for everything by editing /etc/default/locale
Unfortunately this loses a lot of useful aspects of UTF-8 sorting, such as putting accented characters next to their base characters. But it is far less objectionable than the complete hideous mess the libc developers and Unicode consortium did. They fail to understand the purpose of sorting, the need to preserve sort order when strings are concatenated, the need to always produce the same order, and how virtually every program in the world relies on this. Instead they seem to feel it is important to "sort" typos such as spaces inserted into the middle of names by ignoring them (!).
It seems that it has probably been some kind of bug in the version you used. When I execute sort(version from GNU coreutils 8.30) it works as follows:
$ printf '\x5f\xef\xac\x82\x0a\x5f\xef\xac\x81\x0a\x5f\xef\xac\x82\x0a\x5f\xef\xac\x82\x0a' | LC_COLLATE=en_US.UTF-8 sort
_fi
_fl
_fl
_fl
which appears to work as expected. I didn't bother to try if it can successfully handle NFC vs NFD normalization forms because I only use NFC myself.

EBCDIC to ASCII containing COMP types

I have been seeing many tools like syncsort, informatica and etc which are efficient enough to convert EBCDIC mainframe files to ASCII.
Since our company is a small in size and dont want to invest on any of the tools, i have a challange to convert EBCDIC mainframe files to ASCII.
The upstream are mainframe and i am migrating the entire data into hdfs but since hdfs in not efficient enough to handle mainframe i have been asked to
convert with Spark/java rode routine to convert these mainframe EBCDIC files.
I understand that when the file is exported, the files gets converted to ASCII but packed decimal, COMP/COMP3 doesnt get converted.
i need to write a logic to convert these mainframe EBCDIC partially converted file to ASCII so that we can do our further processing in hadoop.
Since iam new in this site and cant even add my sample ebcdic file. request you to consider below as a sample file content which contains ascii as well as junk characters
the below contains some junk which is after salary field, that is Dept field, it is having COMP data type..below is the emp.txt file
101GANESH 10000á?
102RAMESH 20000€
103NAGESH 40000€
below is empcopybook
01 EMPLOYEE-DETAILS.
05 EMP-ID PIC 9(03).
05 EMP-NAME PIC X(10).
05 EMP-SAL PIC 9(05).
05 DEPT PIC 9(3) COMP-3.
There is a library in Java that you can use with spark is called JRecord to convert the binary files of EBCDIC to ASCII.
The code you can find with this guy here
This is possible to integrate with Scala with the function newAPIHadoopFile to run it in spark. This code is a Hadoop coding but will work fine with spark.
There is also this option (it also uses JRecord):
https://wiki.cask.co/display/CE/Plugin+for+COBOL+Copybook+Reader+-+Fixed+Length
it is based on CopybookHadoop which looks to be a clone of CopybookInputFormat that Thiago has mentioned.
Any way from the Documentation:
This example reads data from a local binary file "file:///home/cdap/DTAR020_FB.bin" and parses it using the schema given in the text area "COBOL Copybook"
It will drop field "DTAR020-DATE" and generate structured records with schema as specified in the text area.
{
"name": "CopybookReader",
"plugin": {
"name": "CopybookReader",
"type": "batchsource",
"properties": {
"drop" : "DTAR020-DATE",
"referenceName": "Copybook",
"copybookContents":
"000100* \n
000200* DTAR020 IS THE OUTPUT FROM DTAB020 FROM THE IML \n
000300* CENTRAL REPORTING SYSTEM \n
000400* \n
000500* CREATED BY BRUCE ARTHUR 19/12/90 \n
000600* \n
000700* RECORD LENGTH IS 27. \n
000800* \n
000900 03 DTAR020-KCODE-STORE-KEY. \n
001000 05 DTAR020-KEYCODE-NO PIC X(08). \n
001100 05 DTAR020-STORE-NO PIC S9(03) COMP-3. \n
001200 03 DTAR020-DATE PIC S9(07) COMP-3. \n
001300 03 DTAR020-DEPT-NO PIC S9(03) COMP-3. \n
001400 03 DTAR020-QTY-SOLD PIC S9(9) COMP-3. \n
001500 03 DTAR020-SALE-PRICE PIC S9(9)V99 COMP-3. ",
"binaryFilePath": "file:///home/cdap/DTAR020_FB.bin",
"maxSplitSize": "5"
}
}
}
You can use Cobrix, which is a COBOL data source for Spark. It is open-source.
You can use Spark to load the files, parse the records and store them in any format you want, including plain text, which seems to be what you are looking for.
DISCLAIMER: I work for ABSA and I am one of the developers behind this library. Our focus is on 1) ease of use, 2) performance.

Why does this Unicode / UTF-8 "En Dash" character in my JSON feed get mangled when I download it?

My JSON feed is here:
http://america.aljazeera.com/bin/ajam/api/story.json?path=/content/ajam/watch/shows/america-tonight/articles/2014/4/28/the-dark-side-oftheoilboomhumantraffickingintheheartland
It is a JSON representation of this HTML page, you can see the same En Dash character in the subtitle of the page.
http://america.aljazeera.com/watch/shows/america-tonight/articles/2014/4/28/the-dark-side-oftheoilboomhumantraffickingintheheartland.html
The En Dash is in the 2nd key (description):
description: "In a North Dakota town that was once dying, oil and money are flowing – and bringing big-city problems",
after the word "flowing".
The page has the following HTTP header:
Content-Type: application/json;charset=UTF-8
which can be seen by requesting it via curl -v or curl -I
Downloading it in Ruby using HTTParty like so:
> r = HTTParty.get('http://america.aljazeera.com/bin/ajam/api/story.json?path=/content/ajam/watch/shows/america-tonight/articles/2014/4/28/the-dark-side-oftheoilboomhumantraffickingintheheartland')
> r['description']
=> "In a North Dakota town that was once dying, oil and money are flowing –\u0080\u0093 and bringing big-city problems"
mangles it, as seen above. After much research I realized is a representation of the hex utf-8 unicode value as seen here:
http://www.fileformat.info/info/unicode/char/2013/index.htm
specifically, this:
UTF-8 (hex) 0xE2 0x80 0x93 (e28093)
This data is later fed into an iPhone app and an Android app. On the the Android app it looks like the attached . On an iPhone it looks fine - I think because only the first character is rendered and that is a regular Ascii dash, and the next two characters are skipped.
Finally, downloading it in JavaScript using AJAX does seem to handle it correctly:
> r = json['description'].match(/flowing (.*) and/)[1]
> "–"
> r
> "–"
> r.length
> 3
> r.toString(16)
> "–"
So...what is going on? What can I do to fix it? Is the fault with the server or with my code?
The JSON feed you're using failed to interpret \u2013 correctly. Instead of generating the desired UTF-8 encoded byte sequence:
E2 80 93
it generated:
E2 80 93 C2 80 C2 93
The reason why the iPhone app works fine may be that it ignores the control character C2 80 and C2 93. However, Android app just render it as some special figure.
You'll need to manually clean those wrong sequence, if you don't have control of the JSON feed.

Convert signed cobol number in ruby

I have an ascii file that is a dump of data from a cobol based system.
There is a field that the docs say is PIC S9(3)V9(7)..
Here are two examples of the fields in hex (and ascii) and the resulting number they represent (taken from another source).
Hex Reported value
30 32 38 36 38 35 38 34 35 46 28.687321
ascii : 028685845F
30 39 38 34 35 36 31 33 38 43 -98.480381
ascii : 098456138C
I'm using ruby, and even after adding the Implied Decimal, I seem to be getting the numbers incorrect. I'm trying to parse IBM Cobol Docs but I would appreciate help.
Given an Implied Decimal Cobol field of "PIC S9(3)V9(7).", how can I convert it into a signed float using ruby?
Assuming the data bytes have been run through a dumb EBCDIC-to-ASCII translator, those two values are +28.6858456 and +98.4561383. Which means whatever generated that "reported value" column is either broken or using different bytes as its source.
It looks like the reported values might have been run through a low-precision floating-point conversion, but that still doesn't explain the wrong sign on the second one.
As Mark Reed says i think that the numbers are +28.6858456 and +98.4561383.
but you can refer to this amazing doc for signed numbers between ascii and EBCDIC :
EBCDIC to ASCII Conversion of Signed Fields
hope it helps you
028685845F
098456138C
It's likely that the 2 strings in ASCII was converted from EBCDIC.
These are zone numbers with a sign nibble turn into a byte at the end. Like others have said, the F and C are the sign nibbles.
Check this webpage http://www.simotime.com/datazd01.htm
F is for "unsigned"
C is for "signed positive"
The PIC S9(3)V9(7) is telling you that it's ddd.ddddddd (3 digits before decimal point, 7 digits after, the whole thing is signed.)
It's possible that the 2 "string" are of different PIC, you will need to check the COBOL source that produced the numbers.
It would be best to get the original, hexadecimal dump of the COBOL data (likely in EBCDIC), and post those. (But, I also realize this is a 7.5year old post, the OP probably moved on already.) What I wrote above is for whoever in the future that bump into this thread.

Resources