EBCDIC to ASCII containing COMP types - hadoop

I have been seeing many tools like syncsort, informatica and etc which are efficient enough to convert EBCDIC mainframe files to ASCII.
Since our company is a small in size and dont want to invest on any of the tools, i have a challange to convert EBCDIC mainframe files to ASCII.
The upstream are mainframe and i am migrating the entire data into hdfs but since hdfs in not efficient enough to handle mainframe i have been asked to
convert with Spark/java rode routine to convert these mainframe EBCDIC files.
I understand that when the file is exported, the files gets converted to ASCII but packed decimal, COMP/COMP3 doesnt get converted.
i need to write a logic to convert these mainframe EBCDIC partially converted file to ASCII so that we can do our further processing in hadoop.
Since iam new in this site and cant even add my sample ebcdic file. request you to consider below as a sample file content which contains ascii as well as junk characters
the below contains some junk which is after salary field, that is Dept field, it is having COMP data type..below is the emp.txt file
101GANESH 10000á?
102RAMESH 20000€
103NAGESH 40000€
below is empcopybook
01 EMPLOYEE-DETAILS.
05 EMP-ID PIC 9(03).
05 EMP-NAME PIC X(10).
05 EMP-SAL PIC 9(05).
05 DEPT PIC 9(3) COMP-3.

There is a library in Java that you can use with spark is called JRecord to convert the binary files of EBCDIC to ASCII.
The code you can find with this guy here
This is possible to integrate with Scala with the function newAPIHadoopFile to run it in spark. This code is a Hadoop coding but will work fine with spark.

There is also this option (it also uses JRecord):
https://wiki.cask.co/display/CE/Plugin+for+COBOL+Copybook+Reader+-+Fixed+Length
it is based on CopybookHadoop which looks to be a clone of CopybookInputFormat that Thiago has mentioned.
Any way from the Documentation:
This example reads data from a local binary file "file:///home/cdap/DTAR020_FB.bin" and parses it using the schema given in the text area "COBOL Copybook"
It will drop field "DTAR020-DATE" and generate structured records with schema as specified in the text area.
{
"name": "CopybookReader",
"plugin": {
"name": "CopybookReader",
"type": "batchsource",
"properties": {
"drop" : "DTAR020-DATE",
"referenceName": "Copybook",
"copybookContents":
"000100* \n
000200* DTAR020 IS THE OUTPUT FROM DTAB020 FROM THE IML \n
000300* CENTRAL REPORTING SYSTEM \n
000400* \n
000500* CREATED BY BRUCE ARTHUR 19/12/90 \n
000600* \n
000700* RECORD LENGTH IS 27. \n
000800* \n
000900 03 DTAR020-KCODE-STORE-KEY. \n
001000 05 DTAR020-KEYCODE-NO PIC X(08). \n
001100 05 DTAR020-STORE-NO PIC S9(03) COMP-3. \n
001200 03 DTAR020-DATE PIC S9(07) COMP-3. \n
001300 03 DTAR020-DEPT-NO PIC S9(03) COMP-3. \n
001400 03 DTAR020-QTY-SOLD PIC S9(9) COMP-3. \n
001500 03 DTAR020-SALE-PRICE PIC S9(9)V99 COMP-3. ",
"binaryFilePath": "file:///home/cdap/DTAR020_FB.bin",
"maxSplitSize": "5"
}
}
}

You can use Cobrix, which is a COBOL data source for Spark. It is open-source.
You can use Spark to load the files, parse the records and store them in any format you want, including plain text, which seems to be what you are looking for.
DISCLAIMER: I work for ABSA and I am one of the developers behind this library. Our focus is on 1) ease of use, 2) performance.

Related

Unpacking COMP-3 digit using Record Editor/Jrecord

I have created layout based on cobol copybook.
Layout snap-shot:
I tried to load data also selecting same layout, it gives me wrong result for some columns. I try using all binary numeric type.
CLASS-ORDER-EDGE
DIV-NO-EDG
OFFICE-NO-EDG
REG-AREA-NO-EDG
CITY-NO-EDG
COUNTY-NO-EDG
BILS-COUNT-EDG
REV-AMOUNT-EDG
USAGE-QTY-EDG
GAS-CCF-EDG
result snapshot
Input file can be find below attachment
enter link description here
or
https://drive.google.com/open?id=0B-whK3DXBRIGa0I0aE5SUHdMTDg
Expected output:
Related thread
Unpacking COMP-3 digit using Java
First Problem you have done an EBCDIC --> ascii conversion on the file !!!!
The EBCDIC --> ascii conversion will also try and convert binary fields as well as text.
For example:
Comp-3 value hex hex after Ascii conversion
400 x'400c' x'200c' x'40' is the ebcdic space character
it gets converted to the ascii
space character x'20'
You need to do binary transfer, keeping the file as ebcdic:
Check the file on the Mainframe if it has a RECFM=FB you can do a transfer
If the file is RECFM=VB make sure you transfer the RDW (Record Descriptor word) (or copy the VB file to a FB file on the mainframe).
Other points:
You will have to update RecordEditor/JRecord
The font will need to be ebcdic (cp037 for US ebcdic; for other lookup)
The FileStructure/FileOrganisation needs to change (Fixed length / VB)
Finally
BILS-Count-EDG is either 9 characters long or starts in column 85 (and is 8 bytes long).
You should include Xml in as text not copy a picture in.
In the RecordEditor if you Right click >>> Edit Record; it will show the fields as Value, Raw Text and Hex. That is useful for seeing what is going on
You do not seem to accept many answers; it is not relevant whether the answer solves your problem; it is whether the answer is correct answer for the question.

How to remove special characters in XML through ESQL

I am having problem with special characters coming in input XML.
How can we remove the bad characters which can come anywhere in the XML field through ESQL code in broker toolkit.
In the below XML, description field is having bad character — :
<notificationsRequest>
<BillingCity>Troutdale</BillingCity>
<BillingCountry>United States</BillingCountry>
<BillingPostalCode>97060</BillingPostalCode>
<BillingState>Oregon</BillingState>
<BillingStreet>450 NW 257th Way, Suite 400</BillingStreet>
<CreatedById>005w0000003QlXtAAK</CreatedById>
<Type>Prospect</Type>
<Tyco_Operating_Co__c>Tyco IS - Commercial</Tyco_Operating_Co__c>
<Doing_Business_As_DBA__c>Columbia Gorge Outlets</Doing_Business_As_DBA__c>
<Description>As of January 2016—the property title should read Austell Columbia Gorge Equities, LLC-dba Columbia Gorge Outlets---so the title should be Austell Columbia Gorge Equities, LLC.</Description>
</notificationsRequest>
Your file seems to come with a wrong encoding or was corrupted while the conversion from one encoding to another. If you are a MS Windows user you can open it using Nodepad++ and try to convert its encodeing to UTF8 or any possible encoding to check the issue.

Gnu sort UTF-8 incorrect collation order

I am using GNU sort on Linux for a UTF-8 file and some strings are not being sorted correctly. I have the LC_COLLATE variable set to en_US.UTF-8 in BASH. Here is a hex dump showing the problem.
5f ef ac 82 0a
5f ef ac 81 0a
5f ef ac 82 0a
5f ef ac 82 0a
These are four consecutive sorted lines. The 0a is the end of line. The order on the forth byte is incorrect. The byte value 81 should not be between the 82 value bytes. When this is displayed in the terminal window the second line is a different character from the other three.
I doubt that this is a problem with the sort command because it is a GNU core utility, and it should be rock solid. Any ideas why this could be occurring? And why do I have to use hexdump to track down this problem; it's the 21st century already!
Use LC_COLLATE=C appears to be the only solution.
You can set this up for everything by editing /etc/default/locale
Unfortunately this loses a lot of useful aspects of UTF-8 sorting, such as putting accented characters next to their base characters. But it is far less objectionable than the complete hideous mess the libc developers and Unicode consortium did. They fail to understand the purpose of sorting, the need to preserve sort order when strings are concatenated, the need to always produce the same order, and how virtually every program in the world relies on this. Instead they seem to feel it is important to "sort" typos such as spaces inserted into the middle of names by ignoring them (!).
It seems that it has probably been some kind of bug in the version you used. When I execute sort(version from GNU coreutils 8.30) it works as follows:
$ printf '\x5f\xef\xac\x82\x0a\x5f\xef\xac\x81\x0a\x5f\xef\xac\x82\x0a\x5f\xef\xac\x82\x0a' | LC_COLLATE=en_US.UTF-8 sort
_fi
_fl
_fl
_fl
which appears to work as expected. I didn't bother to try if it can successfully handle NFC vs NFD normalization forms because I only use NFC myself.

Convert signed cobol number in ruby

I have an ascii file that is a dump of data from a cobol based system.
There is a field that the docs say is PIC S9(3)V9(7)..
Here are two examples of the fields in hex (and ascii) and the resulting number they represent (taken from another source).
Hex Reported value
30 32 38 36 38 35 38 34 35 46 28.687321
ascii : 028685845F
30 39 38 34 35 36 31 33 38 43 -98.480381
ascii : 098456138C
I'm using ruby, and even after adding the Implied Decimal, I seem to be getting the numbers incorrect. I'm trying to parse IBM Cobol Docs but I would appreciate help.
Given an Implied Decimal Cobol field of "PIC S9(3)V9(7).", how can I convert it into a signed float using ruby?
Assuming the data bytes have been run through a dumb EBCDIC-to-ASCII translator, those two values are +28.6858456 and +98.4561383. Which means whatever generated that "reported value" column is either broken or using different bytes as its source.
It looks like the reported values might have been run through a low-precision floating-point conversion, but that still doesn't explain the wrong sign on the second one.
As Mark Reed says i think that the numbers are +28.6858456 and +98.4561383.
but you can refer to this amazing doc for signed numbers between ascii and EBCDIC :
EBCDIC to ASCII Conversion of Signed Fields
hope it helps you
028685845F
098456138C
It's likely that the 2 strings in ASCII was converted from EBCDIC.
These are zone numbers with a sign nibble turn into a byte at the end. Like others have said, the F and C are the sign nibbles.
Check this webpage http://www.simotime.com/datazd01.htm
F is for "unsigned"
C is for "signed positive"
The PIC S9(3)V9(7) is telling you that it's ddd.ddddddd (3 digits before decimal point, 7 digits after, the whole thing is signed.)
It's possible that the 2 "string" are of different PIC, you will need to check the COBOL source that produced the numbers.
It would be best to get the original, hexadecimal dump of the COBOL data (likely in EBCDIC), and post those. (But, I also realize this is a 7.5year old post, the OP probably moved on already.) What I wrote above is for whoever in the future that bump into this thread.

à © and other codes

I got a file full of those codes, and I want to "translate" it into normal chars (a whole file, I mean). How can I do it?
Thank you very much in advance.
Looks like you originally had a UTF-8 file which has been interpreted as an 8 bit encoding (e.g. ISO-8859-15) and entity-encoded. I say this because the sequence C3A9 looks like a pretty plausible UTF-8 encoding sequence.
You will need to first entity-decode it, then you'll have a UTF-8 encoding again. You could then use something like iconv to convert to an encoding of your choosing.
To work through your example:
à © would be decoded as the byte sequence 0xC3A9
0xC3A9 = 11000011 10101001 in binary
the leading 110 in the first octet tells us this could be interpreted as a UTF-8 two byte sequence. As the second octet starts with 10, we're looking at something we can interpret as UTF-8. To do that, we take the last 5 bits of the first octet, and the last 6 bits of the second octet...
So, interpreted as UTF8 it's 00011101001 = E9 = é (LATIN SMALL LETTER E WITH ACUTE)
You mention wanting to handle this with PHP, something like this might do it for you:
//to load from a file, use
//$file=file_get_contents("/path/to/filename.txt");
//example below uses a literal string to demonstrate technique...
$file="&Précédent is a French word";
$utf8=html_entity_decode($file);
$iso8859=utf8_decode($utf8);
//$utf8 contains "Précédent is a French word" in UTF-8
//$iso8859 contains "Précédent is a French word" in ISO-8859

Resources