There is a question from an old test that is confusing me:
"When converting a file from UTF16-LE to UTF16-BE will the file be smaller or bigger afterwards?"
I thought it's just a different order of the bytes, but I dont understand why it should change the file's size?
Related
I have an EBCDIC data file which is variable length.
Inside this file, it contains binary data (comp), packed-decimal (comp-3), display-numeric (pic (9)), and string (pic (x)).
How to convert it to ASCII using a language such as Java, Perl, or COBOL?
First of all, you need to verify that the file is still EBCDIC. Some file transfer programs automatically convert the EBCDIC to ASCII, which corrupts the COMP and COMP-3 fields. You can verify this visually by looking for space characters in the alphanumeric fields. EBCDIC space is x'40'. ASCII space is x'20'.
Assuming that the file is EBCDIC, you have to write three separate conversions. COMP to ASCII, COMP-3 to ASCII, and alphanumeric to alphanumeric. Each field in the record has to be extracted and converted separately.
The alphanumeric to alphanumeric conversion is basically a lookup table for the alphabet and digits. You look up the EBCDIC character and replace it with an ASCII character.
COMP is a binary format. It's usually 4 bytes or 8 bytes,
COMP-3 is a packed decimal format with a sign byte. The sign byte is usually following the number, but not always. For example, 124 looks like x'124F'. The last byte is a sign field, x'C' is unsigned (positive), x'D' is negative, and x'F' is positive.
You need to treat each field by itself. You must not convert the COMP fields. You convert the character fields with the method provided by the language of choice.
Yes it is possible (For java look at JRecord project). Jrecord can use a Cobol copybook to read a file in Java. It can also use a Xml description as well.
Warnings
Before doing anything else
Make sure the file is still Ebcdic !!!. If the file has been through a EBCDIC to ASCII Conversion, the comp/comp-3 fields will be corrupted.
You describe the file as being variable length. Does that mean it has variable length records ???. If so make sure the RDW option was used to transfer the file !!
If you do not get the file transfer* done correctly you will end up wasting a lot of time and then redoing what you have already done.
Cobol Copybook
I am assuming you can get a Cobol copybook. If so
Try editing the file using the RecordEditor. There is an outdated answer here. I will try and provide a more up to date answer.
You can generate Skelton Java/JRecord in the RecordEditor. See How do you generate java~jrecord code for a Cobol copybook
JRecord Project
The JRecord project will read/write Cobol data files using
A Cobol Copybook
A Xml File Description
File Schema defined in Java
There are 3 sub-project that can be used to convert simple Cobol files to
CSV, Xml or Json files. More complicated files need Java/JRecord
CobolToCsv
CobolToXml
CobolToJson
JRecord CodeGen
JRecord CodeGen will generate sample Java/JRecord code to Read/Write Cobol files (including Mainframe Cobol) from a Cobol Copybook. JRecord CodeGen is used by the RecordEditor to generate Java/JRecord programs.
I have an image that was given to me in ASCII character format (from an XML document, i,e. sdlf9sdf09e93jr), and I was told that I should be able to convert that into binary and use that data to view it as a .gif image. I've converted all the ASCII characters into binary, but now I'm stuck and not sure what to do from there.
EDIT: Discovered the real problem I'm having: Even though I've translated the ASCII to binary, I've only written it to a file as what the binary should be for those ASCII characters. The "binary" characters are just a bunch of 0's and 1's in ASCII format. I'm using the C language to write this program that will create gif, but don't know what to do to create a .gif from my original XML parse.
First, you need to know the encoding scheme used to encode the image as a sequence of characters in the XML document.
Are you sure you have ASCII data? It may be if it is base-64 encoded. If it is not base-64 encoded, perhaps you are seeing the Latin-1 characters corresponding to each byte.
But you say you have the byte sequence, the "binary" as you call it. If so, then do as MRAB says and save the bytes to a file, slap a .gif extension on that and open it with any viewer.
But again, make sure you know the encoding. Base-64 is common in XML.
If the binary is a GIF image, then just write it to a file with the extension ".gif".
I was wondering how Windows interprets characters.
I made a file with a hex editor with the 3 bytes E3 81 81.
Those bytes are the ぁ character in UTF-8.
I opened the notepad and it displayed ぁ. I didn't specify the encoding of the file, I just created the bytes and the notepad interpreted it correctly.
Is notepad somehow guessing the encoding?
Or is the hex editor saving those bytes with a specific encoding?
If the file only contains these three bytes, then there is no information at all about which encoding to use.
A byte is just a byte, and there is no way to include any encoding information in it. Besides, the hex editor doesn't even know that you intended to decode the data as text.
Notepad normally uses ANSI encoding, so if it reads the file as UTF-8 then it has to guess the encoding based on the data in the file.
If you save a file as UTF-8, Notepad will put the BOM (byte order mark) EF BB BF at the beginning of the file.
Notepad makes an educated guess. I don't know the details, but loading the first few kilobytes and trying to convert them from UTF-8 is very simple, so it probably does something similar to that.
...and sometimes it gets it wrong...
https://ychittaranjan.wordpress.com/2006/06/20/buggy-notepad/
There is an easy and efficient way to check whether a file is in UTF-8. See Wikipedia: http://en.wikipedia.org/w/index.php?title=UTF-8&oldid=581360767#Advantages, fourth bullet point. Notepad probably uses this.
Wikipedia claims that Notepad used the IsTextUnicode function, which checks whether a patricular text is written in UTF-16 (it may have stopped using it in Windows Vista, which fixed the "Bush hid the facts" bug): http://en.wikipedia.org/wiki/Bush_hid_the_facts.
how to identify the file is in which encoding ....?
Go to the file and try to Save As... and you can see the default (current) encoding of the file (in which encoding it is saved).
I'm uploading a file that was originally ASCII and converted to EBCDIC from Windows OS to z/OS. My problem is that when I checked the file after uploading it, I see a lot of new lines.
When I tried to check it with its hex dump I discovered that when mainframe sees a x'15' it translates it into a newline. In the file there are packed decimals so the hex could contain let say a x'001500001c' but when I upload it, mainframe mistook it as a new line. Can anyone help me with this problem?
You should put your FTP client (or library if the upload is done by your code) into binary (IMAGE TYPE) mode instead of ascii/EBCDIC if you are sending a file already in EBCDIC i believe.
It depends on the type of target "file" that you're uploading to.
If you're uploading to a member that has fixed block size (e.g., FB80), you'll need to ensure all the lines are padded out with spaces before you transmit it up (in binary mode).
Text mode transfers are not suitable for binary files (and your files are binary if they contain packed decimals - there's no reliable way for FTP to detect real line-end characters).
You'll need to fix your Windows ASCII-to-EBCDIC converter to be able to generate fixed length records.
The only other option is with a REXX script on the mainframe but this would still require being able to tell the difference between a real end-of-line marker and that marker within the binary data.
You could possibly tell the presence of a packed decimal by virtue of the fact that it consisted of BCD nybbles, the last of which is 0xC or 0xD, but that could also cause false positives or negatives.
My advice: when you convert it from ASCII to EBCDIC, pad out the lines to the desired record length at the same time.
The other point I'd like to raise is that if you just want to look at the files on the mainframe (not use them from any code that requires EBCDIC), the ISPF editor includes a few new commands (as of z/OS 1.9 if I remember correctly).
SOURCE ASCII will display the data as ASCII rather than EBCDIC. In addition, the LF command allows you to massage the ASCII stream in an FB member to correctly fix up line endings.
Is there a good way to see what format an image is, without having to read the entire file into memory?
Obviously this would vary from format to format (I'm particularly interested in TIFF files) but what sort of procedure would be useful to determine what kind of image format a file is without having to read through the entire file?
BONUS: What if the image is a Base64-encoded string? Any reliable way to infer it before decoding it?
Most image file formats have unique bytes at the start. The unix file command looks at the start of the file to see what type of data it contains. See the Wikipedia article on Magic numbers in files and magicdb.org.
Sure there is. Like the others have mentioned, most images start with some sort of 'Magic', which will always translate to some sort of Base64 data. The following are a couple examples:
A Bitmap will start with Qk3
A Jpeg will start with /9j/
A GIF will start with R0l (That's a zero as the second char).
And so on. It's not hard to take the different image types and figure out what they encode to. Just be careful, as some have more than one piece of magic, so you need to account for them in your B64 'translation code'.
Either file on the *nix command-line or reading the initial bytes of the file. Most files come with a unique header in the first few bytes. For example, TIFF's header looks something like this: 0x00000000: 4949 2a00 0800 0000
For more information on the TIFF file format specifically if you'd like to know what those bytes stand for, go here.
TIFFs will begin with either II or MM (Intel byte ordering or Motorolla).
The TIFF 6 specification can be downloaded here and isn't too hard to follow