I'm uploading a file that was originally ASCII and converted to EBCDIC from Windows OS to z/OS. My problem is that when I checked the file after uploading it, I see a lot of new lines.
When I tried to check it with its hex dump I discovered that when mainframe sees a x'15' it translates it into a newline. In the file there are packed decimals so the hex could contain let say a x'001500001c' but when I upload it, mainframe mistook it as a new line. Can anyone help me with this problem?
You should put your FTP client (or library if the upload is done by your code) into binary (IMAGE TYPE) mode instead of ascii/EBCDIC if you are sending a file already in EBCDIC i believe.
It depends on the type of target "file" that you're uploading to.
If you're uploading to a member that has fixed block size (e.g., FB80), you'll need to ensure all the lines are padded out with spaces before you transmit it up (in binary mode).
Text mode transfers are not suitable for binary files (and your files are binary if they contain packed decimals - there's no reliable way for FTP to detect real line-end characters).
You'll need to fix your Windows ASCII-to-EBCDIC converter to be able to generate fixed length records.
The only other option is with a REXX script on the mainframe but this would still require being able to tell the difference between a real end-of-line marker and that marker within the binary data.
You could possibly tell the presence of a packed decimal by virtue of the fact that it consisted of BCD nybbles, the last of which is 0xC or 0xD, but that could also cause false positives or negatives.
My advice: when you convert it from ASCII to EBCDIC, pad out the lines to the desired record length at the same time.
The other point I'd like to raise is that if you just want to look at the files on the mainframe (not use them from any code that requires EBCDIC), the ISPF editor includes a few new commands (as of z/OS 1.9 if I remember correctly).
SOURCE ASCII will display the data as ASCII rather than EBCDIC. In addition, the LF command allows you to massage the ASCII stream in an FB member to correctly fix up line endings.
Related
While compiling a c file, gcc by default compiles it to a file called "a.out". My professor said that the output file contains the binaries, but I when I open it I usually encounter unreadable text (VS Code says something like "This file contains unsupported text encoding").
I assumed that by 'binaries', I would be able to see literal zeroes and ones in the file but that does not seem to be the case. So what exactly does it output file look like or what exactly does it contain and what is 'text encoding'? Why can I not read it? What special characters might it contain? I'm aware of the fact that gcc first pre-processes, which means it removes all comments, expands all macros and copies the contents of any header files that might be included. You get the header file by running gcc -E <file_name>.c, then the this processed file is complied into assembly. Up to this point, the output files are readable, i.e., I can open them with VS Code, but after this the assembled code and the object file thereafter are human-unreadable.
For reference, I have no prior experience with programming or any language for that matter and this is my first CS related course in my first sem of college, and I apologize if this is too trivial of a question to ask.
I actually had the same confusion early on. Not about that file type specifically, but about binary vs text files.
After all aren't all files, even text ones binary? In the sense that all information is 1s and 0s? Well, yes, all information can be stored/transmitted as 1s and 0s, but that's not what binary/text files refer to.
It refers to what that information, the content of the file, those 1s and 0s represent.
In a text file the bytes encode characters. In a binary file the bits encode some information that is not text. The format and semantics of that information is completely free, it can mean anything and use whatever encoding scheme. It's up to the application that writes/reads the file to properly understand the bit patterns.
Most text editors (like VS Code) when open a file they treat it as a text file. I.e. they try to interpret the bit patterns as a text encoding scheme (e.g. ASCII or UTF-8) But not all bit patterns are valid ASCII/UTF-8 so that's why you get "unsupported text encoding".
If you want to inspect the actual 1s and 0 for both text and binary files you need to use a utility that shows you that, e.g. hex viewers/editors.
[Edit/Disclaimer]: Comments pointed out that I have to clarify the encoding the user uses. Will update accordingly
I have a customer from China who recently reported an issue with their filenames on Windows. The software works with most Chinese characters, but it seems he has found one file that fails.
Unfortunately, they are not able to send me over the filename as neither zipping nor transmitting the file through other mediums seem to preserve the filename.
What is the easiest way (e.g. through Python) to generate a filename on Windows that is covered by the NTFS file system encoding but not UTF8?
Unicode strings are encoded as a series of bytes. The rules of what a series of bytes visually looks like to you in an operating system, is what operating systems use to turn bytes into characters.
Given that Windows uses a (variation of-) Unicode, and you say you have a character that's not in unicode, it also means that there is simply no way to represent that character.
Imagine if unicode only contained the numbers 0-9, and you ask someone how to encode the letter A. There's no answer to this, because only 0-9 are defined.
You could make up a new unicode codepoint for your character, but then operating systems won't know what to do with that unless you also make your own font files.
I somehow doubt that that's what you want to do though, but it's an option. Could your customer rename the file before sending it to you?
During my implementation on a protocol buffer application, I tried to work with the text pbtxt files to ease up my programming. The idea was to switch to the pb binary format afterward, once I had a clearer understanding of the API. (I am working in C++)
I made my application working by importing the file with TextFormat::Parse. (The content of the file came from TextFormat::Print). I then generated the corresponding binary file, that I tried to import with myMessageVariable.ParsefromCodedStream (file not compressed). But I notice that only a very small part of the message is imported. The myMessageVariable.IsInitialized returns true, thus I guess that the library "thinks" that it has completely imported the file.
So my question: is there something different in the way the file are imported that could make the import "half-fail"? (Besides the obvious reason that one is binary and the other one is in text?) And what can we do against it?
There are a few differences in reading text data and reading binary data:
Text files sometimes use automatic linefeed conversion (\r\n vs. \n), especially on Windows platforms. This has to be disabled by opening the file in binary mode.
Binary files can contain null bytes at any point. Some text processing functions stop reading at the first null byte.
It could help if you can determine more about how much of the message gets parsed. Then you can look at what kind of bytes are near the problem point, using e.g. hex editor.
I was wondering how Windows interprets characters.
I made a file with a hex editor with the 3 bytes E3 81 81.
Those bytes are the ぁ character in UTF-8.
I opened the notepad and it displayed ぁ. I didn't specify the encoding of the file, I just created the bytes and the notepad interpreted it correctly.
Is notepad somehow guessing the encoding?
Or is the hex editor saving those bytes with a specific encoding?
If the file only contains these three bytes, then there is no information at all about which encoding to use.
A byte is just a byte, and there is no way to include any encoding information in it. Besides, the hex editor doesn't even know that you intended to decode the data as text.
Notepad normally uses ANSI encoding, so if it reads the file as UTF-8 then it has to guess the encoding based on the data in the file.
If you save a file as UTF-8, Notepad will put the BOM (byte order mark) EF BB BF at the beginning of the file.
Notepad makes an educated guess. I don't know the details, but loading the first few kilobytes and trying to convert them from UTF-8 is very simple, so it probably does something similar to that.
...and sometimes it gets it wrong...
https://ychittaranjan.wordpress.com/2006/06/20/buggy-notepad/
There is an easy and efficient way to check whether a file is in UTF-8. See Wikipedia: http://en.wikipedia.org/w/index.php?title=UTF-8&oldid=581360767#Advantages, fourth bullet point. Notepad probably uses this.
Wikipedia claims that Notepad used the IsTextUnicode function, which checks whether a patricular text is written in UTF-16 (it may have stopped using it in Windows Vista, which fixed the "Bush hid the facts" bug): http://en.wikipedia.org/wiki/Bush_hid_the_facts.
how to identify the file is in which encoding ....?
Go to the file and try to Save As... and you can see the default (current) encoding of the file (in which encoding it is saved).
I'm downloading via FTP some files with chinese names (BIG5 encoded), and Filezilla displays those filenames as garbage (as FTP cannot handle any encoding other than ASCII and UTF-8, as least the standard compliant ones).
Given a filename with garbled characters, is it possible for me to repair the encoding and get a proper filename String given that I already know the source encoding? Will the FTP client misinterpreting BIG5 as UTF-8 insert bytes that make conversion back to BIG5 difficult?
My proposed steps (in Java):
1. get the garbled filename using File object.
2. getbytes using UTF-8.
3. create a new string using those bytes in BIG5.
4. Write the decoded filename back to the file.
Will the above method work?
Not every sequence of bytes is a valid ASCII or UTF-8 string so it's quite likely that some of the bytes will have been discarded, converted to the replacement character, or otherwise irreversibly mangled. So it looks like you won't be able to retrieve the original filenames if they have been modified by FileZilla to become correctly formed UTF-8 or ASCII.
You might be lucky to be able to get a certain percentage of the original characters back, where they just happened to be both valid BIG5 and valid UTF-8, but I doubt you will be able to recover the entire filename.
You could post a few examples of your garbled filenames (as raw bytes encoded in hex) to get a more definite answer. That way we can see exactly what the damage is.