File with first bit of every byte set to 0 - utf-8

I was given a file that seems to be encoded in UTF-8, but every byte that should start with 1 starts with 0.
E.g. in place where one would expect polish letter 'ę', encoded in UTF-8 as \o304\o231, there is \o104\o031. Or, in binary, there is 01000100:00011001 instead of 11000100:10011001.
I assume that this was not done on purpose by evil file creator who enjoys my headache, but rather is a result of some erroneous operations performed on a correct UTF-8 file.
The question is: what "reasonable" operations could be the cause? I have no idea how the file was created, probably it was exported by some unknown software, than could have been compressed, uploaded, copied & pasted, converted to another encoding etc.
I'll be greatful for any idea : )

Related

Find and replace non utf8 character

I have a process that inserts data into PDFs that eventually loads into a system that gets searched based on that inserted data. The inserted data looks something like:
<<
/IBM-ODIndexes
<< /Private
<<
/DOB (05031983)
/FULL_NAME (TEST USER)
/YEAR (2020)
>>
/LastModified(D:20210112201530)
>>
However, there are instances where the data in the FULL_NAME field contains non UTF8 characters and then users are unable to search the data. Specifically apostrophes come over from Microsoft Word and then gets interpreted like this:
/FULL_NAME (JERRY OÃ<83>¢ââ<80><9a>‰â<80><9e>¢CONNELL)
In this case I am looking to strip out the apostrophe that is represented as Ã<83>¢ââ<80><9a>‰â<80><9e>¢ and replace it with a white space.
There are several complexities here, but in general I would say that the only reliable way to deal with it is to figure out the text encoding of the incoming document and converting it to the target encoding.
Ã<83>¢ââ<80><9a>‰â<80><9e>¢ is 34 characters (that is, at least 34 bytes), and no single encoding ever used that much space for a single character. What’s probably happening is multiple levels of encoding, such as HTML entities, base64, UTF-8/16/32 or escape characters like %% to represent % in SQL or \\ to represent \ in Bash. Reversing all these levels of encoding manually is going to involve quite a lot of reading the huge docx standard. The simpler alternative is to use a library which can just convert the entire text into a known character encoding for you, at which point you have to do at most a single conversion into UTF-8.
Another argument for this is that the “apostrophe string” does contain otherwise harmless characters like “a” and “e”. Without at least some understanding of the encodings you’re unlikely to be able to separate encoded characters from non-encoded ones, which would make the resulting text full of invalid text.

A hint for end of ASCII data in a binary file

I'm developing a software that stores its data in a binary file format. However, as a courtesy to innocent shell users that might cat to inspect the contents of such a file, I'm thinking of having an ASCII-compatible "magic string" in the start of the file that tells the name and the version of the binary format.
I'm thinking of having at least ten rows (\n) in the message so that head by default settings doesn't hit the binary part.
Now, I wonder if there is any control character or escape code that would hint to the shell that the following content isn't interpretable as printable text, and should be just ignored? I tried 0x00 (the null byte) and 0x04 (ctrl-D) but they seem to be just ignored when catting the file.
Cat regards a file as text. There is no way you can trigger an end-of-file, since EOF is not actually any character.
The other way around works of course; specifying a format that only start reading binary format from a certain character on.

Find and replace (increment) ASN.1 BER hex value

I have a long string of hex (converted from BER ASN.1) where I need to find and increment a particular value which is incorrect.
<TAG> <LENGTH> <VALUE to INCREMENT>
the ASN.1 tag is 84 and the length byte will change from 01 to 02 when the value > 127dec. And the value to increment will therefore become 2 bytes.
The value should start at 00.
e.g.
- Original file: ...840101...840107...84020085...84020097
- New file: ...840100...840101...84020080...84020081
Any ideas how best to do this, preferably using standard bash commands?
Ilya Etingof hinted at this already, but to be explicit about it, BER uses TAG, LENGTH, VALUE (TLV) encoding, where the VALUE can itself be a TLV. If you change the length in a TLV that is nested inside a TLV, you will need to update all of the lengths of the enclosing TLVs as well. It is not a simple search/replace operation.
Assuming you have the octet-stream in text work already, you may consider searching/replacing pieces of text with awk or sed. If you can only use bash, may be variable substitution (${parameter/pattern/string} or ${parameter:offset:length}) would work?
Keep in mind however, that BER is quite flexible in the sense that (sometimes) the same data structure may be encoded differently and that would still constitute a valid encoding. The rationale behind that is to allow the encoder to optimize for its very own situation (e.g. save on memory or CPU cycles or on copying etc).
What I am trying to say that depending on your situation there may be a chance that your search/replace logic may fail. The bullet-proof solution would be to fully decode your BER octet stream, change the data structure you need and re-encode it back into BER.

Unpacking COMP-3 digit using Record Editor/Jrecord

I have created layout based on cobol copybook.
Layout snap-shot:
I tried to load data also selecting same layout, it gives me wrong result for some columns. I try using all binary numeric type.
CLASS-ORDER-EDGE
DIV-NO-EDG
OFFICE-NO-EDG
REG-AREA-NO-EDG
CITY-NO-EDG
COUNTY-NO-EDG
BILS-COUNT-EDG
REV-AMOUNT-EDG
USAGE-QTY-EDG
GAS-CCF-EDG
result snapshot
Input file can be find below attachment
enter link description here
or
https://drive.google.com/open?id=0B-whK3DXBRIGa0I0aE5SUHdMTDg
Expected output:
Related thread
Unpacking COMP-3 digit using Java
First Problem you have done an EBCDIC --> ascii conversion on the file !!!!
The EBCDIC --> ascii conversion will also try and convert binary fields as well as text.
For example:
Comp-3 value hex hex after Ascii conversion
400 x'400c' x'200c' x'40' is the ebcdic space character
it gets converted to the ascii
space character x'20'
You need to do binary transfer, keeping the file as ebcdic:
Check the file on the Mainframe if it has a RECFM=FB you can do a transfer
If the file is RECFM=VB make sure you transfer the RDW (Record Descriptor word) (or copy the VB file to a FB file on the mainframe).
Other points:
You will have to update RecordEditor/JRecord
The font will need to be ebcdic (cp037 for US ebcdic; for other lookup)
The FileStructure/FileOrganisation needs to change (Fixed length / VB)
Finally
BILS-Count-EDG is either 9 characters long or starts in column 85 (and is 8 bytes long).
You should include Xml in as text not copy a picture in.
In the RecordEditor if you Right click >>> Edit Record; it will show the fields as Value, Raw Text and Hex. That is useful for seeing what is going on
You do not seem to accept many answers; it is not relevant whether the answer solves your problem; it is whether the answer is correct answer for the question.

Issues with getline/file reading in Windows

I created some .txt files on my Mac (didn't think that would matter at first, but...) so that I could read them in the application I am making in (unfortunately) Visual Studio on a different computer. They are basically files filled with records, with the number of entries per row at the top, e.g.:
2
int int
age name
9 Bob
34 Mary
12 Jim
...
In the code, which I originally just made (and tested successfully) on the Mac, I attempt to read this file and similar ones:
Table TableFromFile(string _filename){ //For a database system
ifstream infile;
infile.open(_filename.c_str());
if(!infile){
cerr << "File " << _filename << " could not be opened.";
exit(1);
}
//Determine number attributes (columns) in table,
//which is number on first line of input file
std::string num;
getline(infile, num);
int numEntries = atoi(num.c_str());
...
...
In short, this causes a crash! As I looked into it, I found some interesting "Error reading characters of string" issues and found that numEntries is getting some crazy negative garbage value. This seems to be caused by the fact that "num", which should just be "2" as read from the first line, is actually coming out as "ÿþ2".
From a little research, it seems that these strange characters are formatting things...perhaps unicode/Mac specific? In any case, they are a problem, and I am wondering if there is a fast and easy way to make the text files I created on my Mac cooperate and behave in Windows just like they did in the Mac terminal. I tried connecting to a UNIX machine, putting a txt file there, running unix2dos on it, and put into back in VS, but to no avail...still those symbols at the start of the line! Should I just make my input files all over again in Windows? I am very surprised to learn that what you see is not always what you get when it comes to characters in a file across platforms...but a good lesson, I suppose.
As the commenter indicated, the bytes you're seeing are the byte order mark. See http://en.wikipedia.org/wiki/Byte_order_mark.
"ÿþ" is 0xFFFE, the UTF-16 "little endian" byte order mark. The "2" is your first actual character (for UTF-16, characters below 256 will be represented by bytes of the for 0xnn00;, where "nn" is the usual ASCII or UTF-8 code for that character, so something trying to read the bytes as ASCII or UTF-8 will do OK until it reaches the first null byte).
If you need to puzzle out the Unicode details of a text file the best tool I know of is the free SC Unipad editor (www.unipad.org). It is Windows-only but can read and write pretty much any encoding and will be able to tell you what there is to know about the file. It is very good at guessing the encoding.
Unipad will be able to open the file and let you save it in whatever encoding you want: ASCII, UTF-8, etc.

Resources