calculate hash of binary file containing certain bytes - binaryfiles

I'm having trouble understanding the principle/method, on how to "manually" calculate a file's hash (sha256) which consist of certain bytes.
To put into an example:
I have this binary file consisting of these bytes.
2C F2 BA A3 0E 26 5A 3B 2A 1F 01 4A 01 66 60 02
How to get following (correct) hash of the file? ea3cbd30dc6c18914d2cdafdd8bec0ff4ce5995c7b484cce3237900336abb574

1.
Convert all bytes to ASCII.
2.
Hash ASCII string to get correct hash from the file.
Doing this manually is not recommended, since e.g. copy and paste, or other factors can easily distort your ASCII string. So optimally this is written within a program to calculate everything altogether.

Related

Algo to find redundant data in a file

I have a binary file where one record is repeated multiple times. The file only consists of this record but may be repeated for a number of times.
I dont know the size of the record. What is the best algorithm to extract the record and know how many times it is repeated.
For example suppose I have a file with following memory representation in hex. (ignore file headers and all stuff)
3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C
BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F 5C BA 3F
5C BA
so here my record is 3F 5C BA of 3 bytes and it is repeated 15 times here.
How can I get these values (sizeof the record and the number of times its repeated). Can be done using Rabin Karp but is there any other better and efficient way to do it.
One possibility is to take the size of the file and factor it. For example, if the file size was 1280, then you know that the record size is one of the following:
1,2,4,5,8,10,16,20,32,40,64,80,128,160,256,320,640,1280
You could then test each of those assumptions until you find a match or exhaust the possibilities.
Of course, this assumes that the file is not truncated or otherwise corrupted.
That's probably not the most efficient way to do it, but it's quick to code and could work quite fast enough for your purposes. It rather depends on how large your files are and how often you'll want to do this. Sometimes the brute force solution is the right solution, even if it's not the "best" solution.
You can look at suffix trees, you can insert all suffixes of your string into a suffix tree and count the number of times a certain substring occurs, then do tree traversal and find your answer.
Start with the assumption that the length l of your record is 1
Check if your assumption is correct by comparing all subsequent blocks of size l. Stop as soon as you find a mismatch.
If no mismatch is found, you are finished. RETURN.
Search for the next occurrence of the block with length l. This gives you another candidate record length. If the next matching block starts at index i (zero based), set l = i and go to step 2.
If you know that there is always a solution, you might be able to speed up step 2 a bit. If you checked 50% of the data, you can stop.
Note: This answer assumes that you are looking for the shortest possible record. If all your bytes are for instance FF, could find a lot of other solutions than l=1 (e.g. only one big record).
Example: Start with a record of size 1, in your case 3F. Then check whether this is the complete record by checking whether all subsequent bytes are 3F as well. You can stop with the next byte because it differs. Now look for the next 3F. It occurs at index 3 (zero based). Now you know your record is at least 3 bytes long. Assume your record is 3 bytes long. Check if all subsequent three byte blocks match your record. Done!

COBOL COMP-3 number format issue

I have a cobol "tape format" dump which has a mixture of text and number fields. I'm reading the file in C# as a binary array (array of byte). I have the copy book and the formats are lining up fine on the text fields. There are a number of COMP-3 fields as well. The data in those fields doesnt seem to match any BCD format. I know what the data should be and I have the raw bytes of the COMP-3. I tried converting to EBCDIC first which yielded no better results. Any thoughts on how a COMP-3 number can be otherwise internally stored? Below are three examples of the PIC, the raw data and the expected number. I know I have the field positions correct because there is alpha data on either side of the numbers and that all lines up correctly.
First Example:
The PIC of the field is 9(9) COMP-3
There are 5 bytes to the data, the hex values are 02 01 20 91 22
The resulting data should be a date (00CCYYMMDD). This particular date should be 3-17-14.
Second Example:
The PIC of the field is S9(3) COMP-3
There are 2 bytes to the data, the hex values are 0A 14
The resulting value should be between 900 and 999
My understanding is that the "S" means that the last nibble should be 0xC or 0xD to indicate + or -
Third Example:
The PIC of the field is S9(15)V99 COMP-3
There are 9 bytes to the data, the hex values are 00 00 00 00 00 00 01 80 0C
The resulting value should be 12.00
Ok so thanks to the people who responded as they pointed me in the right direction. This is indeed an ASCII/EBCDIC representation issue. The BCD is stored in EBCDIC. Using an ASCII to EBCDIC conversion table yields properly formatted BCD digits:
I used this link to map the data: http://shop.alterlinks.com/ascii-table/ascii-ebcdic-us.php
My data: 0A 14 Converted: 25 3C (turns out that 253 is a valid value, spec was wrong) C = +, all good
My data: 01 80 0C (excluding leading zeros) Converted: 01 20 0C 12.00 C = +, implied 2 digits in format, all good
My data: 02 01 20 91 22 Converted: 02 01 40 31 7F 2014/03/17 (F is unused nibble), all good
There is no such thing as a COBOL "tape format" although the phrase may mean something to the person who gave you the data.
The clue to your problem is that you can read the text. Connect that to the EBCDIC tag and your reference to C#.
So, you are reading data which is originally source from a Mainframe, most likely an IBM Mainframe, which uses EBCDIC instead of ASCII.
COBOL does not have native support for BCD.
What some kind soul has done for you is "convert" the data from EBCDIC to ASCII. Otherwise you wouldn't even recognise the "text".
Unfortunately, what that means for any binary or packed-decimal or floating-point fields (you won't see much of the last, but they are COMP-1/COMP-2) is that "convert" means "potentially scrambled", because the coversion is assuming individual bytes, with simple byte values, whereas all of those fields have conventional coding, either through multiple bytes or non-EBCDIC values or both.
So: COMP-3 PIC 9(9). As you say, five bytes. It is unsigned, so the rightmost nybble will be F (all bits on). You are slightly out with your positions due to the sign position being occupied, even for an unsigned field.
On the Mainframe, it contains a value X'020140317F'. Only that field in its entirety can make any sense as to its value. However, the EBCDIC to ASCII conversion has made it X'0201209122'.
How?
Look up the EBCDIC value of X'02' and X'01'. They don't change. Look up the value of X'40', whoops, that's a space, change it to ASCII X'20'. Look up the value of X'31'. Actually nothing special there, and it has converted to something higher than X'7F', but if you look at the translation table used, I guess you'll see why it happens. The X'7F' is a double-quote, so gets changed to X'22'.
The other values you show suffer the same problem.
You should only ever take data from a Mainframe in character-only format. There are many answers here on this, you should look at the related to the right.
Have a look at this recent question: Convert COMP and COMP-3 Packed Decimal into readable value with C
OK, let's have a look at your first example. Given the format and value the original BCD-content should have been something like
02 01 40 31 7F
When transforming that from EBCDIC to ASCII we run into trouble with the first, second and fourth byte because they are control-characters - so here we would need some more details on how the ASCII->EBCDIC-converter worked. Looking at the two remaining bytes those would be changed
EBCDIC ASCII CHARACTER
40 -> 20 (blank)
7F -> 22 "
So assuming the first two bytes remain unchanged and the third gets converted like 31->91 we end up with
02 01 20 91 22
which is what you got. So it looks like some kind of EBCDIC->ASCII-conversion took place. If that is the case it might be that you can't repair the data since the transformation may not be one-one and thus not reversible.
Looking at the second example and using
EBCDIC ASCII CHARACTER
25 -> 0A (LF)
3C -> 14 (DC4)
you would have started with 25 3C which would fit the format but not the range you gave.
In the third example the original 01 20 0C could be converted to 01 80 0C since 20 also is an EBCDIC control-character with no direct ASCII-equivalent.
But given all other examples I would assume there is some codepage-conversion issue.
If you used some kind of filetransfer to move the data fromm the (supposed) mainframe make sure it is set to binary-mode and don't do any character-conversion before you split the file into fields and know what's meant to be a character and what not.
EDIT: You can find a list of several EBCDIC and ASCII-based codepages here or look here for the same as one pdf.
I'm coming to this a bit late, but have a couple of suggestions that might make your life easier...
First, see if you can get your mainframe conterparts to convert all non-character (i.e. binary numeric and packed decimal) data
to display format (e.g. PIC X) before you
download it. Then you only need to deal with the "printable" range of numeric characters representing 0 through 9. Printable character
only code-page conversions are fairly standaard and tend not to screw up as much. Reformatting data given a copybook is not
a difficult prospect for anybody
proficient in a mainframe environment. Unfortunately, sometimes you get the "runaround" and a claim is made that it is
extremely costly or, takes special software, or any one of a hundred other bogus excuses.
If you get the "runaround" then the next best thing is to is download the file in binary format and do your own codepage conversion
for the character data (fairly
straight forward). Next deal with the binary data based on your copybook definitions. With a few Googles you should be able to find
enough information to get through converting the PACKED-DECIMAL (COMP-3) data to whatever you need.
Here are a couple of links to get you started:
Numeric Data Formats
Packed Decimal
I do not recommend trying to reverse engineer the code page conversions applied by your file transfer package in order to
decode the packed decimal and other binary data.
Ok so thanks to both people who responded as they pointed me in the right direction. This is indeed an ASCII/EBCDIC representation issue. The BCD is stored in EBCDIC. Using an ASCII to EBCDIC conversion table yields properly formatted BCD digits:
I used this link to map the data: http://shop.alterlinks.com/ascii-table/ascii-ebcdic-us.php
My data: 0A 14
Converted: 25 3C (turns out that 253 is a valid value, spec was wrong) C = +, all good
My data: 01 80 0C (excluding leading zeros)
Converted: 01 20 0C 12.00 C = +, implied 2 digits in format, all good
My data: 02 01 20 91 22
Converted: 02 01 40 31 7F 2014/03/17 (F is unused nibble), all good
Thanks again for the two above answers which led me in the right direction.
You can avoid the above issues by having the data converted into a modern method for transferring data: XML.

How to encode and decode large OID values?

I have OID of 1.3.6.1.2.1.2.2.1.8.4096 (ifOperStatus)
In my code I have:
MIB[0]=0x2b
MIB[1]=0x06
MIB[2]=0x01
MIB[3]=0x02
MIB[4]=0x01
MIB[5]=0x02
MIB[6]=0x02
MIB[7]=0x01
MIB[8]=0x08
MIB[9]=0xA0
MIB[10]=0x00
where A0 00 represents 4096.
4096 in HEX is 1000.
Breaking this in 2 bytes will give me 10 00.
SNMP data should be sent in single byte format.
Therefore, a special rule is required for large numbers because one byte (eight bits) can only represent a number from 0-255. The rule is the highest order bit is used as a flag to let the recipient know that this number spans more than one byte.
I have shifted the bits to the left and added 1 to the 8th bit.
Shift left: 20 00
Bit 8 becomes 1: A0 00
Reference:
[OID Encoding] (http://www.rane.com/note161.html)
Have I encoded the 4096 correctly?
What about decoding string of data to the original OID ?
Examples would be good for me to understand the concept.
Yes, you have encoded the OID correctly (as far as the contents go). The full encoding (with tag for OID and length, which were omitted) would be 06 0b 2b 06 01 02 01 02 02 01 08 a0 00.
With regards to encoding/decoding strings in OIDs (presumably an INDEX), the rules depend on whether or not the value in question is for an object defined as a fixed-length or variable-length string, and whether or not the IMPLIED keyword was used in defining the INDEX.
If it's a fixed-length string, or variable-length string with IMPLIED keyword (which would have to be the last INDEX object) then it is encoded simply as one subidentifier per byte of string. Otherwise, a variable-length string is encoded with one subidentifier to indicate the length of the string, followed by each byte encoded in a single subidentifier as with fixed-length.
RFC 2578 section 7.7 details the rules for encoding values for INDEX objects in an OID.

output only blocks of size n with offset of multiple of stride k from start of binary input in shell

given a blocksize of n and another size k, I search for a way to only output blocks with an offset from the start of an input of a multiple of k.
imagine a file consisting of a number of 4-tuples of 2-byte data. now given this input I want only the first entry of each tuple.
example input:
00 00 11 11 22 22 33 33
44 44 55 55 66 66 77 77
88 88 99 99 aa aa bb bb
cc cc dd dd ee ee ff ff
example output with n=2 and k=8:
00 00 44 44 88 88 cc cc
which is only the first "column" of the input.
Now while it would be simple to do this in perl, python, I need this functionality in a shell script as the target system does not have perl or python but only basic utilities. I'm hoping there is a way to misuse an existing tool for that. If it is not possible I would write some C doing that but I would like to avoid it.
One usecase would be to extract one audio channel from a raw audio file.
A term you might search for (other than "zebra stripes") is "stride." That's what some people call this idea of skipping k bytes each time.
It's not entirely clear from your post, but it looks like you actually want to be able to insert this filter in a pipeline and have it consume raw bytes and output the same. If this is the case, I'm not sure how it can be done easily in plain shell script, so would suggest you either hunker down and write it in C, or get Python or something installed on the target system.

Searching Binary Data in Ruby

Using only pure ruby (or justifiably commonplace gems) is there an efficient way to search a large binary document for a specific string of bytes?
Deeper context: the mpeg4 container format is a 4-byte indexed serialised data structure, without having to parse the structure fully (I can assume it is valid) I want to pull out specific tags.
For those of you that haven't come across this 'dmap' serialization before it works something like this:
<4-byte length<4-byte tag><4-byte length><4-byte type definition><8 bytes of something I can't remember><data>
eg, this defines the 'tvsh' (or TV Show) tag as being 'Futurama'
00 00 00 20 ...
74 76 73 68 tvsh
00 00 00 18 ....
64 61 74 61 data
00 00 00 01 ....
00 00 00 00 ....
46 75 74 75 Futu
72 61 6D 61 rama
The exact structure isn't really important, I'd like to write a method which can pull out the show name when I give it 'tvsh' or that it's season 2 if I give it 'tvsn'.
My first plan would be to use Regular Expressions, but I get the (unjustified) feeling that this would be slow.
Let me know your thoughts! Thanks in advance
In Ruby you can use the /n flag when creating your regex to tell Ruby that your input is 8-bit data.
You could use /(.{4})tvsh(.{4})data(.{8})([\x20-\x7F]+)/n to match 4 bytes, tvsh, 4 bytes, data, 8 bytes, and any number of ASCII characters. I don't see any reason why this regex would be significantly slower to execute than hand-coding a similar search. If you don't care about the 4-byte and 8-byte blocks, /tvsh.{4}data.{8}([\x20-\x7F])/n should be nearly as fast as a literal text search for tvsh.
If I understand your description correctly, whole file consists of a number of such "blocks" of a fixed structure?
In that case, I suggest scanning one by one, and skipping ones not of interest to you. So, your each step should do the following:
Read 8 bytes (using IO#readbytes or a similar method)
From the read header, extract the size (first 4 bytes), and the tag (second 4)
If the tag is the one you need, skip following 16 bytes and read size-24 bytes.
If the tag is not of interest, skip following size-16 bytes.
Repeat.
For skipping bytes, you can use IO#seek.
Theoretically you can use regexes against any arbitrary data, including binary strings. HTH.

Resources