Edit and read binary files? - performance

For a data compression project, I want to be able to edit and read binary files, For this particular project it is very important to get 256 combinations out of 1 byte, I noticed saving one character in notepad resulted in a 1 byte file, this is great, so long as there are 256 characters linked to all 8-bit combinations. ASCII currently offers about 218 typeable characters, the rest are control characters
I know that there are 256 combinations in 8 bits (1 byte) because of 2 ^ 8 = 256 and i want to be able to use all those combinations for data compression. So a binary editor and reader would be perfect!

I'm afraid you're question is not specific enough. What kind of tooling are you looking for?
Hex is not fake binary, it's just another representation of the same data.
If you're on windows, open up the calculator and switch it to 'programming mode'. It'll allow you to convert values from decimal, hexadecimal and binary representation.
Then find yourself an hex-editor and you're in business.
Since you mention 'ASCII', I suggest you read Joel Spolskys post on character encodings. It's an excellent post, which clarifies how difficult it can be to show plain text. The post is from 2003, but still valid.
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

Related

What for people sometimes convert numbers or strings to bytes?

Sometimes I encounter questions about converting sth to bytes. Are anything existing where it is vitally important to convert to bytes or what for could I convert sth to bytes?
In most languages the most common string functions come as part of the language or in a library/include/import that comes pre-made, often employing object code to take advantage of processor based strings functions, however, sometimes you need to do something with a string that isnt natively supported by the language so since 8-bit days, people have viewed strings as an array of 7 or 8-bit characters, which fit within a byte and use conventions like ASCII to determine which byte value represents which character.
While standard languages often have functions like "string.replaceChar(OFFSET,'a')" this methodology can be painstaking slow because each call to the replaceChar method results in processing overhead which may be greater than the processing needing to be done.
There is also the simplicity factor when designing your own string algorithms but like I said, most of the common algorithms come prebuilt in modern languages. (stringCompare, trimString, reverseString, etc).
Suppose you want to perform an operation on a string which doesnt come as standard.
Suppose you want to add two numbers which are represented in decimal digits in strings and the size of these numbers are greater than the 64-bit bus size of the processor? The RSA encryption/descryption behind the SSL browser padlocks employs the use of numbers which dont fit into the word size of a desktop computer but none the less the programs on a desktop which deal with RSA certificates and keys must be able to process these data which are actually strings.
There are many and varied reasons you would want to deal with string, as an array of bytes but each of these reasons would be fairly specialised.

bits and bytes and what form are them

I'm still confused about the bits and bytes although I've been searching through the internet. Is that one character of ASCII = 1 bytes = 8 bits? So 8 bits have 256 unique pattern that covered all the ASCII code, what form is it stored in our computer?
And if I typed "Hello" does that mean this consists of 5 bytes?
Yes to everything you wrote. "Bit" is a binary digit: a 0 or a 1. Historically there existed bytes of smaller sizes; now "byte" only ever means "8 bits of information", or a number between 0 and 255.
No. ASCII is a character set with 128 codepoints stored as the values 0-127. Modern computers predominantly address 8-bit memory and disk locations so a 7-bit ASCII value takes up 8 bits.
There is no text but encoded text. An encoding maps a member of a character set to one or more bytes. Unless you absolutely know you are using ASCII, you probably aren't. There are quite a few character sets with encodings that cover all 256 byte values and use any combination of byte values to encode a string.
There are several character sets that are similar but have a few less than 256 characters. And others that use more than one byte to encode a codepoint and don't use every combination of byte values.
Just so you know, Unicode is the predominant character set except in very specialized situations. It has several encodings. UTF-8 is often used for storage and streams. UTF-16 is often used in memory, particularly in Java, .NET, JavaScript, XML, …. When text is communicated between systems, there has to be an agreement, specification, standard, or indication about which character set and encoding it uses so a sequence of bytes can be interpreted as characters.
To add to the confusion, programming languages have data types called char, Character, etc. You have to look at the specific language's reference manual to see what they mean. For example in C, char is simply an integer that is defined as the size of the encoding of character used by that C implementation. (C also calls this a "byte" and it is not necessarily 8 bits. In all other contexts, people mean 8 bits when they say "byte". If they want to be exceedingly unambiguous they might say "octet".)
"Hello" is five characters. In a specific character set, it is five codepoints. In a specific encoding for that character set, it could be 5, 10 or 20, or ??? bytes.
Also, in the source code of a specific language, a literal string like that might be "null-terminated". This means that you could say it is 6 "characters". Other languages might store a string as a counted sequence of code units. Again, you have to look at the language reference to know the underlying data structure of strings. Of, if the language and the libraries used with it are sufficiently high-level, you might never need to know such internals.

How smaz compression library works?

I'm currently working for a short text compression project based on my language. But as a beginner, I also know some basic compression algorithm like LZW. But I still don't understand how smaz works. I have 2 questions:
How does smaz work?
How to build the codebook and reversed codebook?
Can any one explain it for me?
Thank you very much.
trying to answer your questions
How does smaz work?
according [1],
Smaz has a hard-wired constant built-in codebook of 254 common English
words, word fragments, bigrams, and the lowercase letters (except j,
k, q). The inner loop of the Smaz decoder is very simple:
Fetch the next byte X from the compressed file.
Is X == 254? Single byte literal: fetch the next byte L, and pass it straight through to the decoded text.
Is X == 255? Literal string: fetch the next byte L, then pass the following L+1 bytes straight through to the decoded text.
Any other value of X: lookup the X'th "word" in the codebook (that "word" can be from 1 to 5 letters), and copy that word to the decoded
text.
Repeat until there are no more compressed bytes left in the compressed file.
Because the codebook is constant, the Smaz decoder is unable to
"learn" new words and compress them, no matter how often they appear
in the original text.
This page could be helpful to understand the code.
How to build the codebook and reversed codebook?
TODO file in repository and author comments in redit poitns that the dictionary was generated by a unreleased ruby script. Also, the author explains:
btw what the Ruby program does is to consider all the possible substrings, and even all the possible separated words, and build a
table of frequencies, than adjust the weight based on the string
length, and finally hand tuning the table to compress specific things
very well. I added by hand the "http://" and ".com" token for example,
removing the final two entries.
An alternative to your project could be the shoco library which supports generation of a custom compression model based on your language.
The smaz sources is only 178 lines and just 99 lines without comments and codebook tables. You should look to see how it works.
Smaz is pretty simple compression by codebook (like LZW which you know). The library contains table with most popular terms in english (lines 5 - 51 for compression table and 56 -76 for decompression) and replace this terms with indexes in compressed string. And contrary to decompress.
For example, string the end would compressed by 58% becouse if terms the would be one byte index in compression table. So 7 bytes lenght string became 4 bytes length string.

Is there a good two way hash to convert an email address to a predictable, readable, unix username?

We are working with a number of unix based filesystems, all of which share a similar set of restrictions on that certain characters can't be used in the username fields. One of those restrictions is no "#" , "_", or "." in the names. Being unix there are a number of other restrictions.
So the question is if there is a good known algorithm that can take an email address and turn that into a predictable unix filename. We would need to reverse this at some point to get the email.
I've considered doing thing like "."->"DOT", "#"->"AT", etc. But there are size limitations and other things that are generally problematic. I could also optimize by being able to map the #xyz.com part of the email to a special char or something. Each implementation would only have at most 3 domains it would need to support. I'm hoping someone has found a solution without a huge number of tradeoffs.
UPDATE:
-The two target filesystems are AFS and NFS.
-Base64 doesn't work as it has not compatible characters. "/"
-Readable is preferable.
Seems like the best answer would be to replace the #xyz.com domain to a single non-standard character, and then have a function that could shrink the first part of a name to something that fits in the username length restrictions of the various filesystems. But what is a good function for that?
You could try a modified version of the URL percent (%) encoding scheme used on for URIs.
If the percent symbol isn't allowed on your particular filesystem(s), simply replace it with a different, allowed character (and remember to encode any occurrences of that character properly).
Using this method:
mail.address#server.com
Would become:
mail%2Eaddress%40server%2Ecom
Or, if you had to substitute (for example), the letter a instead of the % symbol:
ma61ila2Ea61ddressa40servera2Ecom
Not exactly humanly-readable perhaps, but easily enough processed through an encoding algorithm. For the best space efficiency, your escape character should be a character allowed by the filesystem, yet one that is not likely to appear frequently in an address.
This encoding scheme has the advantage that there is no size increase for most normal characters. The string length will ONLY go up for characters not supported by the filesystem.
Check out base64. Encoding and decoding is well defined.
I'd prefer this over rolling my own format any day.
Hmm, from your question I'm not totally clear on this point, but since you wanted some conversion I'm assuming that you want something that is at least human readable?
Each OS may have different restrictions, but are you close enough to the platforms that you would be able to find out/test what is acceptable in a username? If you could find three 'special' characters that you could use just to do a replace on '#', '.', '_' you would be good to go. (Is that comprehensive? if not you would need to make sure you know all of them otherwise you could clash.) I searched a bit trying to find whether there was a POSIX standard, but wasn't able to find anything, so that's why I think if you can just test what's valid that would be the most direct route.
With even one special character, you could do URL encoding, either with '%' if it's available, or whatever you choose if not, say '!", then { '#'->'!40", '_'->'!5F', '.'-> '!2E' }. (The spec [RFC1738] http://www.rfc-editor.org/rfc/rfc1738.txt) defines the characters as US-ASCII so you can just find a table, e.g. in wikipedia's ASCII article and look up the correct hex digits there.) Or, you could just do your own simple mapping since you don't need the whole ASCII set, you could just do a map with two characters per escaped character and have, say, '!a','!u','!p' for at, underscore, period.
If you have two special characters, say, '%', and '!', you could delimit text that represents the character, say, %at!, &us!, and '&pd!'. (This is pretty much html-style encoding, but instead of '&' and ';' you are using the available ones, and you're making up your own mnemonics.) Another idea is that you could use runs of a symbol to determine the translated character, where each new character flops which symbol is being used. (This conveniently stops the run if we need to put two of the disallowed characters next to each other.) So assume '%' and '!', with period being 1, underscore 2, and at-sign being three, 'mickey._sample_#fake.out' would become 'mickey%!!sample%%!!!fake%out'. There are other variations but this one is easy to code.
If none of this is an option (e.g. no symbols at all, just [a-zA-Z0-9]), then really I think the Base64 answer sounds about right. Really once we're getting to anything other than a simple replacement (and even that) it's already getting hard to type if that's the goal. But if you really need to try to keep the email mostly readable, what you do is implement some sort of escaping. I'm thinking use '0' as your escape character, so now '0' becomes '00', '#' becomes '01', '.' becomes '02', and '_' becomes '03'. So now, 'mickey01._sample_#fake.out'would become 'mickey0010203sample0301fake02out'. Not beautiful but it should work; since we escaped any raw 0's, just always make sure you define a mapping for whatever you choose as your escape char and you should be fine..
That's all I can think of atm. :) Definitely if there's no need for these usernames to be readable in the raw it seems like apparently Base64 won't work, since it can produce slashes. Heck, ok, just the 2-digit US-ASCII hex value for each character and you're done...] is a good way to go; there's lots of nice debugged, heavily field-tested code out there for it and it solves your problem quite handily. :)
Given...
- the limited set of characters allowed in various file systems
- the desire to keep the encoded email address short (both for human readability and for possible concerns with file system limitations)
...a possible approach may be a two steps encoding logic whereby the email is
first compressed using a lossless compression algorithm such as Lempel-Ziv, effectively turning it into a "binary" form, stored in a shorter array of bytes
then this array of bytes is encoded using a Base64-like algorithm
The idea is to minimize the size of the binary representation, so that the expansion associated with the storage inefficiency of the encoding -which can only store roughly 6 bits (and probably a bit less) per character-, doesn't cause the encoded string to be too long.
Without getting overly sophisticated for the compression nor the encoding, such a system would likely produce encoded strings that are maybe 4/5 of the input string size (the email address): the compression should easily half the size, but the encoding, say Base32, would grow the binary form size by 8/5.
Efforts in improving the compression ratio may allow the selection of more "wasteful" encoding schemes (with smaller character sets) and this may help making the output more human-readable and also more broadly safe on various flavors of file systems. For example whereby a Base64 seems optimal. space-wise, using only uppercase letter (base 26) may ensure portability of the underlying scheme to file systems where the file names are not case sensitive.
Another benefit of the initial generic compression is that few, if any, assumptions need to be made about the syntax of valid input key (email addresses here).
Ideas for compression:
LZ seems like a good choice, 'though one may consider primin its initial buffer with common patterns found in email addresses (example ".com" or even "a.com", "b.com" etc.). This initial buffer would ensure several instances of "citations" per compressed email address, hence a better compression ratio overall). To further squeeze a few bytes, maybe LZH or other LZ-variations could be used.
Aside from the priming of the buffer mentioned above, another customization may be to use a shorter buffer than typical LZ algorithms, since the string we have to compress (email address instances) are themselves very short and would not benefit from say a 512 bytes buffer. (Shorter buffer sizes allow shorter codes for the citations)
Ideas for encoding:
Base64 is not suitable as-is because of the slash (/), plus (+) and equal (=) characters. Alternate characters could be used to replace these; dash (-) comes to mind, but finding three charcters, allowed by all "flavors" of the targeted file systems may be a stretch.
Never the less, Base64 and its 4 output characters per 3 payload bytes ratio provide what is probably the barely achievable upper limit of storage efficiency [for an acceptable character set].
At the lower end of this efficiency, is maybe an ASCII representation of the Hexadeciamal values of the bytes in the array. This format with a doubling of the payload bytes may be acceptable, length-wise, and is interesting because of its simplicity (there is a direct and simple relation between each nibble (4 bits) in the input and characters in the encoded string.
Base32 whereby A thru Z encode 0 thru 25 and 0 thru 5 encode 26 thru 31, respectively, essentially variation of Base64 with an 8 output characters per 5 payload bytes ratio may be a very viable compromise.

Encrypt printable text so result is still printable (can be typed)

I want to encrypt some info for a licensing system and I want the result to be able to be typed in by the user.
Update: This operation must be reversible (decrypt-able)
E.g.,
Encrypt ( ComputerID+ProductID) -> (any standard ASCII character that can be typed. Ideally maybe even just A-Z).
So far what I did was to convert the encrypted text to HEX (so it's any character from 0-F) but that doubles the number of characters.
I'm using VB6.
I'm thinking I'd do some operation on each pair of (Input$(x) and Key$(x)) and then do a MOD to keep it within a range of ascii values (maybe 0-9-A-Z)
Any suggestions of a good algorithm?
Look into Base64 "encryption."
Base 64 will convert a number into 64 different ASCII characters, verses hex which is only 16 different ASCII characters... Making Base64 more compact and what you are looking for.
EDIT:
Code to do this in VB6 is available here: http://www.nonhostile.com/howto-encode-decode-base64-vb6.asp
Per Fuzzy Lollipop, below, Base32 looks like an even better option. Bonus points if you can find an example of that.
EDIT: I found an example of Base32 for VB6 although I've not tried it yet. -Clay
encode the encrypted bytes in HEX, or Base32 or Base64
Do you want this to be reversible -- to recover the IDs from the encrypted text? If so then it matters how you combine the key and input strings.
Usually you'd XOR each byte pair (work with byte arrays to avoid Unicode issues), circulating on the key string if it's shorter than the input. You can then use Base N encoding (32, 64 etc) to generate the license string.
Both operations are reversible: you can recover the XORed strings from the Base N string, then XOR with the key again to get the original IDs.
If you don't care about reversing the operations, then any convolution of key and ID will do. XOR is just the simplest.

Resources