What for people sometimes convert numbers or strings to bytes? - byte

Sometimes I encounter questions about converting sth to bytes. Are anything existing where it is vitally important to convert to bytes or what for could I convert sth to bytes?

In most languages the most common string functions come as part of the language or in a library/include/import that comes pre-made, often employing object code to take advantage of processor based strings functions, however, sometimes you need to do something with a string that isnt natively supported by the language so since 8-bit days, people have viewed strings as an array of 7 or 8-bit characters, which fit within a byte and use conventions like ASCII to determine which byte value represents which character.
While standard languages often have functions like "string.replaceChar(OFFSET,'a')" this methodology can be painstaking slow because each call to the replaceChar method results in processing overhead which may be greater than the processing needing to be done.
There is also the simplicity factor when designing your own string algorithms but like I said, most of the common algorithms come prebuilt in modern languages. (stringCompare, trimString, reverseString, etc).
Suppose you want to perform an operation on a string which doesnt come as standard.
Suppose you want to add two numbers which are represented in decimal digits in strings and the size of these numbers are greater than the 64-bit bus size of the processor? The RSA encryption/descryption behind the SSL browser padlocks employs the use of numbers which dont fit into the word size of a desktop computer but none the less the programs on a desktop which deal with RSA certificates and keys must be able to process these data which are actually strings.
There are many and varied reasons you would want to deal with string, as an array of bytes but each of these reasons would be fairly specialised.

Related

how to add signature to protobuf messges?

Is there a common way to sign protobuf messages? what I can imagine is to Add a data field and a signature field in a message, and use SerializeToArray(in cpp) or ToByteArray(in c#) to get raw bytes, and then use md5 or sha256 .. etc to calculate the hash value, then assign the hash value to the field 'sign'. Bue I don't know if there is any different with the raw bytes between different languages, or in proto2 and proto3?
The approach you discuss for signing is fine for integrity validation purposes, as long as your hashing algorithm is strong enough. If it is for anything stronger than an integrity checksum, you should probably use a true cryptographic hash (with public+private keys), as anyone can otherwise sign their own arbitrary payload, defeating the point.
You also seen to discuss determinism. The raw bytes in protobuf are not entirely deterministic. There are multiple valid ways of representing the same payload in protobuf, including:
reordering fields (numerical order is a "should", not a "must")
including or omitting zeros (different between proto2 and proto3)
packed vs sequential "repeated" encoding
the reality that "map" is usually backed by some platform-specific inbuilt map/dictionary type, which commonly do not define order, so in theory it can vary every time
not really an issue in reality, but in theory you can encode a varint with an arbitrary length (up to 10 bytes) simply by including unnecessary groups of zero bytes; similar to in text (JSON, etc) saying that 42, 042, 0042 and 0000000042 all represent the same integer; nobody does that, but: it would be valid

Most efficient barcode to store a GUID

I have a system that I'm working on at the moment that requires users to log into the system, and the client wants to use a barcode scanner and cards to keep prices down. (Yes username and password cheaper, but she wants a card type solution so she gets one.)
All my data uses GUIDs as key fields, so I'd like to store the GUID directly on the card in the barcode. While its simple enough to code it as 3 of 9 its not going to be the most efficient use of space.
Is there a best practice or most efficient method for storing GUIDs in a barcode? I'd have assumed that since there's a consistent length, and depth to the data there would be a standard, but I can't find it. Would be easy enough to generate my own - control char either end and then binary data between, but would like something that standard readers will know how to interpret.
Any help gratefully received.
There are no open standards for special-purpose data compaction with generic linear barcodes such as Code 39 and Code 128. Most ISO/IEC-standardised 2D barcodes do support a special-purpose data encoding mechanism called Extended Channel Interpretation (ECI) which allows you to specify that data conforms to a certain application standard or encoding regime, for example ECI 298765 for IPv4 address compaction [*]. Unfortunately GUID compaction isn't amongst those that have been registered and even if it were you would nevertheless need to handle this within your application as reader support would be lacking.
That leaves you with having to pre-encode (and subsequently decode) the GUID into a format that can be handled efficiently by some ubiquitous barcode symbology.
An efficient way to store a GUID would be to convert it to a 40-digit[†] decimal representation and store the result in a Code 128 barcode using double-density numeric compression ("Mode C").
For example, consider the GUID:
cd171f7c-560d-4a62-8d65-16b87419a58c
Expressed as a hexadecimal number:
0xCD171F7C560D4A628D6516B87419A58C
Converted to 40 decimal digits:
0272611800569275698104677545117639878028
Encoded within a Code 128 barcode:
Your application would of course need to recognise this input as a decimal-encoded GUID and reverse the above process but I doubt that a significantly more efficient approach exists that doesn't require you to transform the data into an unusual radix and then deal with the complexities of handling ASCII control characters at scan time.
[*] The register of assigned ECI codes is available from the AIM store as "ECI Part 3: Register".
[†] Whilst it is possible to store the entire GUID range within 39 digits a 39-digit Mode C Code 128 symbol is in fact longer than a 40-digit symbol.

What is the name of this text compression scheme?

A couple years ago I read about a very lightweight text compression algorithm, and now I can't find a reference or remember its name.
It used the difference between each successive pair of characters. Since, for example, a lowercase letter predicts that the next character will also be a lowercase letter, the differences tend to be small. (It might have thrown out the low-order bits of the preceding character before subtracting; I cannot recall.) Instant complexity reduction. And it's Unicode friendly.
Of course there were a few bells and whistles, and the details of producing a bitstream, but it was super lightweight and suitable for embedded systems. No hefty dictionary to store. I'm pretty sure that the summary I saw was on Wikipedia, but I cannot find anything.
I recall that it was invented at Google, but it was not Snappy.
I think what you're on about is BOCU, Binary-Ordered Compression for Unicode or one of its predecessors/successors. In particular,
The basic structure of BOCU is simple. In compressing a sequence of code points, you subtract the last code point from the current code point, producing a signed delta value that can range from -10FFFF to 10FFFF. The delta is then encoded in a series of bytes. Small differences are encoded in a small number of bytes; larger differences are encoded in a successively larger number of bytes.

endian.h for floats

I am trying to read an unformatted binary file, that was written by a big endian machine. My machine is a 32bit little endian.
I already know how to swap bytes for different variable types, but it is a cumbersome work. I found this set of functions endian.h that handle integer swapping very easily.
I was wondering if there is something similar for floats or strings, or if I have to program it from scratch? Since they are handled differently for this endianness problem as integers.
Thanks.
I do not think there is a standart header for swapping floats. You could take a look at http://www.gamedev.net/page/resources/_/technical/game-programming/writing-endian-independent-code-in-c-r2091
which provides some helpful code.
As for strings, there is no need to do endian-swapping. Endianness is used to order the bytes of a variable. A string consists of a series of chars. Each char has only one Byte so there is nothing to swap.

Is there a good two way hash to convert an email address to a predictable, readable, unix username?

We are working with a number of unix based filesystems, all of which share a similar set of restrictions on that certain characters can't be used in the username fields. One of those restrictions is no "#" , "_", or "." in the names. Being unix there are a number of other restrictions.
So the question is if there is a good known algorithm that can take an email address and turn that into a predictable unix filename. We would need to reverse this at some point to get the email.
I've considered doing thing like "."->"DOT", "#"->"AT", etc. But there are size limitations and other things that are generally problematic. I could also optimize by being able to map the #xyz.com part of the email to a special char or something. Each implementation would only have at most 3 domains it would need to support. I'm hoping someone has found a solution without a huge number of tradeoffs.
UPDATE:
-The two target filesystems are AFS and NFS.
-Base64 doesn't work as it has not compatible characters. "/"
-Readable is preferable.
Seems like the best answer would be to replace the #xyz.com domain to a single non-standard character, and then have a function that could shrink the first part of a name to something that fits in the username length restrictions of the various filesystems. But what is a good function for that?
You could try a modified version of the URL percent (%) encoding scheme used on for URIs.
If the percent symbol isn't allowed on your particular filesystem(s), simply replace it with a different, allowed character (and remember to encode any occurrences of that character properly).
Using this method:
mail.address#server.com
Would become:
mail%2Eaddress%40server%2Ecom
Or, if you had to substitute (for example), the letter a instead of the % symbol:
ma61ila2Ea61ddressa40servera2Ecom
Not exactly humanly-readable perhaps, but easily enough processed through an encoding algorithm. For the best space efficiency, your escape character should be a character allowed by the filesystem, yet one that is not likely to appear frequently in an address.
This encoding scheme has the advantage that there is no size increase for most normal characters. The string length will ONLY go up for characters not supported by the filesystem.
Check out base64. Encoding and decoding is well defined.
I'd prefer this over rolling my own format any day.
Hmm, from your question I'm not totally clear on this point, but since you wanted some conversion I'm assuming that you want something that is at least human readable?
Each OS may have different restrictions, but are you close enough to the platforms that you would be able to find out/test what is acceptable in a username? If you could find three 'special' characters that you could use just to do a replace on '#', '.', '_' you would be good to go. (Is that comprehensive? if not you would need to make sure you know all of them otherwise you could clash.) I searched a bit trying to find whether there was a POSIX standard, but wasn't able to find anything, so that's why I think if you can just test what's valid that would be the most direct route.
With even one special character, you could do URL encoding, either with '%' if it's available, or whatever you choose if not, say '!", then { '#'->'!40", '_'->'!5F', '.'-> '!2E' }. (The spec [RFC1738] http://www.rfc-editor.org/rfc/rfc1738.txt) defines the characters as US-ASCII so you can just find a table, e.g. in wikipedia's ASCII article and look up the correct hex digits there.) Or, you could just do your own simple mapping since you don't need the whole ASCII set, you could just do a map with two characters per escaped character and have, say, '!a','!u','!p' for at, underscore, period.
If you have two special characters, say, '%', and '!', you could delimit text that represents the character, say, %at!, &us!, and '&pd!'. (This is pretty much html-style encoding, but instead of '&' and ';' you are using the available ones, and you're making up your own mnemonics.) Another idea is that you could use runs of a symbol to determine the translated character, where each new character flops which symbol is being used. (This conveniently stops the run if we need to put two of the disallowed characters next to each other.) So assume '%' and '!', with period being 1, underscore 2, and at-sign being three, 'mickey._sample_#fake.out' would become 'mickey%!!sample%%!!!fake%out'. There are other variations but this one is easy to code.
If none of this is an option (e.g. no symbols at all, just [a-zA-Z0-9]), then really I think the Base64 answer sounds about right. Really once we're getting to anything other than a simple replacement (and even that) it's already getting hard to type if that's the goal. But if you really need to try to keep the email mostly readable, what you do is implement some sort of escaping. I'm thinking use '0' as your escape character, so now '0' becomes '00', '#' becomes '01', '.' becomes '02', and '_' becomes '03'. So now, 'mickey01._sample_#fake.out'would become 'mickey0010203sample0301fake02out'. Not beautiful but it should work; since we escaped any raw 0's, just always make sure you define a mapping for whatever you choose as your escape char and you should be fine..
That's all I can think of atm. :) Definitely if there's no need for these usernames to be readable in the raw it seems like apparently Base64 won't work, since it can produce slashes. Heck, ok, just the 2-digit US-ASCII hex value for each character and you're done...] is a good way to go; there's lots of nice debugged, heavily field-tested code out there for it and it solves your problem quite handily. :)
Given...
- the limited set of characters allowed in various file systems
- the desire to keep the encoded email address short (both for human readability and for possible concerns with file system limitations)
...a possible approach may be a two steps encoding logic whereby the email is
first compressed using a lossless compression algorithm such as Lempel-Ziv, effectively turning it into a "binary" form, stored in a shorter array of bytes
then this array of bytes is encoded using a Base64-like algorithm
The idea is to minimize the size of the binary representation, so that the expansion associated with the storage inefficiency of the encoding -which can only store roughly 6 bits (and probably a bit less) per character-, doesn't cause the encoded string to be too long.
Without getting overly sophisticated for the compression nor the encoding, such a system would likely produce encoded strings that are maybe 4/5 of the input string size (the email address): the compression should easily half the size, but the encoding, say Base32, would grow the binary form size by 8/5.
Efforts in improving the compression ratio may allow the selection of more "wasteful" encoding schemes (with smaller character sets) and this may help making the output more human-readable and also more broadly safe on various flavors of file systems. For example whereby a Base64 seems optimal. space-wise, using only uppercase letter (base 26) may ensure portability of the underlying scheme to file systems where the file names are not case sensitive.
Another benefit of the initial generic compression is that few, if any, assumptions need to be made about the syntax of valid input key (email addresses here).
Ideas for compression:
LZ seems like a good choice, 'though one may consider primin its initial buffer with common patterns found in email addresses (example ".com" or even "a.com", "b.com" etc.). This initial buffer would ensure several instances of "citations" per compressed email address, hence a better compression ratio overall). To further squeeze a few bytes, maybe LZH or other LZ-variations could be used.
Aside from the priming of the buffer mentioned above, another customization may be to use a shorter buffer than typical LZ algorithms, since the string we have to compress (email address instances) are themselves very short and would not benefit from say a 512 bytes buffer. (Shorter buffer sizes allow shorter codes for the citations)
Ideas for encoding:
Base64 is not suitable as-is because of the slash (/), plus (+) and equal (=) characters. Alternate characters could be used to replace these; dash (-) comes to mind, but finding three charcters, allowed by all "flavors" of the targeted file systems may be a stretch.
Never the less, Base64 and its 4 output characters per 3 payload bytes ratio provide what is probably the barely achievable upper limit of storage efficiency [for an acceptable character set].
At the lower end of this efficiency, is maybe an ASCII representation of the Hexadeciamal values of the bytes in the array. This format with a doubling of the payload bytes may be acceptable, length-wise, and is interesting because of its simplicity (there is a direct and simple relation between each nibble (4 bits) in the input and characters in the encoded string.
Base32 whereby A thru Z encode 0 thru 25 and 0 thru 5 encode 26 thru 31, respectively, essentially variation of Base64 with an 8 output characters per 5 payload bytes ratio may be a very viable compromise.

Resources