Compression algorithm that produces url safe data - algorithm

I'm looking to store cookie data in a compact form.
Is there such a thing as a compression algorithm that produces URL safe output?
Currently my approach is
String jsonData = GSON.toJson(data);
byte[] cookieBinaryData = jsonData.getBytes("UTF-8");
byte[] cookieSnappyData = Snappy.compress(cookieBinaryData);
String cookieBase64Data = new Base64(true).encodeToString(cookieSnappyData);
From this cookieBase64Data is the one stored inside the cookie.
I would be happy to skip the Base64 hop.

How much are you saving by doing this? Is it worth it?
How about just saving an ID in a cookie and then looking up all the data in a database? Sort of like a long-lived session but you're controlling what data you store so there isn't a huge amount.

No. Compression is often a prefix-free code and not a charset. You can use yEnc encoding to safe some bits. Personally I'm not sure why yEnc is doomed. It uses an extended 8 bit ASCII charset and adds a lot less overhead (2%) to the compressed data: http://en.wikipedia.org/wiki/YEnc

Related

Is there an off the shelf binary format that allows string caching

I am investigating migrating of a highly customized and efficient binary format to one of the available binary formats. The data is stored on some low powered mobile among other places, so performance is important requirement.
Advantage of the current format is that all strings are stored in a pool. This means that we don't repeat the same string hundred of times in file, we read it only once during deserialization and all objects are referencing it by its index. It also means that we keep only one copy in memory. So a lot of advantages :)
I was not able to find a way for capnproto or flatbuffers to support this. Or would I need to build layer on top, and in generated object use integer index to strings explicitly?
Thanks you!
FlatBuffers supports string pooling. Simply serialize a string once, then refer to that string multiple times in other objects. The string will only occur in memory once.
Simplest example, schema:
table MyObject { name: string; id: string; }
code (C++):
FlatBufferBuilder fbb;
auto s = fbb.CreateString("MyPooledString");
// Both string fields point to the same data:
auto o = CreateMyObject(fbb, s, s);
fbb.Finish(o);
You can always do this manually like:
struct MyMessage {
stringTable #0 :List(Text);
# Now encode string fields as integer indexes into the string table.
someString #1 :UInt32;
otherString #2 :UInt32;
}
Cap'n Proto could in theory allow multiple pointers to point at the same object, but currently prohibits this for security reasons: it would be too easy to DoS servers that don't expect it by sending messages that are cyclic or contain lots of overlapping references. See the section on amplification attacks in the docs.

Compressing large string in ruby

I have a web application(ruby on rails) that sends some YAML as the value of a hidden input field.
Now I want to reduce the size of the text that is sent across to the browser. What is the most efficient form of lossless compression that would send across minimal data? I'm ok to incur additional cost of compression and decompression at the server side.
You could use the zlib implementation in the ruby core to in/de-flate data:
require "zlib"
data = "some long yaml string" * 100
compressed_data = Zlib::Deflate.deflate(data)
#=> "x\x9C+\xCE\xCFMU\xC8\xC9\xCFKW\xA8L\xCC\xCDQ(.)\xCA\xCCK/\x1E\x15\x1C\x15\x1C\x15\x1C\x15\x1C\x15\x1C\x15\x1C\x15\x1C\x15D\x15\x04\x00\xB3G%\xA6"
You should base64-encode the compressed data to make it printable:
require 'base64'
encoded_data = Base64.encode64 compressed_data
#=> "eJwrzs9NVcjJz0tXqEzMzVEoLinKzEsvHhUcFRwVHBUcFRwVHBUcFUQVBACz\nRyWm\n"
Later, on the client-side, you might use pako (a zlib port to javascript) to get your data back. This answer probably helps you with implementing the JS part.
To give you an idea on how effective this is, here are the sizes of the example strings:
data.size # 2100
compressed_data.size # 48
encoded_data.size # 66
Same thing goes vice-versa when compressing on the client and inflating on the server.
Zlib::Inflate.inflate(Base64.decode64(encoded_data))
#=> "some long yaml stringsome long yaml str ... (shortened, as the string is long :)
Disclaimer:
The ruby zlib implementation should be compatible with the pako implementation. But I have not tried it.
The numbers about string sizes are a little cheated. Zlib is really effective here, because the string repeats a lot. Real life data usually does not repeat as much.
If you are working on a Rails application, you can also use the ActiveSupport::Gzip wrapper that allows compression/decompression of strings with gzip.
compressed_log = ActiveSupport::Gzip.compress('large string')
=> "\x1F\x8B\b\x00yq5c\x00\x03..."
original_log = ActiveSupport::Gzip.decompress(compressed_log)
=> "large string"
Behind the scenes, the compress method uses the Zlib::GzipWriter class which writes gzipped files. Similarly, the decompress method uses Zlib::GzipReader class which reads a gzipped file.

Salt a key for secure encryption Cocoa?

I was reading a tutorial on how to salt a key to make your encryption secure, but couldn't make much of it. I don't know a lot about cryptography, and need some help. I am using commoncrypto to encrypt files, and am done, except for the fact that it isn't secure...
This is what I have:
- (NSData *)AES256EncryptWithKey:(NSString *)key
{
// 'key' should be 32 bytes for AES256, will be null-padded otherwise
char keyPtr[kCCKeySizeAES256 + 1]; // room for terminator (unused)
bzero( keyPtr, sizeof( keyPtr ) ); // fill with zeroes (for padding)
NSLog(#"You are encrypting something...");
// fetch key data
[key getCString:keyPtr maxLength:sizeof( keyPtr ) encoding:NSUTF8StringEncoding];
NSUInteger dataLength = [self length];
//See the doc: For block ciphers, the output size will always be less than or
//equal to the input size plus the size of one block.
//That's why we need to add the size of one block here
size_t bufferSize = dataLength + kCCBlockSizeAES128;
void *buffer = malloc( bufferSize );
size_t numBytesEncrypted = 0;
CCCryptorStatus cryptStatus = CCCrypt( kCCEncrypt, kCCAlgorithmAES128, kCCOptionPKCS7Padding,
keyPtr, kCCKeySizeAES256,
NULL /* initialization vector (optional) */,
[self bytes], dataLength, /* input */
buffer, bufferSize, /* output */
&numBytesEncrypted );
if( cryptStatus == kCCSuccess )
{
//the returned NSData takes ownership of the buffer and will free it on deallocation
return [NSData dataWithBytesNoCopy:buffer length:numBytesEncrypted];
}
free( buffer ); //free the buffer
return nil;
}
If someone can help me out, and show me exactly how I would implement salt, that would be great! Thanks again!
dYhG9pQ1qyJfIxfs2guVoU7jr9oniR2GF8MbC9mi
Enciphering text
AKA jumbling it around, to try and make it indecipherable. This is the game you play in cryptography. To do this, you use deterministic functions.
Encrypting involves using a function which takes two parameters: usually a short, fixed length one, and an arbitrary length one. It produces output the same size as the second parameter.
We call the first parameter the key; the second, the plaintext; and the output, the ciphertext.
This will have an inverse function (which is sometimes the same one), which has the same signature, but given instead ciphertext will return the plaintext (when using the same key).
Obviously the property of a good encryption function is that the plaintext is not easily determinable from the ciphertext, without knowing the key. An even better one will produce ciphertext that is indistinguishable from random noise.
Hashing involves a function which takes one parameter, of arbitrary size, and returns an output of fixed size. Here, the goal is that given a particular output, it should be hard to find any input that will produce it. It is a one-way function, so it has no inverse. Again, it's awesome if the output looks completely random.
The problem with determinism
The above is all very well and good, but we have a problem with our ultimate goals of indecipherability when designing implementations of these functions: they're deterministic! That's no good for producing random output.
While we can design functions that still produce very random-looking output, thanks to confusion and diffusion, they're still going to give the same output given the same input. We both need this, and don't like it. We would never be able to decipher anything with a non-deterministic crypto system, but we don't like repeatable results! Repeatable means analysable... determinable (huh.). We don't want the enemy to see the same two ciphertexts and know that they came from the same input, that would be giving them information (and useful techniques for breaking crypto-systems, like rainbow tables). How do we solve this problem?
Enter: some random stuff inserted at the start.
That's how we defeat it! We prepend (or sometimes better, append), some unique random input with our actual input, everytime we use our functions. This makes our deterministic functions give different output even when we give the same input. We send the unique random input (when hashing, called a salt; when encrypting, called an Initialisation Vector, or IV) along with the ciphertext. It's not important whether the enemy sees this random input; our real input is already protected by our key (or the one-way hash). All that we were actually worried about is that our output is different all the time, so that it's non-analysable; and we've achieved this.
How do I apply this knowledge?
OK. So everybody has their app, and within it their cryptosystem protecting parts of the app.
Now we don't want to go reinventing the wheel with cryptosystems (Worst. Idea. Ever.), so some really knowledgable people have already come up with good components that can build any system (i.e, AES, RSA, SHA2, HMAC, PBKDF2). But if everyone is using the same components, then that still introduces some repeatability! Fortunately, if everyone uses different keys, and unique initial random inputs, in their own cryptosytem, they should be fine.
Enough already! Talk about implementation!
Let's talk about your example. You're wanting to do some simple encryption. What do we want for that? Well, A) we want a good random key, and B) we want a good random IV. This will make our ciphertext as secure as it can get. I can see you haven't supplied a random IV - it's better practice to do so. Get some bytes from a [secure/crypto]-random source, and chuck it in. You store/send those bytes along with the ciphertext. Yes, this does mean that the ciphertext is a constant length bigger than the plaintext, but it's a small price to pay.
Now what about that key? Sometimes we want a remember-able key (like.. a password), rather than a nice random one that computers like (if you have the option to just use a random key - do that instead). Can we get a compromise? Yes! Should we translate ASCII character passwords into bytes to make the key? HELL NO!
ASCII characters aren't very random at all (heck, they generally only use about 6-7 bits out of 8). If anything, what we want to do is make our key at least look random. How do we do this? Well, hashing happens to be good for this. What if we want to reuse our key? We'll get the same hash... repeatability again!
Luckily, we use the other form of unique random input - a salt. Make a unique random salt, and append that to your key. Then hash it. Then use the bytes to encrypt your data. Add the salt AND the IV along with your ciphertext when you send it, and you should be able to decrypt on the end.
Almost done? NO! You see the hashing solution I described in the paragraph above? Real cryptographers would call it amateurish. Would you trust a system which is amateurish? No! Am I going to discuss why it's amateurish? No, 'cus you don't need to know. Basically, it's just not REALLY-SUPER-SCRAMBLED enough for their liking.
What you need to know is that they've already devised a better system for this very problem. It's called PBKDF2. Find an implementation of it, and [learn to] use that instead.
Now all your data is secure.
Salting just involves adding a random string to the end of the input key.
So generate a random string of some length:
Generate a random alphanumeric string in cocoa
And then just append it to the key using:
NSString *saltedKey = [key stringByAppendingString:salt];
Unless salt is being used in a different way in the article you read this should be correct.
How a random salt is normally used:
#Ca1icoJack is completely correct in saying that all you have to do is generate some random data and append it to the end. The data is usually binary as opposed to alphanumeric though. The salt is then stored unencrypted alongside each hashed password, and gets concatenated with the user's plaintext password in order to check the hash every time the password gets entered.
What the is the point of a SALT if it's stored unencrypted next to the hashed password?
Suppose someone gets access to your hashed passwords. Human chosen passwords are fairly vulnerable to being discovered via rainbow tables. Adding a hash means the rainbow table needs to not only include the values having any possible combination of alphanumeric characters a person might use, but also the random binary salt, which is fairly impractical at this point in time. So, basically adding a salt means that a brute force attacker who has access to both the hashed password and the salt needs to both figure how how the salt was concatenated to the password (before or after, normally) and brute force each password individually, since readily available rainbow tables don't include any random binary data.
Edit: But I said encrypted, not hashed:
Okay, I didn't read very carefully, ignore me. Someone is going to have to brute-force the key whether it's salted or not with encryption. The only discernable benefit I can see would be as that article says to avoid having the same key (from the user's perspective) used to encrypt the same data produce the same result. That is useful in encryption for different reasons (encrypted messages tend to have repeating parts which could be used to help break the encryption more easily) but the commenters are correct in noting that it is normally not called an salt in this instance.
Regardless, the trick is to concatenate the salt, and store it alongside each bit of encrypted data.

Silverlight: Encoding a webClient stream

I've been trying to get this to work, but I'm very frustrated at this point. I am a beginner in this field, so maybe I'm just making mistakes.
What I need to do is to take in a website .html and store it into a txt file. Now the problem is that this website is in Russian (encoding windows-1251) and Silverlight only supports 3 encodings. So in order to bypass that limitation, I got my hands on an encoding class that transfers the stream into a byte array and then tries to pull the correctly encoded string from the text. The problem with this is that
1) I try to ensure that webClient recieves a Unicode encoded stream, because the other ones do not seem to create a retrievable string, but it still doesn't seem to work.
WebClient wc = new WebClient();
wc.Encoding = System.Text.Encoding.Unicode;
wc.DownloadStringCompleted += new DownloadStringCompletedEventHandler(wc_LoadCompleted);
wc.DownloadStringAsync(new Uri(site));
2) I fear that when I store the html into a txt file using streamWriter, the encoding is, yet again, somehow screwed up.
3) The encoding class is not doing its job.
Encoding rus = Encoding.GetEncoding(1251);
Encoding eng = Encoding.Unicode;
byte[] bytes = rus.GetBytes(string);
textBlock1.Text = eng.GetString(bytes);
Can anyone offer any help on this matter? This huge detriment to my project. Thanks in advance,
Since you want to handle an encoding alien to Silverlight you should start with downloading using OpenReadAsync and OpenReadCompleted.
Now you should be able to take the Stream provided by the event args Result property and supply it directly to the encoding component you have acquired to generate the correct string result.

Ruby on Rails - generating bit.ly style identifiers

I'm trying to generate UUIDs with the same style as bit.ly urls like:
http://bit [dot] ly/aUekJP
or cloudapp ones:
http://cl [dot] ly/1hVU
which are even smaller
how can I do it?
I'm now using UUID gem for ruby but I'm not sure if it's possible to limitate the length and get something like this.
I am currently using this:
UUID.generate.split("-")[0] => b9386070
But I would like to have even smaller and knowing that it will be unique.
Any help would be pretty much appreciated :)
edit note: replaced dot letters with [dot] for workaround of banned short link
You are confusing two different things here. A UUID is a universally unique identifier. It has a very high probability of being unique even if millions of them were being created all over the world at the same time. It is generally displayed as a 36 digit string. You can not chop off the first 8 characters and expect it to be unique.
Bitly, tinyurl et-al store links and generate a short code to represent that link. They do not reconstruct the URL from the code they look it up in a data-store and return the corresponding URL. These are not UUIDS.
Without knowing your application it is hard to advise on what method you should use, however you could store whatever you are pointing at in a data-store with a numeric key and then rebase the key to base32 using the 10 digits and 22 lowercase letters, perhaps avoiding the obvious typo problems like 'o' 'i' 'l' etc
EDIT
On further investigation there is a Ruby base32 gem available that implements Douglas Crockford's Base 32 implementation
A 5 character Base32 string can represent over 33 million integers and a 6 digit string over a billion.
If you are working with numbers, you can use the built in ruby methods
6175601989.to_s(30)
=> "8e45ttj"
to go back
"8e45ttj".to_i(30)
=>6175601989
So you don't have to store anything, you can always decode an incoming short_code.
This works ok for proof of concept, but you aren't able to avoid ambiguous characters like: 1lji0o. If you are just looking to use the code to obfuscate database record IDs, this will work fine. In general, short codes are supposed to be easy to remember and transfer from one medium to another, like reading it on someone's presentation slide, or hearing it over the phone. If you need to avoid characters that are hard to read or hard to 'hear', you might need to switch to a process where you generate an acceptable code, and store it.
I found this to be short and reliable:
def create_uuid(prefix=nil)
time = (Time.now.to_f * 10_000_000).to_i
jitter = rand(10_000_000)
key = "#{jitter}#{time}".to_i.to_s(36)
[prefix, key].compact.join('_')
end
This spits out unique keys that look like this: '3qaishe3gpp07w2m'
Reduce the 'jitter' size to reduce the key size.
Caveat:
This is not guaranteed unique (use SecureRandom.uuid for that), but it is highly reliable:
10_000_000.times.map {create_uuid}.uniq.length == 10_000_000
The only way to guarantee uniqueness is to keep a global count and increment it for each use: 0000, 0001, etc.

Resources