public static String sha(String base) {
try{
MessageDigest digest = MessageDigest.getInstance("SHA-512");
byte[] hash = digest.digest(base.getBytes("UTF-8"));
StringBuffer hexString = new StringBuffer();
for (int i = 0; i < hash.length; i++) {
String hex = Integer.toHexString(0xff & hash[i]);
if(hex.length() == 1) hexString.append('0');
hexString.append(hex);
}
return hexString.toString();
} catch(Exception ex){
throw new RuntimeException(ex);
}
}
the method above generate a string which is not started by $6.
i.e :
"0000" --> c6001d5b2ac3df314204a8f9d7a00e1503c9aba0fd4538645de4bf4cc7e2555cfe9ff9d0236bf327ed3e907849a98df4d330c4bea551017d465b4c1d9b80bcb0
However , we know that the first 2 or 3 chars indicates the hashing algorithm .
ie :
Blowfish --> $2$ or $2a$
SHA-512 --> $6$
- Is there a difference between encoding and hashing ?
or what is the story ?
UPDATE :
Linux crypt command line generates a string with 86 chars, however Java8 generates 128 chars .
#man 3 crypt
MD5 | 22 characters
SHA-256 | 43 characters
SHA-512 | 86 characters
It’s not clear why you expect a Java API function to produce the output of a GNU extension of a POSIX API function. Besides that, as you are generating the string in the Java code, you are the one who controls the size of the generated string.
The byte array returned by digest of SHA-512 has exactly 66 bytes which shouldn’t come to surprise as 66 (bytes) * 8 == 512 (bits). This is what an SHA-512 hash is all about. It was your decision to encode it in hexadecimal form, which needs one character for four bits, hence 512/4 == 128 characters.
By the way, even for that functionality, it’s worth getting used to the existing APIs, Java programming language features and coding conventions:
public static String sha(String base) {
try {
MessageDigest digest = MessageDigest.getInstance("SHA-512");
StringBuilder hexString = new StringBuilder();
for(byte b: digest.digest(base.getBytes(StandardCharsets.UTF_8)))
hexString.append(String.format("%02x", b&0xff));
return hexString.toString();
} catch(NoSuchAlgorithmException ex){
throw new RuntimeException(ex);
}
}
However, if the crypt C function generates 86 bytes, it’s most likely using the Base64 encoding, which needs one character for six bits, as 512 / 6 == 85,33…, thus at least 86 chars.
If you want to encode using base64, you can use a standard API starting with Java 8, for older versions you need either a 3rd party library or implement it yourself.
public static String sha(String base) {
try{
MessageDigest digest = MessageDigest.getInstance("SHA-512");
byte[] hash = digest.digest(base.getBytes(StandardCharsets.UTF_8));
return Base64.getEncoder().encodeToString(hash);
} catch(NoSuchAlgorithmException ex){
throw new RuntimeException(ex);
}
}
Note that the encoded string will have 88 characters as the standard base64 format has a padding, i.e. there will be always two = characters at the end here. If you know from context (like with SHA-512), that it has to be 512 bits, you can omit them when storing the result:
public static String sha(String base) {
try{
MessageDigest digest = MessageDigest.getInstance("SHA-512");
byte[] hash = digest.digest(base.getBytes(StandardCharsets.UTF_8));
String encoded = Base64.getEncoder().encodeToString(hash);
assert encoded.length()==88 && encoded.endsWith("==");
return encoded.substring(0, 86);
} catch(NoSuchAlgorithmException ex){
throw new RuntimeException(ex);
}
}
crypt(3) is a password hashing algorithm that is only based on SHA-512.
You can find a Java implementation in the Apache Commons Codec project:
https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/digest/Crypt.html
https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/digest/Sha2Crypt.html
The algorithm actually feeds the result of the first SHA-512 digest to it self several thousand times to be deliberately slow as this makes bruteforce cracking the password harder.
Related
I have a function that needs a random sequence of bytes as input (e.g. a salt for hashing a password). I generate that string using a CSPRNG function and then encode in to base64.
Now I pass that string to the function that needs it, but that function works with bytes, so if it receive a string it turns it into a byte-buffer by reading the string as utf8. The string given as input is not the same sequence of bytes generated with the CSPRNG function but is the utf8 decoded string of the base64 encoded random bytes. So if I generate N bytes, the transformations with encodings turns it in 4/3*N bytes. Can I assume that these expanded bytes are still random after the transformations? Are there any security implications?
Here's a pseudo code to make it more clear:
function needsRandBytes(rand) {
if (typeof rand == 'string') {
rand = Buffer.from(rand, 'utf8'); // here's the expansion
}
// use the rand bytes...
}
randBytes = generateRandomBytes(N); // cryptographically secure function
randString = randBytes.toString('base64');
needsRandBytes(randString);
I want to access data in a database created by Rails for use by non-Ruby code. Some fields use attr_encrypted accessors, and the library in use is the symmetric-encryption gem. I consistently get a "wrong final block length" error if I try to decrypt the data with, e.g., the NodeJS crypto library.
I suspect this has to do either with character encoding or with padding, but I can't figure it out based on the docs.
As an experiment, I tried decrypting data from symmetric-encryption in Ruby's own OpenSSL library, and I get either a "bad decrypt" error or the same problem:
SymmetricEncryption.cipher = SymmetricEncryption::Cipher.new(
key: "1234567890ABCDEF",
iv: "1234567890ABCDEF",
cipher_name: "aes-128-cbc"
)
ciphertext = SymmetricEncryption.encrypt("Hello world")
c = OpenSSL::Cipher.new("aes-128-cbc")
c.iv = c.key = "1234567890ABCDEF"
c.update(ciphertext) + c.final
That gives me a "bad decrypt" error.
Interestingly, the encrypted data in the database can be decrypted by the symmetric-encryption gem, but isn't the same as the output of SymmetricEncryption.encrypt (and OpenSSL doesn't successfully decrypt it, either).
Edit:
psql=# SELECT "encrypted_firstName" FROM people LIMIT 1;
encrypted_firstName
----------------------------------------------------------
QEVuQwBAEAAuR5vRj/iFbaEsXKtpjubrWgyEhK5Pji2EWPDPoT4CyQ==
(1 row)
Then
irb> SymmetricEncryption.decrypt "QEVuQwBAEAAuR5vRj/iFbaEsXKtpjubrWgyEhK5Pji2EWPDPoT4CyQ=="
=> "Lurline"
irb> SymmetricEncryption.encrypt "Lurline"
=> "QEVuQwAAlRBeYptjK0Fg76jFQkjLtA=="
Looking at the source for the symmetric-encryption gem, by default it adds a header to the output and base64 encodes it, although both of these are configurable.
To decrypt using Ruby’s OpenSSL directly, you will need to decode it and strip off this header, which is 6 bytes long in this simple case:
ciphertext = Base64.decode64(ciphertext)
ciphertext = ciphertext[6..-1]
c = OpenSSL::Cipher.new("aes-128-cbc")
c.decrypt
c.iv = "1234567890ABCDEF"
c.key = "1234567890ABCDEF"
result = c.update(ciphertext) + c.final
Of course, you may need to alter this depending on what settings you are using in symmetric-encryption, e.g. the header length may vary. In order to decrypt the result from the database you will need to parse the header. Have a look at the source.
Based on the Rust implementation done by #Shepmaster in my other question (and the source code for the symmetric-encryption gem), I have a working version in TypeScript. #matt is close with his answer, but the header can actually have additional bytes containing metadata about the encrypted data. Note that this doesn't handle (1) compressed encrypted data, or (2) setting the encryption algorithm from the header itself; neither situation is relevant to my use case.
import { createDecipher, createDecipheriv, Decipher } from "crypto";
// We use two types of encoding with SymmetricEncryption: Base64 and UTF-8. We
// define them in an `enum` for type safety.
const enum Encoding {
Base64 = "base64",
Utf8 = "utf8",
}
// Symmetric encryption's header contains the following data:
interface IHeader {
version: number, // The version of the encryption algo
isCompressed: boolean, // Whether the data is compressed (TODO: Implement)
hasIv: boolean, // Whether the header itself has the IV
hasKey: boolean, // Whether the header itself has the Key
hasCipherName: boolean, // Whether the header contains the cipher name
hasAuthTag: boolean, // Whether the header has an authorization tag
offset: number, // How many bytes into the encoded ciphertext the actual encrypted data starts
iv?: Buffer, // The IV, present only if `hasIv` is true
key?: Buffer, // The key, present only if `hasKey` is true
// The cipher name, present only if `hasCipherName` is true. Currently ignored.
cipherName?: string,
authTag?: string, // The authorization tag, present only if // `hasAuthTag` is true
}
// Byte 6 of the header contain bit flags
interface IFlags {
isCompressed: boolean,
hasIv: boolean,
hasKey: boolean,
hasCipherName: boolean,
hasAuthTag: boolean
}
// The 7th byte until the end of the header have the actual values. If all
// of the flags are false, the header ends at the 6th byte.
interface IValues {
iv?: Buffer,
key?: Buffer,
cipherName?: string,
authTag?: string,
size: number,
}
/**
* Represent the encoded ciphertext, complete with the SymmetricEncryption header.
*/
class Ciphertext {
// Bit flags corresponding to the data encoded in byte 6 of the
// header.
readonly FLAG_COMPRESSED = 0b1000_0000;
readonly FLAG_IV = 0b0100_0000;
readonly FLAG_KEY = 0b0010_0000;
readonly FLAG_CIPHER_NAME = 0b0001_0000;
readonly FLAG_AUTH_TAG = 0b0000_1000;
// The literal data encoded in bytes 1 - 4 of the header
readonly MAGIC_HEADER = "#EnC";
// If any of the values represented by the bit flags is present, the first 2
// bytes of the data tells us how long the actual value is. In other words,
// the first 2 bytes aren't the value itself, but rather give the info about
// the length of the rest of the value.
readonly LENGTH_INFO_SIZE = 2;
public header: IHeader | null;
public data: Buffer;
private cipherBuffer: Buffer;
constructor(private input: string) {
this.cipherBuffer = new Buffer(input, Encoding.Base64);
this.header = this.getHeader();
const offset = this.header ? this.header.offset : 0; // If no header, then no offset
this.data = this.cipherBuffer.slice(offset);
}
/**
* Extract the header from the data
*/
private getHeader(): IHeader | null {
let offset = 0;
// Bytes 1 - 4 are the literal `#EnC`. If that's absent, there's no
// SymmetricEncryption header.
if (this.cipherBuffer.toString(Encoding.Utf8, offset, offset += 4) != this.MAGIC_HEADER) {
return null;
}
// Byte 5 is the version
const version = this.cipherBuffer.readInt8(offset++); // Post increment
// Byte 6 is the flags
const rawFlags = this.cipherBuffer.readInt8(offset++);
const flags = this.readFlags(rawFlags);
// Bytes 7 - end are the values.
const values = this.getValues(offset, flags);
offset += values.size;
return Object.assign({ version, offset }, flags, values);
}
/**
* Get the values for `iv`, `key`, `cipherName`, and `authTag`, if any are
* set, based on the bitflags. Return that data, plus how many bytes in the
* header those values represent.
*
* #param offset - What byte we're on when we get to the values. Should be 7
* #param flags - The flags we've extracted, showing us which values to expect
*/
private getValues(offset: number, flags: IFlags): IValues {
let iv: Buffer | undefined = undefined;
let key: Buffer | undefined = undefined;
let cipherName: string | undefined = undefined;
let authTag: string | undefined = undefined;
let size = 0; // If all of the bit flags are false, there is no additional data.
// For each value, see if the flag is set to true. If it is, we need to
// read the value. Keys and IVs need to be `Buffer` types; other values
// should be strings.
[iv, size] = flags.hasIv ? this.readBuffer(offset) : [undefined, size];
[key, size] = flags.hasKey ? this.readBuffer(offset + size) : [undefined, size];
[cipherName, size] = flags.hasCipherName ? this.readString(offset + size) : [undefined, size];
[authTag, size] = flags.hasAuthTag ? this.readString(offset + size) : [undefined, size];
return { iv, key, cipherName, authTag, size };
}
/**
* Parse the 16-bit integer representing the bit flags into an object for
* easier handling
*
* #param flags - The 16-bit integer that contains the bit flags
*/
private readFlags(flags: number): IFlags {
return {
isCompressed: (flags & this.FLAG_COMPRESSED) != 0,
hasIv: (flags & this.FLAG_IV) != 0,
hasKey: (flags & this.FLAG_KEY) != 0,
hasCipherName: (flags & this.FLAG_CIPHER_NAME) != 0,
hasAuthTag: (flags & this.FLAG_AUTH_TAG) != 0
}
}
/**
* Read a string out of the value at the specified offset. Return the value
* itself, plus the number of bytes consumed by the value (including the
* 2-byte encoding of the length of the actual value).
*
* #param offset - The offset (bytes from the beginning of the encoded,
* encrypted Buffer) at which the value in question begins
*/
private readString(offset: number): [string, number] {
// The length is the first 2 bytes, encoded as a little-endian 16-bit integer
const length = this.cipherBuffer.readInt16LE(offset);
// The total size occupied in the header is the 2 bytes encoding length plus the length itself
const size = this.LENGTH_INFO_SIZE + length;
const value = this.cipherBuffer.toString(Encoding.Base64, offset + this.LENGTH_INFO_SIZE, offset + size);
return [value, size];
}
/**
* Read a Buffer out of the value at the specified offset. Return the value
* itself, plus the number of bytes consumed by the value (including the
* 2-byte encoding of the length of the actual value).
*
* #param offset - The offset (bytes from the beginning of the encoded,
* encrypted Buffer) at which the value in question begins
*/
private readBuffer(offset: number): [Buffer, number] {
// The length is the first 2 bytes, encoded as a little-endian 16-bit integer
const length = this.cipherBuffer.readInt16LE(offset);
// The total size occupied in the header is the 2 bytes encoding length plus the length itself
const size = this.LENGTH_INFO_SIZE + length;
const value = this.cipherBuffer.slice(offset + this.LENGTH_INFO_SIZE, offset + size);
return [value, size];
}
}
/**
* Allow decryption of data encrypted by Ruby's `symmetric-encryption` gem
*/
class SymmetricEncryption {
private key: Buffer;
private iv?: Buffer;
constructor(key: string, private algo: string, iv?: string) {
this.key = new Buffer(key);
this.iv = iv ? new Buffer(iv) : undefined;
}
public decrypt(input: string): string {
const ciphertext = new Ciphertext(input);
// IV can be specified by the user. But if it's encoded in the header
// itself, go with that instead.
const iv = (ciphertext.header && ciphertext.header.iv) ? ciphertext.header.iv : this.iv;
// Key can be specified by the user. but if it's encoded in the header,
// go with that instead.
const key = (ciphertext.header && ciphertext.header.key) ? ciphertext.header.key : this.key;
const decipher: Decipher = iv ?
createDecipheriv(this.algo, key, iv) :
createDecipher(this.algo, key);
// Terse version of `update()` + `final()` that passes type checking
return Buffer.concat([decipher.update(ciphertext.data), decipher.final()]).toString();
}
}
const s = new SymmetricEncryption("1234567890ABCDEF", "aes-128-cbc", "1234567890ABCDEF");
console.log(s.decrypt("QEVuQwAADWK0cKzgFIovdIThq9Scrg==")); // => "Hello world"
This is a homework question that I can't get my head around at all
Its a very simple encryption algorithm. You start with a string of characters as your alphabet:
ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!, .
Then ask the user to enter there own string that will act as a map such as:
0987654321! .,POIUYTREWQASDFGHJKLMNBVCXZ
Then the program uses this to make a map and allows you to enter text that gets encrypted.
For example MY NAME IS JOSEPH would be encrypted as .AX,0.6X2YX1PY6O3
This is all very easy, however he said that its a one to one mapping and thus implied that if I enter .AX,0.6X2YX1PY6O3 back into the program I will get out MY NAME IS JOSEPH
This doesn't happen, because .AX,0.6X2YX1PY6O3 becomes Z0QCDZQGAQFOALDH
The mapping only works to decrypt when you go backwards but the question implies that the program just loops and runs the one algorithm every time.
Even if some could say that it is possible I would be happy, I have pages and pages of paper filled up with possible workings, but I came up with nothing, the only solution to run the algorithm backwards back I don't think we are allowed to do that.
Any ideas?
Edit:
Unfortunately I can't get this to work (Using the orbit computation idea) What am I doing wrong?
//import scanner class
import java.util.Scanner;
public class Encryption {
static Scanner inputString = new Scanner(System.in);
//define alphabet
private static String alpha = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!, .";
private static String map;
private static int[] encryptionMap = new int[40];//mapping int array
private static boolean exit = false;
private static boolean valid = true;
public static void main(String[] args) {
String encrypt, userInput;
userInput = new String();
System.out.println("This program takes a large reordered string");
System.out.println("and uses it to encrypt your data");
System.out.println("Please enter a mapping string of 40 length and the same characters as below but in different order:");
System.out.println(alpha);
//getMap();//don't get user input for map, for testing!
map=".ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!, ";//forced input for testing only!
do{
if (valid == true){
System.out.println("Enter Q to quit, otherwise enter a string:");
userInput = getInput();
if (userInput.charAt(0) != 'Q' ){//&& userInput.length()<2){
encrypt = encrypt(userInput);
for (int x=0; x<39; x++){//here I am trying to get the orbit computation going
encrypt = encrypt(encrypt);
}
System.out.println("You entered: "+userInput);
System.out.println("Encrypted Version: "+encrypt);
}else if (userInput.charAt(0) == 'Q'){//&& userInput.length()<2){
exit = true;
}
}
else if (valid == false){
System.out.println("Error, your string for mapping is incorrect");
valid = true;//reset condition to repeat
}
}while(exit == false);
System.out.println("Good bye");
}
static String encrypt(String userInput){
//use mapping array to encypt data
String encrypt;
StringBuffer tmp = new StringBuffer();
char current;
int alphaPosition;
int temp;
//run through the user string
for (int x=0; x<userInput.length(); x++){
//get character
current = userInput.charAt(x);
//get location of current character in alphabet
alphaPosition = alpha.indexOf(current);
//encryptionMap.charAt(alphaPosition)
tmp.append(map.charAt(alphaPosition));
}
encrypt = tmp.toString();
return(encrypt);
}
static void getMap(){
//get a mapping string and validate from the user
map = getInput();
//validate code
if (map.length() != 40){
valid = false;
}
else{
for (int x=0; x<40; x++){
if (map.indexOf(alpha.charAt(x)) == -1){
valid = false;
}
}
}
if (valid == true){
for (int x=0; x<40; x++){
int a = (int)(alpha.charAt(x));
int y = (int)( map.charAt(x));
//create encryption map
encryptionMap[x]=(a-y);
}
}
}
static String getInput(){
//get input(this repeats)
String input = inputString.nextLine();
input = input.toUpperCase();
if ("QUIT".equals(input) || "END".equals(input) || "NO".equals(input) || "N".equals(input)){
StringBuffer tmp = new StringBuffer();
tmp.append('Q');
input = tmp.toString();
}
return(input);
}
}
You will (probably) not get your original string back if you apply that substitution again. I say probably because you can construct such inputs (they all do things like if A->B then B->A). But most inputs won't do that. You would have to construct the reverse map to decrypt.
However, there is a trick you can do if you're only allowed to go forward. Keep applying the mapping and you'll eventually return to your original input. The number of times you'll have to do that depends on your input. To figure out how many times, compute the orbit of each character, and take the least common multiple of all the orbit sizes. For your input the orbits are size 1 (T->T, W->W), 2 (B->9->B H->3->H U->R->U P->O->P), 4 (C->8->N->,->C), 9 (A->...->Y->A), and 17 (E->...->V->E). The LCM of all those is 612, so 611 forward mappings applied to the ciphertext will return you to the plaintext.
Well, you can get your string back this way only if you do reverse mapping. One to one mapping means that a single letter of your default alphabet maps to only one letter of your new alphabet and vice versa. I.e. you can't map ABCD to ABBA. It doesn't imply that you can get your initial string by doing a second round of encryption.
The thing you have described can be achieved if you use a finite alphabet and a displacement to encode your string. You can choose the displacement in such a way that after a number of rounds of encryption totalDisplacement mod alphabetSize == 0 Than you will get your string back going only forward.
I apologize for creating a similar thread to many that are out there now, but I mainly wanted to also get some insight on some methods.
I have a list of Strings (could be just 1 or over a 1000)
Format = XXX-XXXXX-XX where each one is alphanumeric
I am trying to generate a unique string (currently 18 in length but probably could be longer ensuring not to maximize file length or path length) that I could reproduce if I have that same list. Order doesn't matter; although I may be interested if its easier to restrict the order as well.
My current Java code is follows (which failed today, hence why I am here):
public String createOutputFileName(ArrayList alInput, EnumFPFunction efpf, boolean pHeaders) {
/* create file name based on input list */
String sFileName = "";
long partNum = 0;
for (String sGPN : alInput) {
sGPN = sGPN.replaceAll("-", ""); //remove dashes
partNum += Long.parseLong(sGPN, 36); //(base 36)
}
sFileName = Long.toString(partNum);
if (sFileName.length() > 19) {
sFileName.substring(0, 18); //Max length of 19
}
return alInput;
}
So obviously just adding them did not work out so well I found out (also think I should take last 18 digits and not first 18)
Are there any good methods out there (possibly CRC related) that would work?
To assist with my key creation:
The first 3 characters are almost always numeric and would probably have many duplicate (out of 100, there may only be 10 different starting numbers)
These characters are not allowed - I,O
There will never be a character then a number in the last two alphachar subset.
I would use the system time. Here's how you might do it in Java:
public String createOutputFileName() {
long mills = System.currentTimeMillis();
long nanos = System.nanoTime();
return mills + " " + nanos;
}
If you want to add some information about the items and their part numbers, you can, of course!
======== EDIT: "What do I mean by batch object" =========
class Batch {
ArrayList<Item> itemsToProcess;
String inputFilename; // input to external process
boolean processingFinished;
public Batch(ArrayList<Item> itemsToProcess) {
this.itemsToProcess = itemsToProcess;
inputFilename = null;
processingFinished = false;
}
public void processWithExternal() {
if(inputFilename != null || processingFinished) {
throw new IllegalStateException("Cannot initiate process more than once!");
}
String base = System.currentTimeMillis() + " " + System.nanoTime();
this.inputFilename = base + "_input";
writeItemsToFile();
// however you build your process, do it here
Process p = new ProcessBuilder("myProcess","myargs", inputFilename);
p.start();
p.waitFor();
processingFinished = true;
}
private void writeItemsToFile() {
PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter(inputFilename)));
int flushcount = 0;
for(Item item : itemsToProcess) {
String output = item.getFileRepresentation();
out.println(output);
if(++flushcount % 10 == 0) out.flush();
}
out.flush();
out.close();
}
}
In addition to GlowCoder's response, I have thought of another "decent one" that would work.
Instead of just adding the list in base 36, I would do two separate things to the same list.
In this case, since there is no way for negative or decimal numbers, adding every number and multiplying every number separately and concatenating these base36 number strings isn't a bad way either.
In my case, I would take the last nine digits of the added number and last nine of the multiplied number. This would eliminate my previous errors and make it quite robust. It obviously is still possible for errors once overflow starts occurring, but could also work in this case. Extending the allowable string length would make it more robust as well.
Sample code:
public String createOutputFileName(ArrayList alInput, EnumFPFunction efpf, boolean pHeaders) {
/* create file name based on input list */
String sFileName1 = "";
String sFileName2 = "";
long partNum1 = 0; // Starting point for addition
long partNum2 = 1; // Starting point for multiplication
for (String sGPN : alInput) {
//remove dashes
sGPN = sGPN.replaceAll("-", "");
partNum1 += Long.parseLong(sGPN, 36); //(base 36)
partNum2 *= Long.parseLong(sGPN, 36); //(base 36)
}
// Initial strings
sFileName1 = "000000000" + Long.toString(partNum1, 36); // base 36
sFileName2 = "000000000" + Long.toString(partNum2, 36); // base 36
// Cropped strings
sFileName1 = sFileName1.substring(sFileName1.length()-9, sFileName1.length());
sFileName2 = sFileName2.substring(sFileName2.length()-9, sFileName2.length());
return sFileName1 + sFileName2;
}
I have an application where I am reading and writing small blocks of data (a few hundred bytes) hundreds of millions of times. I'd like to generate a compression dictionary based on an example data file and use that dictionary forever as I read and write the small blocks. I'm leaning toward the LZW compression algorithm. The Wikipedia page (http://en.wikipedia.org/wiki/Lempel-Ziv-Welch) lists pseudocode for compression and decompression. It looks fairly straightforward to modify it such that the dictionary creation is a separate block of code. So I have two questions:
Am I on the right track or is there a better way?
Why does the LZW algorithm add to the dictionary during the decompression step? Can I omit that, or would I lose efficiency in my dictionary?
Thanks.
Update: Now I'm thinking the ideal case be to find a library that lets me store the dictionary separate from the compressed data. Does anything like that exist?
Update: I ended up taking the code at http://www.enusbaum.com/blog/2009/05/22/example-huffman-compression-routine-in-c and adapting it. I am Chris in the comments on that page. I emailed my mods back to that blog author, but I haven't heard back yet. The compression rates I'm seeing with that code are not at all impressive. Maybe that is due to the 8-bit tree size.
Update: I converted it to 16 bits and the compression is better. It's also much faster than the original code.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace Book.Core
{
public class Huffman16
{
private readonly double log2 = Math.Log(2);
private List<Node> HuffmanTree = new List<Node>();
internal class Node
{
public long Frequency { get; set; }
public byte Uncoded0 { get; set; }
public byte Uncoded1 { get; set; }
public uint Coded { get; set; }
public int CodeLength { get; set; }
public Node Left { get; set; }
public Node Right { get; set; }
public bool IsLeaf
{
get { return Left == null; }
}
public override string ToString()
{
var coded = "00000000" + Convert.ToString(Coded, 2);
return string.Format("Uncoded={0}, Coded={1}, Frequency={2}", (Uncoded1 << 8) | Uncoded0, coded.Substring(coded.Length - CodeLength), Frequency);
}
}
public Huffman16(long[] frequencies)
{
if (frequencies.Length != ushort.MaxValue + 1)
{
throw new ArgumentException("frequencies.Length must equal " + ushort.MaxValue + 1);
}
BuildTree(frequencies);
EncodeTree(HuffmanTree[HuffmanTree.Count - 1], 0, 0);
}
public static long[] GetFrequencies(byte[] sampleData, bool safe)
{
if (sampleData.Length % 2 != 0)
{
throw new ArgumentException("sampleData.Length must be a multiple of 2.");
}
var histogram = new long[ushort.MaxValue + 1];
if (safe)
{
for (int i = 0; i <= ushort.MaxValue; i++)
{
histogram[i] = 1;
}
}
for (int i = 0; i < sampleData.Length; i += 2)
{
histogram[(sampleData[i] << 8) | sampleData[i + 1]] += 1000;
}
return histogram;
}
public byte[] Encode(byte[] plainData)
{
if (plainData.Length % 2 != 0)
{
throw new ArgumentException("plainData.Length must be a multiple of 2.");
}
Int64 iBuffer = 0;
int iBufferCount = 0;
using (MemoryStream msEncodedOutput = new MemoryStream())
{
//Write Final Output Size 1st
msEncodedOutput.Write(BitConverter.GetBytes(plainData.Length), 0, 4);
//Begin Writing Encoded Data Stream
iBuffer = 0;
iBufferCount = 0;
for (int i = 0; i < plainData.Length; i += 2)
{
Node FoundLeaf = HuffmanTree[(plainData[i] << 8) | plainData[i + 1]];
//How many bits are we adding?
iBufferCount += FoundLeaf.CodeLength;
//Shift the buffer
iBuffer = (iBuffer << FoundLeaf.CodeLength) | FoundLeaf.Coded;
//Are there at least 8 bits in the buffer?
while (iBufferCount > 7)
{
//Write to output
int iBufferOutput = (int)(iBuffer >> (iBufferCount - 8));
msEncodedOutput.WriteByte((byte)iBufferOutput);
iBufferCount = iBufferCount - 8;
iBufferOutput <<= iBufferCount;
iBuffer ^= iBufferOutput;
}
}
//Write remaining bits in buffer
if (iBufferCount > 0)
{
iBuffer = iBuffer << (8 - iBufferCount);
msEncodedOutput.WriteByte((byte)iBuffer);
}
return msEncodedOutput.ToArray();
}
}
public byte[] Decode(byte[] bInput)
{
long iInputBuffer = 0;
int iBytesWritten = 0;
//Establish Output Buffer to write unencoded data to
byte[] bDecodedOutput = new byte[BitConverter.ToInt32(bInput, 0)];
var current = HuffmanTree[HuffmanTree.Count - 1];
//Begin Looping through Input and Decoding
iInputBuffer = 0;
for (int i = 4; i < bInput.Length; i++)
{
iInputBuffer = bInput[i];
for (int bit = 0; bit < 8; bit++)
{
if ((iInputBuffer & 128) == 0)
{
current = current.Left;
}
else
{
current = current.Right;
}
if (current.IsLeaf)
{
bDecodedOutput[iBytesWritten++] = current.Uncoded1;
bDecodedOutput[iBytesWritten++] = current.Uncoded0;
if (iBytesWritten == bDecodedOutput.Length)
{
return bDecodedOutput;
}
current = HuffmanTree[HuffmanTree.Count - 1];
}
iInputBuffer <<= 1;
}
}
throw new Exception();
}
private static void EncodeTree(Node node, int depth, uint value)
{
if (node != null)
{
if (node.IsLeaf)
{
node.CodeLength = depth;
node.Coded = value;
}
else
{
depth++;
value <<= 1;
EncodeTree(node.Left, depth, value);
EncodeTree(node.Right, depth, value | 1);
}
}
}
private void BuildTree(long[] frequencies)
{
var tiny = 0.1 / ushort.MaxValue;
var fraction = 0.0;
SortedDictionary<double, Node> trees = new SortedDictionary<double, Node>();
for (int i = 0; i <= ushort.MaxValue; i++)
{
var leaf = new Node()
{
Uncoded1 = (byte)(i >> 8),
Uncoded0 = (byte)(i & 255),
Frequency = frequencies[i]
};
HuffmanTree.Add(leaf);
if (leaf.Frequency > 0)
{
trees.Add(leaf.Frequency + (fraction += tiny), leaf);
}
}
while (trees.Count > 1)
{
var e = trees.GetEnumerator();
e.MoveNext();
var first = e.Current;
e.MoveNext();
var second = e.Current;
//Join smallest two nodes
var NewParent = new Node();
NewParent.Frequency = first.Value.Frequency + second.Value.Frequency;
NewParent.Left = first.Value;
NewParent.Right = second.Value;
HuffmanTree.Add(NewParent);
//Remove the two that just got joined into one
trees.Remove(first.Key);
trees.Remove(second.Key);
trees.Add(NewParent.Frequency + (fraction += tiny), NewParent);
}
}
}
}
Usage examples:
To create the dictionary from sample data:
var freqs = Huffman16.GetFrequencies(File.ReadAllBytes(#"D:\nodes"), true);
To initialize an encoder with a given dictionary:
var huff = new Huffman16(freqs);
And to do some compression:
var encoded = huff.Encode(raw);
And decompression:
var raw = huff.Decode(encoded);
The hard part in my mind is how you build your static dictionary. You don't want to use the LZW dictionary built from your sample data. LZW wastes a bunch of time learning since it can't build the dictionary faster than the decompressor can (a token will only be used the second time it's seen by the compressor so the decompressor can add it to its dictionary the first time its seen). The flip side of this is that it's adding things to the dictionary that may never get used, just in case the string shows up again. (e.g., to have a token for 'stackoverflow' you'll also have entries for 'ac','ko','ve','rf' etc...)
However, looking at the raw token stream from an LZ77 algorithm could work well. You'll only see tokens for strings seen at least twice. You can then build a list of the most common tokens/strings to include in your dictionary.
Once you have a static dictionary, using LZW sans the dictionary update seems like an easy implementation but to get the best compression I'd consider a static Huffman table instead of the traditional 12 bit fixed size token (as George Phillips suggested). An LZW dictionary will burn tokens for all the sub-strings you may never actually encode (e.g, if you can encode 'stackoverflow', there will be tokens for 'st', 'sta', 'stac', 'stack', 'stacko' etc.).
At this point it really isn't LZW - what makes LZW clever is how the decompressor can build the same dictionary the compressor used only seeing the compressed data stream. Something you won't be using. But all LZW implementations have a state where the dictionary is full and is no longer updated, this is how you'd use it with your static dictionary.
LZW adds to the dictionary during decompression to ensure it has the same dictionary state as the compressor. Otherwise the decoding would not function properly.
However, if you were in a state where the dictionary was fixed then, yes, you would not need to add new codes.
Your approach will work reasonably well and it's easy to use existing tools to prototype and measure the results. That is, compress the example file and then the example and test data together. The size of the latter less the former will be the expected compressed size of a block.
LZW is a clever way to build up a dictionary on the fly and gives decent results. But a more thorough analysis of your typical data blocks is likely to generate a more efficient dictionary.
There's also room for improvement in how LZW represents compressed data. For instance, each dictionary reference could be Huffman encoded to a closer to optimal length based on the expected frequency of their use. To be truly optimal the codes should be arithmetic encoded.
I would look at your data to see if there's an obvious reason it's so easy to compress. You might be able to do something much simpler than LZ78. I've done both LZ77 (lookback) and LZ78 (dictionary).
Try running a LZ77 on your data. There's no dictionary with LZ77, so you could use a library without alteration. Deflate is an implementation of LZ77.
Your idea of using a common dictionary is a good one, but it's hard to know whether the files are similar to each other or just self-similar without doing some tests.
The right track is to use an library -- almost every modern language have a compression library. C#, Python, Perl, Java, VB.net, whatever you use.
LZW save some space by depending the dictionary on previous inputs. It have an initial dictionary, and when you decompress something, you add them to the dictionary -- so the dictionary is growing. (I am omitting some details here, but this is the general idea)
You can omit this step by supply the whole (complete) dictionary as the initial one. But this would cost some space.
I find this aproach quite interesting for repeated log entries and something I would like to explore using.
Can you share the compression statistics for using this approach for your use case so I can compare it with other alternatives?
Have you considered having the common dictionary grow over time or is that not a valid option?