Split encrypted messages into chunks and put them together again - gnupg

I want to send GPG encrypted data via GET request of known format.
Issue #1: Data block size in the request is limited (4096 symbols), and it is not enough for a typical GPG message. So, I need to chunk it.
Issue #2: Chunks may be sent in the wrong order. Each chunk must have a unique message ID and serial number, so the messages can be put together.
GPG has the method to send encrypted data in text format (armoring). RFC 2440 standard allows chunking armored messages:
BEGIN PGP MESSAGE, PART X/Y
Used for multi-part messages, where the armor is split amongst Y
parts, and this is the Xth part out of Y.
BEGIN PGP MESSAGE, PART X
Used for multi-part messages, where this is the Xth part of an
unspecified number of parts. Requires the MESSAGE-ID Armor Header
to be used.
But, unfortunately, I've found no evidence that this feature is implemented in GPG.
And no word about chunking of public keys, which, actually, can be huge too.
So I turned down the idea of using native GPG armors for chunking.
My current home-made solution: binary encrypted data are splitted into chunks, then each chunk is put into a block, which contains UUID (MessageID analog), the serial number of the block, the total number of blocks, and CRC checksum of the block.
Like that:
[ UUID ][ Number ][ Total ][ Chunk of encrypted data ][ Checksum ]
Putting the message together out of that blocks is a bigger challenge, but doable as well.
But I want more clear solution, preferably on C++.
Could you help me?

Qt provides very simple methods for data serialization. I created a class to chunk, store, and rebuild binary data, and for now I don't think I need something more simple.
But, if someone knows a better solution, please share it with me.
#include <QByteArrayView>
#include <QDataStream>
#include <QException>
#include <QUuid>
enum CHUNKER {
MESSAGE_READY = 0,
BLOCK_ADDED
};
struct ChunkedMessage {
QUuid UUID;
QByteArray Data;
};
class Chunker {
public:
Chunker();
~Chunker();
static quint16 GetChecksum(QByteArray *Block);
static QByteArrayList ArmorData(QByteArray *Data, qsizetype *ChunkSize);
CHUNKER AddBlock(QByteArray *Block, ChunkedMessage *Message);
private:
struct MessageBlock {
QUuid UUID;
quint32 Number;
quint32 Total;
QByteArray Data;
};
QMap<QUuid, quint32> Sizes;
QMap<QUuid, QMap<quint32, Chunker::MessageBlock>*> Stack;
MessageBlock DearmorChunk(QByteArray *Block);
bool CheckIntegrity(QUuid *UUID, QByteArray *Reconstructed);
};
Chunker::Chunker() { }
Chunker::~Chunker() { }
quint16 Chunker::GetChecksum(QByteArray *Block) { return qChecksum(QByteArrayView(*Block), Qt::ChecksumIso3309); }
QByteArrayList Chunker::ArmorData(QByteArray *Data, qsizetype *ChunkSize) {
QByteArrayList Result;
QUuid UUID = QUuid::createUuid();
qsizetype RealChunkSize = (*ChunkSize) - sizeof(UUID.toRfc4122()) - sizeof(quint32) - sizeof(quint32) - sizeof(quint16);
const quint32 ChunkCount = ((*Data).length() / RealChunkSize) + 1;
for (auto Pos = 0; Pos < ChunkCount; Pos++) {
QByteArray Block;
QDataStream Stream(&Block, QIODeviceBase::WriteOnly);
Stream << UUID.toRfc4122() << (Pos + 1) << ChunkCount << (*Data).mid(Pos * RealChunkSize, RealChunkSize);
Stream << Chunker::GetChecksum(&Block);
Result.push_back(Block);
}
return Result;
}
Chunker::MessageBlock Chunker::DearmorChunk(QByteArray *Block) {
Chunker::MessageBlock Result;
QDataStream Stream(Block, QIODeviceBase::ReadOnly);
QByteArray ClearBlock = (*Block).chopped(sizeof(quint16));
QByteArray BytesUUID;
quint16 Checksum;
Stream >> BytesUUID >> Result.Number >> Result.Total >> Result.Data >> Checksum;
Result.UUID = QUuid::fromRfc4122(QByteArrayView(BytesUUID));
if (Chunker::GetChecksum(&ClearBlock) != Checksum) throw std::runtime_error("Checksums are not equal");
return Result;
}
bool Chunker::CheckIntegrity(QUuid *UUID, QByteArray *Reconstructed) {
quint32 Size = this->Sizes[*UUID];
if (this->Stack[*UUID]->size() > Size) throw std::runtime_error("Corrupted message blocks");
if (this->Stack[*UUID]->size() < Size) return false;
for (quint32 Counter = 0; Counter < Size; Counter++) {
if (!(this->Stack[*UUID]->contains(Counter + 1))) return false;
(*Reconstructed).append((*(this->Stack[*UUID]))[Counter + 1].Data);
}
return true;
}
CHUNKER Chunker::AddBlock(QByteArray *Block, ChunkedMessage *Message) {
Chunker::MessageBlock DecodedBlock = Chunker::DearmorChunk(Block);
if (!this->Sizes.contains(DecodedBlock.UUID)) {
this->Sizes[(QUuid)DecodedBlock.UUID] = (quint32)DecodedBlock.Total;
this->Stack[(QUuid)DecodedBlock.UUID] = new QMap<quint32, Chunker::MessageBlock>;
}
(*(this->Stack[DecodedBlock.UUID]))[(quint32)(DecodedBlock.Number)] = Chunker::MessageBlock(DecodedBlock);
QByteArray ReconstructedData;
if (this->CheckIntegrity(&DecodedBlock.UUID, &ReconstructedData)) {
(*Message).UUID = (QUuid)(DecodedBlock.UUID);
(*Message).Data = (QByteArray)ReconstructedData;
this->Sizes.remove(DecodedBlock.UUID);
delete this->Stack[DecodedBlock.UUID];
this->Stack.remove(DecodedBlock.UUID);
return CHUNKER::MESSAGE_READY;
}
return CHUNKER::BLOCK_ADDED;
}

Related

Reading binary data

I am trying to read data from a binary file. One block of data is 76 bytes long (this varies with the number of the 2-byte "main data items" in the middle of the block). The first datum is 4 bytes, second is 4 bytes, and then there are a bunch of 2 byte main data items, and at the end are 2 more 2-byte pieces of data.
Based on this Delphi sample I've learned how to read the file with the code below:
short AShortInt; // 16 bits
int AInteger; // 32 bits
try
{
infile=new TFileStream(myfile,fmOpenRead); // myfile is binary
BR = new TBinaryReader(infile, TEncoding::Unicode, false);
for (int rows = 0; rows < 5; rows++) { // just read the first 5 blocks of data for testing
AInteger = BR->ReadInt32(); // read first two 4 byte integers for this block
AInteger = BR->ReadInt32();
for (int i = 0; i < 32; i++) { // now read the 32 2-byte integers from this block
AShortInt = BR->ReadInt16();
}
AShortInt = BR->ReadInt16(); // read next to last 2-byte int
AShortInt = BR->ReadInt16(); // read the last 2-byte int
}
delete infile;
delete BR;
Close();
}
catch(...)
{
delete infile; // closes the file, doesn't delete it.
delete BR;
ShowMessage("Can't open file!");
Close();
}
But, what i would like to do is use a 76-byte wide buffer to read the entire block, and then pick the various datum out of that buffer. I put together the following code based on this question and i can read a whole block of data into the buffer.
UnicodeString myfile = System::Ioutils::TPath::Combine(System::Ioutils::TPath::GetDocumentsPath(), "binaryCOM.dat");
TFileStream*infile=0;
try
{
infile=new TFileStream(myfile,fmOpenRead);
const int bufsize=76;
char*buf=new char[bufsize];
int a = 0;
while(int bytesread=infile->Read(buf,bufsize)) {
a++; // just a place to break on Run to Cursor
}
delete[]buf;
}
catch(...)
{
delete infile;
ShowMessage("Can't open file!");
Close();
}
But i can't figure out how to piece together subsets out of the bytes in the buffer. Is there a way to concatenate bytes? So i could read a block of data into a 76 byte buffer and then do something like this below?
unsigned int FirstDatum = buf[0]+buf[1]+buf[2]+buf[3]; // concatenate the 4 bytes for the first piece of data
This will be an FMX app for Win32, iOS, and Android built in C++Builder 10.3.2.
Here is my modified code using Remy's suggestion of TMemoryStream.
UnicodeString myfile = System::Ioutils::TPath::Combine(System::Ioutils::TPath::GetDocumentsPath(), "binaryCOM.dat");
TMemoryStream *MS=0;
TBinaryReader *BR=0;
std::vector<short> myArray;
short AShortInt;
int AInteger;
int NumDatums = 32; // the variable number of 2-byte main datums
try
{
MS = new TMemoryStream();
MS->LoadFromFile(myfile);
BR = new TBinaryReader(MS, TEncoding::Unicode, false);
for (int rows = 0; rows < 5; rows++) { // testing with first 5 blocks of data
AInteger = BR->ReadInt32(); // read first two 4 byte integers
AInteger = BR->ReadInt32(); // here
for (int i = 0; i < NumDatums; i++) { // read the main 2-byte data
AShortInt = BR->ReadInt16();
myArray.push_back(AShortInt); // push it into vector
}
AShortInt = BR->ReadInt16(); // read next to last 2-byte int
AShortInt = BR->ReadInt16(); // read the last 2-byte int
// code here to do something with this block of data just read from file
}
}
delete MS;
delete BR;
}
catch(...)
{
delete MS;
delete BR;
ShowMessage("Can't open file.");
}

SAFEARRAY data to unsigned char*

I am trying to convert a SAFEARRAY data pointer to unsinged char*. However I am not getting the expected data. Here is a snippet.
SafeArrayLock(psaFrameData);
psaFrameData->rgsabound->cElements;
int nCount = psaFrameData->rgsabound->cElements - psaFrameData->rgsabound->lLbound + 1;
frameData = new unsigned char[nCount];
memset(frameData, 0, nCount);
for (int i = 0; i < nCount; ++i)
{
frameData[i] = ((unsigned char*)(psaFrameData)->pvData)[i];
}
SafeArrayUnlock(psaFrameData);
Do not manually lock the array and then access its pvData (or any of its other data members) directly. Use the various accessors functions instead, such as SafeArrayAccessData():
Increments the lock count of an array, and retrieves a pointer to the array data.
Try something more like this:
// safety check: make sure the array has only 1 dimension...
if (SafeArrayGetDim(psaFrameData) != 1)
{
// handle the error ...
}
else
{
// safety check: make sure the array contains byte elements...
VARTYPE vt = 0;
SafeArrayGetVartype(psaFrameData, &vt);
if (vt != VT_UI1)
{
// handle the error ...
}
else
{
// get a pointer to the array's byte data...
unsigned char *data;
if (FAILED(SafeArrayAccessData(psaFrameData, (void**)&data)))
{
// handle the error ...
}
else
{
// calculate the number of bytes in the array...
LONG lBound, uBound;
SafeArrayGetLBound(psaFrameData, 1, &lBound);
SafeArrayGetUBound(psaFrameData, 1, &uBound);
long nCount = uBound - lBound + 1;
// copy the bytes...
frameData = new unsigned char[nCount];
memcpy(frameData, data, nCount);
// release the pointer to the array's byte data...
SafeArrayUnaccessData(psaFrameData);
}
}
}

wcstombs & allocating memory for character array on heap

I'm reading a file with a single wide character line in it. But, I never know how long it is going to be. I've read this into a std::wstring, inString, and have managed to create the multi byte string out of thin air (Q1 - are these called r-values?). Q2 - Now, how do I allocate memory for this in the heap and obtain a smart pointer to it ? I do not want to use new or malloc (and call free or delete eventually) or any constant to store it on the stack (for I can never know the max length). Q3 - Can I make use of the make_shared or make_unique function templates here ? Q4 - To be specific, can I get a pointer like shared_ptr<char> pointing to the char array allocated on the heap ?
I tried something like the following,
std::shared_ptr<char> MBString(const_cast<char*>(std::string(inString.begin(), inString.end()).c_str()));
it did not work. I tried a few suggestions on the internet but I don't know how to do it yet.
Q5 - Let alone Wide char to multi -byte conversion, in general, how do I allocate an arbitrary length char string on the heap and get a smart pointer to it ?
std::wfstream inFile(L"lengthUnkown.txt", std::ios::in);
std::wstring inString;
inFile >> inString;
std::wcout << inString << std::endl; //prints correctly
std::cout << (const_cast<char*>(std::string(inString.begin(), inString.end()).c_str())) << std::endl; //this prints the line correctly as expected
//convert wide character string to multi-byte on the heap pointed, to by MBString
//std::cout << MBString << std::endl; //I want to print the multi-byte string like this
return 0;
Not resource optimal but reliable:
wchar_t* mb2wstr(const char* inval) {
size_t size = std::strlen(inval);
#define OUTSZ (size+1)*sizeof(wchar_t)
auto buf = (wchar_t*)std::malloc(OUTSZ);
std::memset(buf, 0, OUTSZ);
std::setlocale(LC_CTYPE,""); // необходима, чтобы отработала "mbstowcs"
size = std::mbstowcs(buf, inval, size);
if ( size == (size_t)(-1) ) {
std::free(buf);
buf = nullptr;
} else {
buf = (wchar_t*)std::realloc(buf,OUTSZ);
}
return buf;
#undef OUTSZ
}
char* wstr2mb(const wchar_t* inval) {
size_t size = std::wcslen(inval);
#define OUTSZ (size+1)*MB_CUR_MAX // Maximum length of a multibyte character in the current locale
auto buf = (char*)std::malloc(OUTSZ);
std::memset(buf, 0, OUTSZ);
std::setlocale(LC_CTYPE,""); // необходима, чтобы отработала "wcstombs"
size = std::wcstombs(buf, inval, size*sizeof(wchar_t));
if ( size == (size_t)(-1) ) {
std::free(buf);
buf = nullptr;
} else {
buf = (char*)std::realloc(buf,size+1);
}
return buf;
#undef OUTSZ
}
const std::string pwchar2string(const wchar_t* inval) {
char* tmp = wstr2mb(inval);
string out{tmp};
std::free(tmp);
return out;
}
const std::wstring pchar2wstring(const char* inval) {
wchar_t* tmp = mb2wstr(inval);
wstring out{tmp};
std::free(tmp);
return out;
}
const wstring string2wstring(const string& value) {
return pchar2wstring(value.c_str());
}
const string wstring2string(const wstring& value) {
return pwchar2string(value.c_str());
}
const wchar_t* char2wchar(const char* value) {
return pchar2wstring(value).c_str();
}
const char* wchar2char(const wchar_t* value) {
return pwchar2string(value).c_str();
}

Extract trailing int from string containing other characters

I have a problem in regards of extracting signed int from string in c++.
Assuming that i have a string of images1234, how can i extract the 1234 from the string without knowing the position of the last non numeric character in C++.
FYI, i have try stringstream as well as lexical_cast as suggested by others through the post but stringstream returns 0 while lexical_cast stopped working.
int main()
{
string virtuallive("Images1234");
//stringstream output(virtuallive.c_str());
//int i = stoi(virtuallive);
//stringstream output(virtuallive);
int i;
i = boost::lexical_cast<int>(virtuallive.c_str());
//output >> i;
cout << i << endl;
return 0;
}
How can i extract the 1234 from the string without knowing the position of the last non numeric character in C++?
You can't. But the position is not hard to find:
auto last_non_numeric = input.find_last_not_of("1234567890");
char* endp = &input[0];
if (last_non_numeric != std::string::npos)
endp += last_non_numeric + 1;
if (*endp) { /* FAILURE, no number on the end */ }
auto i = strtol(endp, &endp, 10);
if (*endp) {/* weird FAILURE, maybe the number was really HUGE and couldn't convert */}
Another possibility would be to put the string into a stringstream, then read the number from the stream (after imbuing the stream with a locale that classifies everything except digits as white space).
// First the desired facet:
struct digits_only: std::ctype<char> {
digits_only(): std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table() {
// everything is white-space:
static std::vector<std::ctype_base::mask>
rc(std::ctype<char>::table_size,std::ctype_base::space);
// except digits, which are digits
std::fill(&rc['0'], &rc['9'], std::ctype_base::digit);
// and '.', which we'll call punctuation:
rc['.'] = std::ctype_base::punct;
return &rc[0];
}
};
Then the code to read the data:
std::istringstream virtuallive("Images1234");
virtuallive.imbue(locale(locale(), new digits_only);
int number;
// Since we classify the letters as white space, the stream will ignore them.
// We can just read the number as if nothing else were there:
virtuallive >> number;
This technique is useful primarily when the stream contains a substantial amount of data, and you want all the data in that stream to be interpreted in the same way (e.g., only read numbers, regardless of what else it might contain).

Lossless compression in small blocks with precomputed dictionary

I have an application where I am reading and writing small blocks of data (a few hundred bytes) hundreds of millions of times. I'd like to generate a compression dictionary based on an example data file and use that dictionary forever as I read and write the small blocks. I'm leaning toward the LZW compression algorithm. The Wikipedia page (http://en.wikipedia.org/wiki/Lempel-Ziv-Welch) lists pseudocode for compression and decompression. It looks fairly straightforward to modify it such that the dictionary creation is a separate block of code. So I have two questions:
Am I on the right track or is there a better way?
Why does the LZW algorithm add to the dictionary during the decompression step? Can I omit that, or would I lose efficiency in my dictionary?
Thanks.
Update: Now I'm thinking the ideal case be to find a library that lets me store the dictionary separate from the compressed data. Does anything like that exist?
Update: I ended up taking the code at http://www.enusbaum.com/blog/2009/05/22/example-huffman-compression-routine-in-c and adapting it. I am Chris in the comments on that page. I emailed my mods back to that blog author, but I haven't heard back yet. The compression rates I'm seeing with that code are not at all impressive. Maybe that is due to the 8-bit tree size.
Update: I converted it to 16 bits and the compression is better. It's also much faster than the original code.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace Book.Core
{
public class Huffman16
{
private readonly double log2 = Math.Log(2);
private List<Node> HuffmanTree = new List<Node>();
internal class Node
{
public long Frequency { get; set; }
public byte Uncoded0 { get; set; }
public byte Uncoded1 { get; set; }
public uint Coded { get; set; }
public int CodeLength { get; set; }
public Node Left { get; set; }
public Node Right { get; set; }
public bool IsLeaf
{
get { return Left == null; }
}
public override string ToString()
{
var coded = "00000000" + Convert.ToString(Coded, 2);
return string.Format("Uncoded={0}, Coded={1}, Frequency={2}", (Uncoded1 << 8) | Uncoded0, coded.Substring(coded.Length - CodeLength), Frequency);
}
}
public Huffman16(long[] frequencies)
{
if (frequencies.Length != ushort.MaxValue + 1)
{
throw new ArgumentException("frequencies.Length must equal " + ushort.MaxValue + 1);
}
BuildTree(frequencies);
EncodeTree(HuffmanTree[HuffmanTree.Count - 1], 0, 0);
}
public static long[] GetFrequencies(byte[] sampleData, bool safe)
{
if (sampleData.Length % 2 != 0)
{
throw new ArgumentException("sampleData.Length must be a multiple of 2.");
}
var histogram = new long[ushort.MaxValue + 1];
if (safe)
{
for (int i = 0; i <= ushort.MaxValue; i++)
{
histogram[i] = 1;
}
}
for (int i = 0; i < sampleData.Length; i += 2)
{
histogram[(sampleData[i] << 8) | sampleData[i + 1]] += 1000;
}
return histogram;
}
public byte[] Encode(byte[] plainData)
{
if (plainData.Length % 2 != 0)
{
throw new ArgumentException("plainData.Length must be a multiple of 2.");
}
Int64 iBuffer = 0;
int iBufferCount = 0;
using (MemoryStream msEncodedOutput = new MemoryStream())
{
//Write Final Output Size 1st
msEncodedOutput.Write(BitConverter.GetBytes(plainData.Length), 0, 4);
//Begin Writing Encoded Data Stream
iBuffer = 0;
iBufferCount = 0;
for (int i = 0; i < plainData.Length; i += 2)
{
Node FoundLeaf = HuffmanTree[(plainData[i] << 8) | plainData[i + 1]];
//How many bits are we adding?
iBufferCount += FoundLeaf.CodeLength;
//Shift the buffer
iBuffer = (iBuffer << FoundLeaf.CodeLength) | FoundLeaf.Coded;
//Are there at least 8 bits in the buffer?
while (iBufferCount > 7)
{
//Write to output
int iBufferOutput = (int)(iBuffer >> (iBufferCount - 8));
msEncodedOutput.WriteByte((byte)iBufferOutput);
iBufferCount = iBufferCount - 8;
iBufferOutput <<= iBufferCount;
iBuffer ^= iBufferOutput;
}
}
//Write remaining bits in buffer
if (iBufferCount > 0)
{
iBuffer = iBuffer << (8 - iBufferCount);
msEncodedOutput.WriteByte((byte)iBuffer);
}
return msEncodedOutput.ToArray();
}
}
public byte[] Decode(byte[] bInput)
{
long iInputBuffer = 0;
int iBytesWritten = 0;
//Establish Output Buffer to write unencoded data to
byte[] bDecodedOutput = new byte[BitConverter.ToInt32(bInput, 0)];
var current = HuffmanTree[HuffmanTree.Count - 1];
//Begin Looping through Input and Decoding
iInputBuffer = 0;
for (int i = 4; i < bInput.Length; i++)
{
iInputBuffer = bInput[i];
for (int bit = 0; bit < 8; bit++)
{
if ((iInputBuffer & 128) == 0)
{
current = current.Left;
}
else
{
current = current.Right;
}
if (current.IsLeaf)
{
bDecodedOutput[iBytesWritten++] = current.Uncoded1;
bDecodedOutput[iBytesWritten++] = current.Uncoded0;
if (iBytesWritten == bDecodedOutput.Length)
{
return bDecodedOutput;
}
current = HuffmanTree[HuffmanTree.Count - 1];
}
iInputBuffer <<= 1;
}
}
throw new Exception();
}
private static void EncodeTree(Node node, int depth, uint value)
{
if (node != null)
{
if (node.IsLeaf)
{
node.CodeLength = depth;
node.Coded = value;
}
else
{
depth++;
value <<= 1;
EncodeTree(node.Left, depth, value);
EncodeTree(node.Right, depth, value | 1);
}
}
}
private void BuildTree(long[] frequencies)
{
var tiny = 0.1 / ushort.MaxValue;
var fraction = 0.0;
SortedDictionary<double, Node> trees = new SortedDictionary<double, Node>();
for (int i = 0; i <= ushort.MaxValue; i++)
{
var leaf = new Node()
{
Uncoded1 = (byte)(i >> 8),
Uncoded0 = (byte)(i & 255),
Frequency = frequencies[i]
};
HuffmanTree.Add(leaf);
if (leaf.Frequency > 0)
{
trees.Add(leaf.Frequency + (fraction += tiny), leaf);
}
}
while (trees.Count > 1)
{
var e = trees.GetEnumerator();
e.MoveNext();
var first = e.Current;
e.MoveNext();
var second = e.Current;
//Join smallest two nodes
var NewParent = new Node();
NewParent.Frequency = first.Value.Frequency + second.Value.Frequency;
NewParent.Left = first.Value;
NewParent.Right = second.Value;
HuffmanTree.Add(NewParent);
//Remove the two that just got joined into one
trees.Remove(first.Key);
trees.Remove(second.Key);
trees.Add(NewParent.Frequency + (fraction += tiny), NewParent);
}
}
}
}
Usage examples:
To create the dictionary from sample data:
var freqs = Huffman16.GetFrequencies(File.ReadAllBytes(#"D:\nodes"), true);
To initialize an encoder with a given dictionary:
var huff = new Huffman16(freqs);
And to do some compression:
var encoded = huff.Encode(raw);
And decompression:
var raw = huff.Decode(encoded);
The hard part in my mind is how you build your static dictionary. You don't want to use the LZW dictionary built from your sample data. LZW wastes a bunch of time learning since it can't build the dictionary faster than the decompressor can (a token will only be used the second time it's seen by the compressor so the decompressor can add it to its dictionary the first time its seen). The flip side of this is that it's adding things to the dictionary that may never get used, just in case the string shows up again. (e.g., to have a token for 'stackoverflow' you'll also have entries for 'ac','ko','ve','rf' etc...)
However, looking at the raw token stream from an LZ77 algorithm could work well. You'll only see tokens for strings seen at least twice. You can then build a list of the most common tokens/strings to include in your dictionary.
Once you have a static dictionary, using LZW sans the dictionary update seems like an easy implementation but to get the best compression I'd consider a static Huffman table instead of the traditional 12 bit fixed size token (as George Phillips suggested). An LZW dictionary will burn tokens for all the sub-strings you may never actually encode (e.g, if you can encode 'stackoverflow', there will be tokens for 'st', 'sta', 'stac', 'stack', 'stacko' etc.).
At this point it really isn't LZW - what makes LZW clever is how the decompressor can build the same dictionary the compressor used only seeing the compressed data stream. Something you won't be using. But all LZW implementations have a state where the dictionary is full and is no longer updated, this is how you'd use it with your static dictionary.
LZW adds to the dictionary during decompression to ensure it has the same dictionary state as the compressor. Otherwise the decoding would not function properly.
However, if you were in a state where the dictionary was fixed then, yes, you would not need to add new codes.
Your approach will work reasonably well and it's easy to use existing tools to prototype and measure the results. That is, compress the example file and then the example and test data together. The size of the latter less the former will be the expected compressed size of a block.
LZW is a clever way to build up a dictionary on the fly and gives decent results. But a more thorough analysis of your typical data blocks is likely to generate a more efficient dictionary.
There's also room for improvement in how LZW represents compressed data. For instance, each dictionary reference could be Huffman encoded to a closer to optimal length based on the expected frequency of their use. To be truly optimal the codes should be arithmetic encoded.
I would look at your data to see if there's an obvious reason it's so easy to compress. You might be able to do something much simpler than LZ78. I've done both LZ77 (lookback) and LZ78 (dictionary).
Try running a LZ77 on your data. There's no dictionary with LZ77, so you could use a library without alteration. Deflate is an implementation of LZ77.
Your idea of using a common dictionary is a good one, but it's hard to know whether the files are similar to each other or just self-similar without doing some tests.
The right track is to use an library -- almost every modern language have a compression library. C#, Python, Perl, Java, VB.net, whatever you use.
LZW save some space by depending the dictionary on previous inputs. It have an initial dictionary, and when you decompress something, you add them to the dictionary -- so the dictionary is growing. (I am omitting some details here, but this is the general idea)
You can omit this step by supply the whole (complete) dictionary as the initial one. But this would cost some space.
I find this aproach quite interesting for repeated log entries and something I would like to explore using.
Can you share the compression statistics for using this approach for your use case so I can compare it with other alternatives?
Have you considered having the common dictionary grow over time or is that not a valid option?

Resources