Lossless compression in small blocks with precomputed dictionary - algorithm

I have an application where I am reading and writing small blocks of data (a few hundred bytes) hundreds of millions of times. I'd like to generate a compression dictionary based on an example data file and use that dictionary forever as I read and write the small blocks. I'm leaning toward the LZW compression algorithm. The Wikipedia page (http://en.wikipedia.org/wiki/Lempel-Ziv-Welch) lists pseudocode for compression and decompression. It looks fairly straightforward to modify it such that the dictionary creation is a separate block of code. So I have two questions:
Am I on the right track or is there a better way?
Why does the LZW algorithm add to the dictionary during the decompression step? Can I omit that, or would I lose efficiency in my dictionary?
Thanks.
Update: Now I'm thinking the ideal case be to find a library that lets me store the dictionary separate from the compressed data. Does anything like that exist?
Update: I ended up taking the code at http://www.enusbaum.com/blog/2009/05/22/example-huffman-compression-routine-in-c and adapting it. I am Chris in the comments on that page. I emailed my mods back to that blog author, but I haven't heard back yet. The compression rates I'm seeing with that code are not at all impressive. Maybe that is due to the 8-bit tree size.
Update: I converted it to 16 bits and the compression is better. It's also much faster than the original code.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace Book.Core
{
public class Huffman16
{
private readonly double log2 = Math.Log(2);
private List<Node> HuffmanTree = new List<Node>();
internal class Node
{
public long Frequency { get; set; }
public byte Uncoded0 { get; set; }
public byte Uncoded1 { get; set; }
public uint Coded { get; set; }
public int CodeLength { get; set; }
public Node Left { get; set; }
public Node Right { get; set; }
public bool IsLeaf
{
get { return Left == null; }
}
public override string ToString()
{
var coded = "00000000" + Convert.ToString(Coded, 2);
return string.Format("Uncoded={0}, Coded={1}, Frequency={2}", (Uncoded1 << 8) | Uncoded0, coded.Substring(coded.Length - CodeLength), Frequency);
}
}
public Huffman16(long[] frequencies)
{
if (frequencies.Length != ushort.MaxValue + 1)
{
throw new ArgumentException("frequencies.Length must equal " + ushort.MaxValue + 1);
}
BuildTree(frequencies);
EncodeTree(HuffmanTree[HuffmanTree.Count - 1], 0, 0);
}
public static long[] GetFrequencies(byte[] sampleData, bool safe)
{
if (sampleData.Length % 2 != 0)
{
throw new ArgumentException("sampleData.Length must be a multiple of 2.");
}
var histogram = new long[ushort.MaxValue + 1];
if (safe)
{
for (int i = 0; i <= ushort.MaxValue; i++)
{
histogram[i] = 1;
}
}
for (int i = 0; i < sampleData.Length; i += 2)
{
histogram[(sampleData[i] << 8) | sampleData[i + 1]] += 1000;
}
return histogram;
}
public byte[] Encode(byte[] plainData)
{
if (plainData.Length % 2 != 0)
{
throw new ArgumentException("plainData.Length must be a multiple of 2.");
}
Int64 iBuffer = 0;
int iBufferCount = 0;
using (MemoryStream msEncodedOutput = new MemoryStream())
{
//Write Final Output Size 1st
msEncodedOutput.Write(BitConverter.GetBytes(plainData.Length), 0, 4);
//Begin Writing Encoded Data Stream
iBuffer = 0;
iBufferCount = 0;
for (int i = 0; i < plainData.Length; i += 2)
{
Node FoundLeaf = HuffmanTree[(plainData[i] << 8) | plainData[i + 1]];
//How many bits are we adding?
iBufferCount += FoundLeaf.CodeLength;
//Shift the buffer
iBuffer = (iBuffer << FoundLeaf.CodeLength) | FoundLeaf.Coded;
//Are there at least 8 bits in the buffer?
while (iBufferCount > 7)
{
//Write to output
int iBufferOutput = (int)(iBuffer >> (iBufferCount - 8));
msEncodedOutput.WriteByte((byte)iBufferOutput);
iBufferCount = iBufferCount - 8;
iBufferOutput <<= iBufferCount;
iBuffer ^= iBufferOutput;
}
}
//Write remaining bits in buffer
if (iBufferCount > 0)
{
iBuffer = iBuffer << (8 - iBufferCount);
msEncodedOutput.WriteByte((byte)iBuffer);
}
return msEncodedOutput.ToArray();
}
}
public byte[] Decode(byte[] bInput)
{
long iInputBuffer = 0;
int iBytesWritten = 0;
//Establish Output Buffer to write unencoded data to
byte[] bDecodedOutput = new byte[BitConverter.ToInt32(bInput, 0)];
var current = HuffmanTree[HuffmanTree.Count - 1];
//Begin Looping through Input and Decoding
iInputBuffer = 0;
for (int i = 4; i < bInput.Length; i++)
{
iInputBuffer = bInput[i];
for (int bit = 0; bit < 8; bit++)
{
if ((iInputBuffer & 128) == 0)
{
current = current.Left;
}
else
{
current = current.Right;
}
if (current.IsLeaf)
{
bDecodedOutput[iBytesWritten++] = current.Uncoded1;
bDecodedOutput[iBytesWritten++] = current.Uncoded0;
if (iBytesWritten == bDecodedOutput.Length)
{
return bDecodedOutput;
}
current = HuffmanTree[HuffmanTree.Count - 1];
}
iInputBuffer <<= 1;
}
}
throw new Exception();
}
private static void EncodeTree(Node node, int depth, uint value)
{
if (node != null)
{
if (node.IsLeaf)
{
node.CodeLength = depth;
node.Coded = value;
}
else
{
depth++;
value <<= 1;
EncodeTree(node.Left, depth, value);
EncodeTree(node.Right, depth, value | 1);
}
}
}
private void BuildTree(long[] frequencies)
{
var tiny = 0.1 / ushort.MaxValue;
var fraction = 0.0;
SortedDictionary<double, Node> trees = new SortedDictionary<double, Node>();
for (int i = 0; i <= ushort.MaxValue; i++)
{
var leaf = new Node()
{
Uncoded1 = (byte)(i >> 8),
Uncoded0 = (byte)(i & 255),
Frequency = frequencies[i]
};
HuffmanTree.Add(leaf);
if (leaf.Frequency > 0)
{
trees.Add(leaf.Frequency + (fraction += tiny), leaf);
}
}
while (trees.Count > 1)
{
var e = trees.GetEnumerator();
e.MoveNext();
var first = e.Current;
e.MoveNext();
var second = e.Current;
//Join smallest two nodes
var NewParent = new Node();
NewParent.Frequency = first.Value.Frequency + second.Value.Frequency;
NewParent.Left = first.Value;
NewParent.Right = second.Value;
HuffmanTree.Add(NewParent);
//Remove the two that just got joined into one
trees.Remove(first.Key);
trees.Remove(second.Key);
trees.Add(NewParent.Frequency + (fraction += tiny), NewParent);
}
}
}
}
Usage examples:
To create the dictionary from sample data:
var freqs = Huffman16.GetFrequencies(File.ReadAllBytes(#"D:\nodes"), true);
To initialize an encoder with a given dictionary:
var huff = new Huffman16(freqs);
And to do some compression:
var encoded = huff.Encode(raw);
And decompression:
var raw = huff.Decode(encoded);

The hard part in my mind is how you build your static dictionary. You don't want to use the LZW dictionary built from your sample data. LZW wastes a bunch of time learning since it can't build the dictionary faster than the decompressor can (a token will only be used the second time it's seen by the compressor so the decompressor can add it to its dictionary the first time its seen). The flip side of this is that it's adding things to the dictionary that may never get used, just in case the string shows up again. (e.g., to have a token for 'stackoverflow' you'll also have entries for 'ac','ko','ve','rf' etc...)
However, looking at the raw token stream from an LZ77 algorithm could work well. You'll only see tokens for strings seen at least twice. You can then build a list of the most common tokens/strings to include in your dictionary.
Once you have a static dictionary, using LZW sans the dictionary update seems like an easy implementation but to get the best compression I'd consider a static Huffman table instead of the traditional 12 bit fixed size token (as George Phillips suggested). An LZW dictionary will burn tokens for all the sub-strings you may never actually encode (e.g, if you can encode 'stackoverflow', there will be tokens for 'st', 'sta', 'stac', 'stack', 'stacko' etc.).
At this point it really isn't LZW - what makes LZW clever is how the decompressor can build the same dictionary the compressor used only seeing the compressed data stream. Something you won't be using. But all LZW implementations have a state where the dictionary is full and is no longer updated, this is how you'd use it with your static dictionary.

LZW adds to the dictionary during decompression to ensure it has the same dictionary state as the compressor. Otherwise the decoding would not function properly.
However, if you were in a state where the dictionary was fixed then, yes, you would not need to add new codes.
Your approach will work reasonably well and it's easy to use existing tools to prototype and measure the results. That is, compress the example file and then the example and test data together. The size of the latter less the former will be the expected compressed size of a block.
LZW is a clever way to build up a dictionary on the fly and gives decent results. But a more thorough analysis of your typical data blocks is likely to generate a more efficient dictionary.
There's also room for improvement in how LZW represents compressed data. For instance, each dictionary reference could be Huffman encoded to a closer to optimal length based on the expected frequency of their use. To be truly optimal the codes should be arithmetic encoded.

I would look at your data to see if there's an obvious reason it's so easy to compress. You might be able to do something much simpler than LZ78. I've done both LZ77 (lookback) and LZ78 (dictionary).
Try running a LZ77 on your data. There's no dictionary with LZ77, so you could use a library without alteration. Deflate is an implementation of LZ77.
Your idea of using a common dictionary is a good one, but it's hard to know whether the files are similar to each other or just self-similar without doing some tests.

The right track is to use an library -- almost every modern language have a compression library. C#, Python, Perl, Java, VB.net, whatever you use.
LZW save some space by depending the dictionary on previous inputs. It have an initial dictionary, and when you decompress something, you add them to the dictionary -- so the dictionary is growing. (I am omitting some details here, but this is the general idea)
You can omit this step by supply the whole (complete) dictionary as the initial one. But this would cost some space.

I find this aproach quite interesting for repeated log entries and something I would like to explore using.
Can you share the compression statistics for using this approach for your use case so I can compare it with other alternatives?
Have you considered having the common dictionary grow over time or is that not a valid option?

Related

how to overwrite repeated feild values in google buffer protocol

my question is about,how to overwrite the repeated feild in google buffer protocol
Example
message seltMeasureParam
{
repeated integer val = 1;
}
I want to fill val with 255 up to 8000 times using for loop, then i want to fill values at some particular positions within range 4000.
Filling 255 8000 times can be filled easily, i want to know how to fill val at some particular sub-range within 4000 range.
please help in this, thanks in advance
This will depend entirely on the particular language / library that you are using. For example, if I use protobuf-net, the code-gen for repeated int32 val = 1; will generate (for that member) either (syntax="proto2"; or no syntax specified):
[global::ProtoBuf.ProtoMember(1, Name = #"val")]
public int[] Vals { get; set; }
or (syntax="proto3";)
[global::ProtoBuf.ProtoMember(1, Name = #"val", IsPacked = true)]
public int[] Vals { get; set; }
So then your code is simply:
var arr = new int[8000];
for(int i = 0; i < arr.Length; i++) arr[i] = 255;
obj.Vals = arr;
and then later: just set .Vals[someIndex] = someValue; with your other values.
How that works in other libraries and languages will vary depending on the API exposed in that framework.

sum and max values in a single iteration

I have a List of a custom CallRecord objects
public class CallRecord {
private String callId;
private String aNum;
private String bNum;
private int seqNum;
private byte causeForOutput;
private int duration;
private RecordType recordType;
.
.
.
}
There are two logical conditions and the output of each is:
Highest seqNum, sum(duration)
Highest seqNum, sum(duration), highest causeForOutput
As per my understanding, Stream.max(), Collectors.summarizingInt() and so on will either require several iterations for the above result. I also came across a thread suggesting custom collector but I am unsure.
Below is the simple, pre-Java 8 code that is serving the purpose:
if (...) {
for (CallRecord currentRecord : completeCallRecords) {
highestSeqNum = currentRecord.getSeqNum() > highestSeqNum ? currentRecord.getSeqNum() : highestSeqNum;
sumOfDuration += currentRecord.getDuration();
}
} else {
byte highestCauseForOutput = 0;
for (CallRecord currentRecord : completeCallRecords) {
highestSeqNum = currentRecord.getSeqNum() > highestSeqNum ? currentRecord.getSeqNum() : highestSeqNum;
sumOfDuration += currentRecord.getDuration();
highestCauseForOutput = currentRecord.getCauseForOutput() > highestCauseForOutput ? currentRecord.getCauseForOutput() : highestCauseForOutput;
}
}
Your desire to do everything in a single iteration is irrational. You should strive for simplicity first, performance if necessary, but insisting on a single iteration is neither.
The performance depends on too many factors to make a prediction in advance. The process of iterating (over a plain collection) itself is not necessarily an expensive operation and may even benefit from a simpler loop body in a way that makes multiple traversals with a straight-forward operation more efficient than a single traversal trying to do everything at once. The only way to find out, is to measure using the actual operations.
Converting the operation to Stream operations may simplify the code, if you use it straight-forwardly, i.e.
int highestSeqNum=
completeCallRecords.stream().mapToInt(CallRecord::getSeqNum).max().orElse(-1);
int sumOfDuration=
completeCallRecords.stream().mapToInt(CallRecord::getDuration).sum();
if(!condition) {
byte highestCauseForOutput = (byte)
completeCallRecords.stream().mapToInt(CallRecord::getCauseForOutput).max().orElse(0);
}
If you still feel uncomfortable with the fact that there are multiple iterations, you could try to write a custom collector performing all operations at once, but the result will not be better than your loop, neither in terms of readability nor efficiency.
Still, I’d prefer avoiding code duplication over trying to do everything in one loop, i.e.
for(CallRecord currentRecord : completeCallRecords) {
int nextSeqNum = currentRecord.getSeqNum();
highestSeqNum = nextSeqNum > highestSeqNum ? nextSeqNum : highestSeqNum;
sumOfDuration += currentRecord.getDuration();
}
if(!condition) {
byte highestCauseForOutput = 0;
for(CallRecord currentRecord : completeCallRecords) {
byte next = currentRecord.getCauseForOutput();
highestCauseForOutput = next > highestCauseForOutput? next: highestCauseForOutput;
}
}
With Java-8 you can resolved it with a Collector with no redudant iteration.
Normally, we can use the factory methods from Collectors, but in your case you need to implement a custom Collector, that reduces a Stream<CallRecord> to an instance of SummarizingCallRecord which cotains the attributes you require.
Mutable accumulation/result type:
class SummarizingCallRecord {
private int highestSeqNum = 0;
private int sumDuration = 0;
// getters/setters ...
}
Custom collector:
BiConsumer<SummarizingCallRecord, CallRecord> myAccumulator = (a, callRecord) -> {
a.setHighestSeqNum(Math.max(a.getHighestSeqNum(), callRecord.getSeqNum()));
a.setSumDuration(a.getSumDuration() + callRecord.getDuration());
};
BinaryOperator<SummarizingCallRecord> myCombiner = (a1, a2) -> {
a1.setHighestSeqNum(Math.max(a1.getHighestSeqNum(), a2.getHighestSeqNum()));
a1.setSumDuration(a1.getSumDuration() + a2.getSumDuration());
return a1;
};
Collector<CallRecord, SummarizingCallRecord, SummarizingCallRecord> myCollector =
Collector.of(
() -> new SummarizinCallRecord(),
myAccumulator,
myCombiner,
// Collector.Characteristics.CONCURRENT/IDENTITY_FINISH/UNORDERED
);
Execution example:
List<CallRecord> callRecords = new ArrayList<>();
callRecords.add(new CallRecord(1, 100));
callRecords.add(new CallRecord(5, 50));
callRecords.add(new CallRecord(3, 1000));
SummarizingCallRecord summarizingCallRecord = callRecords.stream()
.collect(myCollector);
// Result:
// summarizingCallRecord.highestSeqNum = 5
// summarizingCallRecord.sumDuration = 1150
You don't need and should not implement the logic by Stream API because the tradition for-loop is simple enough and the Java 8 Stream API can't make it simpler:
int highestSeqNum = 0;
long sumOfDuration = 0;
byte highestCauseForOutput = 0; // just get it even if it may not be used. there is no performance hurt.
for(CallRecord currentRecord : completeCallRecords) {
highestSeqNum = Math.max(highestSeqNum, currentRecord.getSeqNum());
sumOfDuration += currentRecord.getDuration();
highestCauseForOutput = Math.max(highestCauseForOutput, currentRecord.getCauseForOutput());
}
// Do something with or without highestCauseForOutput.

Swift Dictionary slow even with optimizations: doing uncessary retain/release?

The following code, which maps simple value holders to booleans, runs over 20x faster in Java than Swift 2 - XCode 7 beta3, "Fastest, Aggressive Optimizations [-Ofast]", and "Fast, Whole Module Optimizations" turned on. I can get over 280M lookups/sec in Java but only about 10M in Swift.
When I look at it in Instruments I see that most of the time is going into a pair of retain/release calls associated with the map lookup. Any suggestions on why this is happening or a workaround would be appreciated.
The structure of the code is a simplified version of my real code, which has a more complex key class and also stores other types (though Boolean is an actual case for me). Also, note that I am using a single mutable key instance for the retrieval to avoid allocating objects inside the loop and according to my tests this is faster in Swift than an immutable key.
EDIT: I have also tried switching to NSMutableDictionary but when used with Swift objects as keys it seems to be terribly slow.
EDIT2: I have tried implementing the test in objc (which wouldn't have the Optional unwrapping overhead) and it is faster but still over an order of magnitude slower than Java... I'm going to pose that example as another question to see if anyone has ideas.
EDIT3 - Answer. I have posted my conclusions and my workaround in an answer below.
public final class MyKey : Hashable {
var xi : Int = 0
init( _ xi : Int ) { set( xi ) }
final func set( xi : Int) { self.xi = xi }
public final var hashValue: Int { return xi }
}
public func == (lhs: MyKey, rhs: MyKey) -> Bool {
if ( lhs === rhs ) { return true }
return lhs.xi==rhs.xi
}
...
var map = Dictionary<MyKey,Bool>()
let range = 2500
for x in 0...range { map[ MyKey(x) ] = true }
let runs = 10
for _ in 0...runs
{
let time = Time()
let reps = 10000
let key = MyKey(0)
for _ in 0...reps {
for x in 0...range {
key.set(x)
if ( map[ key ] == nil ) { XCTAssertTrue(false) }
}
}
print("rate=\(time.rate( reps*range )) lookups/s")
}
and here is the corresponding Java code:
public class MyKey {
public int xi;
public MyKey( int xi ) { set( xi ); }
public void set( int xi) { this.xi = xi; }
#Override public int hashCode() { return xi; }
#Override
public boolean equals( Object o ) {
if ( o == this ) { return true; }
MyKey mk = (MyKey)o;
return mk.xi == this.xi;
}
}
...
Map<MyKey,Boolean> map = new HashMap<>();
int range = 2500;
for(int x=0; x<range; x++) { map.put( new MyKey(x), true ); }
int runs = 10;
for(int run=0; run<runs; run++)
{
Time time = new Time();
int reps = 10000;
MyKey buffer = new MyKey( 0 );
for (int it = 0; it < reps; it++) {
for (int x = 0; x < range; x++) {
buffer.set( x );
if ( map.get( buffer ) == null ) { Assert.assertTrue( false ); }
}
}
float rate = reps*range/time.s();
System.out.println( "rate = " + rate );
}
After much experimentation I have come to some conclusions and found a workaround (albeit somewhat extreme).
First let me say that I recognize that this kind of very fine grained data structure access within a tight loop is not representative of general performance, but it does affect my application and I'm imagining others like games and heavily numeric applications. Also let me say that I know that Swift is a moving target and I'm sure it will improve - perhaps my workaround (hacks) below will not be necessary by the time you read this. But if you are trying to do something like this today and you are looking at Instruments and seeing the majority of your application time spent in retain/release and you don't want to rewrite your entire app in objc please read on.
What I have found is that almost anything that one does in Swift that touches an object reference incurs an ARC retain/release penalty. Additionally Optional values - even optional primitives - also incur this cost. This pretty much rules out using Dictionary or NSDictionary.
Here are some things that are fast that you can include in a workaround:
a) Arrays of primitive types.
b) Arrays of final objects as long as long as the array is on the stack and not on the heap. e.g. Declare an array within the method body (but outside of your loop of course) and iteratively copy the values to it. Do not Array(array) copy it.
Putting this together you can construct a data structure based on arrays that stores e.g. Ints and then store array indexes to your objects in that data structure. Within your loop you can look up the objects by their index in the fast local array. Before you ask "couldn't the data structure store the array for me" - no, because that would incur two of the penalties I mentioned above :(
All things considered this workaround is not too bad - If you can enumerate the entities that you want to store in the Dictionary / data structure you should be able to host them in an array as described. Using the technique above I was able to exceed the Java performance by a factor of 2x in Swift in my case.
If anyone is still reading and interested at this point I will consider updating my example code and posting.
EDIT: I'd add an option: c) It is also possible to use UnsafeMutablePointer<> or Unmanaged<> in Swift to create a reference that will not be retained when passed around. I was not aware of this when I started and I would hesitate to recommend it in general because it's a hack, but I've used it in a few cases to wrap a heavily used array that was incurring a retain/release every time it was referenced.

Build trie faster

I'm making an mobile app which needs thousands of fast string lookups and prefix checks. To speed this up, I made a Trie out of my word list, which has about 180,000 words.
Everything's great, but the only problem is that building this huge trie (it has about 400,000 nodes) takes about 10 seconds currently on my phone, which is really slow.
Here's the code that builds the trie.
public SimpleTrie makeTrie(String file) throws Exception {
String line;
SimpleTrie trie = new SimpleTrie();
BufferedReader br = new BufferedReader(new FileReader(file));
while( (line = br.readLine()) != null) {
trie.insert(line);
}
br.close();
return trie;
}
The insert method which runs on O(length of key)
public void insert(String key) {
TrieNode crawler = root;
for(int level=0 ; level < key.length() ; level++) {
int index = key.charAt(level) - 'A';
if(crawler.children[index] == null) {
crawler.children[index] = getNode();
}
crawler = crawler.children[index];
}
crawler.valid = true;
}
I'm looking for intuitive methods to build the trie faster. Maybe I build the trie just once on my laptop, store it somehow to the disk, and load it from a file in the phone? But I don't know how to implement this.
Or are there any other prefix data structures which will take less time to build, but have similar lookup time complexity?
Any suggestions are appreciated. Thanks in advance.
EDIT
Someone suggested using Java Serialization. I tried it, but it was very slow with this code:
public void serializeTrie(SimpleTrie trie, String file) {
try {
ObjectOutput out = new ObjectOutputStream(new BufferedOutputStream(new FileOutputStream(file)));
out.writeObject(trie);
out.close();
} catch (IOException e) {
e.printStackTrace();
}
}
public SimpleTrie deserializeTrie(String file) {
try {
ObjectInput in = new ObjectInputStream(new BufferedInputStream(new FileInputStream(file)));
SimpleTrie trie = (SimpleTrie)in.readObject();
in.close();
return trie;
} catch (IOException | ClassNotFoundException e) {
e.printStackTrace();
return null;
}
}
Can this above code be made faster?
My trie: http://pastebin.com/QkFisi09
Word list: http://www.isc.ro/lists/twl06.zip
Android IDE used to run code: http://play.google.com/store/apps/details?id=com.jimmychen.app.sand
Double-Array tries are very fast to save/load because all data is stored in linear arrays. They are also very fast to lookup, but the insertions can be costly. I bet there is a Java implementation somewhere.
Also, if your data is static (i.e. you don't update it on phone) consider DAFSA for your task. It is one of the most efficient data structures for storing words (must be better than "standard" tries and radix tries both for size and for speed, better than succinct tries for speed, often better than succinct tries for size). There is a good C++ implementation: dawgdic - you can use it to build DAFSA from command line and then use a Java reader for the resulting data structure (example implementation is here).
You could store your trie as an array of nodes, with references to child nodes replaced with array indices. Your root node would be the first element. That way, you could easily store/load your trie from simple binary or text format.
public class SimpleTrie {
public class TrieNode {
boolean valid;
int[] children;
}
private TrieNode[] nodes;
private int numberOfNodes;
private TrieNode getNode() {
TrieNode t = nodes[++numberOnNodes];
return t;
}
}
Just build a large String[] and sort it. Then you can use binary search to find the location of a String. You can also do a query based on prefixes without too much work.
Prefix look-up example:
Compare method:
private static int compare(String string, String prefix) {
if (prefix.length()>string.length()) return Integer.MIN_VALUE;
for (int i=0; i<prefix.length(); i++) {
char s = string.charAt(i);
char p = prefix.charAt(i);
if (s!=p) {
if (p<s) {
// prefix is before string
return -1;
}
// prefix is after string
return 1;
}
}
return 0;
}
Finds an occurrence of the prefix in the array and returns it's location (MIN or MAX are mean not found)
private static int recursiveFind(String[] strings, String prefix, int start, int end) {
if (start == end) {
String lastValue = strings[start]; // start==end
if (compare(lastValue,prefix)==0)
return start; // start==end
return Integer.MAX_VALUE;
}
int low = start;
int high = end + 1; // zero indexed, so add one.
int middle = low + ((high - low) / 2);
String middleValue = strings[middle];
int comp = compare(middleValue,prefix);
if (comp == Integer.MIN_VALUE) return comp;
if (comp==0)
return middle;
if (comp>0)
return recursiveFind(strings, prefix, middle + 1, end);
return recursiveFind(strings, prefix, start, middle - 1);
}
Gets a String array and prefix, prints out occurrences of prefix in array
private static boolean testPrefix(String[] strings, String prefix) {
int i = recursiveFind(strings, prefix, 0, strings.length-1);
if (i==Integer.MAX_VALUE || i==Integer.MIN_VALUE) {
// not found
return false;
}
// Found an occurrence, now search up and down for other occurrences
int up = i+1;
int down = i;
while (down>=0) {
String string = strings[down];
if (compare(string,prefix)==0) {
System.out.println(string);
} else {
break;
}
down--;
}
while (up<strings.length) {
String string = strings[up];
if (compare(string,prefix)==0) {
System.out.println(string);
} else {
break;
}
up++;
}
return true;
}
Here's a reasonably compact format for storing a trie on disk. I'll specify it by its (efficient) deserialization algorithm. Initialize a stack whose initial contents are the root node of the trie. Read characters one by one and interpret them as follows. The meaning of a letter A-Z is "allocate a new node, make it a child of the current top of stack, and push the newly allocated node onto the stack". The letter indicates which position the child is in. The meaning of a space is "set the valid flag of the node on top of the stack to true". The meaning of a backspace (\b) is "pop the stack".
For example, the input
TREE \b\bIE \b\b\bOO \b\b\b
gives the word list
TREE
TRIE
TOO
. On your desktop, construct the trie using whichever method and then serialize by the following recursive algorithm (pseudocode).
serialize(node):
if node is valid: put(' ')
for letter in A-Z:
if node has a child under letter:
put(letter)
serialize(child)
put('\b')
This isn't a magic bullet, but you can probably reduce your runtime slightly by doing one big memory allocation instead of a bunch of little ones.
I saw a ~10% speedup in the test code below (C++, not Java, sorry) when I used a "node pool" instead of relying on individual allocations:
#include <string>
#include <fstream>
#define USE_NODE_POOL
#ifdef USE_NODE_POOL
struct Node;
Node *node_pool;
int node_pool_idx = 0;
#endif
struct Node {
void insert(const std::string &s) { insert_helper(s, 0); }
void insert_helper(const std::string &s, int idx) {
if (idx >= s.length()) return;
int char_idx = s[idx] - 'A';
if (children[char_idx] == nullptr) {
#ifdef USE_NODE_POOL
children[char_idx] = &node_pool[node_pool_idx++];
#else
children[char_idx] = new Node();
#endif
}
children[char_idx]->insert_helper(s, idx + 1);
}
Node *children[26] = {};
};
int main() {
#ifdef USE_NODE_POOL
node_pool = new Node[400000];
#endif
Node n;
std::ifstream fin("TWL06.txt");
std::string word;
while (fin >> word) n.insert(word);
}
Tries that prealloate space all possible children (256) have a huge amount of wasted space. You are making your cache cry. Store those pointers to children in a resizable data structure.
Some tries will optimize by having one node to represent a long string, and break that string up only when needed.
Instead of a simple file you can use a database like sqlite and a nested set or celko tree to store the trie and you can also build a faster and shorter (less nodes) trie with a ternary search trie.
I don't like the idea of addressing nodes by index in array, but only because it requires one more addition (index to the pointer). But with array of preallocated nodes you will maybe save some time on allocation and initialization. And you can also save a lot of space by reserving first 26 indices for leaf nodes. Thus you'll not need to allocate and initialize 180000 leaf nodes.
Also with indices you will be able to read the prepared nodes array from disk in binary format. This has to be several times faster. But I'm not sure how to do this on your language. Is this Java?
If you checked that your source vocabulary is sorted, you may also save some time by comparing some prefix of the current string with the previous one. E.g. first 4 characters. If they are equal you can start your
for(int level=0 ; level < key.length() ; level++) {
loop from the 5-th level.
Is it space inefficient or time inefficient? If you are rolling a plain trie then space may be part of the problem when dealing with a mobil device. Check out patricia/radix tries, especially if you are using it as a prefix look-up tool.
Trie:
http://en.wikipedia.org/wiki/Trie
Patricia/Radix trie:
http://en.wikipedia.org/wiki/Radix_tree
You didn't mention a language but here are two implementations of prefix tries in Java.
Regular trie:
http://github.com/phishman3579/java-algorithms-implementation/blob/master/src/com/jwetherell/algorithms/data_structures/Trie.java
Patricia/Radix (space-effecient) trie:
http://github.com/phishman3579/java-algorithms-implementation/blob/master/src/com/jwetherell/algorithms/data_structures/PatriciaTrie.java
Generally speaking, avoid using a lot of object creations from scratch in Java, which is both slow and it also has a massive overhead. Better implement your own pooling class for memory management that allocates e.g. half a million entries at a time in one go.
Also, serialization is too slow for large lexicons. Use a binary read to populate array-based representations proposed above quickly.

Simple encryption algorithm for homework. not getting decryption working properly

This is a homework question that I can't get my head around at all
Its a very simple encryption algorithm. You start with a string of characters as your alphabet:
ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!, .
Then ask the user to enter there own string that will act as a map such as:
0987654321! .,POIUYTREWQASDFGHJKLMNBVCXZ
Then the program uses this to make a map and allows you to enter text that gets encrypted.
For example MY NAME IS JOSEPH would be encrypted as .AX,0.6X2YX1PY6O3
This is all very easy, however he said that its a one to one mapping and thus implied that if I enter .AX,0.6X2YX1PY6O3 back into the program I will get out MY NAME IS JOSEPH
This doesn't happen, because .AX,0.6X2YX1PY6O3 becomes Z0QCDZQGAQFOALDH
The mapping only works to decrypt when you go backwards but the question implies that the program just loops and runs the one algorithm every time.
Even if some could say that it is possible I would be happy, I have pages and pages of paper filled up with possible workings, but I came up with nothing, the only solution to run the algorithm backwards back I don't think we are allowed to do that.
Any ideas?
Edit:
Unfortunately I can't get this to work (Using the orbit computation idea) What am I doing wrong?
//import scanner class
import java.util.Scanner;
public class Encryption {
static Scanner inputString = new Scanner(System.in);
//define alphabet
private static String alpha = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!, .";
private static String map;
private static int[] encryptionMap = new int[40];//mapping int array
private static boolean exit = false;
private static boolean valid = true;
public static void main(String[] args) {
String encrypt, userInput;
userInput = new String();
System.out.println("This program takes a large reordered string");
System.out.println("and uses it to encrypt your data");
System.out.println("Please enter a mapping string of 40 length and the same characters as below but in different order:");
System.out.println(alpha);
//getMap();//don't get user input for map, for testing!
map=".ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!, ";//forced input for testing only!
do{
if (valid == true){
System.out.println("Enter Q to quit, otherwise enter a string:");
userInput = getInput();
if (userInput.charAt(0) != 'Q' ){//&& userInput.length()<2){
encrypt = encrypt(userInput);
for (int x=0; x<39; x++){//here I am trying to get the orbit computation going
encrypt = encrypt(encrypt);
}
System.out.println("You entered: "+userInput);
System.out.println("Encrypted Version: "+encrypt);
}else if (userInput.charAt(0) == 'Q'){//&& userInput.length()<2){
exit = true;
}
}
else if (valid == false){
System.out.println("Error, your string for mapping is incorrect");
valid = true;//reset condition to repeat
}
}while(exit == false);
System.out.println("Good bye");
}
static String encrypt(String userInput){
//use mapping array to encypt data
String encrypt;
StringBuffer tmp = new StringBuffer();
char current;
int alphaPosition;
int temp;
//run through the user string
for (int x=0; x<userInput.length(); x++){
//get character
current = userInput.charAt(x);
//get location of current character in alphabet
alphaPosition = alpha.indexOf(current);
//encryptionMap.charAt(alphaPosition)
tmp.append(map.charAt(alphaPosition));
}
encrypt = tmp.toString();
return(encrypt);
}
static void getMap(){
//get a mapping string and validate from the user
map = getInput();
//validate code
if (map.length() != 40){
valid = false;
}
else{
for (int x=0; x<40; x++){
if (map.indexOf(alpha.charAt(x)) == -1){
valid = false;
}
}
}
if (valid == true){
for (int x=0; x<40; x++){
int a = (int)(alpha.charAt(x));
int y = (int)( map.charAt(x));
//create encryption map
encryptionMap[x]=(a-y);
}
}
}
static String getInput(){
//get input(this repeats)
String input = inputString.nextLine();
input = input.toUpperCase();
if ("QUIT".equals(input) || "END".equals(input) || "NO".equals(input) || "N".equals(input)){
StringBuffer tmp = new StringBuffer();
tmp.append('Q');
input = tmp.toString();
}
return(input);
}
}
You will (probably) not get your original string back if you apply that substitution again. I say probably because you can construct such inputs (they all do things like if A->B then B->A). But most inputs won't do that. You would have to construct the reverse map to decrypt.
However, there is a trick you can do if you're only allowed to go forward. Keep applying the mapping and you'll eventually return to your original input. The number of times you'll have to do that depends on your input. To figure out how many times, compute the orbit of each character, and take the least common multiple of all the orbit sizes. For your input the orbits are size 1 (T->T, W->W), 2 (B->9->B H->3->H U->R->U P->O->P), 4 (C->8->N->,->C), 9 (A->...->Y->A), and 17 (E->...->V->E). The LCM of all those is 612, so 611 forward mappings applied to the ciphertext will return you to the plaintext.
Well, you can get your string back this way only if you do reverse mapping. One to one mapping means that a single letter of your default alphabet maps to only one letter of your new alphabet and vice versa. I.e. you can't map ABCD to ABBA. It doesn't imply that you can get your initial string by doing a second round of encryption.
The thing you have described can be achieved if you use a finite alphabet and a displacement to encode your string. You can choose the displacement in such a way that after a number of rounds of encryption totalDisplacement mod alphabetSize == 0 Than you will get your string back going only forward.

Resources