sum and max values in a single iteration - java-8

I have a List of a custom CallRecord objects
public class CallRecord {
private String callId;
private String aNum;
private String bNum;
private int seqNum;
private byte causeForOutput;
private int duration;
private RecordType recordType;
.
.
.
}
There are two logical conditions and the output of each is:
Highest seqNum, sum(duration)
Highest seqNum, sum(duration), highest causeForOutput
As per my understanding, Stream.max(), Collectors.summarizingInt() and so on will either require several iterations for the above result. I also came across a thread suggesting custom collector but I am unsure.
Below is the simple, pre-Java 8 code that is serving the purpose:
if (...) {
for (CallRecord currentRecord : completeCallRecords) {
highestSeqNum = currentRecord.getSeqNum() > highestSeqNum ? currentRecord.getSeqNum() : highestSeqNum;
sumOfDuration += currentRecord.getDuration();
}
} else {
byte highestCauseForOutput = 0;
for (CallRecord currentRecord : completeCallRecords) {
highestSeqNum = currentRecord.getSeqNum() > highestSeqNum ? currentRecord.getSeqNum() : highestSeqNum;
sumOfDuration += currentRecord.getDuration();
highestCauseForOutput = currentRecord.getCauseForOutput() > highestCauseForOutput ? currentRecord.getCauseForOutput() : highestCauseForOutput;
}
}

Your desire to do everything in a single iteration is irrational. You should strive for simplicity first, performance if necessary, but insisting on a single iteration is neither.
The performance depends on too many factors to make a prediction in advance. The process of iterating (over a plain collection) itself is not necessarily an expensive operation and may even benefit from a simpler loop body in a way that makes multiple traversals with a straight-forward operation more efficient than a single traversal trying to do everything at once. The only way to find out, is to measure using the actual operations.
Converting the operation to Stream operations may simplify the code, if you use it straight-forwardly, i.e.
int highestSeqNum=
completeCallRecords.stream().mapToInt(CallRecord::getSeqNum).max().orElse(-1);
int sumOfDuration=
completeCallRecords.stream().mapToInt(CallRecord::getDuration).sum();
if(!condition) {
byte highestCauseForOutput = (byte)
completeCallRecords.stream().mapToInt(CallRecord::getCauseForOutput).max().orElse(0);
}
If you still feel uncomfortable with the fact that there are multiple iterations, you could try to write a custom collector performing all operations at once, but the result will not be better than your loop, neither in terms of readability nor efficiency.
Still, I’d prefer avoiding code duplication over trying to do everything in one loop, i.e.
for(CallRecord currentRecord : completeCallRecords) {
int nextSeqNum = currentRecord.getSeqNum();
highestSeqNum = nextSeqNum > highestSeqNum ? nextSeqNum : highestSeqNum;
sumOfDuration += currentRecord.getDuration();
}
if(!condition) {
byte highestCauseForOutput = 0;
for(CallRecord currentRecord : completeCallRecords) {
byte next = currentRecord.getCauseForOutput();
highestCauseForOutput = next > highestCauseForOutput? next: highestCauseForOutput;
}
}

With Java-8 you can resolved it with a Collector with no redudant iteration.
Normally, we can use the factory methods from Collectors, but in your case you need to implement a custom Collector, that reduces a Stream<CallRecord> to an instance of SummarizingCallRecord which cotains the attributes you require.
Mutable accumulation/result type:
class SummarizingCallRecord {
private int highestSeqNum = 0;
private int sumDuration = 0;
// getters/setters ...
}
Custom collector:
BiConsumer<SummarizingCallRecord, CallRecord> myAccumulator = (a, callRecord) -> {
a.setHighestSeqNum(Math.max(a.getHighestSeqNum(), callRecord.getSeqNum()));
a.setSumDuration(a.getSumDuration() + callRecord.getDuration());
};
BinaryOperator<SummarizingCallRecord> myCombiner = (a1, a2) -> {
a1.setHighestSeqNum(Math.max(a1.getHighestSeqNum(), a2.getHighestSeqNum()));
a1.setSumDuration(a1.getSumDuration() + a2.getSumDuration());
return a1;
};
Collector<CallRecord, SummarizingCallRecord, SummarizingCallRecord> myCollector =
Collector.of(
() -> new SummarizinCallRecord(),
myAccumulator,
myCombiner,
// Collector.Characteristics.CONCURRENT/IDENTITY_FINISH/UNORDERED
);
Execution example:
List<CallRecord> callRecords = new ArrayList<>();
callRecords.add(new CallRecord(1, 100));
callRecords.add(new CallRecord(5, 50));
callRecords.add(new CallRecord(3, 1000));
SummarizingCallRecord summarizingCallRecord = callRecords.stream()
.collect(myCollector);
// Result:
// summarizingCallRecord.highestSeqNum = 5
// summarizingCallRecord.sumDuration = 1150

You don't need and should not implement the logic by Stream API because the tradition for-loop is simple enough and the Java 8 Stream API can't make it simpler:
int highestSeqNum = 0;
long sumOfDuration = 0;
byte highestCauseForOutput = 0; // just get it even if it may not be used. there is no performance hurt.
for(CallRecord currentRecord : completeCallRecords) {
highestSeqNum = Math.max(highestSeqNum, currentRecord.getSeqNum());
sumOfDuration += currentRecord.getDuration();
highestCauseForOutput = Math.max(highestCauseForOutput, currentRecord.getCauseForOutput());
}
// Do something with or without highestCauseForOutput.

Related

Algorithm / data structure for resolving nested interpolated values in this example?

I am working on a compiler and one aspect currently is how to wait for interpolated variable names to be resolved. So I am wondering how to take a nested interpolated variable string and build some sort of simple data model/schema for unwrapping the evaluated string so to speak. Let me demonstrate.
Say we have a string like this:
foo{a{x}-{y}}-{baz{one}-{two}}-foo{c}
That has 1, 2, and 3 levels of nested interpolations in it. So essentially it should resolve something like this:
wait for x, y, one, two, and c to resolve.
when both x and y resolve, then resolve a{x}-{y} immediately.
when both one and two resolve, resolve baz{one}-{two}.
when a{x}-{y}, baz{one}-{two}, and c all resolve, then finally resolve the whole expression.
I am shaky on my understanding of the logic flow for handling something like this, wondering if you could help solidify/clarify the general algorithm (high level pseudocode or something like that). Mainly just looking for how I would structure the data model and algorithm so as to progressively evaluate when the pieces are ready.
I'm starting out trying and it's not clear what to do next:
{
dependencies: [
{
path: [x]
},
{
path: [y]
}
],
parent: {
dependency: a{x}-{y} // interpolated term
parent: {
dependencies: [
{
}
]
}
}
}
Some sort of tree is probably necessary, but I am having trouble figuring out what it might look like, wondering if you could shed some light on that with some pseudocode (or JavaScript even).
watch the leaf nodes at first
then, when the children of a node are completed, propagate upward to resolving the next parent node. This would mean once x and y are done, it could resolve a{x}-{y}, but then wait until the other nodes are ready before doing the final top-level evaluation.
You can just simulate it by sending "events" to the system theoretically, like:
ready('y')
ready('c')
ready('x')
ready('a{x}-{y}')
function ready(variable) {
if ()
}
...actually that may not work, not sure how to handle the interpolated nodes in a hacky way like that. But even a high level description of how to solve this would be helpful.
export type SiteDependencyObserverParentType = {
observer: SiteDependencyObserverType
remaining: number
}
export type SiteDependencyObserverType = {
children: Array<SiteDependencyObserverType>
node: LinkNodeType
parent?: SiteDependencyObserverParentType
path: Array<string>
}
(What I'm currently thinking, some TypeScript)
Here is an approach in JavaScript:
Parse the input string to create a Node instance for each {} term, and create parent-child dependencies between the nodes.
Collect the leaf Nodes of this tree as the tree is being constructed: group these leaf nodes by their identifier. Note that the same identifier could occur multiple times in the input string, leading to multiple Nodes. If a variable x is resolved, then all Nodes with that name (the group) will be resolved.
Each node has a resolve method to set its final value
Each node has a notify method that any of its child nodes can call in order to notify it that the child has been resolved with a value. This may (or may not yet) lead to a cascading call of resolve.
In a demo, a timer is set up that at every tick will resolve a randomly picked variable to some number
I think that in your example, foo, and a might be functions that need to be called, but I didn't elaborate on that, and just considered them as literal text that does not need further treatment. It should not be difficult to extend the algorithm with such function-calling features.
class Node {
constructor(parent) {
this.source = ""; // The slice of the input string that maps to this node
this.texts = []; // Literal text that's not part of interpolation
this.children = []; // Node instances corresponding to interpolation
this.parent = parent; // Link to parent that should get notified when this node resolves
this.value = undefined; // Not yet resolved
}
isResolved() {
return this.value !== undefined;
}
resolve(value) {
if (this.isResolved()) return; // A node is not allowed to resolve twice: ignore
console.log(`Resolving "${this.source}" to "${value}"`);
this.value = value;
if (this.parent) this.parent.notify();
}
notify() {
// Check if all dependencies have been resolved
let value = "";
for (let i = 0; i < this.children.length; i++) {
const child = this.children[i];
if (!child.isResolved()) { // Not ready yet
console.log(`"${this.source}" is getting notified, but not all dependecies are ready yet`);
return;
}
value += this.texts[i] + child.value;
}
console.log(`"${this.source}" is getting notified, and all dependecies are ready:`);
this.resolve(value + this.texts.at(-1));
}
}
function makeTree(s) {
const leaves = {}; // nodes keyed by atomic names (like "x" "y" in the example)
const tokens = s.split(/([{}])/);
let i = 0; // Index in s
function dfs(parent=null) {
const node = new Node(parent);
const start = i;
while (tokens.length) {
const token = tokens.shift();
i += token.length;
if (token == "}") break;
if (token == "{") {
node.children.push(dfs(node));
} else {
node.texts.push(token);
}
}
node.source = s.slice(start, i - (tokens.length ? 1 : 0));
if (node.children.length == 0) { // It's a leaf
const label = node.texts[0];
leaves[label] ??= []; // Define as empty array if not yet defined
leaves[label].push(node);
}
return node;
}
dfs();
return leaves;
}
// ------------------- DEMO --------------------
let s = "foo{a{x}-{y}}-{baz{one}-{two}}-foo{c}";
const leaves = makeTree(s);
// Create a random order in which to resolve the atomic variables:
function shuffle(array) {
for (var i = array.length - 1; i > 0; i--) {
var j = Math.floor(Math.random() * (i + 1));
[array[j], array[i]] = [array[i], array[j]];
}
return array;
}
const names = shuffle(Object.keys(leaves));
// Use a timer to resolve the variables one by one in the given random order
let index = 0;
function resolveRandomVariable() {
if (index >= names.length) return; // all done
console.log("\n---------------- timer tick --------------");
const name = names[index++];
console.log(`Variable ${name} gets a value: "${index}". Calling resolve() on the connected node instance(s):`);
for (const node of leaves[name]) node.resolve(index);
setTimeout(resolveRandomVariable, 1000);
}
setTimeout(resolveRandomVariable, 1000);
your idea of building a dependency tree it's really likeable.
Anyway I tryed to find a solution as simplest possible.
Even if it already works, there are many optimizations possible, take this just as proof of concept.
The background idea it's produce a List of Strings which you can read in order where each element it's what you need to solve progressively. Each element might be mandatory to solve something that come next in the List, hence for the overall expression. Once you solved all the chunks you have all pieces to solve your original expression.
It's written in Java, I hope it's understandable.
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Objects;
public class StackOverflow {
public static void main(String[] args) {
String exp = "foo{a{x}-{y}}-{baz{one}-{two}}-foo{c}";
List<String> chunks = expToChunks(exp);
//it just reverse the order of the list
Collections.reverse(chunks);
System.out.println(chunks);
//output -> [c, two, one, baz{one}-{two}, y, x, a{x}-{y}]
}
public static List<String> expToChunks(String exp) {
List<String> chunks = new ArrayList<>();
//this first piece just find the first inner open parenthesys and its relative close parenthesys
int begin = exp.indexOf("{") + 1;
int numberOfParenthesys = 1;
int end = -1;
for(int i = begin; i < exp.length(); i++) {
char c = exp.charAt(i);
if (c == '{') numberOfParenthesys ++;
if (c == '}') numberOfParenthesys --;
if (numberOfParenthesys == 0) {
end = i;
break;
}
}
//this if put an end to recursive calls
if(begin > 0 && begin < exp.length() && end > 0) {
//add the chunk to the final list
String substring = exp.substring(begin, end);
chunks.add(substring);
//remove from the starting expression the already considered chunk
String newExp = exp.replace("{" + substring + "}", "");
//recursive call for inner element on the chunk found
chunks.addAll(Objects.requireNonNull(expToChunks(substring)));
//calculate other chunks on the remained expression
chunks.addAll(Objects.requireNonNull(expToChunks(newExp)));
}
return chunks;
}
}
Some details on the code:
The following piece find the begin and the end index of the first outer chunk of expression. The background idea is: in a valid expression the number of open parenthesys must be equal to the number of closing parenthesys. The count of open(+1) and close(-1) parenthesys can't ever be negative.
So using that simple loop once I find the count of parenthesys to be 0, I also found the first chunk of the expression.
int begin = exp.indexOf("{") + 1;
int numberOfParenthesys = 1;
int end = -1;
for(int i = begin; i < exp.length(); i++) {
char c = exp.charAt(i);
if (c == '{') numberOfParenthesys ++;
if (c == '}') numberOfParenthesys --;
if (numberOfParenthesys == 0) {
end = i;
break;
}
}
The if condition provide validation on the begin and end indexes and stop the recursive call when no more chunks can be found on the remained expression.
if(begin > 0 && begin < exp.length() && end > 0) {
...
}

Java 8 Collections max size with count

I have a class like below.
class Student {
public Student(Set<String> seminars) {
this.seminar = seminars;
}
Set<String> seminar;
public Set<String> getSeminar()
{
return seminar;
}
}
And created a set of students like below.
List<Student> students = new ArrayList<Student>();
Set<String> seminars = new HashSet<String>();
seminars.add("SeminarA");
seminars.add("SeminarB");
seminars.add("SeminarC");
students.add(new Student(seminars)); //Student 1 - 3 seminars
students.add(new Student(seminars)); //Student 2 - 3 seminars
seminars = new HashSet<String>();
seminars.add("SeminarA");
seminars.add("SeminarB");
students.add(new Student(seminars)); //Student 3 - 2 seminars
seminars = new HashSet<String>();
students.add(new Student(seminars)); //Student 4 - 0 seminars
Now the question is "I'm trying to get the count of students who has atteneded maximus seminars" As you can see there are 2 students who has attended to 3(maximum) seminars, So I needed to get that count.
I achieved the same using 2 different statements using stream
OptionalInt max = students.stream()
.map(Student::getSeminar)
.mapToInt(Set::size)
.max();
long count = students.stream()
.map(Student::getSeminar)
.filter(size -> size.size() == max.getAsInt())
.count();
is there a way to achieve the same using one statement?
To solve this, please use the follwing code:
students.stream()
.map(Student::getSeminar)
.collect(Collectors.groupingBy(Set::size, Collectors.counting()));
Your output will be:
{0=1, 2=1, 3=2}
As you can see, there are two students that are going to the maximum number of seminars which is three.
First we changed all the students objects from the stream with the actual sets and then we used Collectors groupingBy() method to group the sets by size.
If you want to get only the number of the students, please use following code:
students.stream()
.map(Student::getSeminar)
.collect(Collectors.groupingBy(Set::size, Collectors.counting()))
.values().stream()
.max(Comparator.comparing(a->a))
.get();
Your output will be: 2.
In most practical situations, your approach of traversing the list twice is the best one. It is simple and the code is easy to understand. Fight the urge to conserve statements -- you don't get charged per semicolon!
However, there are some conceivable scenarios when the source data cannot be traversed twice. Perhaps it comes from some slow or once-only source like a network stream.
If you have such a situation, you might want to define a custom collector allMaxBy that collects all max elements into a downstream collector.
Then you would be able to write
long maxCount = students.stream()
.collect(allMaxBy(
comparingInt(s -> s.getSeminar().size()),
counting()
));
Here's an implementation for allMaxBy:
public static <T, A, R> Collector<T, ?, R> allMaxBy(Comparator<? super T> cmp, Collector<? super T, A, R> downstream) {
class AllMax {
T val;
A acc = null; // null means empty
void add(T t) {
int c = acc == null ? 1 : cmp.compare(t, val);
if (c > 0) {
val = t;
acc = downstream.supplier().get();
downstream.accumulator().accept(acc, t);
} else if (c == 0) {
downstream.accumulator().accept(acc, t);
}
}
AllMax merge(AllMax other) {
if (other.acc == null) {
return this;
} else if (this.acc == null) {
return other;
}
int c = cmp.compare(this.val, other.val);
if (c == 0) {
this.acc = downstream.combiner().apply(this.acc, other.acc);
}
return c >= 0 ? this : other;
}
R finish() {
return downstream.finisher().apply(acc);
}
}
return Collector.of(AllMax::new, AllMax::add, AllMax::merge, AllMax::finish);
}
This is ugly, but the following would work:
long count = students.stream()
.filter(s -> s.getSeminar().size() == students
.stream().mapToInt(a -> a.getSeminar().size())
.max().orElse(0))
.count();
Explanation: This streams the students list and filters the students so that only students with the maximum number of seminars (which is what the nested lambda is) remain and then takes the count of that stream.
Little modified #Alex's solution, to sort the keys while doing the grouping
TreeMap<Integer, Long> agg = students.stream()
.map(Student::getSeminar)
.collect(Collectors.groupingBy(Set::size, TreeMap::new, Collectors.counting()));
System.out.println(agg);
System.out.println(agg.lastEntry().getKey() + " - " + agg.lastEntry().getValue());
output
3 - 2

How do I shuffle nodes in a linked list?

I just started a project for my Java2 class and I've come to a complete stop. I just can't get
my head around this method. Especially when the assignment does NOT let us use any other DATA STRUCTURE or shuffle methods from java at all.
So I have a Deck.class in which I've already created a linked list containing 52 nodes that hold 52 cards.
public class Deck {
private Node theDeck;
private int numCards;
public Deck ()
{
while(numCards < 52)
{
theDeck = new Node (new Card(numCards), theDeck);
numCards++;
}
}
public void shuffleDeck()
{
int rNum;
int count = 0;
Node current = theDeck;
Card tCard;
int range = 0;
while(count != 51)
{
// Store whatever is inside the current node in a temp variable
tCard = current.getItem();
// Generate a random number between 0 -51
rNum = (int)(Math.random()* 51);
// Send current on a loop a random amount of times
for (int i=0; i < rNum; i ++)
current = current.getNext(); ******<-- (Btw this is the line I'm getting my error, i sort of know why but idk how to stop it.)
// So wherever current landed get that item stored in that node and store it in the first on
theDeck.setItem(current.getItem());
// Now make use of the temp variable at the beginning and store it where current landed
current.setItem(tCard);
// Send current back to the beginning of the deck
current = theDeck;
// I've created a counter for another loop i want to do
count++;
// Send current a "count" amount of times for a loop so that it doesn't shuffle the cards that have been already shuffled.
for(int i=0; i<count; i++)
current = current.getNext(); ****<-- Not to sure about this last loop because if i don't shuffle the cards that i've already shuffled it will not count as a legitimate shuffle? i think? ****Also this is where i sometimes get a nullpointerexception****
}
}
}
Now I get different kinds of errors
When I call on this method:
it will sometimes shuffle just 2 cards but at times it will shuffle 3 - 5 cards then give me a NullPointerException.
I've pointed out where it gives me this error with asterisks in my code above
at one point I got it to shuffle 13 cards but then everytime it did that it didn't quite shuffle them the right way. one card kept always repeating.
at another point I got all 52 cards to go through the while loop but again it repeated one card various times.
So I really need some input in what I'm doing wrong. Towards the end of my code I think my logic is completely wrong but I can't seem to figure out a way around it.
Seems pretty long-winded.
I'd go with something like the following:
public void shuffleDeck() {
for(int i=0; i<52; i++) {
int card = (int) (Math.random() * (52-i));
deck.addLast(deck.remove(card));
}
}
So each card just gets moved to the back of the deck in a random order.
If you are authorized to use a secondary data structure, one way is simply to compute a random number within the number of remaining cards, select that card, move it to the end of the secondary structure until empty, then replace your list with the secondary list.
My implementation shuffles a linked list using a divide-and-conquer algorithm
public class LinkedListShuffle
{
public static DataStructures.Linear.LinkedListNode<T> Shuffle<T>(DataStructures.Linear.LinkedListNode<T> firstNode) where T : IComparable<T>
{
if (firstNode == null)
throw new ArgumentNullException();
if (firstNode.Next == null)
return firstNode;
var middle = GetMiddle(firstNode);
var rightNode = middle.Next;
middle.Next = null;
var mergedResult = ShuffledMerge(Shuffle(firstNode), Shuffle(rightNode));
return mergedResult;
}
private static DataStructures.Linear.LinkedListNode<T> ShuffledMerge<T>(DataStructures.Linear.LinkedListNode<T> leftNode, DataStructures.Linear.LinkedListNode<T> rightNode) where T : IComparable<T>
{
var dummyHead = new DataStructures.Linear.LinkedListNode<T>();
DataStructures.Linear.LinkedListNode<T> curNode = dummyHead;
var rnd = new Random((int)DateTime.Now.Ticks);
while (leftNode != null || rightNode != null)
{
var rndRes = rnd.Next(0, 2);
if (rndRes == 0)
{
if (leftNode != null)
{
curNode.Next = leftNode;
leftNode = leftNode.Next;
}
else
{
curNode.Next = rightNode;
rightNode = rightNode.Next;
}
}
else
{
if (rightNode != null)
{
curNode.Next = rightNode;
rightNode = rightNode.Next;
}
else
{
curNode.Next = leftNode;
leftNode = leftNode.Next;
}
}
curNode = curNode.Next;
}
return dummyHead.Next;
}
private static DataStructures.Linear.LinkedListNode<T> GetMiddle<T>(DataStructures.Linear.LinkedListNode<T> firstNode) where T : IComparable<T>
{
if (firstNode.Next == null)
return firstNode;
DataStructures.Linear.LinkedListNode<T> fast, slow;
fast = slow = firstNode;
while (fast.Next != null && fast.Next.Next != null)
{
slow = slow.Next;
fast = fast.Next.Next;
}
return slow;
}
}
Just came across this and decided to post a more concise solution which allows you to specify how much shuffling you want to do.
For the purposes of the answer, you have a linked list containing PlayingCard objects;
LinkedList<PlayingCard> deck = new LinkedList<PlayingCard>();
And to shuffle them use something like this;
public void shuffle(Integer swaps) {
for (int i=0; i < swaps; i++) {
deck.add(deck.remove((int)(Math.random() * deck.size())));
}
}
The more swaps you do, the more randomised the list will be.

XTend For-Loop Support and Adding Range Support

I can't seem to find a great way to express the following in Xtend without resorting to a while loop:
for(int i = 0; i < 3; i++){
println("row ");
}
println("your boat");
So, I guess my question has two parts:
Is there a better way to do the above? I didn't see anything promising in their documenation
A large portion of the features the language has are just Xtend library extensions (and they're great!). Is there range() functionality à la Python that I don't know about?
I ended up rolling my own and got something like the following:
class LanguageUtil {
def static Iterable<Integer> range(int stop) {
range(0, stop)
}
def static Iterable<Integer> range(int start, int stop) {
new RangeIterable(start, stop, 1)
}
def static Iterable<Integer> range(int start, int stop, int step) {
new RangeIterable(start, stop, step)
}
}
// implements Iterator and Iterable which is bad form.
class RangeIterable implements Iterator<Integer>, Iterable<Integer> {
val int start
val int stop
val int step
var int current
new(int start, int stop, int step) {
this.start = start;
this.stop = stop;
this.step = step
this.current = start
}
override hasNext() {
current < stop
}
override next() {
val ret = current
current = current + step
ret
}
override remove() {
throw new UnsupportedOperationException("Auto-generated function stub")
}
/**
* This is bad form. We could return a
* new RangeIterable here, but that seems worse.
*/
override iterator() {
this
}
}
The exact replacement for
for(int i = 0; i < 3; i++){
println("row ");
}
is
for (i : 0 ..< 3) {
println("row ")
}
Notice the exclusive range operator: ..<
Also you can doing it more idiomatically with
(1..3).forEach[println("row")]
Very new to Xtend but man it makes programming on the jvm awesome.
To me a range-based forEach implies the range is somehow meaningful. For looping a specific number of times with no iteration variable, I find Ruby's times loop expresses the intent more clearly:
3.times [|println("row")]
Sadly it's not a part of IntegerExtensions, but the implementation is trivial:
def static times(int iterations, Runnable runnable) {
if (iterations < 0) throw new IllegalArgumentException(
'''Can't iterate negative («iterations») times.''')
for (i: 0 ..< iterations) runnable.run()
}
Heh, I found the answer a little while later:
for(i: 1..3) {
println("row ")
}
Since Xtend 2.6, we also support the "traditional" for-loop, just like in Java.
There is actually a version of forEach() that accepts a lambda with two parameters.
It is useful if you need to access the iteration index within the loop.
(10..12).forEach[ x, i | println('''element=«x» index=«i»''')]
prints:
element=10 index=0
element=11 index=1
element=12 index=2

Lossless compression in small blocks with precomputed dictionary

I have an application where I am reading and writing small blocks of data (a few hundred bytes) hundreds of millions of times. I'd like to generate a compression dictionary based on an example data file and use that dictionary forever as I read and write the small blocks. I'm leaning toward the LZW compression algorithm. The Wikipedia page (http://en.wikipedia.org/wiki/Lempel-Ziv-Welch) lists pseudocode for compression and decompression. It looks fairly straightforward to modify it such that the dictionary creation is a separate block of code. So I have two questions:
Am I on the right track or is there a better way?
Why does the LZW algorithm add to the dictionary during the decompression step? Can I omit that, or would I lose efficiency in my dictionary?
Thanks.
Update: Now I'm thinking the ideal case be to find a library that lets me store the dictionary separate from the compressed data. Does anything like that exist?
Update: I ended up taking the code at http://www.enusbaum.com/blog/2009/05/22/example-huffman-compression-routine-in-c and adapting it. I am Chris in the comments on that page. I emailed my mods back to that blog author, but I haven't heard back yet. The compression rates I'm seeing with that code are not at all impressive. Maybe that is due to the 8-bit tree size.
Update: I converted it to 16 bits and the compression is better. It's also much faster than the original code.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace Book.Core
{
public class Huffman16
{
private readonly double log2 = Math.Log(2);
private List<Node> HuffmanTree = new List<Node>();
internal class Node
{
public long Frequency { get; set; }
public byte Uncoded0 { get; set; }
public byte Uncoded1 { get; set; }
public uint Coded { get; set; }
public int CodeLength { get; set; }
public Node Left { get; set; }
public Node Right { get; set; }
public bool IsLeaf
{
get { return Left == null; }
}
public override string ToString()
{
var coded = "00000000" + Convert.ToString(Coded, 2);
return string.Format("Uncoded={0}, Coded={1}, Frequency={2}", (Uncoded1 << 8) | Uncoded0, coded.Substring(coded.Length - CodeLength), Frequency);
}
}
public Huffman16(long[] frequencies)
{
if (frequencies.Length != ushort.MaxValue + 1)
{
throw new ArgumentException("frequencies.Length must equal " + ushort.MaxValue + 1);
}
BuildTree(frequencies);
EncodeTree(HuffmanTree[HuffmanTree.Count - 1], 0, 0);
}
public static long[] GetFrequencies(byte[] sampleData, bool safe)
{
if (sampleData.Length % 2 != 0)
{
throw new ArgumentException("sampleData.Length must be a multiple of 2.");
}
var histogram = new long[ushort.MaxValue + 1];
if (safe)
{
for (int i = 0; i <= ushort.MaxValue; i++)
{
histogram[i] = 1;
}
}
for (int i = 0; i < sampleData.Length; i += 2)
{
histogram[(sampleData[i] << 8) | sampleData[i + 1]] += 1000;
}
return histogram;
}
public byte[] Encode(byte[] plainData)
{
if (plainData.Length % 2 != 0)
{
throw new ArgumentException("plainData.Length must be a multiple of 2.");
}
Int64 iBuffer = 0;
int iBufferCount = 0;
using (MemoryStream msEncodedOutput = new MemoryStream())
{
//Write Final Output Size 1st
msEncodedOutput.Write(BitConverter.GetBytes(plainData.Length), 0, 4);
//Begin Writing Encoded Data Stream
iBuffer = 0;
iBufferCount = 0;
for (int i = 0; i < plainData.Length; i += 2)
{
Node FoundLeaf = HuffmanTree[(plainData[i] << 8) | plainData[i + 1]];
//How many bits are we adding?
iBufferCount += FoundLeaf.CodeLength;
//Shift the buffer
iBuffer = (iBuffer << FoundLeaf.CodeLength) | FoundLeaf.Coded;
//Are there at least 8 bits in the buffer?
while (iBufferCount > 7)
{
//Write to output
int iBufferOutput = (int)(iBuffer >> (iBufferCount - 8));
msEncodedOutput.WriteByte((byte)iBufferOutput);
iBufferCount = iBufferCount - 8;
iBufferOutput <<= iBufferCount;
iBuffer ^= iBufferOutput;
}
}
//Write remaining bits in buffer
if (iBufferCount > 0)
{
iBuffer = iBuffer << (8 - iBufferCount);
msEncodedOutput.WriteByte((byte)iBuffer);
}
return msEncodedOutput.ToArray();
}
}
public byte[] Decode(byte[] bInput)
{
long iInputBuffer = 0;
int iBytesWritten = 0;
//Establish Output Buffer to write unencoded data to
byte[] bDecodedOutput = new byte[BitConverter.ToInt32(bInput, 0)];
var current = HuffmanTree[HuffmanTree.Count - 1];
//Begin Looping through Input and Decoding
iInputBuffer = 0;
for (int i = 4; i < bInput.Length; i++)
{
iInputBuffer = bInput[i];
for (int bit = 0; bit < 8; bit++)
{
if ((iInputBuffer & 128) == 0)
{
current = current.Left;
}
else
{
current = current.Right;
}
if (current.IsLeaf)
{
bDecodedOutput[iBytesWritten++] = current.Uncoded1;
bDecodedOutput[iBytesWritten++] = current.Uncoded0;
if (iBytesWritten == bDecodedOutput.Length)
{
return bDecodedOutput;
}
current = HuffmanTree[HuffmanTree.Count - 1];
}
iInputBuffer <<= 1;
}
}
throw new Exception();
}
private static void EncodeTree(Node node, int depth, uint value)
{
if (node != null)
{
if (node.IsLeaf)
{
node.CodeLength = depth;
node.Coded = value;
}
else
{
depth++;
value <<= 1;
EncodeTree(node.Left, depth, value);
EncodeTree(node.Right, depth, value | 1);
}
}
}
private void BuildTree(long[] frequencies)
{
var tiny = 0.1 / ushort.MaxValue;
var fraction = 0.0;
SortedDictionary<double, Node> trees = new SortedDictionary<double, Node>();
for (int i = 0; i <= ushort.MaxValue; i++)
{
var leaf = new Node()
{
Uncoded1 = (byte)(i >> 8),
Uncoded0 = (byte)(i & 255),
Frequency = frequencies[i]
};
HuffmanTree.Add(leaf);
if (leaf.Frequency > 0)
{
trees.Add(leaf.Frequency + (fraction += tiny), leaf);
}
}
while (trees.Count > 1)
{
var e = trees.GetEnumerator();
e.MoveNext();
var first = e.Current;
e.MoveNext();
var second = e.Current;
//Join smallest two nodes
var NewParent = new Node();
NewParent.Frequency = first.Value.Frequency + second.Value.Frequency;
NewParent.Left = first.Value;
NewParent.Right = second.Value;
HuffmanTree.Add(NewParent);
//Remove the two that just got joined into one
trees.Remove(first.Key);
trees.Remove(second.Key);
trees.Add(NewParent.Frequency + (fraction += tiny), NewParent);
}
}
}
}
Usage examples:
To create the dictionary from sample data:
var freqs = Huffman16.GetFrequencies(File.ReadAllBytes(#"D:\nodes"), true);
To initialize an encoder with a given dictionary:
var huff = new Huffman16(freqs);
And to do some compression:
var encoded = huff.Encode(raw);
And decompression:
var raw = huff.Decode(encoded);
The hard part in my mind is how you build your static dictionary. You don't want to use the LZW dictionary built from your sample data. LZW wastes a bunch of time learning since it can't build the dictionary faster than the decompressor can (a token will only be used the second time it's seen by the compressor so the decompressor can add it to its dictionary the first time its seen). The flip side of this is that it's adding things to the dictionary that may never get used, just in case the string shows up again. (e.g., to have a token for 'stackoverflow' you'll also have entries for 'ac','ko','ve','rf' etc...)
However, looking at the raw token stream from an LZ77 algorithm could work well. You'll only see tokens for strings seen at least twice. You can then build a list of the most common tokens/strings to include in your dictionary.
Once you have a static dictionary, using LZW sans the dictionary update seems like an easy implementation but to get the best compression I'd consider a static Huffman table instead of the traditional 12 bit fixed size token (as George Phillips suggested). An LZW dictionary will burn tokens for all the sub-strings you may never actually encode (e.g, if you can encode 'stackoverflow', there will be tokens for 'st', 'sta', 'stac', 'stack', 'stacko' etc.).
At this point it really isn't LZW - what makes LZW clever is how the decompressor can build the same dictionary the compressor used only seeing the compressed data stream. Something you won't be using. But all LZW implementations have a state where the dictionary is full and is no longer updated, this is how you'd use it with your static dictionary.
LZW adds to the dictionary during decompression to ensure it has the same dictionary state as the compressor. Otherwise the decoding would not function properly.
However, if you were in a state where the dictionary was fixed then, yes, you would not need to add new codes.
Your approach will work reasonably well and it's easy to use existing tools to prototype and measure the results. That is, compress the example file and then the example and test data together. The size of the latter less the former will be the expected compressed size of a block.
LZW is a clever way to build up a dictionary on the fly and gives decent results. But a more thorough analysis of your typical data blocks is likely to generate a more efficient dictionary.
There's also room for improvement in how LZW represents compressed data. For instance, each dictionary reference could be Huffman encoded to a closer to optimal length based on the expected frequency of their use. To be truly optimal the codes should be arithmetic encoded.
I would look at your data to see if there's an obvious reason it's so easy to compress. You might be able to do something much simpler than LZ78. I've done both LZ77 (lookback) and LZ78 (dictionary).
Try running a LZ77 on your data. There's no dictionary with LZ77, so you could use a library without alteration. Deflate is an implementation of LZ77.
Your idea of using a common dictionary is a good one, but it's hard to know whether the files are similar to each other or just self-similar without doing some tests.
The right track is to use an library -- almost every modern language have a compression library. C#, Python, Perl, Java, VB.net, whatever you use.
LZW save some space by depending the dictionary on previous inputs. It have an initial dictionary, and when you decompress something, you add them to the dictionary -- so the dictionary is growing. (I am omitting some details here, but this is the general idea)
You can omit this step by supply the whole (complete) dictionary as the initial one. But this would cost some space.
I find this aproach quite interesting for repeated log entries and something I would like to explore using.
Can you share the compression statistics for using this approach for your use case so I can compare it with other alternatives?
Have you considered having the common dictionary grow over time or is that not a valid option?

Resources