Non-blocking Queues - nonblocking

IBM (see Source) wrote on the benefits of Java's 1.5 java.util.concurrent class, which offers non-blocking queues.
Please explain the weaknesses/disadvantages of the NonBlockingCounter below.
public class NonblockingCounter {
private AtomicInteger value;
public int getValue() {
return value.get();
}
public int increment() {
int v;
do {
v = value.get();
}
while (!value.compareAndSet(v, v + 1)); // params - (actual, expected)
return v + 1;
}
}
Source - http://www.ibm.com/developerworks/java/library/j-jtp04186/index.html

The disadvantage is that it spins while trying to increment the value if there's contention. That means it's bad for high-contention locks.
The advantage is that it doesn't have lock acquistion/semaphore overhead. That's good for low-contention locks.

Related

Why no copy methods for channels in IOUtils?

Can anybody please tell me why there is no copy(Large) methods, for channels, in IOUtils class?
Are those functionalities implemented in any other class? Or is it unnecessary?
Or is it just that those methods are not defined yet?
I'm not particularly a fan of 3rd party utilities and, consequently, using another library, such as Guava, is not an option.
Moreover, I'm asking about the functionalities just because Commons-IO is, transitively, on my classpath.
static long copy1(ReadableByteChannel source, WritableByteChannel target,
ByteBuffer buffer) {
long count = 0L;
while (source.read(buffer) != -1) {
for (buffer.flip(); buffer.hasRemaining();) {
count += target.write(buffer);
}
buffer.clear();
}
return count;
}
static long copy2(ReadableByteChannel source, WritableByteChannel target,
ByteBuffer buffer) {
long count = 0L;
while (source.read(buffer) != -1) {
buffer.flip();
count += target.write(buffer);
buffer.compact();
}
for (buffer.flip(); buffer.hasRemaining(); ) {
target.write(buffer);
}
return count;
}

Why java Map.merge does not pass a supplier?

I want in java a method which allows me to modify a value if exist, or insert one if it doesn't. Similar to merge, but:
I want to pass a value supplier and not a value, to avoid creating it when not needed
In case the value exists, I don't want to reinsert it nor remove it, just access its methods with a container.
I had to write this. The problem with writing it myself is that the version for Concurrent maps is not trivial
public static <K, V> V putOrConsume(Map<K, V> map, K key, Supplier<V> ifAbsent, Consumer<V> ifPresent) {
V val = map.get(key);
if (val != null) {
ifPresent.accept(val);
} else {
map.put(key, ifAbsent.get());
}
return val;
}
The best "standard" way of achieving it is to use compute():
Map<String, String> map = new HashMap<>();
BiFunction<String, String, String> convert = (k, v) -> v == null ? "new_" + k : "old_" + v;
map.compute("x", convert);
map.compute("x", convert);
System.out.println(map.get("x")); //prints old_new_x
Now, say, you have your Supplier and Consumer and would like to follow DRY principle. Then you could use a simple function combinator:
Map<String, String> map = new HashMap<>();
Supplier<String> ifAbsent = () -> "new";
Consumer<String> ifPresent = System.out::println;
BiFunction<String, String, String> putOrConsume = (k, v) -> {
if (v == null) return ifAbsent.get();
ifPresent.accept(v);
return v;
};
map.compute("x", putOrConsume); //nothing
map.compute("x", putOrConsume); //prints "new"
Obviously, you could write a combinator function that takes supplier and consumer and returns BiFunction to make the code above even more generic.
The drawback of this proposed approach is in the extra call to map.put() even if you simply consume the value, i.e. it will be slightly slower, by the time of key lookup. The good news are, map implementations will simply replace the value without creating the new node. I.e. no new objects will be created or garbage collected. Most of the time such trade-offs are justified.
map.compute(...) and map.putIfAbsent(...) are much more powerful than fairly specialized proposed putOrConsume(...). It is so asymmetrical I would actually review the reasons why you need it in the code.
You can achieve what you want with Map.compute and a trivial helper method, as well as with the help of a local class to know if your ifAbsent supplier has been used:
public static <K, V> V putOrConsume(
Map<K, V> map,
K key,
Supplier<V> ifAbsent,
Consumer<V> ifPresent) {
class AbsentSupplier implements Supplier<V> {
boolean used = false;
public V get() {
used = true;
return ifAbsent.get();
}
}
AbsentSupplier absentSupplier = new AbsentSupplier();
V computed = map.compute(
key,
(k, v) -> v == null ?
absentSupplier.get() :
consumeAndReturn(v, ifPresent));
return absentSupplier.used ? null : computed;
}
private static <V> V consumeAndReturn(V v, Consumer<V> consumer) {
consumer.accept(v);
return v;
}
The tricky part is finding whether you have used your ifAbsent supplier to return either null or the existent, consumed value.
The helper method simply adapts the ifPresent consumer so that it behaves like a unary operator that consumes the given value and returns it.
different from others answers, you also using Map.compute method and combine Functions with interface default methods / static methods to make your code more readable. for example:
Usage
//only consuming if value is present
Consumer<V> action = ...;
map.compute(key,ValueMapping.ifPresent(action));
//create value if value is absent
Supplier<V> supplier = ...;
map.compute(key,ValueMapping.ifPresent(action).orElse(supplier));
//map value from key if value is absent
Function<K,V> mapping = ...;
map.compute(key,ValueMapping.ifPresent(action).orElse(mapping));
//orElse supports short-circuit feature
map.compute(key,ValueMapping.ifPresent(action)
.orElse(supplier)
.orElse(() -> fail("it should not be called "+
"if the value computed by the previous orElse")));
<T> T fail(String message) {
throw new AssertionError(message);
}
ValueMapping
interface ValueMapping<T, R> extends BiFunction<T, R, R> {
default ValueMapping<T, R> orElse(Supplier<R> other) {
return orElse(k -> other.get());
}
default ValueMapping<T, R> orElse(Function<T, R> other) {
return (k, v) -> {
R result = this.apply(k, v);
return result!=null ? result : other.apply(k);
};
}
static <T, R> ValueMapping<T, R> ifPresent(Consumer<R> action) {
return (k, v) -> {
if (v!=null) {
action.accept(v);
}
return v;
};
}
}
Note
I used Objects.isNull in ValueMapping in previous version. and #Holger point out that is an overusing case, and should replacing it with simpler condition it != null.

How to sort comma separated keys in Reducer ouput?

I am running an RFM Analysis program using MapReduce. The OutputKeyClass is Text.class and I am emitting comma separated R (Recency), F (Frequency), M (Monetory) as the key from Reducer where R=BigInteger, F=Binteger, M=BigDecimal and the value is also a Text representing Customer_ID. I know that Hadoop sorts output based on keys but my final result is a bit wierd. I want the output keys to be sorted by R first, then F and then M. But I am getting the following output sort order for unknown reasons:
545,1,7652 100000
545,23,390159.402343750 100001
452,13,132586 100002
452,4,32202 100004
452,1,9310 100007
452,1,4057 100018
452,3,18970 100021
But I want the following output:
545,23,390159.402343750 100001
545,1,7652 100000
452,13,132586 100002
452,4,32202 100004
452,3,18970 100021
452,1,9310 100007
452,1,4057 100018
NOTE: The customer_ID was the key in Map phase and all the RFM values belonging to a particular Customer_ID are brought together at the Reducer for aggregation.
So after a lot of searching I found some useful material the compilation of which I am posting now:
You have to start with your custom data type. Since I had three comma separated values which needed to be sorted descendingly, I had to create a TextQuadlet.java data type in Hadoop. The reason I am creating a quadlet is because the first part of the key will be the natural key and the rest of the three parts will be the R, F, M:
import java.io.*;
import org.apache.hadoop.io.*;
public class TextQuadlet implements WritableComparable<TextQuadlet> {
private String customer_id;
private long R;
private long F;
private double M;
public TextQuadlet() {
}
public TextQuadlet(String customer_id, long R, long F, double M) {
set(customer_id, R, F, M);
}
public void set(String customer_id2, long R2, long F2, double M2) {
this.customer_id = customer_id2;
this.R = R2;
this.F = F2;
this.M=M2;
}
public String getCustomer_id() {
return customer_id;
}
public long getR() {
return R;
}
public long getF() {
return F;
}
public double getM() {
return M;
}
#Override
public void write(DataOutput out) throws IOException {
out.writeUTF(this.customer_id);
out.writeLong(this.R);
out.writeLong(this.F);
out.writeDouble(this.M);
}
#Override
public void readFields(DataInput in) throws IOException {
this.customer_id = in.readUTF();
this.R = in.readLong();
this.F = in.readLong();
this.M = in.readDouble();
}
// This hashcode function is important as it is used by the custom
// partitioner for this class.
#Override
public int hashCode() {
return (int) (customer_id.hashCode() * 163 + R + F + M);
}
#Override
public boolean equals(Object o) {
if (o instanceof TextQuadlet) {
TextQuadlet tp = (TextQuadlet) o;
return customer_id.equals(tp.customer_id) && R == (tp.R) && F==(tp.F) && M==(tp.M);
}
return false;
}
#Override
public String toString() {
return customer_id + "," + R + "," + F + "," + M;
}
// LHS in the conditional statement is the current key
// RHS in the conditional statement is the previous key
// When you return a negative value, it means that you are exchanging
// the positions of current and previous key-value pair
// Returning 0 or a positive value means that you are keeping the
// order as it is
#Override
public int compareTo(TextQuadlet tp) {
// Here my natural is is customer_id and I don't even take it into
// consideration.
// So as you might have concluded, I am sorting R,F,M descendingly.
if (this.R != tp.R) {
if(this.R < tp.R) {
return 1;
}
else{
return -1;
}
}
if (this.F != tp.F) {
if(this.F < tp.F) {
return 1;
}
else{
return -1;
}
}
if (this.M != tp.M){
if(this.M < tp.M) {
return 1;
}
else{
return -1;
}
}
return 0;
}
public static int compare(TextQuadlet tp1, TextQuadlet tp2) {
int cmp = tp1.compareTo(tp2);
return cmp;
}
public static int compare(Text customer_id1, Text customer_id2) {
int cmp = customer_id1.compareTo(customer_id1);
return cmp;
}
}
Next you'll need a custom partitioner so that all the values which have the same key end up at one reducer:
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
public class FirstPartitioner_RFM extends Partitioner<TextQuadlet, Text> {
#Override
public int getPartition(TextQuadlet key, Text value, int numPartitions) {
return (int) key.hashCode() % numPartitions;
}
}
Thirdly, you'll need a custom group comparater so that all the values are grouped together by their natural key which is customer_id and not the composite key which is customer_id,R,F,M:
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
public class GroupComparator_RFM_N extends WritableComparator {
protected GroupComparator_RFM_N() {
super(TextQuadlet.class, true);
}
#SuppressWarnings("rawtypes")
#Override
public int compare(WritableComparable w1, WritableComparable w2) {
TextQuadlet ip1 = (TextQuadlet) w1;
TextQuadlet ip2 = (TextQuadlet) w2;
// Here we tell hadoop to group the keys by their natural key.
return ip1.getCustomer_id().compareTo(ip2.getCustomer_id());
}
}
Fourthly, you'll need a key comparater which will again sort the keys based on R,F,M descendingly and implement the same sort technique which is used in TextQuadlet.java. Since I got lost while coding, I slightly changed the way I compared data types in this function but the underlying logic is the same as in TextQuadlet.java:
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
public class KeyComparator_RFM extends WritableComparator {
protected KeyComparator_RFM() {
super(TextQuadlet.class, true);
}
#SuppressWarnings("rawtypes")
#Override
public int compare(WritableComparable w1, WritableComparable w2) {
TextQuadlet ip1 = (TextQuadlet) w1;
TextQuadlet ip2 = (TextQuadlet) w2;
// LHS in the conditional statement is the current key-value pair
// RHS in the conditional statement is the previous key-value pair
// When you return a negative value, it means that you are exchanging
// the positions of current and previous key-value pair
// If you are comparing strings, the string which ends up as the argument
// for the `compareTo` method turns out to be the previous key and the
// string which is invoking the `compareTo` method turns out to be the
// current key.
if(ip1.getR() == ip2.getR()){
if(ip1.getF() == ip2.getF()){
if(ip1.getM() == ip2.getM()){
return 0;
}
else{
if(ip1.getM() < ip2.getM())
return 1;
else
return -1;
}
}
else{
if(ip1.getF() < ip2.getF())
return 1;
else
return -1;
}
}
else{
if(ip1.getR() < ip2.getR())
return 1;
else
return -1;
}
}
}
And finally, in your driver class, you'll have to include our custom classes. Here I have used TextQuadlet,Text as k-v pair. But you can choose any other class depending on your needs.:
job.setPartitionerClass(FirstPartitioner_RFM.class);
job.setSortComparatorClass(KeyComparator_RFM.class);
job.setGroupingComparatorClass(GroupComparator_RFM_N.class);
job.setMapOutputKeyClass(TextQuadlet.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(TextQuadlet.class);
job.setOutputValueClass(Text.class);
Do correct me if I am technically going wrong somewhere in the code or in the explanation as I have based this answer purely on my personal understanding from what I read on the internet and it works for me perfectly.

Throughput measure

I have to implement a limitation algorithm in order to avoid to reach a throughput limit imposed by the service I'm interacting with.
The limit is specified as «N requests over 1 day» where N is of the order of magnitude of 10^6.
I have a distributed system of clients interacting with the service so they should share the measure.
An exact solution should involve to record all the events and than computing the limit «when» the event of calling the service occur: of course this approach is too expensive and so I'm looking for an approximate solution.
The first one I devised imply to discretize the detection of the events: for example maintaing 24 counters at most and recording the number of requests occurred within an hour.
Acceptable.
But I feel that a more elegant, even if leaded by different «forces», is to declinate the approach to the continuum.
Let's say recording the last N events I could easily infer the «current» throughput. Of course this algorithm suffer for missing consideration of the past events occurred the hours before. I could improve with with an aging algorithm but… and here follow my question:
Q: «There's an elegant approximate solution to the problem of estimating the throughput of a service over a long period with and high rate of events?»
As per my comments, you should use a monitor and have it sample the values every 15 minutes or something to get a reasonable guess of the number of requests.
I mocked something up here but haven't tested it, should give you a starter.
import java.util.LinkedList;
import java.util.Queue;
import java.util.Timer;
import java.util.TimerTask;
public class TestCounter {
private final Monitor monitor;
private TestCounter() {
monitor = new Monitor();
}
/** The thing you are limiting */
public void myService() {
if (monitor.isThresholdExceeded()) {
//Return error
} else {
monitor.incremenetCounter();
//do stuff
}
}
public static void main(String[] args) {
TestCounter t = new TestCounter();
for (int i = 0; i < 100000; i++) {
t.myService();
}
for (int i = 0; i < 100000; i++) {
t.myService();
}
}
private class Monitor {
private final Queue<Integer> queue = new LinkedList<Integer>();
private int counter = 1;
/** Number of 15 minute periods in a day. */
private final int numberOfSamples = 76;
private final int threshold = 1000000;
private boolean thresholdExceeded;
public Monitor() {
//Schedule a sample every 15 minutes.
Timer t = new Timer();
t.scheduleAtFixedRate(new TimerTask() {
#Override
public void run() {
sampleCounter();
}
}, 0l, 900000 /** ms in 15 minutes */
);
}
/** Could synchroinise */
void incremenetCounter() {
counter++;
}
/** Could synchroinise */
void sampleCounter() {
int tempCount = counter;
counter = 0;
queue.add(tempCount);
if (queue.size() > numberOfSamples) {
queue.poll();
}
int totalCount = 0;
for (Integer value : queue) {
totalCount += value;
}
if (totalCount > threshold) {
thresholdExceeded = true;
} else {
thresholdExceeded = false;
}
}
public boolean isThresholdExceeded() {
return thresholdExceeded;
}
}
}

Lossless hierarchical run length encoding

I want to summarize rather than compress in a similar manner to run length encoding but in a nested sense.
For instance, I want : ABCBCABCBCDEEF to become: (2A(2BC))D(2E)F
I am not concerned that an option is picked between two identical possible nestings E.g.
ABBABBABBABA could be (3ABB)ABA or A(3BBA)BA which are of the same compressed length, despite having different structures.
However I do want the choice to be MOST greedy. For instance:
ABCDABCDCDCDCD would pick (2ABCD)(3CD) - of length six in original symbols which is less than ABCDAB(4CD) which is length 8 in original symbols.
In terms of background I have some repeating patterns that I want to summarize. So that the data is more digestible. I don't want to disrupt the logical order of the data as it is important. but I do want to summarize it , by saying, symbol A times 3 occurrences, followed by symbols XYZ for 20 occurrences etc. and this can be displayed in a nested sense visually.
Welcome ideas.
I'm pretty sure this isn't the best approach, and depending on the length of the patterns, might have a running time and memory usage that won't work, but here's some code.
You can paste the following code into LINQPad and run it, and it should produce the following output:
ABCBCABCBCDEEF = (2A(2BC))D(2E)F
ABBABBABBABA = (3A(2B))ABA
ABCDABCDCDCDCD = (2ABCD)(3CD)
As you can see, the middle example encoded ABB as A(2B) instead of ABB, you would have to make that judgment yourself, if single-symbol sequences like that should be encoded as a repeated symbol or not, or if a specific threshold (like 3 or more) should be used.
Basically, the code runs like this:
For each position in the sequence, try to find the longest match (actually, it doesn't, it takes the first 2+ match it finds, I left the rest as an exercise for you since I have to leave my computer for a few hours now)
It then tries to encode that sequence, the one that repeats, recursively, and spits out a X*seq type of object
If it can't find a repeating sequence, it spits out the single symbol at that location
It then skips what it encoded, and continues from #1
Anyway, here's the code:
void Main()
{
string[] examples = new[]
{
"ABCBCABCBCDEEF",
"ABBABBABBABA",
"ABCDABCDCDCDCD",
};
foreach (string example in examples)
{
StringBuilder sb = new StringBuilder();
foreach (var r in Encode(example))
sb.Append(r.ToString());
Debug.WriteLine(example + " = " + sb.ToString());
}
}
public static IEnumerable<Repeat<T>> Encode<T>(IEnumerable<T> values)
{
return Encode<T>(values, EqualityComparer<T>.Default);
}
public static IEnumerable<Repeat<T>> Encode<T>(IEnumerable<T> values, IEqualityComparer<T> comparer)
{
List<T> sequence = new List<T>(values);
int index = 0;
while (index < sequence.Count)
{
var bestSequence = FindBestSequence<T>(sequence, index, comparer);
if (bestSequence == null || bestSequence.Length < 1)
throw new InvalidOperationException("Unable to find sequence at position " + index);
yield return bestSequence;
index += bestSequence.Length;
}
}
private static Repeat<T> FindBestSequence<T>(IList<T> sequence, int startIndex, IEqualityComparer<T> comparer)
{
int sequenceLength = 1;
while (startIndex + sequenceLength * 2 <= sequence.Count)
{
if (comparer.Equals(sequence[startIndex], sequence[startIndex + sequenceLength]))
{
bool atLeast2Repeats = true;
for (int index = 0; index < sequenceLength; index++)
{
if (!comparer.Equals(sequence[startIndex + index], sequence[startIndex + sequenceLength + index]))
{
atLeast2Repeats = false;
break;
}
}
if (atLeast2Repeats)
{
int count = 2;
while (startIndex + sequenceLength * (count + 1) <= sequence.Count)
{
bool anotherRepeat = true;
for (int index = 0; index < sequenceLength; index++)
{
if (!comparer.Equals(sequence[startIndex + index], sequence[startIndex + sequenceLength * count + index]))
{
anotherRepeat = false;
break;
}
}
if (anotherRepeat)
count++;
else
break;
}
List<T> oneSequence = Enumerable.Range(0, sequenceLength).Select(i => sequence[startIndex + i]).ToList();
var repeatedSequence = Encode<T>(oneSequence, comparer).ToArray();
return new SequenceRepeat<T>(count, repeatedSequence);
}
}
sequenceLength++;
}
// fall back, we could not find anything that repeated at all
return new SingleSymbol<T>(sequence[startIndex]);
}
public abstract class Repeat<T>
{
public int Count { get; private set; }
protected Repeat(int count)
{
Count = count;
}
public abstract int Length
{
get;
}
}
public class SingleSymbol<T> : Repeat<T>
{
public T Value { get; private set; }
public SingleSymbol(T value)
: base(1)
{
Value = value;
}
public override string ToString()
{
return string.Format("{0}", Value);
}
public override int Length
{
get
{
return Count;
}
}
}
public class SequenceRepeat<T> : Repeat<T>
{
public Repeat<T>[] Values { get; private set; }
public SequenceRepeat(int count, Repeat<T>[] values)
: base(count)
{
Values = values;
}
public override string ToString()
{
return string.Format("({0}{1})", Count, string.Join("", Values.Select(v => v.ToString())));
}
public override int Length
{
get
{
int oneLength = 0;
foreach (var value in Values)
oneLength += value.Length;
return Count * oneLength;
}
}
}
public class GroupRepeat<T> : Repeat<T>
{
public Repeat<T> Group { get; private set; }
public GroupRepeat(int count, Repeat<T> group)
: base(count)
{
Group = group;
}
public override string ToString()
{
return string.Format("({0}{1})", Count, Group);
}
public override int Length
{
get
{
return Count * Group.Length;
}
}
}
Looking at the problem theoretically, it seems similar to the problem of finding the smallest context free grammar which generates (only) the string, except in this case the non-terminals can only be used in direct sequence after each other, so e.g.
ABCBCABCBCDEEF
s->ttDuuF
t->Avv
v->BC
u->E
ABABCDABABCD
s->ABtt
t->ABCD
Of course, this depends on how you define "smallest", but if you count terminals on the right side of rules, it should be the same as the "length in original symbols" after doing the nested run-length encoding.
The problem of the smallest grammar is known to be hard, and is a well-studied problem. I don't know how much the "direct sequence" part adds to or subtracts from the complexity.

Resources