How do streams stop? - java-8

I was wondering when I created my own infinite stream with Stream.generate how the Streams which are in the standard library stop...
For example when you have a list with records:
List<Record> records = getListWithRecords();
records.stream().forEach(/* do something */);
The stream won't be infinite and running forever, but it will stop when all items in the list are traversed. But how does that work? The same functionality applies for the stream created by Files.lines(path) (source: http://www.mkyong.com/java8/java-8-stream-read-a-file-line-by-line/).
And a second question, how can a stream created with Stream.generate be stopped in the same manner then?

Finite streams simply aren’t created via Stream.generate.
The standard way of implementing a stream, is to implement a Spliterator, sometimes using the Iterator detour. In either case, the implementation has a way to report an end, e.g. when Spliterator.tryAdvance returns false or its forEachRemaining method just returns, or in case of an Iterator source, when hasNext() returns false.
A Spliterator may even report the expected number of elements before the processing begins.
Streams, created via one of the factory methods inside the Stream interface, like Stream.generate may be implemented either, by a Spliterator as well or using internal features of the stream implementation, but regardless of how they are implemented, you don’t get hands on this implementation to change their behavior, so the only way to make such a stream finite, is to chain a limit operation to the stream.
If you want to create a non-empty finite stream that is not backed by an array or collection and none of the existing stream sources fits, you have to implement your own Spliterator and create a stream out of it. As told above, you can use an existing method to create a Spliterator out of an Iterator, but you should resists the temptation to use an Iterator just because it’s familiar. A Spliterator is not hard to implement:
/** like {#code Stream.generate}, but with an intrinsic limit */
static <T> Stream<T> generate(Supplier<T> s, long count) {
return StreamSupport.stream(
new Spliterators.AbstractSpliterator<T>(count, Spliterator.SIZED) {
long remaining=count;
public boolean tryAdvance(Consumer<? super T> action) {
if(remaining<=0) return false;
remaining--;
action.accept(s.get());
return true;
}
}, false);
}
From this starting point, you can add overrides for the default methods of the Spliterator interface, weighting development expense and potential performance improvements, e.g.
static <T> Stream<T> generate(Supplier<T> s, long count) {
return StreamSupport.stream(
new Spliterators.AbstractSpliterator<T>(count, Spliterator.SIZED) {
long remaining=count;
public boolean tryAdvance(Consumer<? super T> action) {
if(remaining<=0) return false;
remaining--;
action.accept(s.get());
return true;
}
/** May improve the performance of most non-short-circuiting operations */
#Override
public void forEachRemaining(Consumer<? super T> action) {
long toGo=remaining;
remaining=0;
for(; toGo>0; toGo--) action.accept(s.get());
}
}, false);
}

I have created a generic workaround for this
public class GuardedSpliterator<T> implements Spliterator<T> {
final Supplier<? extends T> generator;
final Predicate<T> termination;
final boolean inclusive;
public GuardedSpliterator(Supplier<? extends T> generator, Predicate<T> termination, boolean inclusive) {
this.generator = generator;
this.termination = termination;
this.inclusive = inclusive;
}
#Override
public boolean tryAdvance(Consumer<? super T> action) {
T next = generator.get();
boolean end = termination.test(next);
if (inclusive || !end) {
action.accept(next);
}
return !end;
}
#Override
public Spliterator<T> trySplit() {
throw new UnsupportedOperationException("Not supported yet.");
}
#Override
public long estimateSize() {
throw new UnsupportedOperationException("Not supported yet.");
}
#Override
public int characteristics() {
return Spliterator.ORDERED;
}
}
Usage is pretty easy:
GuardedSpliterator<Integer> source = new GuardedSpliterator<>(
() -> rnd.nextInt(),
(i) -> i > 10,
true
);
Stream<Integer> ints = StreamSupport.stream(source, false);
ints.forEach(i -> System.out.println(i));

Related

Custom FileInputFormat always assign one filesplit to one slot

I have been writing protobuf records to our s3 buckets. And I want to use flink dataset api to read from it. So I implemented a custom FileInputFormat to achieve this. The code is as below.
public class ProtobufInputFormat extends FileInputFormat<StandardLog.Pageview> {
public ProtobufInputFormat() {
}
private transient boolean reachedEnd = false;
#Override
public boolean reachedEnd() throws IOException {
return reachedEnd;
}
#Override
public StandardLog.Pageview nextRecord(StandardLog.Pageview reuse) throws IOException {
StandardLog.Pageview pageview = StandardLog.Pageview.parseDelimitedFrom(stream);
if (pageview == null) {
reachedEnd = true;
}
return pageview;
}
#Override
public boolean supportsMultiPaths() {
return true;
}
}
public class BatchReadJob {
public static void main(String... args) throws Exception {
String readPath1 = args[0];
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
ProtobufInputFormat inputFormat = new ProtobufInputFormat();
inputFormat.setNestedFileEnumeration(true);
inputFormat.setFilePaths(readPath1);
DataSet<StandardLog.Pageview> dataSource = env.createInput(inputFormat);
dataSource.map(new MapFunction<StandardLog.Pageview, String>() {
#Override
public String map(StandardLog.Pageview value) throws Exception {
return value.getId();
}
}).writeAsText("s3://xxx", FileSystem.WriteMode.OVERWRITE);
env.execute();
}
}
The problem is that flink always assign one filesplit to one parallelism slot. In other word, it always process the same number of file split as the number of the parallelism.
I want to know what's the correct way of implementing custom FileInputFormat.
Thanks.
I believe the behavior you're seeing is because ExecutionJobVertex calls the FileInputFormat. createInputSplits() method with a minNumSplits parameter equal to the vertex (data source) parallelism. So if you want a different behavior, then you'd have to override the createInputSplits method.
Though you didn't say what behavior you actually wanted. If, for example, you just want one split per file, then you could override the testForUnsplittable() method in your subclass of FileInputFormat to always return true; it should also set the (protected) unsplittable boolean to true.

Java8 to Java7 - Migrate Comparators

I'm having troubles understanding how to "migrate" a simple Comparator in Java7.
The actual version I'm using in Java8 is like:
private static final Comparator<Entry> ENTRY_COMPARATOR = Comparator.comparing(new Function<Entry, EntryType>() {
#Override
public EntryType apply(Entry t) {
return t.type;
}
})
.thenComparing(Comparator.comparingLong(new ToLongFunction<Entry>() {
#Override
public long applyAsLong(Entry value) {
return value.count;
}
}).reversed());
But in build phase I get this error:
static interface method invocations are not supported in -source 7
How can I migrate the same comparator to Java7? I'm googling and searching for solution but the only thing I can think of, is to implement my own class as a Comparator interface implementation.
But If I go down that road, how can I apply both "comparing", "then comparing" and "reverse" in the same "compare" method?
Thanks in advance
Even your java-8 version can be made a lot shorter and easier to read with:
Comparator.comparing(Entry::getType)
.thenComparingLong(Entry::getCount)
.reversed();
With guava (java-7 compatible), this looks a bit more verbose:
#Override
public int compare(Entry left, Entry right) {
return ComparisonChain.start()
.compare(left.getType(), right.getCount(), Ordering.natural().reversed())
.compare(left.getCount(), right.getCount(), Ordering.natural().reversed())
.result();
}
You can write the logic in a single compare method:
public int compare (Entry one,Entry two) {
int result = two.getType().compareTo(one.getType());
if (result == 0) {
result = Long.compare(two.getCount(),one.getCount());
}
return result;
}
Note that the reversed order in achieved by swapping the order of the compared Entry instances.
You can construct a Comparator<Entry> the java 7 way, afterward, you can chain the default methods as you can do in java 8, but without using lambda expressions or method references as a parameter :
private static final Comparator<Entry> ENTRY_COMPARATOR = new Comparator<Entry>() {
#Override
public int compare(Entry left, Entry right) {
return left.type.compareTo(right.type);
}
}
.thenComparingLong(new ToLongFunction<Entry>() {
#Override
public long applyAsLong(Entry entry) {
return entry.value;
}
})
.reversed();
The code above is compiled with -source 1.7.

How to nicely do allOf/AnyOf with Collections of CompletionStage

Currently to do something simple with Collections of CompletionStage requires jumping through several ugly hoops:
public static CompletionStage<String> translate(String foo) {
// just example code to reproduce
return CompletableFuture.completedFuture("translated " + foo);
}
public static CompletionStage<List<String>> translateAllAsync(List<String> input) {
List<CompletableFuture<String>> tFutures = input.stream()
.map(s -> translate(s)
.toCompletableFuture())
.collect(Collectors.toList()); // cannot use toArray because of generics Arrays creation :-(
return CompletableFuture.allOf(tFutures.toArray(new CompletableFuture<?>[0])) // not using size() on purpose, see comments
.thenApply(nil -> tFutures.stream()
.map(f -> f.join())
.map(s -> s.toUpperCase())
.collect(Collectors.toList()));
}
What I want to write is:
public CompletionStage<List<String>> translateAllAsync(List<String> input) {
// allOf takes a collection< futures<X>>,
// and returns a future<collection<x>> for thenApply()
return XXXUtil.allOf(input.stream()
.map(s -> translate(s))
.collect(Collectors.toList()))
.thenApply(translations -> translations.stream()
.map(s -> s.toUpperCase())
.collect(Collectors.toList()));
}
The whole ceremony about toCompletableFuture and converting to an Array and join is boilerplate distracting from the actual code semantics.
Possibly having a version of allOf() returning a Future<Collection<Future<X>>> instead of Future<Collection<X>> may also be useful in some cases.
I could try implementing XXXUtil myself, but I wonder if there already is a mature 3rdparty library for this and similar issues (Such as Spotify's CompletableFutures). If so, I'd like to see the equivalent code for such a library as an answer.
Or maybe the original code posted above can somehow be written more compactly in a different way?
JUnit test code:
#Test
public void testTranslate() throws Exception {
List<String> list = translateAllAsync(Arrays.asList("foo", "bar")).toCompletableFuture().get();
Collections.sort(list);
assertEquals(list,
Arrays.asList("TRANSLATED BAR", "TRANSLATED FOO"));
}
I just looked into the source code of CompletableFuture.allOf, to find that it basically creates a binary tree of nodes handling two stages at a time. We can easily implement a similar logic without using toCompletableFuture() explicitly and handling the result list generation in one go:
public static <T> CompletionStage<List<T>> allOf(
Stream<? extends CompletionStage<? extends T>> source) {
return allOf(source.collect(Collectors.toList()));
}
public static <T> CompletionStage<List<T>> allOf(
List<? extends CompletionStage<? extends T>> source) {
int size = source.size();
if(size == 0) return CompletableFuture.completedFuture(Collections.emptyList());
List<T> result = new ArrayList<>(Collections.nCopies(size, null));
return allOf(source, result, 0, size-1).thenApply(x -> result);
}
private static <T> CompletionStage<Void> allOf(
List<? extends CompletionStage<? extends T>> source,
List<T> result, int from, int to) {
if(from < to) {
int mid = (from+to)>>>1;
return allOf(source, result, from, mid)
.thenCombine(allOf(source, result, mid+1, to), (x,y)->x);
}
return source.get(from).thenAccept(t -> result.set(from, t));
}
That’s it.
You can use this solution to implement the logic of your question’s code as
public static CompletionStage<List<String>> translateAllAsync(List<String> input) {
return allOf(input.stream().map(s -> translate(s)))
.thenApply(list -> list.stream()
.map(s -> s.toUpperCase())
.collect(Collectors.toList()));
}
though it would be more natural to use
public static CompletionStage<List<String>> translateAllAsync(List<String> input) {
return allOf(input.stream().map(s -> translate(s).thenApply(String::toUpperCase)));
}
Note that this solution maintains the order, so there is no need for sorting the result in the test case:
#Test
public void testTranslate() throws Exception {
List<String> list = translateAllAsync(Arrays.asList("foo", "bar")).toCompletableFuture().get();
assertEquals(list, Arrays.asList("TRANSLATED FOO", "TRANSLATED BAR"));
}

CompletableFuture exceptionally breaks the work chain

The idea of using CompletableFuture is because it offers a chain, while the first several steps encapsulate beans before the last step uses it. Because any exception may happen in these steps and exceptionally is used to handle error. However, exceptionally only accepts Throwable argument and so far I haven't found a way to grab those encapsulated beans.
CompletableFuture.supplyAsync(this::msgSource)
.thenApply(this::sendMsg).exceptionally(this::errorHandler).thenAccept(this::saveResult)
public List<Msg> msgSource() // take message from somewhere.
public List<Msg> sendMsg(List<Msg>) // exceptions may happen like 403 or timeout
public List<Msg> errorHandler() // set a success flag to false in Msg.
public void saveResult(List<Msg>) // save send result like success or false in data center.
In the above example, comments are the working flow. However, since errorHandler neither accepts List<Msg> nor passes it on, so the chain is broken. How to get the return from msgSource?
EDIT
public class CompletableFutureTest {
private static Logger log = LoggerFactory.getLogger(CompletableFutureTest.class);
public static void main(String[] args) {
CompletableFutureTest test = new CompletableFutureTest();
CompletableFuture future = new CompletableFuture();
future.supplyAsync(test::msgSource)
.thenApply(test::sendMsg).exceptionally(throwable -> {
List<String> list = (List<String>) future.join(); // never complete
return list;
}).thenAccept(test::saveResult);
try {
future.get();
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ExecutionException e) {
e.printStackTrace();
}
}
private List<String> saveResult(List<String> list) {
return list;
}
private List<String> sendMsg(List<String> list) {
throw new RuntimeException();
}
public List<String> msgSource() {
List<String> result = new ArrayList<>();
result.add("1");
result.add("2");
return result;
}
}
A chain implies that each node, i.e. completion stage, uses the result of the previous one. But if the previous stage failed with an exception, there is no such result. It’s a special property of your sendMsg stage that its result is just the same value as it received from the previous stage, but that has no influence on the logic nor API design. If sendMsg fails with an exception, it has no result that the exception handler could use.
If you want to use the result of the msgSource stage in the exceptional case, you don’t have a linear chain any more. But CompletableFuture does allow to model arbitrary dependency graphs, not just linear chains, so you can express it like
CompletableFuture<List<Msg>> source = CompletableFuture.supplyAsync(this::msgSource);
source.thenApply(this::sendMsg)
.exceptionally(throwable -> {
List<Msg> list = source.join();
for(Msg m: list) m.success = false;
return list;
})
.thenAccept(this::saveResult);
However, there is no semantic difference nor advantage over
CompletableFuture.runAsync(() -> {
List<Msg> list = msgSource();
try {
list = sendMsg(list);
} catch(Throwable t) {
for(Msg m: list) m.success = false;
}
saveResult(list);
});
which expresses the same logic as an ordinary code flow.

locking on a cache key

I've read several questions similar to this, but none of the answers provide ideas of how to clean up memory while still maintaining lock integrity. I'm estimating the number of key-value pairs at a given time to be in the tens of thousands, but the number of key-value pairs over the lifespan of the data structure is virtually infinite (realistically it probably wouldn't be more than a billion, but I'm coding to the worst case).
I have an interface:
public interface KeyLock<K extends Comparable<? super K>> {
public void lock(K key);
public void unock(K key);
}
with a default implementation:
public class DefaultKeyLock<K extends Comparable<? super K>> implements KeyLock<K> {
private final ConcurrentMap<K, Mutex> lockMap;
public DefaultKeyLock() {
lockMap = new ConcurrentSkipListMap<K, Mutex>();
}
#Override
public void lock(K key) {
Mutex mutex = new Mutex();
Mutex existingMutex = lockMap.putIfAbsent(key, mutex);
if (existingMutex != null) {
mutex = existingMutex;
}
mutex.lock();
}
#Override
public void unock(K key) {
Mutex mutex = lockMap.get(key);
mutex.unlock();
}
}
This works nicely, but the map never gets cleaned up. What I have so far for a clean implementation is:
public class CleanKeyLock<K extends Comparable<? super K>> implements KeyLock<K> {
private final ConcurrentMap<K, LockWrapper> lockMap;
public CleanKeyLock() {
lockMap = new ConcurrentSkipListMap<K, LockWrapper>();
}
#Override
public void lock(K key) {
LockWrapper wrapper = new LockWrapper(key);
wrapper.addReference();
LockWrapper existingWrapper = lockMap.putIfAbsent(key, wrapper);
if (existingWrapper != null) {
wrapper = existingWrapper;
wrapper.addReference();
}
wrapper.addReference();
wrapper.lock();
}
#Override
public void unock(K key) {
LockWrapper wrapper = lockMap.get(key);
if (wrapper != null) {
wrapper.unlock();
wrapper.removeReference();
}
}
private class LockWrapper {
private final K key;
private final ReentrantLock lock;
private int referenceCount;
public LockWrapper(K key) {
this.key = key;
lock = new ReentrantLock();
referenceCount = 0;
}
public synchronized void addReference() {
lockMap.put(key, this);
referenceCount++;
}
public synchronized void removeReference() {
referenceCount--;
if (referenceCount == 0) {
lockMap.remove(key);
}
}
public void lock() {
lock.lock();
}
public void unlock() {
lock.unlock();
}
}
}
This works for two threads accessing a single key lock, but once a third thread is introduced the lock integrity is no longer guaranteed. Any ideas?
I don't buy that this works for two threads. Consider this:
(Thread A) calls lock(x), now holds lock x
thread switch
(Thread B) calls lock(x), putIfAbsent() returns the current wrapper for x
thread switch
(Thread A) calls unlock(x), the wrapper reference count hits 0 and it gets removed from the map
(Thread A) calls lock(x), putIfAbsent() inserts a new wrapper for x
(Thread A) locks on the new wrapper
thread switch
(Thread B) locks on the old wrapper
How about:
LockWrapper starts with a reference count of 1
addReference() returns false if the reference count is 0
in lock(), if existingWrapper != null, we call addReference() on it. If this returns false, it has already been removed from the map, so we loop back and try again from the putIfAbsent()
I would use a fixed array by default for a striped lock, since you can size it to the concurrency level that you expect. While there may be hash collisions, a good spreader will resolve that. If the locks are used for short critical sections, then you may be creating contention in the ConcurrentHashMap that defeats the optimization.
You're welcome to adapt my implementation, though I only implemented the dynamic version for fun. It didn't seem useful in practice so only the fixed was used in production. You can use the hash() function from ConcurrentHashMap to provide a good spreading.
ReentrantStripedLock in,
http://code.google.com/p/concurrentlinkedhashmap/wiki/IndexableCache

Resources