I'm confused on this situation:
I've a Producer which produces an undetermined number of items from an underlining iterator, possibly a large number of them.
Each item must be mapped to a different interface (eg, wrapper, JavaBean from JSON structure).
So, I'm thinking that it would be good for Producer to return a stream, it's easier to write code that convert Iterator to Stream (using Spliterators and StreamSupport.stream()), then apply Stream.map() and return the final stream.
The problem is I have an invoker that does nothing with the resulting stream, eg, a unit test, yet I still want the mapping code to be invoked for every item. At the moment I'm simply calling Stream.count() from the invoker to force that.
Questions are:
Am I doing it wrong? Should I use different interfaces? Note that I think implementing next()/hasNext() for Iterator is cumbersome, mainly because it forces you to create a new class (even if it can be anonymous) and keep a pointer and check it. Same for collection views, returning a collection that is created and not a dynamic view over the underlining iterator is out of question (the input data set might be very large). The only alternative I like so far is a Java implementation of yield(). Neither do I want the stream to be consumed inside Producer (ie, forEach()), since some other invoker might want it to perform some real operation.
Is there a better best practice to force the stream processing?
Related
I have a question about the marker trait Sync after reading Extensible Concurrency with the Sync and Send Traits.
Java's "synchronize" means blocking, so I was very confused about how a Rust struct with Sync implemented whose method is executed on multiple threads would be effective.
I searched but found no meaningful answer. I'm thinking about it this way: every thread will get the struct's reference synchronously (blocking), but call the method in parallel, is that true?
Java: Accesses to this object from multiple threads become a synchronized sequence of actions when going through this codepath.
Rust: It is safe to access this type synchronously through a reference from multiple threads.
(The two points above are not canonical definitions, they are just demonstrations how similar words can be used in sentences to obtain different meanings)
synchronized is implemented as a mutual exclusion lock at runtime. Sync is a compile time promise about runtime properties of a specific type that allows other types depend on those properties through trait bounds. A Mutex just happens to be one way one can provide Sync behavior. Immutable types usually provide this behavior too without any runtime cost.
Generally you shouldn't rely on words having exactly the same meaning in different contexts. Java IO stream != java collection stream != RxJava reactive stream ~= tokio Stream. C volatile != java volatile. etc. etc.
Ultimately the prose matters a lot more than the keyword which are just shorthands.
I am controlling the Consumer that get's to this forEach so it may or may not be asked to perform an action.
list.parallelStream().forEach( x-> {} );
Streams being lazy Streams won't iterate, right? Nothing will happen is what i expect. Tell me if i am wrong, please.
It will traverse the whole stream, submitting tasks to fork-join pool, splitting the list to parts and passing all the list elements to this empty lambda. Currently it's impossible to check in runtime whether the lambda expression is empty or not, thus it cannot be optimized out.
Similar problem appears in using Collector. All collectors have the finisher operation, but in many cases it's an identity function like x -> x. In this case sometimes the code which uses collectors can be greatly optimized, but you cannot robustly detect whether the supplied lambda is identity or not. To solve this an additional collector characteristic called IDENTITY_FINISH was introduced instead. Were it possible to robustly detect whether supplied lambda is identity function, this characteristic would be unnecessary.
Also look at JDK-8067971 discussion. This proposes creating static constants like Predicate.TRUE (always true) or Predicate.FALSE (always false) to optimize operations like Stream.filter. For example, if Predicate.TRUE is supplied, then filtering step can be removed, and if Predicate.FALSE is supplied, then stream can be replaced with empty stream at this point. Again were it possible to detect in runtime that the supplied predicate is always true, then it would be unnecessary to create such constants.
I have been using hadoop for quite a time now but I'm not sure why Hadoop uses its own data types and not Java data types ? I have been searching for same thing over internet but nothing helped. please help.
Short answer is because of the serialization & deserialization performance that they provide.
Long version:
The primary benefit of using Writables (Hadoop's data types) is in their efficiency. Compared to Java serialization, which would have been an obvious alternative choice, they have a more compact representation. Writables don't store their type in the serialized representation, since at the point of deserialization it is known which type is expected.
Here is a more detailed excerpt from Hadoop Definitive Guide:
Java serialization is not compact, classes that implement java.io.Serializable or java.io.Externalizable write their classname and the object representation to the stream. Subsequent instances of the same class write a reference handle to the first occurrence, which occupies only 5 bytes. However, reference handles don't work well with random access, because the referent class may occur at any point in the preceding stream - that is, there is state stored in the stream. Even worse, reference handles play havoc with sorting records in a serialized stream, since the first record of a particular class is distinguished and must be treated as a special case. All these problems can be avoided by not writing the classname to the stream at all, which is the approach Writable takes. The result is that the format is considerably more compact than Java serialization, and random access and sorting work as expected because each record is independent of the others (so there is no stream state).
Preamble
This is about improving message send efficiency in a JIT compiler. Despite referring to Smalltalk, this question applies to most dynamic JIT-compiled languages.
Problem
Given a message send site, it can be classified as monomorphic, polymorphic or megamorphic. If the receiver of the message send is always of the same type, it is a monomorphic send, as in
10 timesRepeat: [Object new].
where the receiver of new is always Object. For this kind of sends JITs emit monomorphic inline caches.
Sometimes a given send site refers to a few different object types, like:
#(1 'a string' 1.5) do: [:element | element print]
In this case, print is sent to different types of objects. For these cases, JITs usually emit polymorphic inline caches.
Megamorphic message sends occur when a message is sent to not just a few but a lot of different object types in a same place. One of the most prominent examples is this:
Behavior>>#new
^self basicNew initialize
Here, basicNew creates the object, then initialize does initialization. You could do:
Object new
OrderedCollection new
Dictionary new
and they will all execute the same Behavior>>#new method. As the implementation of initialize is different in a lot of classes, the PIC will quickly fill. I'm interested in this kind of send sites, knowing they only occur unfrequently (only 1% of sends are megamorphic).
Question
What are the possible and specific optimizations for megamorphic send sites to avoid doing a lookup?
I imagine a few, and want to know more. After a PIC gets full, we'll have to call the lookup (being it full or the global cached one), but to optimize we can:
Recycle the PIC, throwing away all entries (many entries could be old and not used frequently).
Call some sort of specific megamorphic lookup (i.e. one that would cache all previously dispatched types in an array accesed by the type hash).
Inline the containing method (when inlined, the send site may stop being megamorphic)
I am reading Code Complete 2, Chapter 7.1 and I don't understand the point author said below.
7.1 Valid Reasons to Create a Routine
Hide pointer operations
Pointer operations tend to be hard to read and error prone. By isolating them in routines (or a class, if appropriate), you can concentrate on the intent of the operation rather than the mechanics of pointer manipulation. Also, if the operations are done in only one place, you can be more certain that the code is correct. If you find a better data type than pointers, you can change the program without traumatizing the routines that would have used the pointers.
Please explain or give example of this purpose.
Essentially, the advice is a specific example of the data-hiding. It boils down to this -
Stick to Object-oriented design and hide your data within objects.
In case of pointers, the norm is to NEVER expose pointers of "internal" data-structures as public members. Rather make them private and expose ONLY certain meaningful manipulations that are allowed to be performed on the pointers as public member functions.
Portable / Easy to maintain
The added advantage (as explained in the section quoted) is that any change in the internal data structures never forces the external API to be changed. Only the internal implementation of the publicly exposed member functions needs to be modified to handle any changes.
Code re-use / Easy to debug
Also pointer manipulations are now NOT copy/pasted and littered all around the code with no idea what exactly they do. They are now limited to the member functions which are written keeping in mind how exactly the internal data structures are being manipulated.
For example if we have a table of data which the user is allowed to add rows into,
Do NOT expose
pointers to the head/tail of table.
pointers to the individual elements.
Instead create a table object that exposes the functions
addNewRowTop(newData)
addNewRowBottom(newData)
addNewRow(position, newData)
To take this further, we implement addNewRowTop() and addNewRowBottom() by simply calling addNewRow() with the proper position - another internal variable of the table object.