Difference between JavaDstream & JavaReceiverInputDstream? - spark-streaming

What is the diffrence between JavaDstream and JavaReceiverInputDstream ??
I already tried both but, nothing different. Also, whether it affects the output of the function print()? because I saw from some source output generated from the twitter stream by using streaming spark, they simply show a set of batch with no log is blocking the output of public tweets. Examples of logs that blocks the output terminal is:
INFO [JoBGenerator], INFO [sparkDriver-akka.actor.default-dispatcher]
and so on. The whole terminal is filled with those logs INFO, and I can not see public tweets and exception clearly.
Next question is,
At first the twitter stream running properly (public tweets can be captured), but some time later the receiver did not receive a single tweet while the batch is still running well. So the conclusion is my system only accepts public tweets at the beginning of the running program and stop receiving tweets like forever..
Is there any spark file that contain log produced after running program? because i cant see the log clearly in the terminal..
Thx & Help me

JavaReceiverInputDstream - A Java-friendly interface to ReceiverInputDStream, the abstract class for defining any input stream that receives data over the network.
JavaDstream - A Java-friendly interface to DStream, the basic abstraction in Spark Streaming that represents a continuous stream of data. DStreams can either be created from live data (such as, data from TCP sockets, Kafka, Flume, etc.) or it can be generated by transforming existing DStreams using operations such as map, window. For operations applicable to key-value pair DStreams, see JavaPairDStream.
So the difference lies between ReceiverInputDStream and DStream. ReceiverInputDStream has method getReceiver() - Gets the receiver object that will be sent to the worker nodes to receive data.. We don't have this facilities in JavaDstream

Related

Best practices for transforming a batch of records using KStream

I am new to KStream and would like to know best practices or guidance on how to optimally process batch of records of n size using KStream. I have a working code as shown below but it does work for single messages at a time.
KStream<String, String> sourceStream = builder.stream("upstream-kafka-topic",
Consumed.with(Serdes.String(),
Serders.String());
//transform sourceStream using implementation of ValueTransformer<String,String>
sourceStream.transformValues(() -> new MyValueTransformer()).
to("downstream-kafka-topic",
Produced.with(Serdes.String(),
Serdes.String());
Above code works with single records as MyValueTransformer which implements ValueTransformer transforms single String value. How do I make above code work for Collection of String values?
You would need to somehow "buffer / aggregate" the messages. For example, you could add a state store to your transformer and store N messages inside the store. As long as the store contains fewer than N messages you don't do any processing and also don't emit any output (you might want to use flatTransformValues which allows you to emit zero results).
Not sure what you're trying to achieve. Kafka Streams by concept is designed to process one record at a time. If you want to process a collection or batch of messages you have a few options.
You might not actually need Kafka streams as the example you mentioned doesn't do much with the message, in this case, you can leverage a normal Consumer which will enable you to process in Batches. Check spring Kafka implementation of this here -> https://docs.spring.io/spring-kafka/docs/current/reference/html/#receiving-messages (Kafka process batches on the network layer but normally you would process one record at a time, but it's possible with a standard client to process batches) OR you might model your Object value to have an array of messages so for each record you will be receiving an object which contains a collection embedded which you could then use Kafka streams to do it, check the array type for Avro -> https://avro.apache.org/docs/current/spec.html#Arrays
Check this part of the documentation to understand better the Kafka streams concepts -> https://kafka.apache.org/31/documentation/streams/core-concepts

Need explanation on internal working of read_table method in pyarrow.parquet

I stored all the required parquet tables in a Hadoop Filesystem, and all these files have a unique path for identification. These paths are pushed into a RabbitMQ queue as a JSON and is consumed by the consumer (in CherryPy) for processing. After successful consumption, the first path is sent for reading and the following paths will be read once the preceding read processes are done. Now to read a specific table I am using the following line of code,
data_table = parquet.read_table(path_to_the_file)
Let's say I have five read tasks in the message. The first read process is being carried out and gets read successfully and now before the other reading tasks are yet to be performed I just manually stopped my server. This stop would not send a message execution successful acknowledgement to the queue as there are a four remaining read processes. Once I restart the server, the whole consumption and reading processes starts from the initial stage. And now when the read_table method is called upon the first path, it gets stuck totally.
Digging up inside the work flow of read_table method, I found out where it actually gets stuck.
But further explanations of this method for reading a file inside a hadoop filesystem is required.
path = 'hdfs://173.21.3.116:9000/tempDir/test_dataset.parquet'
data_table = parquet.read_table(path)
Can somebody please give me a picture of the internal implementation that happens after calling this method? So that I could find where the issue is actually occurred and a solution to it.

Apache Flink relating/caching data options

This is a very broad question, I’m new to Flink and looking into the possibility of using it as a replacement for a current analytics engine.
The scenario is, data collected from various equipment, the data is received As a JSON encoded string with the format of {“location.attribute”:value, “TimeStamp”:value}
For example a unitary traceability code is received for a location, after which various process parameters are received in a real-time stream. The analysis is to be ran over the process parameters however the output needs to include a relation to a traceability code. For example {“location.alarm”:value, “location.traceability”:value, “TimeStamp”:value}
What method does Flink use for caching values, in this case the current traceability code whilst running analysis over other parameters received at a later time?
I’m mainly just looking for the area to research as so far I’ve been unable to find any examples of this kind of scenario. Perhaps it’s not the kind of process that Flink can handle
A natural way to do this sort of thing with Flink would be to key the stream by the location, and then use keyed state in a ProcessFunction (or RichFlatMapFunction) to store the partial results until ready to emit the output.
With a keyed stream, you are guaranteed that every event with the same key will be processed by the same instance. You can then use keyed state, which is effectively a sharded key/value store, to store per-key information.
The Apache Flink training includes some explanatory material on keyed streams and working with keyed state, as well as an exercise or two that explore how to use these mechanisms to do roughly what you need.
Alternatively, you could do this with the Table or SQL API, and implement this as a join of the stream with itself.

Network shuffle in streaming

So,keyBy or groupBy causes a network shuffle that repartitions the stream. It is said that it is pretty expensive, since it involves network communication along with serialization and deserialization etc.
For an example, if I run the following operators:
map(Mapper1).keyBy(0).map(Mapper2)
with a parallelism of 2, I would get something like this:
Mapper1(1) -\-/- Mapper2(1)
X
Mapper1(2) -/-\- Mapper2(2)
And in the end all records with the same key within the Mapper1 are assigned to the same partition in Mapper2.
My question is:
I want to know what happens during the keyBy or groupBy in streaming. Every processed element is serialized and deserialized by every sub task ? How can I compare the cost of keyBy or groupBy with an another operation ?
Also, I am familiar with the concept of partitioner in batch systems, but I am getting a bit confused when I am trying to apply that in streaming.
Thank you !
So Apache Flink buffers the outgoing of a task and after that sends it to the next task for processing. setBufferTimeout is a parameter on the job-level which can be configured via the StreamExecutionEnvironment and the default value for this timeout is 100 ms. After this time, the buffers are sent automatically even if they are not full.
Also the following links are really helpful to understand the details:
https://flink.apache.org/2019/06/05/flink-network-stack.html
https://flink.apache.org/2019/07/23/flink-network-stack-2.html

#Storm: how to setup various metrics for the same data source

I'm trying to setup Storm to aggregate a stream, but with various (DRPC available) metrics on the same stream.
E.g. the stream is consisted of messages that have a sender, a recipient, the channel through which the message arrived and a gateway through which it was delivered. I'm having trouble deciding how to organize one or more topologies that could give me e.g. total count of messages by gateway and/or by channel. And besides the total, counts per minute would be nice too.
The basic idea is to have a spout that will accept messaging events, and from there aggregate the data as needed. Currently I'm playing around with Trident and DRPC and I've came up with two possible topologies that solve the problem at this stage. Can't decide which approach is better, if any?!
The entire source is available at this gist.
It has three classes:
RandomMessageSpout
used to emit the messaging data
simulates the real data source
SeparateTopology
creates a separate DRPC stream for each metric needed
also a separate query state is created for each metric
they all use the same spout instance
CombinedTopology
creates a single DRPC stream with all the metrics needed
creates a separate query state for each metric
each query state extracts the desired metric and groups results for it
Now, for the problems and questions:
SeparateTopology
is it necessary to use the same spout instance or can I just say new RandomMessageSpout() each time?
I like the idea that I don't need to persist grouped data by all the metrics, but just the groupings we need to extract later
is the spout emitted data actually processed by all the state/query combinations, e.g. not the first one that comes?
would this also later enable dynamic addition of new state/query combinations at runtime?
CombinedTopology
I don't really like the idea that I need to persist data grouped by all the metrics since I don't need all the combinations
it came as a surprise that the all the metrics always return the same data
e.g. channel and gateway inquiries return status metrics data
I found that this was always the data grouped by the first field in state definition
this topic explains the reasoning behind this behaviour
but I'm wondering if this is a good way of doing thins in the first place (and will find a way around this issue if need be)
SnapshotGet vs TupleCollectionGet in stateQuery
with SnapshotGet things tended to work, but not always, only TupleCollectionGet solved the issue
any pointers as to what is correct way of doing that?
I guess this is a longish question / topic, but any help is really appreciated!
Also, if I missed the architecture entirely, suggestions on how to accomplish this would be most welcome.
Thanks in advance :-)
You can't actually split a stream in SeparateTopology by invoking newStream() using the same spout instance, since that would create new instances of the same RandomMessageSpout spout, which would result in duplicate values being emitted to your topology by multiple, separate spout instances. (Spout parallelization is only possible in Storm with partitioned spouts, where each spout instance processes a partition of the whole dataset -- a Kafka partition, for example).
The correct approach here is to modify the CombinedTopology to split the stream into multiple streams as needed for each metric you need (see below), and then do a groupBy() by that metric's field and persistentAggregate() on each newly branched stream.
From the Trident FAQ,
"each" returns a Stream object, which you can store in a variable. You can then run multiple eaches on the same Stream to split it, e.g.:
Stream s = topology.each(...).groupBy(...).aggregate(...)
Stream branch1 = s.each(...)
Stream branch2 = s.each(...)
See this thread on Storm's mailing list, and this one for more information.

Resources