I try store and parse&store some raw data with two strategies (serial & parallel)
Flux<PanasonicData> f = Flux.create(sink -> dataRepo.addConsumer(sink::next));
Flux.from(f).publishOn(Schedulers.single()).subscribe(this::save1);
Flux.from(f).publishOn(Schedulers.parallel()).map(MyClass::parse).subscribe(this::save2);
Or
ConnectableFlux<PanasonicData> cf = Flux.create(sink -> dataRepo.addConsumer(sink::next)).publish();
cf.autoConnect().publishOn(Schedulers.single()).subscribe(this::save1);
cf.autoConnect().publishOn(Schedulers.parallel()).map(MyClass::parse).subscribe(this::save2);
But the second task is never ran !!!
How can i run this two tasks with this two different strategies?
You can specify the minimum number of subscribers via autoConnect(int minSubscribers):
Flux<PanasonicData> cf = Flux.create(sink -> dataRepo.addConsumer(sink::next)).publish().autoConnect(2);
cf.publishOn(Schedulers.single()).subscribe(this::save1);
cf.publishOn(Schedulers.parallel()).map(MyClass::parse).subscribe(this::save2);
Related
Setting 'common' properties for child tasks is not working
The SCDF version I'm using is 2.9.6.
I want to make CTR A-B-C, each of tasks does follows:
A : sql select on some source DB
B : process DB data that A got
C : sql insert on some target DB
Simplest way to make this work seems to define shared work directory folder Path "some_work_directory", and pass it as application properties to A, B, C. Under {some_work_directory}, I just store each of task result as file, like select.result, process.result, insert.result, and access them consequently. If there is no precedent data, I could assume something went wrong, and make tasks exit with 1.
================
I tried with a composed task instance QWER, with two task from same application "global" named as A, B. This simple application prints out test.value application property to console, which is "test" in default when no other properties given.
If I tried to set test.value in global tab on SCDF launch builder, it is interpreted as app.*.test.value in composed task's log. However, SCDF logs on child task A, B does not catch this configuration from parent. Both of them fail to resolve input given at launch time.
If I tried to set test.value as row in launch builder, and pass any value to A, B like I did when task is not composed one, this even fails. I know this is not 'global' that I need, it seems that CTR is not working correctly with SCDF launch builder.
The only workaround I found is manually setting app.QWER.A.test.value=AAAAA and app.QWER.B.test.value=BBBBB in launch freetext. This way, input is converted to app.QWER-A.app.global4.test.value=AAAAA, app.QWER-B.app.global4.test.value=BBBBB, and print well.
I understand that, by this way, I could set detailed configurations for each of child task at launch time. However, If I just want to set some 'global' that tasks in one CTR instance would share, there seems to be no feasible way.
Am I missing something? Thanks for any information in advance.
CTR will orchestrate the execution of a collection of tasks. There is no implicit data transfer between tasks. If you want the data from A to be the input to B and then output of B becomes the input of C you can create one Task / Batch application that have readers and writers connected by a processor OR you can create a stream application for B and use JDBC source and sink for A and C.
Apologies if this has been already covered before here, I couldn't find anything closely related. I have this Kafka Streams app which reads from multiple topics, persist the records on a DB and then publish an event to an output topic. Pretty straightforward, it's stateless in terms of kafka local stores. (Topology below)
Topic1(T1) has 5 partitions, Topic2(T2) has a single partition. The issue here is, while consuming from two topics, if I want to go "full speed" with T1 (5 consumers), it doesn't guarantee that I will have dedicated consumers for each partition on T1. It will be distributed within the two topic partitions and I might end up with unbalanced consumers (and idle consumers), something like below:
[c1: t1p1, t1p3], [c2: t1p2, t1p5], [c3: t1p4, t2p1], [c4: (idle consumer)], [c5: (idle consumer)]
[c1: t1p1, t1p2], [c2: t1p5], [c3: t1p4, t2p1], [c4: (idle consumer)], [c5: t1p3]
With that said:
Is it a good practice having a topology that reads from multiple topics within the same KafkaStreams instance?
Is there any way to achieve a partition assignment like the following if I want go "full speed" for T1? [c1: t1p1, t2p1], [c2: t1p2], [c3: t1p3], [c4: t1p4], [c5: t1p5]
Which of the topologies below is most optimal to what I want to achieve? Or is it completely unrelated?
Option A (Current topology)
Topologies:
Sub-topology: 0
Source: topic1-source (topics: [TOPIC1])
--> topic1-processor
Processor: topic1-processor (stores: [])
--> topic1-sink
<-- topic1-source
Sink: topic1-sink (topic: OUTPUT-TOPIC)
<-- topic1-processor
Sub-topology: 1
Source: topic2-source (topics: [TOPIC2])
--> topic2-processor
Processor: topic2-processor (stores: [])
--> topic2-sink
<-- topic2-source
Sink: topic2-sink (topic: OUTPUT-TOPIC)
<-- topic2-processor
Option B:
Topologies:
Sub-topology: 0
Source: topic1-source (topics: [TOPIC1])
--> topic1-processor
Source: topic2-source (topics: [TOPIC2])
--> topic2-processor
Processor: topic1-processor (stores: [])
--> response-sink
<-- topic1-source
Processor: topic2-processor (stores: [])
--> response-sink
<-- topic2-source
Sink: response-sink (topic: OUTPUT-TOPIC)
<-- topic2-processor, topic1-processor
If I use two streams for each topic instead of a single streams with multiple topic, would that work for what I am trying to achieve?
config1.put("application.id", "app1");
KakfaStreams stream1 = new KafkaStreams(config1, topologyTopic1);
stream1.start();
config2.put("application.id", "app2");
KakfaStreams stream2 = new KafkaStreams(config2, topologyTopic2);
stream2.start();
The initial assignments you describe, would never happen with Kafka Streams (And also not with any default Consumer config). If there are 5 partitions and you have 5 consumers, each consumer would get 1 partition assigned (for a plain consumer with a custom PartitionAssignor you could do the assignment differently, but all default implementations would ensure proper load balancing).
Is it a good practice having a topology that reads from multiple topics within the same KafkaStreams instance?
There is not issue with that.
Is there any way to achieve a partition assignment like the following if I want go "full speed" for T1? [c1: t1p1, t2p1], [c2: t1p2], [c3: t1p3], [c4: t1p4], [c5: t1p5]
Depending how you write your topology, this would be the assignment Kafka Streams uses out-of-the-box. For you two options, option B would result in this assignment.
Which of the topologies below is most optimal to what I want to achieve? Or is it completely unrelated?
As mentioned above, Option B would result in the assignment above. For Option A, you could actually even use a 6th instance and each instance would processes exactly one partition (because there are two sub-topologies, you get 6 tasks, 5 for sub-topology-0 and 1 for sub-topology-1; sub-topologies are scaled out independently of each other); for Option A, you only get 5 tasks though because there is only one sub-topology and thus the maximum number of partitions of both input topic (that is 5) determines the number of tasks.
If I use two streams for each topic instead of a single streams with multiple topic, would that work for what I am trying to achieve?
Yes, it would be basically the same as Option A -- however, you get two consumer groups and thus "two application" instead of one.
I am running a very simple program which counts words in a S3 Files
JavaRDD<String> rdd = sparkContext.getSc().textFile("s3n://" + S3Plugin.s3Bucket + "/" + "*", 10);
JavaRDD<String> words = rdd.flatMap(s -> java.util.Arrays.asList(s.split(" ")).iterator()).persist(StorageLevel.MEMORY_AND_DISK_SER());
JavaPairRDD<String, Integer> pairs = words.mapToPair(s -> new Tuple2<String, Integer>(s, 1)).persist(StorageLevel.MEMORY_AND_DISK_SER());
JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b).persist(StorageLevel.MEMORY_AND_DISK_SER());
//counts.cache();
Map m = counts.collectAsMap();
System.out.println(m);
After running the program multiple times, I can see multiple entries
Storage entries
This means that everytime I am running the process it keeps on creating new cache.
The time taken for running the script everytime remains the same.
Also when I run the program, I always see these kind of logs
[Stage 12:===================================================> (9 + 1) / 10]
My understanding was that when we cache Rdds, it wont do the operations again and fetch the data from the cache.
So I need to understand that why Spark doesnt use the cached rdd and instead creates a new cached entry when the process is run again.
Does spark allows to use cached rdds across Jobs or is it available only in the current context
Cached data only persists for the length of your Spark application. If you run the application again, you will not be able to make use of cached results from previous runs of the application.
In logs it will show the total stages but when you go to localhost:4040 you can see there is some task skip because of caching so monitor jobs more properly with spark UI localhost:4040
I work with Spark 1.4.1. I want to listen to two different streams at the same time and find common events in both the streams.
For example: Assume one stream of temperature data and another stream of pressure data. I want to listen both the the streams and give an alert when both are high.
I have two questions
Is it possible to process to two different streams in a single spark
context.
Is it possible to have multiple spark context with variable window sizes in a single driver program.
Any other idea on how to work on the above situation will also be deeply appreciated.
Thanks
You can create multiple DStreams from the same StreamingContext. E.g.
val dstreamTemp: DStream[String, Int] = KafkaUtils.createStream(ssc, zkQuorum, group, "TemperatureData").map(...)
val dstreamPres: DStream[String, Int] = KafkaUtils.createStream(ssc, zkQuorum, group, "PressureData")
They will both have the same "batch duration" as that is defined on the StreamingContext. However, you can create new Windows:
val windowedStreamTemp = dstreamTemp.window(Seconds(20))
val windowedStreamPres = dstreamPres.window(Minutes(1))
You can also join the streams (assuming a stream of key-values). E.g.
val joinedStream = windowedStreamTemp.join(windowedStreamPres)
You can then alert on the joinedStream.
I am running my program on a Spark cluster. But when I look at the UI while the job is running, I see that only one worker does most of the tasks. My cluster has one master and 4 workers where the master is also a worker.
I want my task to complete as quickly as possible and I believe that if the number of tasks were to be divided equally among the workers, the job will be completed faster.
Is there any way I can customize this?
System.setProperty("spark.default.parallelism","20")
val sc = new SparkContext("spark://10.100.15.2:7077","SimpleApp","/home/madhura/spark",List("hdfs://master:54310/simple-project_2.10-1.0.jar"))
val dRDD = sc.textFile("hdfs://master:54310/in*",10)
val keyval=dRDD.coalesce(100,true).mapPartitionsWithIndex{(ind,iter) => iter.map(x => process(ind,x.trim().split(' ').map(_.toDouble),q,m,r))}
I tried this but it did not help.