Spark Streaming:how to sum up all result for several DStreams? - spark-streaming

I am now using Spark Streaming + Kafka to construct my message processing system.But I have a little technical problem , I will describe it below:
For example , I want to do a wordcount for each 10 minutes,So, in my earliest code,I set Batch Interval to 10 minutes.Code is like below:
val sparkConf = new SparkConf().setAppName(args(0)).setMaster(args(1))
val ssc = new StreamingContext(sparkConf, Minutes(10))
But I don't think it is a very good solution because 10 minutes is what a long time and large amount of data that my memory cannot sustain so much data.So , I want to reduce batch interval to 1 minutes, like:
val sparkConf = new SparkConf().setAppName(args(0)).setMaster(args(1))
val ssc = new StreamingContext(sparkConf, Minutes(1))
Then the problem comes:How can I sum up the result of 10 minutes for ten '1 minutes'? I think this word can only be done in driver instead of worker program,what can I do?
I am new learner of Spark Streaming.Any one can give me a hand?

Maybe I have my idea. In this condition ,I should use stateful function like UpdateStateByKey() because , since what I want is a global 10 minutes' result but what I can get is just each intermediate result of each 1 minute , so before each 10 minutes end , I have to record the state of each 1 minute , such as the word count result of each 1 minute and add them up for each 1 minute.

Posting here as I had a similar issue and came across the Window Operations section of Spark Streaming. In the poster's original case, they want a count for the past 10 minutes, done every 10 minutes although their program calculates counts each 1 minute. Assuming we have counts defined and calculated as the standard word count (i.e. at a 1-minute batch duration, with tuples (word, count)), we could follow the linked guide and define something along the lines of
// Reduce/count last 10 seconds worth of data, every 10 seconds
val windowedWordCounts = counts.reduceByKeyAndWindow(_+_, Seconds(10), Seconds(10))
where _+_ is a sum function.

Related

Find the difference between 2 dates and check if smaller than a given value

my issue is that I want to be able to get two time stamps and compare if the second (later taken) one is less than 59 minutes away from the first one.
Following this thread Compare two dates with JavaScript
the date object may do the job.
but first thing i am not happy with is that it takes the time from my system.
is it possible to get the time from some public server or something?
cause there always is a chance that the system clock gets manipulated within the time stamps, so that would be too unreliable.
some outside source would be great.
then i am not too sure how to get the difference between 2 times (using 2 date objects).
many issue that may pop up:
time being something like 3:59 and 6:12
so just comparing minutes would give the wrong idea.
so we consider hours too.
biut there the issue with the modulo 24.
day 3 23:59 and day 4 0:33 wouldnt be viewed proper either.
so including days too.
then the modulo 30 thing, even though that on top changes month for month.
so month and year to be included as well.
so we would need the whole date, everything from current year to second (because second would be nice too, for precision)
and comparing them would require tons of if clauses for year, month, etc.
do the date objects have some predfeined date comparision function that actually keeps all these things in mind (havent even mentioned leap years yet, have I)?
time would be very important cause exactly at the 59 minutes mark (+-max 5 seconds wouldnt matter but getting rmeitely close to 60 is forbidden)
a certain function would have to be used that without fail closes a website.
script opens website at mark 0 min, does some stuff rinse and repeat style and closes page at 59 min mark.
checking the time like every few seconds would be smart.
Any good ideas how to implement such a time comparision that doesnt take too more computer power yet is efficient as in new month starting and stuff doesnt mess it up?
You can compare the two Date times, but when creating a date time there is a parameter of DateTime(value) which you can use.
You can use this API to get the current UTC time which returns a example JSON array like this:
{
"$id":"1",
"currentDateTime":"2019-11-09T21:12Z",
"utcOffset":"00:00:00",
"isDayLightSavingsTime":false,
"dayOfTheWeek":"Saturday",
"timeZoneName":"UTC",
"currentFileTime":132178075626292927,
"ordinalDate":"2019-313",
"serviceResponse":null
}
So you can use either the currentFileTime or the currentDateTime return from that API to construct your date object.
Example:
const date1 = new Date('2019-11-09T21:12Z') // time when I started writing this answer
const date2 = new Date('2019-11-09T21:16Z') // time when I finished writing this answer
const diff = new Date(date2-date1)
console.log(diff.toTimeString()) // time it took me to write this
Please keep in mind that due to network speeds, the time API will be a little bit off (by a few milliseconds)

Loop is taking around 10 minutes in vb

For Each drow As DataGridViewRow In DgvItemList.Rows
drow.Cells("strSrNo").Value = drow.Index + 1
Next
I have more than 3500 records in DgvItemList. I just give to numbering to that records but it tool 9 to 10 minutes for that.
How to reduce this time ?
Two things. Each time you change the value, it could cause the DataGridView to update, so just before your loop, add
DgvItemList.SuspendLayout
and after the loop, add
DgvItemList.ResumeLayout
You could also change the loop to a Parallel.For loop, so your final code would be something like
DgvItemList.SuspendLayout
Parallel.For(0, DgvItemList.Rows.Count, Sub(index As Integer)
DgvItemList.Rows(index).Cells("strSrNo").Value = DgvItemList.Rows(index).Index + 1
End Sub)
DgvItemList.ResumeLayout
Try it with just the Suspend and Resume layout first. You may not get a vast amount of improvement from the parallelization. Worth a go though.

How to save start time of all the individual samples in my jmeter test and use that in JSR223 Listener

Im using influxfb to save the result of my jmeter test.
bellow is the part of code in JSR223 Listener where im in need of your help.
result = new StringBuilder();
result.append("Thro_5,")
.append("label=")
.append(escapeValue(sampleResult.getSampleLabel()))
result.append("count=")
count=sampleResult.getSampleCount();
result.append(count)
result.append(",duration=")
dur1=sampleResult.getStartTime();
result.append(sampleResult.getEndTime()-sampleResult.getStartTime())
*****here code to write data to influxdb*****
I'm trying this code in which i want to know the total duration sample has taken till now to calculate throughput.
a=sampleResult.getEndTime()-sampleResult.getStartTime()
.append(",throughput_=")
.append(totalSamplecount/(a/1000))
Last line in the above code , i.e sampleResult.getStartTime() ,it should be the starting time of a sample in the first loop.
If i have 3 samples in my test ,having loop count 3 ,i want to save the starting time of each sample in the first iteration and use that value in the calculation of throughput of each samples.
Then while i'm in 3rd loop i want to know the total duration it has taken so far from the first iteration.And totalsamplecount/duration
As far as i know sampleResult holds the result of current sample.
I'm stuck in 2 points:
in saving the start time of each samples and use it later for each iteration to calculate the duration.
In saving the total count of individual samples executed till now.

How to efficiently print strings with time span in a stream conditions?

Suppose we get a stream string input with clear data structure:
content:arrive time
Below are samples:
AAA : 12:00:00
ABC : 12:00:01
ABB : 12:00:02
ABM : 12:00:11
And we have a program to check this stream input and if
1) this content does not exist before, print this content;
2) if this content arrive before and the time span is less than 10 seconds, print empty;
3) if this content arrive before and the time span is more than 10 seconds, print content;
Hashtable(String, Date) is OK, and we can update the date when a new one come in.
And my question is:
What if is the string numbers are quite large and can't be stored in hashtable? Considering we are design a program that run 24*7 and hashtable becomes bigger and bigger.
Any other way we can do to solve this issue? And can we solve this with several servers?
You could use a Hashtable(String -> Date) and a simple Queue([Date, String]), every time you add a new item to the hashtable add it to the queue as well.
Every few minutes pop items from the queue and remove them from the hashtable until the date isn't old enough (<10 seconds).
This way you only hold data from the last few minutes and both the hashtable and the queue won't grow too much.

Limitation in retrieving rows from a mongodb from ruby code

I have a code which gets all the records from a collection of a mongodb and then it performs some computations.
My program takes too much time as the "coll_id.find().each do |eachitem|......." returns only 300 records at an instant.
If I place a counter inside the loop and check it prints 300 records and then sleeps for around 3 to 4 seconds before printing the counter value for next set of 300 records..
coll_id.find().each do |eachcollectionitem|
puts "counter value for record " + counter.to_s
counter=counter +1
---- My computations here -----
end
Is this a limitation of ruby-mongodb api or some configurations needs to be done so that the code can get access to all the records at one instant.
How large are your documents? It's possible that the deseriaization is taking a long time. Are you using the C extensions (bson_ext)?
You might want to try passing a logger when you connect. That could help sort our what's going on. Alternatively, can you paste in the MongoDB log? What's happening there during the pause?

Resources