We are creating an hourly context in esper and using #hint with output rate limiter.But we are still running out of memory in certain events.
XMX -12 G
In certain events, if we have bad compression, means, for example, we got data for 6 GB which is non-aggregated and therefore it turns out to be bad compression from esper.
I want to limit the size also on esper, so if we hit a limit before hourly context, I can flush this large chunk of data for which compression is bad.
Its like either you hit the hourly context or the size for flushing out of output data.
My esper query
private static final String HOURLY_CONTEXT =
"create context HourlyRollup start(0,*,*,*,*,0) end(59,*,*,*,*,59)";
private static final String HINT = "#Hint('enable_outputlimit_opt') ";
private static final String HOURLY_STATEMENT = HINT+
"context HourlyRollup "
+ "select count(*) as xcount,hourlyFloor(min(from_time)),a,b,c,d,e,f,"
+ "g,h,sum(h),sum(i),j,k,l,"
+ "m,n,y,o,p,q,r "
+ "from io.common.Bean where Dir in (-5,-3,0,1) "
+ "group by a,b,c,d,e,f,g,Direction,h,"
+ "i,j,k,l,m,l,n,o,p output all "+"when terminated";
Make sure your query is fully-aggregated otherwise Esper must retain the events since the select-clause is asking for event data. When the query is fully-aggregated Esper retainw only aggregated data and that is what you want.
This is not fully aggregated:
select a, b, count(*) from Event group by a
This is better and fully aggregated:
select a, b, count(*) from Event group by a, b
The doc for this is here.
As per the documentation it mentions ) (soql for loop) retrieves all sObjects using a call to query and queryMore whereas (list for loop) retrieves a number of objects records. It is advisable to use (soql for loop) over (list for loop) to avoid heap size limit error.
Total Heap Size Limit : 6 M.B Synchronous and 12 M.B Asynchronous.
In below case, let say each record is taking 2 K.B so 50,000 will take 50,000*2=100000 K.B (100 M.B approx in conList) which will cause heap size limit error as the allowed limit is 6 M.B for synchronous.
list<contact> conList=new list<contact>();
conList=[Select id,phone from contact];
To avoid this we should use "SOQL for loop" as con variable highlighted below will have 1 record at a time i.e 2k.B of data at a time thus preventing heap size limit error.
for (List<Contact> con: [SELECT id, name FROM contact]){
Question - What does it mean that "SOQL for loop" as con variable highlighted below will have 1 record at a time i.e 2k.B of data at a time.
The main difference would be that you can't use the retrieved records outside of the for loop if you go with that. When you store the records in the List, you can use that list to manipulate that in the for loop, but you can also use it in other operations at a later time.
To give you an example:
List<Contact> conList = [SELECT Id, Name FROM Contact LIMIT 100];
for(Contact c:conList){
c.Title = Mr/Mrs;
update conList;//I am able to use the same list in the update call.
We are Esper To aggregate i.e group by on certain set of events...but esper is not dereferencing that aggregate object.
Esper query:
private static final String HOURLY_CONTEXT =
"create context HourlyRollup start(0,*,*,*,*,0) end(59,*,*,*,*,59)";
This is our hourly context...
This bean in query is not getting de-referenced and we are getting gbs of these objects.
private static final String HOURLY_STATEMENT =
"context HourlyRollup "
+ "select count(*) as xcount, hourlyFloor(min(from_time)), a, b, c, d, e, f,"
+ "g,h,sum(h),sum(i),j,k,l,"
+ "m,n,y,o,p,q,r "
+ "from io.common.Bean where Dir in (-5,-3,0,1) "
+ "group by a,b,c,d,e,f,g,Direction,h,"
+ "i,j,k,l,m,l,n,o,p output all "
+ "when terminated order by a,b,c,Dir,d,e";
private static final int HOURLY = RollupPeriod.HOURLY.ordinal();
When a select-clause selects properties of each event that are not in the group-by clause, that means that Esper cannot forget the event itself and keeps the events around until it is time to output and terminate.
This type of query is http://esper.espertech.com/release-8.2.0/reference-esper/html_single/index.html#processingmodel_aggregation_batch_group_agg
When a select-clause only has aggregated properties plus un-aggregated properties that appear in the group-by clause that mean Esper can forget the event and keep the aggregated value instead.
This type of query is http://esper.espertech.com/release-8.2.0/reference-esper/html_single/index.html#processingmodel_aggregation_batch_full_agg
So...check the expressions in the select-clause and make sure that event properties are either in the group-by clause, or are all aggregated for example "last(r)" or "first(r)" or similar.
I have a spring cloud kafka streams application that rekeys incoming data to be able to join two topics, selectkeys, mapvalues and aggregate data. Over time the consumer lag seems to increase and scaling by adding multiple instances of the app doesn't help a bit. With every instance the consumer lag seems to be increasing.
I scaled up and down the instances from 1 to 18 but no big difference is noticed. The number of messages it lags behind, keeps increasing every 5 seconds independent of the number of instances
KStream<String, MappedOriginalSensorData> flattenedOriginalData = originalData
.through("atl-mapped-original-sensor-data-repartition", Produced.with(Serdes.String(), new MappedOriginalSensorDataSerde()));
//#2. Save modelid and algorithm parts of the key of the errorscore topic and reduce the key
// to installationId:assetId:tagName
//Repartition ahead of time avoiding multiple repartition topics and thereby duplicating data
KStream<String, MappedErrorScoreData> enrichedErrorData = errorScoreData
.through("atl-mapped-error-score-data-repartition", Produced.with(Serdes.String(), new MappedErrorScoreDataSerde()));
return enrichedErrorData
//#3. Join
.join(flattenedOriginalData, join(),
// allow messages within one second to be joined together based on their timestamp
// configure the retention period of the local state store involved in this join
new MappedErrorScoreDataSerde(),
new MappedOriginalSensorDataSerde()))
//#4. Set instalation:assetid:modelinstance:algorithm::tag key back
.selectKey((k,v) -> v.getOriginalKey())
//#5. Map to ErrorScore (basically removing the originalKey field)
then the aggregation part:
Materialized<String, ErrorScore, WindowStore<Bytes, byte[]>> materialized = Materialized
// Set retention of changelog topic
// Configure how windows looks like and how long data will be retained in local stores
TimeWindows configuredTimeWindows = getConfiguredTimeWindows(
localStore.getTimeUnit(), Long.parseLong(topicConfig.get(RETENTION_MS)));
// Processing description:
// 2. With the groupByKey we group the data on the new key
// 3. With windowedBy we split up the data in time intervals depending on the provided LocalStore enum
// 4. With reduce we determine the maximum value in the time window
// 5. Materialized will make it stored in a table
.reduce((aggValue, newValue) -> getMaxErrorScore(aggValue, newValue), materialized);
private TimeWindows getConfiguredTimeWindows(long windowSizeMs, long retentionMs) {
TimeWindows timeWindows = TimeWindows.of(windowSizeMs);
return timeWindows;
I would expect that increasing the number of instances would decrease the consumer lag tremendous.
So in this setup there are multiple topics involved such as:
* original-sensor-data
* error-score
* kstream-joinother
* kstream-jointhis
* atl-mapped-original-sensor-data-repartition
* atl-mapped-error-score-data-repartition
* atl-joined-data-repartition
the idea is to join the original-sensor-data with the error-score. The rekeying requires the atl-mapped-* topics. then the join will use the kstream* topics and in the end as a result of the join the atl-joined-data-repartition is filled. After that the aggregation also creates topics but I leave this out of scope now.
\ atl-mapped-original-sensor-data-repartition-- kstream-jointhis -\
/ atl-mapped-error-score-data-repartition -- kstream-joinother -\
/ \
error-score atl-joined-data-repartition
As it seems that increasing the number of instances doesn't seem to have much of affect anymore since I introduced the join and the atl-mapped topics, I'm wondering if it is possible that this topology would become its own bottleneck. From the consumer lag it seems that the original-sensor-data and error-score topic have a much smaller consumer lag compare to for instance the atl-mapped-* topics. Is there a way to cope with this by removing these changelogs or does this result in not being able to scale?
I have a Spark stream in which records are flowing in. And the interval size is 1 second.
I want to union all the data in the stream. So i have created an empty RDD , and then using transform method, doing union of RDD (in the stream) with this empty RDD.
I am expecting this empty RDD to have all the data at the end.
But this RDD always remains empty.
Also, can somebody tell me if my logic is correct.
JavaRDD<Row> records = ss.emptyDataFrame().toJavaRDD();
JavaDStream<Row> transformedMessages = messages.flatMap(record -> processData(record))
.transform(rdd -> rdd.union(records));
transformedMessages.foreachRDD(record -> {
System.out.println("Aman" +record.count());
StructType schema = DataTypes.createStructType(fields);
Dataset ds = ss.createDataFrame(records, schema);
Initially, records is empty.
Then we have transformedMessages = messages + records, but records is empty, so we have: transformedMessages = messages (obviating the flatmap function which is not relevant for the discussion)
Later on, when we do Dataset ds = ss.createDataFrame(records, schema); records
is still empty. That does not change in the flow of the program, so it will remain empty as an invariant over time.
I think what we want to do is, instead of
.transform(rdd -> rdd.union(records));
we should do:
.foreachRDD{rdd => records = rdd.union(records)} //Scala: translate to Java syntax
That said, please note that as this process iteratively adds to the lineage of the 'records' RDD and also will accumulate all data over time. This is not a job that can run stable for a long period of time as, eventually, given enough data, it will grow beyond the limits of the system.
There's no information about the usecase behind this question, but the current approach does not seem to be scalable nor sustainable.
I'm working with a time-based index storing syslog events.
All the data is coming from different sources (PCs).
Suppose I have this kind of events:
timestamp = 0
source = PC-1
event = event_type_1
timestamp = 1
source = PC-1
event = event_type_1
timestamp = 1
source = PC-2
event = event_type_1
I want to make a query that will retrieve all the distinct value of "source" field for documents where match event = event_type_1
I am expecting to have all exact values (no approximations).
To achieve it I have written a cardinality query with an aggregation specifying the correct size, because I have no prior knowledge of the number of distinct sources. I think this is a expensive work to do as it consumes a lot of memory.
Is there any other alternative to get this done?