Pattern Matching Redis keys - events

I have a question regarding doing pattern matching in Redis keys. Currently, I am storing a set of subscriptions, where keys are composite of different events. For example, if a subscription comes in as
S1 - {event:created, userId:1234, stateId:xyz}
It's stored in cache for matching as (events are sorted before creating the key)
event:created#stateId:xyz#userId:1234 = {S1}
Now there can be other events that can subscribe to this exact combination. But if an event comes if any of the three attributes, it will be matched with all the keys to which they are a substring. Example
event:created#stateId:xyz#userId:1234 = {S1,S2,S3}
event:started#stateId:xyz#userId:1234 = {S4,S5,S6}
event:created#stateId:abc#userId:1234 = {S7,S8,S9}
The following will be the event and subscription chart.
event:created -> S1,S2,S3,S7,S8,S9
event:started -> S4,S5,S6
state:xyz -> S1,S2,S3,S4,S5,S6
userId:1234 -> S1,S2,S3,S4,S5,S6,S7,S8,S9
stateId:abc and userId:1234 -> S1,S2,S3,S4,S5,S6,S7,S8,S9
I tried using a SCAN on Redis with a pattern match, but it takes a long time as my cache can have a lot of entries, and SCAN takes O(N) time.
Any idea how I can do this efficiently? Maybe by using a secondary structure in Redis like a Tree or something? Or any other Redis data structure I should look at?
Thanks

Related

Are java 8 stream operations optimized? [duplicate]

This question already has answers here:
Are Java streams stages sequential?
(4 answers)
Closed 4 years ago.
There is a simple query that filters a list and gets a field value of the item found.
myList.getParents().stream()
.filter(x -> x.getSomeField() == 1)
.map(x -> x.getOtherField())
.findFirst();
Are operations executed one after one as in code: from initial list we filter all where someField is 1, after it we create new list with a value of another field and after it we take the first one in this new list?
Let's imagine that there are 1 000 000 items in this list and after filtering they are 1000. Will it map those 1000 items to get only first one of them?
If I change the order will it optimize the performance or is it smart enough itself?
myList.getParents().stream()
.filter(x -> x.getSomeField() == 1)
.findFirst()
.map(x -> x.getOtherField());
No, in Java8 stream processing pipeline one data item is processed in one single pass. That way we can perform short circuit evaluations and gives us more room to optimize.
For instance in your case we take the first item, applies the filter, assume it satisfies the filter criteria. Then we go ahead and do the mapping and push that element. We don't need to access any other element in the stream source, since we process it in one pass. This short circuit evaluation allows us to do more optimizations.
However the second representation of processing pipeline is wrong. You can't have map at the end. The terminal operation findFirst should be at the end of the pipeline though.

Getting duplicates with NiFi HBase_1_1_2_ClientMapCacheService

I need to remove duplicates from a flow I've developed, it can receive the same ${filename} multiple times. I tried using HBase_1_1_2_ClientMapCacheService with DetectDuplicate (I am using NiFi v1.4), but found that it lets a few duplicates through. If I use DistributedMapCache (ClientService and Server), I do not get any duplicates. Why would I receive some duplicates with the HBase Cache?
As a test, I listed a directory (ListSFTP) with 20,000 files on all cluster nodes (4 nodes) and passed to DetectDuplicate (using the HBase Cache service). It routed 20,020 to "non-duplicate", and interestingly the table actually has 20,000 rows.
Unfortunately I think this is due to a limitation in the operations that are offered by HBase.
The DetectDuplicate processor relies on an operation "getAndPutIfAbsent" which is expected to return the original value, and then set the new value if it wasn't there. For example, first time through it would return null and set the new value, indicating it wasn't a duplicate.
HBase doesn't natively support this operation, so the implementation of this method in the HBase map cache client does this:
V got = get(key, keySerializer, valueDeserializer);
boolean wasAbsent = putIfAbsent(key, value, keySerializer, valueSerializer);
if (! wasAbsent) return got;
else return null;
So because it is two separate calls there is a possible race condition...
Imagine node 1 calls the first line and gets null, but then node 2 performs the get and the putIfAbsent, now when node 1 calls putIfAbsent it gets false because node 2 just populated the cache, so now node 1 returns the null value from the original get... both of these look like non-duplicates to DetectDuplicate.
In the DistributedMapCacheServer, it locks the entire cache per operation so it can provide an atomic getAndPutIfAbsent.

How to do aggregation on fixed-size count-based sliding window?

How do I implement a sliding window aggregation (or transformation) with a fixed-size count-based window?
For e.g: If I have stream data like the following
input stream = 1,2,3,4,5,6,7,8...
Assume that time is not relevant here. And say my aggregate function is AVERAGE and window size is fixed at 3 records (not 3 millis, 3 secs, 3 hours etc), I would like my output stream to be
output stream = avg(1,2,3), avg(2,3,4), avg(3,4,5), avg(4,5,6), avg(5,6,7)... = 2,3,4,5,6...
The Windows documented in Kafka streams work are "time-based". Even the constructor of base class Window has following signature:
Window(long startMs, long endMs)
So I was not sure if it's the right tool to do non-time based windowing aggregating.
Apache Flink supports count-based sliding and tumbling windows. That's exactly what I need, but I'm looking for a similar feature in Kafka Streams.
If time-ordering is no concern for you, you can implement a custom Transformer with attached state.
StreamsBuilder builder = new StreamsBuilder();
builder.addStoreStore(...); // add KeyValueStore here
KStream result = builder.stream("topic").transform(...); // pass in name of your KeyValueStore, too
For you custom Transformer you can maintain a List per key with the list being your window -- as long as the list is smaller than you window-size you append new record to the list -- if it's exactly the size, you trigger the computation -- if it exceeds the size, you trim it and trigger the computation afterwards.
See the docs for more details: https://kafka.apache.org/10/documentation/streams/developer-guide/processor-api.html (Note, that a Processor and a Transformer are basically the same thing.)
If you wish to use Apache Storrm which is also an streaming engine, kafka can be connected as a data source to it. Storm new version provides a concept called Tumbling Window, which delivers exact number of tuple to your topology. This can easily be used to solve your problem.
For more have a look at https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.0/bk_storm-component-guide/content/storm-windowing-concepts.html

Do I need caching after repartitining

I have madde a dataaframe which I repartitined based on its primary key on the nodes
val config=new SparkConf().setAppName("MyHbaseLoader").setMaster("local[10]")
val context=new SparkContext(config)
val sqlContext=new SQLContext(context)
val rows="sender,time,time(utc),reason,context-uuid,rat,cell-id,first-pkt,last-pkt,protocol,sub-proto,application-id,server-ip,server-domain-name, http-proxy-ip,http-proxy-domain-name, video,packets-dw, packets-ul, bytes-dw, bytes-ul"
val scheme= new StructType(rows.split(",").map(e=>new StructField(e.trim,StringType,true)))
val dFrame=sqlContext.read
.schema(scheme)
.format("csv")
.load("E:\\Users\\Mehdi\\Downloads\\ProbDocument\\ProbDocument\\ggsn_cdr.csv")
dFrame.registerTempTable("GSSN")
dFrame.persist(StorageLevel.MEMORY_AND_DISK)
val distincCount=sqlContext.sql("select count(distinct sender) as SENDERS from GSSN").collectAsList().get(0).get(0).asInstanceOf[Long]
dFrame.repartition(distincCount.toInt/3,dFrame("sender"))
Do I need to call my presist method again after repartitioning for next reducing jobs on dataframe?
Yes, repartition returns a new DataFrame so you would need to cache it again.
While the answer provided by Dikei seems to address your direct question it is important to note that in a case like this there is typically no reason to explicitly cache at all.
Every shuffle in Spark (here it is repartition) serves as an implicit caching point. If some part of lineage has to be re-executed and none of the executors has been lost you it won't have to go further back than to the last shuffle and read shuffle files.
It means that caching just before or just after a shuffle is typically a waste of time and resources especially if you're not interested in in-memory only or some non standard caching mechanism.
You would need to persist the reparation DataFrame, since DataFrames are immutable and reparation returns a new DataFrame.
A approach which you could follow is to persist dFrame and after its reparation the new DataFrame which returned is dFrameRepart. At this stage you could persist the dFrameRepart and unpersist the dFrame in order to free up the memory, provided that you won't be using dFrame again. In case your using dFrame after the reparation operation , both the DataFrames can be persisted.
dFrame.registerTempTable("GSSN")
dFrame.persist(StorageLevel.MEMORY_AND_DISK)
val distincCount=sqlContext.sql("select count(distinct sender) as SENDERS from GSSN").collectAsList().get(0).get(0).asInstanceOf[Long]
valdFrameRepart=dFrame.repartition(distincCount.toInt/3, dFrame("sender")).persist(StorageLevel.MEMORY_AND_DISK)
dFrame.unpersist

Minimizing shuffle when mapper output is mostly sorted

I have a map-reduce process in which the mapper takes input from a file that is sorted by key. For example:
1 ...
2 ...
2 ...
3 ...
3 ...
3 ...
4 ...
Then it gets transformed and 99.9% of the keys stay in the same order in relation to one another and 99% of the remainder are close. So the following might be the output of running the map task on the above data:
a ...
c ...
c ...
d ...
e ...
d ...
e ...
Thus, if you could make sure that a reducer took in a range of inputs and put that reducer in the same node where most of the inputs were already located, the shuffle would require very little data transfer. For example, suppose that I partitioned the data so that a-d were taken care of by one reducer and e-g by the next. Then if a-d could be run on the same node that had handled the mapping of 1-4, only two records for e would need to be sent over the network.
How do I construct a system that takes advantage of this property of my data? I have both Hadoop and Spark available and do not mind writing custom partitioners and the like. However, the full workload is such a classic example of MapReduce that I'd like to stick with a framework which supports that paradigm.
Hadoop mail archives mention consideraton of such an optimization. Would one need to modify the framework itself to implement it?
From the SPARK perspective there is not direct support for this: the closest is mapPartitions with preservePartions=true. However that will not directly help in your case because the keys may not be changed.
/**
* Return a new RDD by applying a function to each partition of this RDD.
*
* `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
*/
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U] = {
val func = (context: TaskContext, index: Int, iter: Iterator[T]) => f(iter)
new MapPartitionsRDD(this, sc.clean(func), preservesPartitioning)
}
If you were able to know definitively that none of the keys would move outside of their original partitions the above would work. But the values on the boundaries would likely not cooperate.
What is the scale of the data compared to the migrating keys? You may consider adding a postprocessing step. First construct a partition for all migrating keys. Your mapper would output a special key value for keys needing to migrate. Then postprocess the results to do some sort of append to the standard partitions. That is extra hassle so you would need to evaluate the tradeoff in an extra step and pipeline complexity.

Resources