Aggregation over a specific partition in Apache Kafka Streams - apache-kafka-streams

Lets say I have a Kafka topic named SensorData to which two sensors S1 and S2 are sending data (timestamp and value) to two different partitions e.g. S1 -> P1 and S2 -> P2. Now I need to aggregate the values for these two sensors separately, lets say calculating the average sensor value over a time window of 1 hour and writing it into a new topic SensorData1Hour. With this scenario
How can I select a specific topic partition using the KStreamBuilder#stream method?
Is it possible to apply some aggregation function over two (multiple) different partitions from same topic?

You cannot (directly) access single partitions and you cannot (directly) apply an aggregation function over multiple partitions.
Aggregations are always done per key: http://docs.confluent.io/current/streams/developer-guide.html#stateful-transformations
Thus, you could use a different key for each partition and than aggregate by key. See http://docs.confluent.io/current/streams/developer-guide.html#windowing-a-stream
The simplest way is to let each of your producers apply a key to each message right away.
If you want to aggregate multiple partitions, you first need to set a new key (e.g., using selectKey()) and set the same key for all data you want to aggregate (if you want to aggregate all partitions, you would use a single key value -- however, keep in mind, this might quickly become a bottleneck!).

Related

Elasticsearch data-stream selecting indices to search

with our current implementation of search engine we do something like:
search by date range from to (by #timestamp)
get all indices by some prefix (e.g. technical-logs*)
filter out only those indices which applies the range from to (e.g. if from=20230101 and to=20230118 then we select all indices in those ranges with prefix technical-logs-yyyyMMdd)
It seems like that data streams could be beneficial for us. The problem I see is that all indices being created by data streams are hidden by default so I won't be able to see them (by default) therefore I won't be able to query only those indices which I'm interested in (from-to).
Is there some easy mechanism how we can select only indices which we want or does the ES has some functionality for that? I know that there is #timestamp field but I don't know if that is somehow being used also to filtering out only indices which contains given date.
That's the whole point of data streams, i.e. you don't need to know which indices to query, you just query the data stream (i.e. like an alias) or a subset thereof technical-logs* and ES will make sure to only query the underlying indexes that satisfy your constraints (from/to time interval, etc)
Time-series data streams use time bound indices. Each of those backing indices is then sorted by #timestamp so that when you search for a specific time interval, ES will only query the relevant backing indexes.

Redis embedding value in the key vs json

I'm planning to store rooms availability in a redis database. The json object looks as such:
{
BuildingID: "RE0002439",
RoomID: "UN0002384391290",
SentTime: 1572616800,
ReceivedTime: 1572616801,
Status: "Occupied",
EstimatedAvailableFrom: 1572620400000,
Capacity: 20,
Layout: "classroom"
}
This is going to be reported by both devices and apps (tablet outside the room, sensor within the room in some rooms, by users etc.) and vary largely as we have hundreds of buildings and over 1000 rooms.
My intention is to use a simple key value structure in Redis. The main query would be which room is available now, but other queries are possible.
Because of that I was thinking that the key should look like
RoomID,Status,Capacity
My question is is it correct assumption because this is the main query we expect to have these all in the key? Should there be other fields in the key too or should the key be just a number with Redis increment, as if it was SQL?
There are plenty of questions I could find about hierarchy but my object has no hierarchy really.
Unless you will use the redis instance exclusively for this, using keys with pattern matching for common queries is not a good idea. KEYS is O(N) and SCAN too when called multiple times to traverse the whole keyspace.
Consider RediSearch module, it would give you a lot of power on this use case.
If RediSearch is not an option:
You can use a single hash key to store all rooms, but then you have to store the whole json string as value, and whenever you want to modify a field, you need to get, then modify then set.
You are probably better off using multiple data structures, here an idea to get you started:
Store each room as a hash key. If RoomID is unique you can use it as key, or pair it with building id if needed. This way, you can edit a field value in one operation.
HSET UN0002384391290 BuildingID RE0002439 Capacity 20 ...
Keep a set with all room IDs. SADD AllRooms UN0002384391290
Use sets and sorted sets as indexes for the rest:
A set of available rooms: Use SADD AvailableRooms UN0002384391290 and SREM AvailableRooms UN0002384391290 to mark rooms as available or not. This way your common query of all rooms available is as fast as it gets. You can use this in place of Status inside the room data. Use SISMEMBER to test is a given room is available now.
A sorted set with capacity: Use ZADD RoomsByCapacity 20 UN0002384391290. So now you can start doing nice queries like ZRANGEBYSCORE RoomsByCapacity 15 +inf WITHSCORES to get all rooms with a capacity >=15. You then can intersect with available rooms.
Sets by layout: SADD RoomsByLayout:classroom UN0002384391290. Then you can intersect by layout, like SINTER AvailableRooms RoomsByLayout:classroom to get all available classrooms.
Sets by building: SADD RoomsByBuilding:RE0002439 UN0002384391290. Then you can intersect by buildings too, like SINTER AvailableRooms RoomsByLayout:classroom RoomsByBuilding:RE0002439 to get all available classrooms in a building.
You can mix sets with sorted sets, like ZINTERSTORE Available:RE0002439:ByCap 3 RoomsByBuilding:RE0002439 RoomsByCapacity AvailableRooms AGGREGATE MAX to get all available rooms scored by capacity in building RE0002439. Sorted sets only allow ZINTERSTORE and ZUNIONSTORE, so you need to clean up after your queries.
You can avoid sorted sets by using sets with capacity buckets, like Rooms:Capacity:1-5, Rooms:Capacity:6-10, etc.
Consider adding coordinates to your buildings, so your users can query by proximity. See GEOADD and GEORADIUS.
You may want to allow reservations and availability queries into the future. See Date range overlap on Redis?.

Designing of the "mapper" and "reducer" functions' functionality for hadoop?

I am trying to design a mapper and reducer for Hadoop. I am new to Hadoop, and I'm a bit confused about how the mapper and reducer is supposed for work for my specific application.
The input to my mapper is a large directed graph's connectivity. It is a 2 column input where each row is an individual edge connectivity. The first column is the start node id and the second column is the end node id of each edge. I'm trying to output the number of neighbors for each start node id into a 2 column text file, where the first column is sorted in order of increasing start node id.
My questions are:
(1) The input is already set up such that each line is a key-value pair, where the key is the start node id, and the value is the end node id. Would the mapper simply just read in each line and write it out? That seems redundant.
(2) Does the sorting take place in between the mapper and reducer or could the sorting actually be done with the reducer itself?
If my understanding is correct, you want to count how many distinct values a key will have.
Simply emitting the input key-value pairs in the mapper, and then counting the distinct values per key (e.g., by adding them to a set and emitting the set size as the value of the reducer) in the reducer is one way of doing it, but a bit redundant, as you say.
In general, you want to reduce the network traffic, so you may want to do some more computations before the shuffling (yes, this is done by Hadoop).
Two easy ways to improve the efficiency are:
1) Use a combiner, which will output sets of values, instead of single values. This way, you will send fewer key-value pairs to the reducers, and also, some values may be skipped, since they have been already in the local value set of the same key.
2) Use map-side aggregation. Instead of emitting the input key-value pairs right away, store them locally in the mapper (in memory) in a data structure (e.g., hashmap or multimap). The key can be the map input key and the value can be a set of values seen so far for this key. Each type you meet a new value for this key, you append it to this structure. At the end of each mapper, you emit this structure (or you convert the values to an array), from the close() method (if I remember the name).
You can lookup both methods using the keywords "combiner" and "map-side aggregation".
A global sorting on the key is a bit trickier. Again, two basic options, but are not really good though:
1) you use a single reducer, but then you don't gain anything from parallelism,
2) you use a total order partitioner, which needs some extra coding.
Other than that, you may want to move to Spark for a more intuitive and efficient solution.

Hive table sorted but inserted without sort

what happen if
create table X (...) clustered by(date) sorted by (time)
but inserted without sort
insert into x select * from raw
Will data be sorted after fetched from raw before inserted?
If unsorted data inserted
What does "sorted by" do in create table statement.
It works just hint for later select queries?
The documentation explains:
The CLUSTERED BY and SORTED BY creation commands do not affect how
data is inserted into a table – only how it is read. This means that
users must be careful to insert data correctly by specifying the
number of reducers to be equal to the number of buckets, and using
CLUSTER BY and SORT BY commands in their query.
I think it is clear that you want to insert the data sorted if you are using that option.
No, the data will not be sorted.
As another answer explains, the SORTED BY and CLUSTERED BY options do not change how data will be returned from queries. While the documentation is technically accurate, the purpose of CLUSTER BY is to write underlying data to HDFS in a way that will make subsequent queries faster in some cases. Clustering (bucketing) is similar to partitioning as it allows the query processor to skip reading rows ... If the cluster is chosen wisely. A common use of buckets is sampling data, where you explicitly include only certain buckets, thereby avoiding reads against those excluded.

Efficient point-in-time query of group membership

We have a scenario like this:
Millions of records (Record 1, Record 2, Record 3...)
Partitioned into millions of small non-intersecting groups (Group A, Group B, Group C...)
Membership gradually changes over time, i.e. a record may be reassigned to another group.
We are redesigning the data schema, and one use case we need to support is given a particular record, find all other records that belonged to the same group at a given point in time. Alternatively, this can be thought of as two separate queries, e.g.:
To which group did Record 15544 belong, three years ago? (Call this Group g).
What records belonged to Group g, three years ago?
Supposing we use a relational database, the association between records and groups is easily modelled using a two-column table of record id and group id. A common approach for allowing historical queries is to add a timestamp column. This allows us to answer the question above as follows:
Find the row for Record 15544 with the most recent timestamp prior to the given date. This tells us Group g.
Find all records that have at any time belonged to Group g.
For each of these records, find the row with the most recent timestamp prior to the given date. If this indicates that the record was in Group g at that time, then add it to the result set.
This is not too bad (assuming the table is separately indexed by both record id and group id), and may even be the optimal algorithm for the naive table structure just described, but it does cost an index lookup for every record found in step 2. Is there an alternative data structure that would answer the query more efficiently?
ETA: This is only one of several use cases for the system, so we don't want to speed up this query at the expense of making queries about current groupings slower, nor do we want to pay a huge price in space consumption, etc.
How about creating two tables:
(recordID, time-> groupID) - key is recordID, time - sorted by
recordID, and secondary by time (Let that be map1)
(groupID, time-> List) - key is groupID, time - sorted by
recordID, and secondary by time (Let that be map2)
At each record change:
Retrieve the current groupID of the record you are changing
set t <- current time
create a new entry to map1 for old group: (oldGroupID,t,list') - where list' is the same list, but without the entry you just moved out from there.
Add a new entry to map1 for new group: (newGroupId,t,list'') - where list'' is the old list for the new group, with the changed record added to it.
Add a new entry (recordId,t,newGroupId) to map1
During query:
You need to find the entry in map2 that is 'closest' and smaller than
(recordId,desired_time) - this is classic O(logN) operation in
sorted data structure.
This will give you the group g the element belonged to at the desired time.
Now, look in map1 similarly for the entry with key closest but smaller than (g,desired_time). The value is the list of all records that are at the group at the desired time.
This requires quite a bit of more space (at constant factor though...), but every operation is O(logN) - where N is the number of record changes.
An efficient sorted DS for entries that are mostly stored on disk is a B+ tree, which is also implemented by many relational DS implementations.

Resources