validate and processing data in Redis sorted set efficiently

validate and processing data in Redis sorted set efficiently - performance

we have micro service (written in Go lang) which it's primary purpose is to get the logs from multiple IoT devices and do some processing on them and put the result into a PostgreSQL table. The way the system works is that each device has its own sorted set which the logs will be saved there and for each log the score would be a timestamp ( of course I know time series would be a better decision but we currently want to worked with sorted sets). know this logs come every 1 second from each device.
I want to process the data inside these sets every 5 second, but for each set, the logs inside should pass some tests:
there should be more than one log inside the set
two logs can be removed from the set, if the time difference between timestamps is 1 second
when the logs are validated then they can be passed to other methods or functions to the the rest of the processing. If logs are invalid ( there exists a log that has time difference of more than 1 second with other logs) then it go's back to the set and wait for the next iteration to be checked again.
Problem:
My problem is basically that I don't know how to get the data out of the list, validate them and put them back again! to be more clear for each set,all or none of the logs inside can be removed, and this occurs while new data is coming in contently, and since I cant validate the data with redis it self I don't know what to do. My current solution is as follows:
Every 5 seconds, all data from each set should be removed from Redis and saved in some data structure inside the code ( like a list...) the after validating, some logs that are not yet validated should be putted back to Redis. as you can see these solution needs two database access from the code, and when putting the invalid logs, they should be sorted by Redis ...
when the logs are so much and there are many devices, I think this solution is not the best way to go. I'm not very experienced with Redis so would be thankful to give your comments on the problem. Thanks

Since you decided to use sorted sets, here are things to know first
"there should be more than one log inside the set". If there is no element in the sorted set, than the set/key doesn't exist. You can check if there is any log in the sorted set via two different commands; zcard and exists - both works in O(1).
There can't be same log(really same) in the sorted set more than once. You need an identifier(such as timestamp, uuid, hash etc) to separate each individual log from each other in a single sorted set. It will update the score of existing element (it may not be what you want)
127.0.0.1:6379> zadd mydevice 1234 "log-a"
(integer) 1
127.0.0.1:6379> zadd mydevice 12345 "log-a"
(integer) 0
127.0.0.1:6379> zrange mydevice 0 -1 withscores
1) "log-a"
2) "12345"
127.0.0.1:6379>
There is no single way to do this on data layer with built-in methods. You will need application layer with business logic to accomplish what you need.
My suggestions would be keeping the combination of every IOT device + minute separate in a different sorted set. So every minute each device will have a different key, you will append minute 2020:06:06:21:21 to the device identifier key, it will put at most 60 logs. you can check it with zcard
It would be something like this;
127.0.0.1:6379> zadd device:1:2020:06:06:21:21 1591442137 my-iot-payload
(integer) 1
127.0.0.1:6379> zadd device:1:2020:06:06:21:21 1591442138 my-iot-payload-another
(integer) 1
127.0.0.1:6379> zadd device:1:2020:06:06:21:21 1591442138 my-iot-payload-yet-another
(integer) 1
127.0.0.1:6379> zrange device:1:2020:06:06:21:21 0 -1
1) "my-iot-payload"
2) "my-iot-payload-another"
3) "my-iot-payload-yet-another"
127.0.0.1:6379>
In your application layer;
Every minute for every device you check for the sorted sets (I know you said 5 seconds but if you want to do it you need a modulo way to separate them in 5 seconds interval keys instead of minute one)
You have the list of devices(maybe in your database table), you know what time is it(convert to redis key)
Get minute/device separated keys with zrange(withscores option) to make calculations and validations at application level for each device and for that exact minute.
If they pass then save into your PostgreSQL database(delete sorted set key or execute expire whenever you add new element with zadd).
If they fail, that's totally up to you. You have minute separated logs for each device, you may delete it or parse them partially to save it.

Related

Redis intersection - analog of Where

I have following struct
user_id: [
{item_id:delivered_at},
{item_id:delivered_at},
{item_id:delivered_at}
]
What I need to do: I have array of item_id's for user_id as argument and I need to check, if they were delivered or not.
I see two approaches here:
I store just key:string, in this case key will be user_id:item_id and value delivered_at. So when I have array of item_id's, then I can query consistently all item_id's (and one network request) to return delivered_at for each item_id if it exists.
Second approach is to store for each user_id a zset, key will be user_id, and zset will be score: delivered_at and value:item_id.
Why I don't like the first approach - I cannot easily get all items for provided user, which would be very handful and next task will be exactly this.
What would be perfect-if I could intersect incomming set of item_id's with those in zset.
As far as I understood - I can intersect sets, but only 2 sets in redis, so in my case I need to create new temp set, and after that intersect it with one in zset.
There is an alternative way for 2nd approach- I can just get all items for provided user from zset and after that intersect sets in application, however I don't like this way because there can be thousands of deliveries for user, and it isn't great idea to transfer couple of MB of data for checking 2 or 3 deliveries.
Is there any option to intersect INCOMING set with already present in Redis?
Is there a better approach to storing data for this specific case? I've read some articles, including secondary indexes in redis blog.

Smart pagination algorithm that works with local data cache

This is a problem I have been thinking about for a long time but I haven't written any code yet because I first want to solve some general problems I am struggling with. This is the main one.
Background
A single page web application makes requests for data to some remote API (which is under our control). It then stores this data in a local cache and serves pages from there. Ideally, the app remains fully functional when offline, including the ability to create new objects.
Constraints
Assume a server side database of products containing +- 50000 products (50Mb)
Assume no db type, we interact with it via REST/GraphQL interface
Assume a single product record is < 1kB
Assume a max payload for a resultset of 256kB
Assume max 5MB storage on the client
Assume search result sets ranging between 0 ... 5000 items per search
Challenge
The challenge is to define a stateless but (network) efficient way fetch pages from a result set so that it is deterministic which results we will get.
Example
In traditional paging, when getting the next 100 results for some query using this url:
https://example.com/products?category=shoes&firstResult=100&pageSize=100
the search result may look like this:
{
"totalResults": 2458,
"firstResult": 100,
"pageSize": 100,
"results": [
{"some": "item"},
{"some": "other item"},
// 98 more ...
]
}
The problem with this is that there is no way, based on this information, to get exactly the objects that are on a certain page. Because by the time we request the next page, the result set may have changed (due to changes in the DB), influencing which items are part of the result set. Even a small change can have a big impact: one item removed from the DB, that happened to be on page 0 of the result set, will change what results we will get when requesting all subsequent pages.
Goal
I am looking for a mechanism to make the definition of the result set independent of future database changes, so if someone was looking for shoes and got a result set of 2458 items, he could actually fetch all pages of that result set reliably even if it got influenced by later changes in the DB (I plan to not really delete items, but set a removed flag on them, for this purpose)
Ideas so far
I have seen a solution where the result set included a "pages" property, which was an array with the first and last id of the items in that page. Assuming your IDs keep going up in number and you don't really delete items from the DB ever, the number of items between two IDs is constant. Meaning the app could get all items between those two IDs and always get the exact same items back. The problem with this solution is that it only works if the list is sorted in ID order... I need custom sorting options.
The only way I have come up with for now is to just send a list of all IDs in the result set... That way pages can be fetched by doing a SELECT * FROM products WHERE id IN (3,4,6,9,...)... but this feels rather inelegant...
Any way I am hoping it is not too broad or theoretical. I have a web-based DB, just no good idea on how to do paging with it. I am looking for answers that help me in a direction to learn, not full solutions.

Versioning DB is the answer for resultsets consistency.
Each record has primary id, modification counter (version number) and timestamp of modification/creation. Instead of modification of record r you add new record with same id, version number+1 and sysdate for modification.
In fetch response you add DB request_time (do not use client timestamp due to possibly difference in time between client/server). First page is served normally, but you return sysdate as request_time. Other pages are served differently: you add condition like modification_time <= request_time for each versioned table.

You can cache the result set of IDs on the server side when a query arrives for the first time and return a unique ID to the frontend. This unique ID corresponds to the result set for that query. So now the frontend can request something like next_page with the unique ID that it got the first time it made the query. You should still go ahead with your approach of changing DELETE operation to a removed operation because it would make sure that none of the entries from the result set it deleted. You can discard the result set of the query from the cache when the frontend reaches the end of the result set or you can set a time limit on the lifetime of the cache entry.

How Can I get the last inserted sequence value for respective to a web session in JSP and Oracle?

First of all I beg to request you, please do not treat this as duplicate.
I have seen all the threads for this issue but none was of my type.
I am developing an online registration system using JBOSS 6 and Oracle 11g. I want to give every registrant a unique form number sequentially.
For this, I think oracle's sequence_name.nextval for a primary key field is best for inserting a unique yet sequential number and for retrieving the same I would use sequence_name.currval. Till this I hope, it's ok.
But will this ensure parity if two or more concurrent users submits the web form simultaneously? (I mean will there be any overlap of interchange of value among the concurrent users?)
More precisely, is it session dependent?
Let me give two hypothetical situations so that matter becomes clearer.
Say there are two users, user1 and user2 trying to register at the same time sitting at Newyork and Paris respectively. The max(form_no) is say 100 before they click the submit button. Now, in the code I have written say
insert into member(....) values(seq_form_no.nextval,....).
Now since the two users will invoke the same query sitting at two different terminals will they get their own sequential id or user1 will get user2's or vice-versa? Hope I made the issue clear. See, the sequence will be unique, I know, but I want to associate the ids inserted respectively.
Thanks in advance.

I'm not sure to understand. But simply said, a SENQUENCE ensure uniqueness of the generated number among concurrent transactions/connections. Unless if the sequence was created with the CYCLE option, from within a transaction, you can rely on a strictly monotonically increasing (resp. decreasing) numbering. But not from the absence of gap (probably what you where expecting when talking about "sequential numbers").
Worth mentioning that sequence numbers never go backward. When someone acquires a value, it is "consumed" from the sequence and will never get back inside (beside CYCLE) -- even if you rollback the current transaction.
From the doc (emphasis mine):
When a sequence number is generated, the sequence is incremented, independent of the transaction committing or rolling back. If two users concurrently increment the same sequence, then the sequence numbers each user acquires may have gaps, because sequence numbers are being generated by the other user. One user can never acquire the sequence number generated by another user. After a sequence value is generated by one user, that user can continue to access that value regardless of whether the sequence is incremented by another user.
My JSP is a little bit ... "rusty", but something like that will work as expected:
<sql:update dataSource="${ds}" var="result">
INSERT INTO member(....) values(seq_form_no.nextval,....);
</sql:update>
<sql:query dataSource="${ds}" var="last_inserted_member_id">
SELECT seq_form_no.currval FROM DUAL;
</sql:query>

Optimal storage structure in Redis

I'm looking to store the follow group of information. I store a minute timestamp(e.g. group all browserIDs seen in a 1 minute window) and then a list of browser references. I'd like to be able have only one instance of a browser id
What data structure in Redis can I use for this data structure? Is there a more optimal way to store it?
...
12:06 -> browser1, browser7
12:07 -> browser8
12:08 -> browser4, browser5, browser6, browser9
...
Each row can have a time to live of about 1 day.
When adding a new browserID I'm first checking to see if the browser id already exists somewhere in the data if so delete and add to new minute row.
Lastly every 1 minute I take the row from 30 minutes ago and process those browserIDs and then remove that row from the list when fully processed.
There could be up to 1 million browser referenes in this data structure at any one time.

Ok, new information, new answer :)
Let's make each browser a key in the database, pointing to which timestamp it's currently in. And also a key for each timestamp, with a set of which browsers it "contains".
When a new browser is added;
Check if it's already in the system by checking if its key exists.
If it is, check which timestamp it belongs to, remove it from the old time stamp, add it to the new one. Update the browser key.
If not, add it to the time stamp and set the browser key.
To expire the keys I would probably not use the built in expire, instead use a cron job or something to
Remove all browser keys in the timestamp
Remove the timestamp key.
Example data structure;
ts:12:01 -> {1, 3}
ts:12:02 -> {2}
browser:1 -> 12:01
browser:2 -> 12:02
browser:3 -> 12:01
This should be reasonably O(1), but with a slightly higher constant time (multiple requests for each operation). Could be reduced by using server side ruby scripting possibly.
Hope that helps!

List is enough. In fact, if the number of browser is less than 400(according to your conf file, but default 400), Redis implements a sequential array to substitude list for space-efficient.
For more infomation : https://github.com/antirez/redis/blob/unstable/src/ziplist.h

Redis Data Structure to Store All Clicks for All Links

I'm trying to set up a system in which ALL links posted by users and clicked by their followers are stored in redis in such a way that the following requirements are met:
Able to get (for example, 10%) most clicked links within a time-frame (can be either today, this week, all time, or custom).
Able to query all users who posted the same link.
Since we already used many keys, the ideal is that we store all this in a single Redis key.
Can encode value to JSON if needed.
Here is what I came up so far:
-I use a single Redis Hash with each fields are single hour, so that in one day, that hash will contain 24 fields.
-In each field, I store a JSON encoded from an array with format:
array("timestamp1" => array($url1, $url2, ...)
, "timestamp2" => array($url3, $url4, ...)
, ..., ...);
-The complete structure is this hash:
[01/01/2010 00:00] => JSON(...),
[01/01/2010 01:00] => JSON(...),
....
This way, I can get all the clicks on any URL within any time-frame.
However, I can't seem to reuse this hash for getting all the users who posted the URL.
The question is: Is there any better way to do?
Updated 07/30/2011: I'm currently storing the minutes, the hours, the days, weeks, months, and years in the same hash.
So, one click is stored in many fields at once:
- in the field for the minute (format YmdHi)
- in the field for the hour (format YmdH)
- in the field for the day (format Ymd)
- in the field for the week (format YW)
- in the field for the month (format Ym)
- in the field for the year (format Y).
That's way, when trying to get a specific timeframe, I could only access the necessary fields withouth looping through the hours.
For example, if I need clicks from 07/26/2011 20:00 to 07/28/2011 02:00, I only need to query 7 fields: 1 field for the full day of 07/27/2011, 4 fields for the hours from 20:00 to 23:00 on 07/26, and then 2 more fields for hours from 00:00 to 01:00 on 07/28

If you drop the third requirement it becomes a lot easier. A lot of people seem to think that you should always use hashes instead of keys, but this stems from misunderstanding of a post about using hashes to improve performance in a particular limited set of circumstances.
To get the most clicked links, create a sorted set for each hour or day, with the value being the link and score being clicks set using ZINCRBY. Use ZCARD and ZREVRANGEBYSCORE to get the top 10%. It is simplest if the set holds all links in the system, though there are strategies you can use to drop less popular items from the set if necessary.
To get all users posting a link, store a set of users for each link. You could do this with JSON and a key or hash storing details for the link, but a set makes updating and querying easier.

I recommend using some bucket strategy like hashing Keys or keeping records of Link to User month wise as you don't have control on size of data structure how huge it may grow . There will be millions of user visiting a particular link . Now to get the details of all the user again it will be of no use if thrown at once . I believe what can be done is maintain counter or some metadata that act like current state and then maintain an archival storage not to be in mem. or go for a memory grid like GemFire

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio