Optimal storage structure in Redis - data-structures

I'm looking to store the follow group of information. I store a minute timestamp(e.g. group all browserIDs seen in a 1 minute window) and then a list of browser references. I'd like to be able have only one instance of a browser id
What data structure in Redis can I use for this data structure? Is there a more optimal way to store it?
...
12:06 -> browser1, browser7
12:07 -> browser8
12:08 -> browser4, browser5, browser6, browser9
...
Each row can have a time to live of about 1 day.
When adding a new browserID I'm first checking to see if the browser id already exists somewhere in the data if so delete and add to new minute row.
Lastly every 1 minute I take the row from 30 minutes ago and process those browserIDs and then remove that row from the list when fully processed.
There could be up to 1 million browser referenes in this data structure at any one time.

Ok, new information, new answer :)
Let's make each browser a key in the database, pointing to which timestamp it's currently in. And also a key for each timestamp, with a set of which browsers it "contains".
When a new browser is added;
Check if it's already in the system by checking if its key exists.
If it is, check which timestamp it belongs to, remove it from the old time stamp, add it to the new one. Update the browser key.
If not, add it to the time stamp and set the browser key.
To expire the keys I would probably not use the built in expire, instead use a cron job or something to
Remove all browser keys in the timestamp
Remove the timestamp key.
Example data structure;
ts:12:01 -> {1, 3}
ts:12:02 -> {2}
browser:1 -> 12:01
browser:2 -> 12:02
browser:3 -> 12:01
This should be reasonably O(1), but with a slightly higher constant time (multiple requests for each operation). Could be reduced by using server side ruby scripting possibly.
Hope that helps!

List is enough. In fact, if the number of browser is less than 400(according to your conf file, but default 400), Redis implements a sequential array to substitude list for space-efficient.
For more infomation : https://github.com/antirez/redis/blob/unstable/src/ziplist.h

Related

validate and processing data in Redis sorted set efficiently

we have micro service (written in Go lang) which it's primary purpose is to get the logs from multiple IoT devices and do some processing on them and put the result into a PostgreSQL table. The way the system works is that each device has its own sorted set which the logs will be saved there and for each log the score would be a timestamp ( of course I know time series would be a better decision but we currently want to worked with sorted sets). know this logs come every 1 second from each device.
I want to process the data inside these sets every 5 second, but for each set, the logs inside should pass some tests:
there should be more than one log inside the set
two logs can be removed from the set, if the time difference between timestamps is 1 second
when the logs are validated then they can be passed to other methods or functions to the the rest of the processing. If logs are invalid ( there exists a log that has time difference of more than 1 second with other logs) then it go's back to the set and wait for the next iteration to be checked again.
Problem:
My problem is basically that I don't know how to get the data out of the list, validate them and put them back again! to be more clear for each set,all or none of the logs inside can be removed, and this occurs while new data is coming in contently, and since I cant validate the data with redis it self I don't know what to do. My current solution is as follows:
Every 5 seconds, all data from each set should be removed from Redis and saved in some data structure inside the code ( like a list...) the after validating, some logs that are not yet validated should be putted back to Redis. as you can see these solution needs two database access from the code, and when putting the invalid logs, they should be sorted by Redis ...
when the logs are so much and there are many devices, I think this solution is not the best way to go. I'm not very experienced with Redis so would be thankful to give your comments on the problem. Thanks
Since you decided to use sorted sets, here are things to know first
"there should be more than one log inside the set". If there is no element in the sorted set, than the set/key doesn't exist. You can check if there is any log in the sorted set via two different commands; zcard and exists - both works in O(1).
There can't be same log(really same) in the sorted set more than once. You need an identifier(such as timestamp, uuid, hash etc) to separate each individual log from each other in a single sorted set. It will update the score of existing element (it may not be what you want)
127.0.0.1:6379> zadd mydevice 1234 "log-a"
(integer) 1
127.0.0.1:6379> zadd mydevice 12345 "log-a"
(integer) 0
127.0.0.1:6379> zrange mydevice 0 -1 withscores
1) "log-a"
2) "12345"
127.0.0.1:6379>
There is no single way to do this on data layer with built-in methods. You will need application layer with business logic to accomplish what you need.
My suggestions would be keeping the combination of every IOT device + minute separate in a different sorted set. So every minute each device will have a different key, you will append minute 2020:06:06:21:21 to the device identifier key, it will put at most 60 logs. you can check it with zcard
It would be something like this;
127.0.0.1:6379> zadd device:1:2020:06:06:21:21 1591442137 my-iot-payload
(integer) 1
127.0.0.1:6379> zadd device:1:2020:06:06:21:21 1591442138 my-iot-payload-another
(integer) 1
127.0.0.1:6379> zadd device:1:2020:06:06:21:21 1591442138 my-iot-payload-yet-another
(integer) 1
127.0.0.1:6379> zrange device:1:2020:06:06:21:21 0 -1
1) "my-iot-payload"
2) "my-iot-payload-another"
3) "my-iot-payload-yet-another"
127.0.0.1:6379>
In your application layer;
Every minute for every device you check for the sorted sets (I know you said 5 seconds but if you want to do it you need a modulo way to separate them in 5 seconds interval keys instead of minute one)
You have the list of devices(maybe in your database table), you know what time is it(convert to redis key)
Get minute/device separated keys with zrange(withscores option) to make calculations and validations at application level for each device and for that exact minute.
If they pass then save into your PostgreSQL database(delete sorted set key or execute expire whenever you add new element with zadd).
If they fail, that's totally up to you. You have minute separated logs for each device, you may delete it or parse them partially to save it.

Smart pagination algorithm that works with local data cache

This is a problem I have been thinking about for a long time but I haven't written any code yet because I first want to solve some general problems I am struggling with. This is the main one.
Background
A single page web application makes requests for data to some remote API (which is under our control). It then stores this data in a local cache and serves pages from there. Ideally, the app remains fully functional when offline, including the ability to create new objects.
Constraints
Assume a server side database of products containing +- 50000 products (50Mb)
Assume no db type, we interact with it via REST/GraphQL interface
Assume a single product record is < 1kB
Assume a max payload for a resultset of 256kB
Assume max 5MB storage on the client
Assume search result sets ranging between 0 ... 5000 items per search
Challenge
The challenge is to define a stateless but (network) efficient way fetch pages from a result set so that it is deterministic which results we will get.
Example
In traditional paging, when getting the next 100 results for some query using this url:
https://example.com/products?category=shoes&firstResult=100&pageSize=100
the search result may look like this:
{
"totalResults": 2458,
"firstResult": 100,
"pageSize": 100,
"results": [
{"some": "item"},
{"some": "other item"},
// 98 more ...
]
}
The problem with this is that there is no way, based on this information, to get exactly the objects that are on a certain page. Because by the time we request the next page, the result set may have changed (due to changes in the DB), influencing which items are part of the result set. Even a small change can have a big impact: one item removed from the DB, that happened to be on page 0 of the result set, will change what results we will get when requesting all subsequent pages.
Goal
I am looking for a mechanism to make the definition of the result set independent of future database changes, so if someone was looking for shoes and got a result set of 2458 items, he could actually fetch all pages of that result set reliably even if it got influenced by later changes in the DB (I plan to not really delete items, but set a removed flag on them, for this purpose)
Ideas so far
I have seen a solution where the result set included a "pages" property, which was an array with the first and last id of the items in that page. Assuming your IDs keep going up in number and you don't really delete items from the DB ever, the number of items between two IDs is constant. Meaning the app could get all items between those two IDs and always get the exact same items back. The problem with this solution is that it only works if the list is sorted in ID order... I need custom sorting options.
The only way I have come up with for now is to just send a list of all IDs in the result set... That way pages can be fetched by doing a SELECT * FROM products WHERE id IN (3,4,6,9,...)... but this feels rather inelegant...
Any way I am hoping it is not too broad or theoretical. I have a web-based DB, just no good idea on how to do paging with it. I am looking for answers that help me in a direction to learn, not full solutions.
Versioning DB is the answer for resultsets consistency.
Each record has primary id, modification counter (version number) and timestamp of modification/creation. Instead of modification of record r you add new record with same id, version number+1 and sysdate for modification.
In fetch response you add DB request_time (do not use client timestamp due to possibly difference in time between client/server). First page is served normally, but you return sysdate as request_time. Other pages are served differently: you add condition like modification_time <= request_time for each versioned table.
You can cache the result set of IDs on the server side when a query arrives for the first time and return a unique ID to the frontend. This unique ID corresponds to the result set for that query. So now the frontend can request something like next_page with the unique ID that it got the first time it made the query. You should still go ahead with your approach of changing DELETE operation to a removed operation because it would make sure that none of the entries from the result set it deleted. You can discard the result set of the query from the cache when the frontend reaches the end of the result set or you can set a time limit on the lifetime of the cache entry.

Cassandra DB: is it favorable, or frowned upon, to index multiple criteria per row?

I've been doing a lot of reading lately on Cassandra, and specifically how to structure rows to take advantage of indexing/sorting, but there is one thing I am still unclear on; how many "index" items (or filters if you will) should you include in a column family (CF) row?
Specifically: I am building an app and will be using Cassandra to archive log data, which I will use for analytics.
Example types of analytic searches will include (by date range):
total visits to specific site section
total visits by Country
traffic source
I plan to store the whole log object in JSON format, but to avoid having to go through each item to get basic data, or to create multiple CF just to get basic data, I am curious to know if it's a good idea to include these above "filters" as columns (compound column segment)?
Example:
Row Key | timeUUID:data | timeUUID:country | timeUUID:source |
======================================================
timeUUID:section | JSON Object | USA | example.com |
So as you can see from the structure, the row key would be a compound key of timeUUID (say per day) plus the site section I want to get stats for. This lets me query a date range quite easily.
Next, my dilemma, the columns. Compound column name with timeUUID lets me sort & do a time based slice, but does the concept make sense?
Is this type of structure acceptable by the current "best practice", or would it be frowned upon? Would it be advisable to create a separate "index" CF for each metric I want to query on? (even when it's as simple as this?)
I would rather get this right the first time instead of having to restructure the data and refactor my application code later.
I think the idea behind this is OK. It's a pretty common way of doing timeslicing (assuming I've understood your schema anyway - a create table snippet would be great). Some minor tweaks ...
You don't need a timeUUID as your row key. Given that you suggest partitioning by individual days (which are inherently unique) you don't need a UUID aspect. A timestamp is probably fine, or even simpler a varchar in the format YYYYMMDD (or whatever arrangement you prefer).
You will probably also want to swap your row key composition around to section:time. The reason for this is that if you need to specify an IN clause (i.e. to grab multiple days) you can only do it on the last part of the key. This means you can do WHERE section = 'foo' and time IN (....). I imagine that's a more common use case - but the decision is obviously yours.
If your common case is querying the most recent data don't forget to cluster your timeUUID columns in descending order. This keeps the hot columns at the head.
Double storing content is fine (i.e. once for the JSON payload, and denormalised again for data you need to query). Storage is cheap.
I don't think you need indexes, but it depends on the queries you intend to run. If your queries are simple then you may want to store counters by (date:parameter) instead of values and just increment them as data comes in.

Redis Data Structure to Store All Clicks for All Links

I'm trying to set up a system in which ALL links posted by users and clicked by their followers are stored in redis in such a way that the following requirements are met:
Able to get (for example, 10%) most clicked links within a time-frame (can be either today, this week, all time, or custom).
Able to query all users who posted the same link.
Since we already used many keys, the ideal is that we store all this in a single Redis key.
Can encode value to JSON if needed.
Here is what I came up so far:
-I use a single Redis Hash with each fields are single hour, so that in one day, that hash will contain 24 fields.
-In each field, I store a JSON encoded from an array with format:
array("timestamp1" => array($url1, $url2, ...)
, "timestamp2" => array($url3, $url4, ...)
, ..., ...);
-The complete structure is this hash:
[01/01/2010 00:00] => JSON(...),
[01/01/2010 01:00] => JSON(...),
....
This way, I can get all the clicks on any URL within any time-frame.
However, I can't seem to reuse this hash for getting all the users who posted the URL.
The question is: Is there any better way to do?
Updated 07/30/2011: I'm currently storing the minutes, the hours, the days, weeks, months, and years in the same hash.
So, one click is stored in many fields at once:
- in the field for the minute (format YmdHi)
- in the field for the hour (format YmdH)
- in the field for the day (format Ymd)
- in the field for the week (format YW)
- in the field for the month (format Ym)
- in the field for the year (format Y).
That's way, when trying to get a specific timeframe, I could only access the necessary fields withouth looping through the hours.
For example, if I need clicks from 07/26/2011 20:00 to 07/28/2011 02:00, I only need to query 7 fields: 1 field for the full day of 07/27/2011, 4 fields for the hours from 20:00 to 23:00 on 07/26, and then 2 more fields for hours from 00:00 to 01:00 on 07/28
If you drop the third requirement it becomes a lot easier. A lot of people seem to think that you should always use hashes instead of keys, but this stems from misunderstanding of a post about using hashes to improve performance in a particular limited set of circumstances.
To get the most clicked links, create a sorted set for each hour or day, with the value being the link and score being clicks set using ZINCRBY. Use ZCARD and ZREVRANGEBYSCORE to get the top 10%. It is simplest if the set holds all links in the system, though there are strategies you can use to drop less popular items from the set if necessary.
To get all users posting a link, store a set of users for each link. You could do this with JSON and a key or hash storing details for the link, but a set makes updating and querying easier.
I recommend using some bucket strategy like hashing Keys or keeping records of Link to User month wise as you don't have control on size of data structure how huge it may grow . There will be millions of user visiting a particular link . Now to get the details of all the user again it will be of no use if thrown at once . I believe what can be done is maintain counter or some metadata that act like current state and then maintain an archival storage not to be in mem. or go for a memory grid like GemFire

Can you sort a GET on a Cassandra column family by the Timestamp value created for each column entry, rather than the column Keys?

Basically I have a 'thread line' where new threads are made and a TimeUUID is used as a key. Which obviously provides sorting of a new thread quite easily, espically when say making a query of the latest 20 threads etc.
My problem is that when a new 'post' is made to a thread I want to be able to 'bump' that thread to the front of the 'thread line' which is where the problem comes in, how do I basically make this happen so I can still make queries that can still be selected in the right order without providing any kind of duplicates etc.
The only way I can see this working is if rather than a column family sorting via a TimeUUID I need the column family to sort via the insertion Timestamp, therefore I can use the unique thread IDs for column keys and retrieve these in the order they are inserted or reinserted rather than by TimeUUID? Is this possible or am I missing a simple trick that allows for this? As far as I know you have to set a particular comparitor or otherwise it defaults to bytes?
Columns within a row are always sorted by name with the given comparator. You cannot sort by timestamp or value or anything else, or Cassandra would not be able to merge multiple updates to the same column correctly.
As to your use case, I can think of two options.
The most similar to what you are doing now would be to create a second columnfamily, ThreadMostRecentPosts, with timeuuid columns (you said "keys" but it sounds like you mean "columns"). When a new post arrives, delete the old most-recent column and add a new one.
This has two problems:
The unit of replication is the row, so having this grow indefinitely could be problematic. (Using expiring columns to age out no-longer-relevant thread information might help.)
You need a lock manager so that multiple posts to the same thread don't race and possibly leave multiple entries in this row.
I would suggest instead creating a row per day (for instance), whose columns are the thread IDs and whose values are the most recent post. Adding a new post just updates the value in that column; no delete/re-add is done, so the race is not a problem anymore. You don't get sorting for free anymore but that's okay because you're limiting it to a small enough set that you can do that sort in memory (say, yesterday's threads and today's).
(Finally, I would add that I can say from experience that having a cutoff past which old threads don't get bumped to the front by a new reply is a Good Thing.)

Resources