Composite key / secondary indexing strategy in Redis - data-structures

Say I have some data of the following sort that I want to store in Redis:
* UUID
* State (e.g. PROCESSED, WAITING_FOR_RESPONSE)
* […] other vals
The UUID or the State are the only two variables that I will ever need to query on
What data structure in Redis is most suited for this?
How would I go about structuring the keys?

Okay, not sure I understand completely but going to try to go with it.
Assuming you need to look up all entities with state PROCESSED you can use sets for these:
SADD PROCESSED 123-abcd-4567-0000
Then you can easily find all entities with PROCESSED state. You'll do same for each state you want.
SMEMBERS PROCESSED
Now you'll also then want to have a hash for all your entities and its values:
HSET 123-abcd-4567-0000 state PROCESSED
HSET 123-abcd-4567-0000 otherproperty valuedata
This will set the "state" in the hash for the UUID to be processed (you'll need to figure out how to keep these in sync, you can use scripts or just handle in your code)
So in summary you have 2 major structures:
Sets to store your State to UUID information
Thus you have 1 set per state
Hashes to store the UUID to properties information
Thus you have 1 hash PER entity
Example Hash
123-abcd-4567-0000 => { state: PROCESSED, active: true }
987-zxy-1234-0000 => { state: PROCESSED, active: false }
But please clarify more if this doesn't seem to fit.
If you want to reduce your key space since the Hashes per entity can be much you can create a hash per attribute instead:
HSET states 123-abcd-4567-0000 PROCESSED
Thus you have a hash per attribute and your key is the UUID and value is the value of the property which is the hash key.
Example Hash
state => { 123-abcd-4567-0000: PROCESSED, 987-zxy-1234-0000: PROCESSED }
active => { 123-abcd-4567-0000: true, 987-zxy-1234-0000: false }

RediSearch (a Redis module) supports adding secondary index to existing data in Redis like Hash.
After setting the schema for the fiels you would like to index you can easily search based on these fields values.
e.g.
127.0.0.1:6379> FT.CREATE myIdx ON HASH PREFIX 1 doc: SCHEMA title TEXT
127.0.0.1:6379> hset doc:1 title "mytitle" body "lorem ipsum" url "http://redis.io"
(integer) 3
127.0.0.1:6379> FT.SEARCH myIdx "#title:mytitle" LIMIT 0 10
1) (integer) 1
2) "doc:1"
3) 1) "title"
2) "mytitle"
3) "body"
4) "lorem ipsum"
5) "url"
6) "http://redis.io"

Related

Optimize range search between two numeric columns having start "From" and "To" end limit

I have a DB Table structure like below with millions of records
ACCOUNT_RANGE_FROM
ACCOUNT_RANGE_TO
Name
12345670000
12345679999
XYZ
12345680000
12345689999
XYY
I need to search if a given input number is available in any of the ranges in the table.
As of now we are using the Oracle DB and with query like [number > ACCOUNT_RANGE_FROM and number < ACCOUNT_RANGE_TO ] performance is very slow as it does a Full table search on all the rows.
Creating index on columns is also of little help.
So I was thinking how we can cache such data to improve search time. Application is developed in Spring Boot
Could you please advice if Redis is a suitable candidate for such use case or if any alternate approach needs to be evaluated.
You could use two Sorted Sets one for the beginning of the ranges and one for the end of the ranges. Using the values in your example:
127.0.0.1:6379> ZADD zset:acct_from 12345670000 XYZ 12345680000 XYY
(integer) 2
127.0.0.1:6379> ZADD zset:acct_to 12345679999 XYZ 12345689999 XYY
(integer) 2
Say you are looking for the label for the range where 12345671000 lies:
127.0.0.1:6379> ZRANGESTORE zset:tmp_from zset:acct_from -inf 12345671000 BYSCORE
(integer) 1
127.0.0.1:6379> ZRANGESTORE zset:tmp_to zset:acct_to 12345671000 inf BYSCORE
(integer) 2
Inspect the two temp sets:
127.0.0.1:6379> ZRANGE zset:tmp_to -inf inf BYSCORE
1) "XYZ"
2) "XYY"
127.0.0.1:6379> ZRANGE zset:tmp_from -inf inf BYSCORE
1) "XYZ"
The answers is the intersection of the two temp sets:
127.0.0.1:6379> ZINTER 2 zset:tmp_from zset:tmp_to
1) "XYZ"
127.0.0.1:6379>
In Spring:
redisTemplate.opsForZSet().add("zset:acct_from", "XYZ", 12345680000d);
redisTemplate.opsForZSet().add("zset:acct_to", "XYZ", 12345679999d);
redisTemplate.opsForZSet().add("zset:acct_from", "XYY", 12345679999d);
redisTemplate.opsForZSet().add("zset:acct_to", "XYY", 12345689999d);
// unfortunately I don't ZRANGESTORE in Spring Data Redis - so you'll have to do the diff in memory - this might be costly (memory-wise)
Set<String> tmpFrom = redisTemplate.opsForZSet().rangeByScore("zset:acct_from", Double.NEGATIVE_INFINITY, 12345671000d);
Set<String> tmpTo = redisTemplate.opsForZSet().rangeByScore("zset:acct_to", 12345671000d, Double.POSITIVE_INFINITY);
Set<String> result = new HashSet<>(tmpFrom);
result.retainAll(tmpTo);

Kafka Streams: Add Sequence to each message within a group of message

Set Up
Kafka 2.5
Apache KStreams 2.4
Deployment to Openshift(Containerized)
Objective
Group a set of messages from a topic using a set of value attributes & assign a unique group identifier
-- This can be achieved by using selectKey and groupByKey
originalStreamFromTopic
.selectKey((k,v)-> String.join("|",v.attribute1,v.attribute2))
.groupByKey()
groupedStream.mapValues((k,v)->
{
v.setGroupKey(k);
return v;
});
For each message within a specific group , create a new message with an itemCount number as one of the attributes
e.g. A group with key "keypart1|keyPart2" can have 10 messages and each of the message should have an incremental id from 1 through 10.
aggregate?
count and some additional StateStore based implementation.
One of the options (that i listed above), can make use of a couple of state stores
state store 1-> Mapping of each groupId and individual Item (KTable)
state store 2 -> Count per groupId (KTable)
A join of these 2 tables to stamp a sequence on the message as they get published to the final topic.
Other statistics:
Average number of messages per group would be in some 1000s except for an outlier case where it can go upto 500k.
In general the candidates for a group should be made available on the source within a span of 15 mins max.
Following points are of concern from the optimum solution perspective .
I am still not clear how i would be able to stamp a sequence number on the messages unless some kind of state store is used for keeping track of messages published within a group.
Use of KTable and state stores (either explicit usage or implicitly by the use of KTable) , would add to the state store size considerably.
Given the problem involves some kind of tasteful processing , the state store cant be avoided but any possible optimizations might be useful.
Any thoughts or references to similar patterns would be helpful.
You can use one state store with which you maintain the ID for each composite key. When you get a message you select a new composite key and then you lookup the next ID for the composite key in the state store. You stamp the message with the new ID that you just looked up. Finally, you increase the ID and write it back to the state store.
Code-wise, it would be something like:
// create state store
StoreBuilder<KeyValueStore<String,String>> keyValueStoreBuilder = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("idMaintainer"),
Serdes.String(),
Serdes.Long()
);
// add store
builder.addStateStore(keyValueStoreBuilder);
originalStreamFromTopic
.selectKey((k,v)-> String.join("|",v.attribute1,v.attribute2))
.repartition()
.transformValues(() -> new ValueTransformer() {
private StateStore state;
void init(ProcessorContext context) {
state = context.getStateStore("idMaintainer");
}
NewValueType transform(V value) {
// your logic to:
// - get the ID for the new composite key,
// - stamp the record
// - increase the ID
// - write the ID back to the state store
// - return the stamped record
}
void close() {
}
}, "idMaintainer")
.to("output-topic");
You do not need to worry about concurrent access to the state store because in Kafka Streams same keys are processed by one single task and tasks do not share state stores. That means, your new composite keys with the same value will be processed by one single task that exclusively maintains the IDs for the composite keys in its state store.

Search by values in Redis cache - Secondary Indexing

I am new to Redis. I want to search by one or multiple values that comes from API.
e.g - Let's say that I want to store some sec data as below:
Value1
{
"isin":"isin123",
"id_bb_global":"BBg12345676",
"cusip":"cusip123",
"sedol":"sedol123",
"cpn":"0.09",
"cntry":"US",
"144A":"xyz",
"issue_cntry":"UK"
}
Value2
{
"isin":"isin222",
"id_bb_global":"BBG222",
"cusip":"cusip222",
"sedol":"sedol222",
"cpn":"1.0",
"cntry":"IN",
"144A":"Y",
"issue_cntry":"DE"
}
...
...
I want to search by cusip or cusip and id_bb_global, ISIN plus Exchange, or sedol.
e.g - search query data -> {"isin":"isin222", "cusip":"cusip222"} , should return all data sets from value.
What is the best way to store this kind of data structure in Redis and API to retrieve the same faster.
when you insert data, you can create sets to maintain the index.
{
"isin":"isin123",
"id_bb_global":"BBg12345676",
"cusip":"cusip123",
"sedol":"sedol123",
"cpn":"0.09",
"cntry":"US",
"144A":"xyz",
"issue_cntry":"UK"
}
example for the above data, if you wnat to filter by isin and cusip, you can create the respective set for isin:123 and cusip:123 and add that item id to both of those sets.
later on, if you want to find item that are in both isin:123 and cusip:123, you just have to run SINTER on those 2 sets.
Or if you want to find items that are either in isin:123 OR cusip:123, you can union them.

Querying Elasticsearch using array of values

I index items in elasticsearch where in each item has these properties:
tags - array of strings eg. [ 'c++', 'java', 'python' ]
submitter_id - uuid
id - uuid
Also i have user who has these properties:
tags - array of strings
following_ids - array of uuids
What i want to do is query elasticsearch for items where tags match tags of the user or submitter_id is one of user's following_ids, also i boost fields. Right now i form the query like this
"should"=>[{"match"=>{"tags"=>{"query"=>"yoga", "boost"=>3}}}, {"match"=>{"tags"=>{"query"=>"yogic technique", "boost"=>3}}},
{"match"=>{"tags"=>{"query"=>"lag jaa gale", "boost"=>3}}}, {"match"=>{"tags"=>{"query"=>"jonita gandhiband", "boost"=>3}}}
{"match"=>{"submitter_id"=>"fc8b720f-a306-4849-8bc1-38fafae7c92b"}},
{"match"=>{"submitter_id"=>"c35ec42f-2df0-4870-89a4-9e59c9df04ea"}}]
But if the user has a lot of tags or following_ids, i would soon run into maximum clauses limit. How should i handle this ?
Since you're looking for the exact ids and tags you should be using the Terms Query anyway but the added advantage for you in this case is that it allows you to give multiple terms so you would only need 1 clause for all your tags and 1 for your user ids.

redis sort by multi fields

It's easy to use sql to query with multi sort fields.For example:
select * from user order by score desc,name desc
with two fields sort(score,name).
how should do this in redis?
Use sorted set of redis which is sorted by score. You have to prepare score according to your needs.
finalScore = score*MAX_NAME_VALUE + getIntRepresentation(name)
//MAX_NAME_VALUE is the maximum value returned by getIntRepresentation() method
and then use
zadd myset finalScore value
and the just use
zrevrange myset 0 10

Resources