I'm trying to sort user objects stored as hashes in Redis.
The keys are like users:valid:2, users:valid:5, users:valid:10, users:invalid:14, etc.
My goal is to fetch users ordered by their ids ASC, valid users first.
I additionally store users ids in a sorted set where "score" is a filter id. Filters mapping is like { valid: 0, invalid: 1 }
$redis.zadd 'sorted-ids', 0, 2
$redis.zadd 'sorted-ids', 0, 5
$redis.zadd 'sorted-ids', 0, 10
$redis.zadd 'sorted-ids', 1, 14
It lets to fetch user ids already being sorted ASC, valid first. Great! However...
$redis.zrange 'sorted-ids', 0, -1, with_scores: true
=> [["10", 0.0], ["2", 0.0], ["5", 0.0], ["14", 1.0]]
"10" comes before "2".
So the ids are not actually ASC-ordered because they're stored as strings.
https://redis.io/commands/zadd
When multiple elements have the same score, they are ordered
lexicographically (they are still ordered by score as a first key,
however, locally, all the elements with the same score are relatively
ordered lexicographically). The lexicographic ordering used is binary,
it compares strings as array of bytes.
Is there a way to achieve this sorting properly in Redis? Like storing ids as integers to avoid lexographic ordering or any other way I can come up with?
You can pad your IDs (Sorted Set members) with zeros, i.e. '1' becomes '000000001', and lexicographical ordering will work for you. Makes sure, however, that the length of padded IDs stays the same (e.g. 9 digits).
Related
I have an application which has a dashboard, basically a table with hundreds of thousands of records.
This table has up to 50 different columns. These columns have different types in mapping: keyword, text, boolean, integer.
As records in the table might have the same values, I use sorting as an array of 2 attributes:
First attribute is what client wants to sort by. It can be a simple
sorting object or some sort query with nested filter.
Second
attribute is basically a default sorting by id, needed for sorting
the documents which have identical values for the column customer
wants to sort by.
I checked multiple topics/issues on github and here
on elastic forum to understand how to implement search_after
mechanism for back sorting but it's not working for all the cases I
need.
Please have a look at the image:
Imagine there is a limit = 3, the customer right now is on the 3d page of a table and all the data is sorted by name asc, _id asc
The names are: A, B, C, D, E on the image.
The ids are numeric parts of the Doc word.
When customer wants to go back to the previous page, which is a page #2 on my picture, what I do is pass the following to elastic:
sort: [
{
name: 'desc'
},
{
_id: 'desc'
}
],
search_after: [null, Doc7._id]
As as result, I get only one document, which is Doc6: null on my image. It seems to be logical, because I ask elastic to search by desc after null and id 7 and I have only 1 doc corresponding this..it's Doc6 but it's not what I need.
I can't make up the solution to get the data that I need.
Could anyone help, please?
Is that possible in ES to query and sort by array of values like for ex.
Give me all results, but results with "country_code" = [ 'de', 'au', 'es'] should be prioritized at the given order like they are in array
It is possible???
Elasticsearch does not really handle arrays, internally it's just same field having 3 different values: country_code = "de" AND country_code = "au" AND country_code = "es" all at the same time. You can though use script based sorting and handle arrays in Painless.
Problem description:
The goal is to efficiently query strings from a set of JSON documents while respecting document-level security, such that a user is only able to retrieve data from documents they have access to.
Suppose we have the following documents:
Document document_1, which has no restrictions:
{
"id": "document_1",
"set_of_strings_1": [
"the",
"quick",
"brown"
],
"set_of_strings_2": [
"fox",
"jumps",
"over",
],
"isPublic": true
}
Document document_2, which can only be accessed by 3 users:
{
"id": "document_2",
"set_of_strings_1": [
"the"
"lazy"
],
"set_of_strings_2": [
"dog",
],
"isPublic": false,
"allowed_users": [
"Alice",
"Bob",
"Charlie"
]
}
Now suppose user Bob (has access to both documents) makes the following query:
getStrings(
user_id: "Bob",
set_of_strings_id: "set_of_strings_1"
)
The correct response should be the union of set_of_strings_1 from both documents:
["the", "quick", "brown", "lazy"]
Now suppose user Dave (has access to document_1 only) makes the following query:
getStrings(
user_id: "Dave",
set_of_strings_id: "set_of_strings_1"
)
The correct response should be set_of_strings_1 from document_1:
["the", "quick", "brown"]
A further optimization is to handle prefix tokens. E.g. for the query
getStrings(
user_id: "Bob",
set_of_strings_id: "set_of_strings_1",
token: "t"
)
The correct response should be:
["the"]
Note: empty token should match all strings.
However, I am happy to perform a simple in-memory prefix-match after the strings have been retrieved. The bottleneck here is expected to be the number of documents, not the number of strings.
What I have tried:
Approach 1: Naive approach
The naive solution here would be to:
put all the documents in a SQL database
perform a full-table scan to get all the documents (we can have millions of documents)
iterate through all the documents to figure out user permissions
filtering out the set of documents the user can access
iterating through the filtered list to get all the strings
This is too slow.
Approach 2: Inverted indices
Another approach considered is to create an inverted index from users to documents, e.g.
users
documents_they_can_see
user_1
document_1, document_2, document_3
user_2
document_1
user_3
document_1, document_4
This will efficiently give us the document ids, which we can use against some other index to construct the string set.
If this next step is done naively, it still involves a linear scan through all the documents the user is able to access. To avoid this, we can create another inverted index mapping document_id#set_of_strings_id to the corresponding set of strings then we just take the union of all the sets to get the result and then we can run prefix match after. However, this involves doing the union of a large number of sets.
Approach 3: Caching
Use redis with the following data model:
key
value
user_id#set_of_strings_id
[String]
Then we perform prefix match in-memory on the set of strings we get from the cache.
We want this data to be fairly up-to-date so the source-of-truth datastore still needs to be performant.
I don't want to reinvent the wheel. Is there a data structure or some off-the-shelf system that does what I am trying to do?
I want to find all distinct values of a term over a time range.
Example data:
[
{
'a': 123,
'b': 456,
'user': 'bob',
},
{
...
'user': 'sally',
...
},
{
...
'user': 'bob',
...
},
{
'x': 2,
'y': 3,
}
]
In this case I want to find all distinct values of user.
Note that some users will appear in multiple records, and not all records have a user.
Also, my requirement is that the list of returned users MUST be comprehensive (ie. if there exists a record with a certain user, then that user MUST appear in the list of results).
Having the number of occurrences of each user would be nice too, but not required.
I considered Cardinality Aggregations but I'm concerned about the 'approximate' nature of the results. I need a comprehensive list of users.
How can I do this in elasticsearch?
As mentioned in comments terms aggregation is the one you are looking for. Results are approximate only if you query N most common terms and data is split in multiple shards.
You can set size to zero to get "unlimited" (Integer.MAX_VALUE) results.
I have a mongo db collection for restaurants.
e.g.
{_id: uniquemongoid,
rank: 3,
city: 'Berlin'
}
Restaurants are listed by city and ordered by rank (an integer) - should I create an index on city and rank, or city/rank compound? (I query by city and sort by rank)
Furthermore there are several fields with booleans e.g. { hasParking:true, familyFriendly:true } - should I create indexes to speed up queries for these filters? compound indexes? Its not clear for me if I should create compound indexes as the queries can have only one boolean set or more booleans set.
The best way to figure out whether you need indexes is to benchmark it with "explain()".
As for your suggested indexes:
You will need the city/rank compound index. Indexes in MongoDB can only be used for left-to-right (at the moment) and hence doing an equality search on "city" and then sorting the result by "rank" will mean that the { city: 1, rank: -1 } index would work best.
Indexes on boolean fields are often not very useful, as on average MongoDB will still need to access half of your documents. After doing a selection by city (and hopefully a limit!) doing an extra filter for hasParking etc will not make MongoDB use both the city/rank and the hasParking index. MongoDB can only use one index per query.
1) create index { restaurant:1, rank: 1} which will serve your purpose.
You will avoid 2 indexes
2) Create a document in following format and you can query for any no of fields you want.
{
info: [{hasParking:true}, {familyFriendly:true}],
_id:
rank:
city:
}
db.restaurants.ensureIndex({info : 1});
db.restaurants.find({ info :{ hasParking:true}})
Note MongoDB don't use two index for the same query (except $or queries). So, in the (2) case, if you want to add addition filter over the (1) query, then this (2) option won't work. I am not sure of your (2) requirement, so posting this solution.