How to match most specific string from a collection of strings - data-structures

I want to search a collection of strings to find the most specific match from left to right. Each string can have some associated data that must be returned when a match occurs. Matches must be returned in the order of specificness. That is I should be able to iterate through matches in most specific to least specific order (other way is also ok).
For example for the set of string, data tuples:
"abc", data_1
"xyz", data_2
"abcpqr", data_3
"abcpqrstu", data_4
"xyzpqr", data_2
If I search for abcpqr, I should get following in most specific to least specific order:
"abcpqr", data_3
"abc", data_1
For abcpqrstu:
"abcpqrstu" : data_4
"abcpqr", data_3
"abc", data_1
For abcpq:
"abc", data_1
Can I use a suffix tree or any other data structure to do this?

Related

Efficient data-structure to searching data only in documents a user can access

Problem description:
The goal is to efficiently query strings from a set of JSON documents while respecting document-level security, such that a user is only able to retrieve data from documents they have access to.
Suppose we have the following documents:
Document document_1, which has no restrictions:
{
"id": "document_1",
"set_of_strings_1": [
"the",
"quick",
"brown"
],
"set_of_strings_2": [
"fox",
"jumps",
"over",
],
"isPublic": true
}
Document document_2, which can only be accessed by 3 users:
{
"id": "document_2",
"set_of_strings_1": [
"the"
"lazy"
],
"set_of_strings_2": [
"dog",
],
"isPublic": false,
"allowed_users": [
"Alice",
"Bob",
"Charlie"
]
}
Now suppose user Bob (has access to both documents) makes the following query:
getStrings(
user_id: "Bob",
set_of_strings_id: "set_of_strings_1"
)
The correct response should be the union of set_of_strings_1 from both documents:
["the", "quick", "brown", "lazy"]
Now suppose user Dave (has access to document_1 only) makes the following query:
getStrings(
user_id: "Dave",
set_of_strings_id: "set_of_strings_1"
)
The correct response should be set_of_strings_1 from document_1:
["the", "quick", "brown"]
A further optimization is to handle prefix tokens. E.g. for the query
getStrings(
user_id: "Bob",
set_of_strings_id: "set_of_strings_1",
token: "t"
)
The correct response should be:
["the"]
Note: empty token should match all strings.
However, I am happy to perform a simple in-memory prefix-match after the strings have been retrieved. The bottleneck here is expected to be the number of documents, not the number of strings.
What I have tried:
Approach 1: Naive approach
The naive solution here would be to:
put all the documents in a SQL database
perform a full-table scan to get all the documents (we can have millions of documents)
iterate through all the documents to figure out user permissions
filtering out the set of documents the user can access
iterating through the filtered list to get all the strings
This is too slow.
Approach 2: Inverted indices
Another approach considered is to create an inverted index from users to documents, e.g.
users
documents_they_can_see
user_1
document_1, document_2, document_3
user_2
document_1
user_3
document_1, document_4
This will efficiently give us the document ids, which we can use against some other index to construct the string set.
If this next step is done naively, it still involves a linear scan through all the documents the user is able to access. To avoid this, we can create another inverted index mapping document_id#set_of_strings_id to the corresponding set of strings then we just take the union of all the sets to get the result and then we can run prefix match after. However, this involves doing the union of a large number of sets.
Approach 3: Caching
Use redis with the following data model:
key
value
user_id#set_of_strings_id
[String]
Then we perform prefix match in-memory on the set of strings we get from the cache.
We want this data to be fairly up-to-date so the source-of-truth datastore still needs to be performant.
I don't want to reinvent the wheel. Is there a data structure or some off-the-shelf system that does what I am trying to do?

FaunaDB search document and get its ranking based on a score

I have the following Collection of documents with structure:
type Streak struct {
UserID string `fauna:"user_id"`
Username string `fauna:"username"`
Count int `fauna:"count"`
UpdatedAt time.Time `fauna:"updated_at"`
CreatedAt time.Time `fauna:"created_at"`
}
This looks like the following in FaunaDB Collections:
{
"ref": Ref(Collection("streaks"), "288597420809388544"),
"ts": 1611486798180000,
"data": {
"count": 1,
"updated_at": Time("2021-01-24T11:13:17.859483176Z"),
"user_id": "276989300",
"username": "yodanparry"
}
}
Basically I need a lambda or a function that takes in a user_id and spits out its rank within the collection. rank is simply sorted by the count field. For example, let's say I have the following documents (I ignored other fields for simplicity):
user_id
count
abc
12
xyz
10
fgh
999
If I throw in fgh as an input for this lambda function, I want it to spit out 1 (or 0 if you start counting from 0).
I already have an index for user_id so I can query and match a document reference from this index. I also have an index sorted_count that sorts document based on count field ascendingly.
My current solution was to query all documents by sorted_count index, then get the rank by iterating through the array. I think there should be a better solution for this. I'm just not seeing it.
Please help. Thank you!
Counting things in Fauna isn't as easy as one might expect. But you might still be able to do something more efficient than you describe.
Assuming you have:
CreateIndex(
{
name: "sorted_count",
source: Collection("streaks"),
values: [
{ field: ["data", "count"] }
]
}
)
Then you can query this index like so:
Count(
Paginate(
Match(Index("sorted_count")),
{ after: 10, size: 100000 }
)
)
Which will return an object like this one:
{
before: [10],
data: [123]
}
Which tells you that there are 123 documents with count >= 10, which I think is what you want.
This means that, in order to get a user's rank based on their user_id, you'll need to implement this two-step process:
Determine the count of the user in question using your index on user_id.
Query sorted_count using the user's count as described above.
Note that, in case your collection has more than 100,000 documents, you'll need your Go code to iterate through all the pages based on the returned object's after field. 100,000 is Fauna's maximum allowed page size. See the Fauna docs on pagination for details.
Also note that this might not reflect whatever your desired logic is for resolving ties.

Filtering all field values per row

I have a table called 'sample'. Based on which algorithm is used, each sample may have different field (property) names.
I need to be able to retrieve all samples which have field values that contain/match a user filter value.
So for instance, if a sample has the following properties:
example 1: "name", "gender", "state"
and another had properties:
example 2: "name", "gender", "rate"
and there would be thousands of such samples with more variation.
If a user looking at a table with a set of samples from the second example above ("name", "gender", "rate") and used a filter "foo", I need to query the table "sample" for all rows where any of the property's values contained/matched "foo" where value could be "foobar".
If they were looking at a set of samples that had the properties that example 1 has ("name", "gender", "state"), then I need to do the same, however, I cannot hard code the properties of either.
In SQL I would get the field names and dynamically build a SQL query string but with REQL object DOT notation, I am struggling with how to do it.

MongoDB compound indexes vs Single FIeld Indexes in terms of space consumption

According to this post compound indexes are bigger in terms of dimensions (I could not find much info on docs, so if you could point me there I would be grateful).
Suppose I have to search for the whole address (we can assume I will always have all the fields available both in collection and in the query) through a collection of addresses like
{
name: String,
street: String,
postcode: String,
City: String,
Country: String
}
My question is: how bigger a compound index would be?
If a compound index is bigger then a single field wouldn't it be better to add a hash of the concatenation of all values to all objects, add a single index to the hash field and search by that (although it do not sounds like a good practice)?
If a compound index is bigger then a single field wouldn't it be better to add a hash of the concatenation of all values to all objects, add a single index to the hash field and search by that (although it do not sounds like a good practice)?
These accomplish different things. A compound index has an order and that order has an effect. For instance, the index { 'country' : 1, 'city' : 1, 'postcode' : 1 } would allow to search for all address in a specific city of a specific country. A hash can't do that - hashes only support exact matches.
I don't see how this is bad practice at all, it's just a very narrow use case. Remember than every slight difference in spelling, additional white spaces, etc. will result in different hash values and that you can't even answer simple question like "how many address in country X do we store?". But if you don't need that, why not?
By the way, MongoDB has built-in support for this. If the address is embedded, using a hashed index on the entire subdocument will accomplish what you need:
MongoDB supports hashed indexes of any single field. The hashing function collapses embedded documents and computes the hash for the entire value,
e.g.:
> db.hash.insert( {"name": "john", "address" : { "city" : "Chicago", "state":"IL",
"country" : "US" } } );
WriteResult({ "nInserted" : 1 })
> db.hash.createIndex( { "address" : "hashed" } );
...
>
> This query uses the index and finds the document:
> db.hash.find({"address" : {"city" : "Chicago", "state": "IL", "country" : "US" } } );
>
> // this query wont find the document b/c of missing state, but is still fast (IXSCAN)
> db.hash.find({"address" : {"city" : "Chicago", "country" : "US" } } );

In Couchbase or N1QL how can I check if the values in an array match

In a couchbase I have the following document structure...
{
name: "bob",
permissions: [
2,
4,
6
]
}
I need to be able to create a view, or N1QL query which will check if the permissions for "bob" are contained within a given array.
e.g I have an array with contents
[1,2,3,4,5,6]
I need the "bob" document to be returned because my array contains 2,4,6 and so does "bob"
If my array contained 1,3,4,5,6 "bob" should not be selected because my array does not contain "2"
Essentially I want to match any documents whose permission entries are all contained in my array.
The solution can either a view or an N1QL query.
Using N1QL, you can do the following:
SELECT * FROM my_bucket WHERE EVERY p IN permissions SATISFIES p IN [ 1,2,3,4,5,6 ] END;

Resources