Boost value depending on number of common terms - elasticsearch

Assuming we have index called users and each of them have hobbies stored as array of strings like :["basketball", "games"].
We want to fetch users with common hobbies, although we want to boost users with more than one mutual hobby. Let's say if it is 1 common hobby, then boost value should be 1, if it is 2 common hobbies then boost should be 3 and if it is 3 common hobbies or more it should be 5.
Is it possible with terms_set query? Should it be somehow wrapped with function_score?
terms_set: {
hobbies: {
terms: userDoingQuery.hobbies,
minimum_should_match_script: {
source: ???,
},
boost: ???,
},
},

Related

FaunaDB search document and get its ranking based on a score

I have the following Collection of documents with structure:
type Streak struct {
UserID string `fauna:"user_id"`
Username string `fauna:"username"`
Count int `fauna:"count"`
UpdatedAt time.Time `fauna:"updated_at"`
CreatedAt time.Time `fauna:"created_at"`
}
This looks like the following in FaunaDB Collections:
{
"ref": Ref(Collection("streaks"), "288597420809388544"),
"ts": 1611486798180000,
"data": {
"count": 1,
"updated_at": Time("2021-01-24T11:13:17.859483176Z"),
"user_id": "276989300",
"username": "yodanparry"
}
}
Basically I need a lambda or a function that takes in a user_id and spits out its rank within the collection. rank is simply sorted by the count field. For example, let's say I have the following documents (I ignored other fields for simplicity):
user_id
count
abc
12
xyz
10
fgh
999
If I throw in fgh as an input for this lambda function, I want it to spit out 1 (or 0 if you start counting from 0).
I already have an index for user_id so I can query and match a document reference from this index. I also have an index sorted_count that sorts document based on count field ascendingly.
My current solution was to query all documents by sorted_count index, then get the rank by iterating through the array. I think there should be a better solution for this. I'm just not seeing it.
Please help. Thank you!
Counting things in Fauna isn't as easy as one might expect. But you might still be able to do something more efficient than you describe.
Assuming you have:
CreateIndex(
{
name: "sorted_count",
source: Collection("streaks"),
values: [
{ field: ["data", "count"] }
]
}
)
Then you can query this index like so:
Count(
Paginate(
Match(Index("sorted_count")),
{ after: 10, size: 100000 }
)
)
Which will return an object like this one:
{
before: [10],
data: [123]
}
Which tells you that there are 123 documents with count >= 10, which I think is what you want.
This means that, in order to get a user's rank based on their user_id, you'll need to implement this two-step process:
Determine the count of the user in question using your index on user_id.
Query sorted_count using the user's count as described above.
Note that, in case your collection has more than 100,000 documents, you'll need your Go code to iterate through all the pages based on the returned object's after field. 100,000 is Fauna's maximum allowed page size. See the Fauna docs on pagination for details.
Also note that this might not reflect whatever your desired logic is for resolving ties.

Maps vs Lists in Elasticsearch for optimized query performance

I have some data I will be putting into Elasticsearch, and want to decide on a format that will optimize query performance. The query will be in words: "Is ID X in category Y?". I have a fixed number of categories (small, say, 5), and possibly a large number of IDs to put into each category (currently in the dozens, but of indeterminate size in the future). Each ID will be in at most one category (possibly none).
Format 1:
{
"field1": "value1",
...
"categories": {
"category1": ["id10", "id24", "id38",...],
...
"category5": ["id62", "id19", "id82" ...]
}
}
or
Format 2:
{
"field1": "value1",
...
"categories": {
"id1": "category4",
"id2": "category2",
"id3": "category1",
...
}
}
Which data format would be preferred? The latter format has linear lookup time, but possibly many keys.
I think method 1 is better, Id will be more in the future, if you press method 2, then you may need to close the categories index or increase the number of index fields, and using method 1 can be more convenient to determine the type of a single id (indeOf).There are pros and cons. Maybe there's a better way.

Elasticsearch question, should I have duplicate data along 2 different indices? Not sure how to set up the data

Edit: 3 different incides. Sorry about the title :c
I am trying to grasp elasticsearch as fast as I can but I think I've confused myself majorly here. How should I set this data up?
I have 3 major searches:
1: Search by pokemon name. Eg: Show all Charizard in the system.
2: Search by trainer name Eg: Show all of John Doe's pokemon/checkins at the pokecenter.
3: Search by checkins at the pokecenter.
Should each of these be in their own separate index? I am absolutely from an SQL background primarily so I want to have separate tables for all of these. But that isn't how elasticsearch works... so I am really confused here.
Should I have a separate index for each pokemon?
And then another separate index for each trainer?
And then another separate index for each checkin at the pokecenter?
Query return examples
1: Search by pokemon name.
{
1 : {
id: 9239329,
pokeId: 6,
name: Charizard,
trainerId: 2932
}
}
2: Search by trainer name
{
1 : {
id: 2932,
name: John Doe,
pokemon: [
9239329
]
}
}
3: Search by checkins at the pokecenter.
{
1 : {
id: 3232,
date: 11/11/1111,
pokemon: [
9239329
],
trainerId: 2932
}
}
But if I have a separate index.... and index for EACH of these ... while that would be fast wouldn't that just be crazy horrendous data duplication?
It depends on the scope of the project :
the ideal way is to have each one as it's separate index this allows you to scale them differently if needed and move them to another cluster and also allow each one to have different replica settings
The quick way , is to have the checkins as an index and the trainer as a nested object , and under that the pokemon as a nested object.
note: nested queries are slower, and writing the queries to return exactly what you want is a little tricker.

Issue with defining a graphQL schema for an object

I have an object with some properties but unable to define the schema.
Structure of the object
{
"1": {
average: 40,
count: 15
},
"2": {
average: 11,
count: 2
},
"3": {
average: 30,
count: 2
}
}
My schema
const typeDef = gql`
type Query {
scoreByRank: Count
}
type Count {
"1": obj
"2": obj
"3": obj
}
type obj{
average: Int
count: Int
}
`;
But this query is failing because of the object prop with "1", "2" etc. Is there a better way to build this schema?
GraphQL field names must follow certain rules. Specifically /^[_a-zA-Z][_a-zA-Z0-9]*$/. As you can see, a field name must not start with a number 0-9. Certain field names might be inconvenient for a consumer of your GraphQL API depending on what languages they are using. The field name restrictions are defined so that most common languages can easily use the data provided by a GraphQL API.
The recommended solution is to follow the rules and come up with better field names, like _1, two or field3.
You could return an array instead of an object, but there is currently no easy way of enforcing a specific length for an array. On top of that, the indexing of an array would start at 0 rather than 1.
You could define a custom scalar for your object, but then you would lose the ability to leave out unnecessary fields and it would require far more convoluted code to work. As such, I would consider it bad practice. Only do it if you're dealing with an inflexible and opaque third party application that leaves you no other way.

elasticsearch - comprehensive list of distinct values

I want to find all distinct values of a term over a time range.
Example data:
[
{
'a': 123,
'b': 456,
'user': 'bob',
},
{
...
'user': 'sally',
...
},
{
...
'user': 'bob',
...
},
{
'x': 2,
'y': 3,
}
]
In this case I want to find all distinct values of user.
Note that some users will appear in multiple records, and not all records have a user.
Also, my requirement is that the list of returned users MUST be comprehensive (ie. if there exists a record with a certain user, then that user MUST appear in the list of results).
Having the number of occurrences of each user would be nice too, but not required.
I considered Cardinality Aggregations but I'm concerned about the 'approximate' nature of the results. I need a comprehensive list of users.
How can I do this in elasticsearch?
As mentioned in comments terms aggregation is the one you are looking for. Results are approximate only if you query N most common terms and data is split in multiple shards.
You can set size to zero to get "unlimited" (Integer.MAX_VALUE) results.

Resources