ElasticSearch, how to search for a document containing a specific array element - elasticsearch

I am having a little problem with elasticsearch and wonder if someone can help me solve it.
I have a document containing an array of tuples (publications).
Something like :
{
....
publications: [
{
item1: 385294,
item2: 11
},
{
item1: 395078,
item2: 1
}
]
....
}
The problem i have is for retrieving documents who contain a specific tuple, for exemple (item1 = 395078 AND item2 = 1).
Whatever i try, it seems to always treat item1 and item2 separately, i fail to tell elasticsearch that item1 and item2 must have a specific value inside the same tuple, not accross the whole array...
Is there something i'm missing here ?
Thanks

This is not possible in the straight way.
ElasticSearch flattens the array before checking for condition.
Which mean
elasticSearch matches
a=x AND b=y1 to [{a=x,b=y},{a=x1,b=y1}] which doesnt happen in the conventianal array checking.
What you can do here is
Usage of nested type - https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html (but for each element in array , an extra document would be created)
Store the array as
publications: [
{
385294:11
},
{
395078:1
}
]

Related

FaunaDB search document and get its ranking based on a score

I have the following Collection of documents with structure:
type Streak struct {
UserID string `fauna:"user_id"`
Username string `fauna:"username"`
Count int `fauna:"count"`
UpdatedAt time.Time `fauna:"updated_at"`
CreatedAt time.Time `fauna:"created_at"`
}
This looks like the following in FaunaDB Collections:
{
"ref": Ref(Collection("streaks"), "288597420809388544"),
"ts": 1611486798180000,
"data": {
"count": 1,
"updated_at": Time("2021-01-24T11:13:17.859483176Z"),
"user_id": "276989300",
"username": "yodanparry"
}
}
Basically I need a lambda or a function that takes in a user_id and spits out its rank within the collection. rank is simply sorted by the count field. For example, let's say I have the following documents (I ignored other fields for simplicity):
user_id
count
abc
12
xyz
10
fgh
999
If I throw in fgh as an input for this lambda function, I want it to spit out 1 (or 0 if you start counting from 0).
I already have an index for user_id so I can query and match a document reference from this index. I also have an index sorted_count that sorts document based on count field ascendingly.
My current solution was to query all documents by sorted_count index, then get the rank by iterating through the array. I think there should be a better solution for this. I'm just not seeing it.
Please help. Thank you!
Counting things in Fauna isn't as easy as one might expect. But you might still be able to do something more efficient than you describe.
Assuming you have:
CreateIndex(
{
name: "sorted_count",
source: Collection("streaks"),
values: [
{ field: ["data", "count"] }
]
}
)
Then you can query this index like so:
Count(
Paginate(
Match(Index("sorted_count")),
{ after: 10, size: 100000 }
)
)
Which will return an object like this one:
{
before: [10],
data: [123]
}
Which tells you that there are 123 documents with count >= 10, which I think is what you want.
This means that, in order to get a user's rank based on their user_id, you'll need to implement this two-step process:
Determine the count of the user in question using your index on user_id.
Query sorted_count using the user's count as described above.
Note that, in case your collection has more than 100,000 documents, you'll need your Go code to iterate through all the pages based on the returned object's after field. 100,000 is Fauna's maximum allowed page size. See the Fauna docs on pagination for details.
Also note that this might not reflect whatever your desired logic is for resolving ties.

Elastic Ingest Pipeline split field and create a nested field

Dear freindly helpers,
I have an index that is fed by a database via Kafka. Now this database holds a field that aggregates a couple of pieces of information like so key/value; key/value; (don't ask for the reason, I have no idea who designed it liked that and why ;-) )
93/4; 34/12;
it can be empty, or it can hold 1..n key/value pairs.
I want to use an ingest pipeline and ideally have a "nested" field which holds all values that are in tha field.
Probably like this:
{"categories":
{ "93": 7,
"82": 4
}
}
The use case is the following: we want to visualize the sum of a filtered number of these categories (they tell me how many minutes a specific process took longer) and relate them in ranges.
Example: I filter categories x, y ,z and then group how many documents for the day had no delay, which had a delay up to 5 minutes and which had a delay between 5 and 15 minutes.
I have tried to get the fields neatly separated with the kv processor and wanted to work from there on but it was a complete wrong approach I guess.
"kv": {
"field": "IncomingField",
"field_split": ";",
"value_split": "/",
"target_field": "delays",
"ignore_missing": true,
"trim_key": "\\s",
"trim_value": "\\s",
"ignore_failure": true
}
When I test the pipeline it seems ok
"delays": {
"62": "3",
"86": "2"
}
but there are two things that don't work.
I can't know upfront how many of these combinations I have and thus converting the values from string t int in the same pipeline is an issue.
When I want to create a kibana index pattern I end up with many fields like delay.82 and delay.82.keyword which does not make sense at all for the usecase as I can't filter (get only the sum of delays where the key is one of x,y,z) and aggregate.
I have looked into other processors (dorexpander) but can't really get my head around how to get this working.
I hope my question is clear (I lack english skills, sorry) and that someone can point me at the right direction.
Thank you very much!
You should rather structure them as an array of objects with shared accessors, for instance:
[ {key: 93, value: 7}, ...]
That way, you'll be able to aggregate on categories.key and categories.value.
So this means iterating the categories' entrySet() using a custom script processor like so:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "extracts k/v pairs",
"processors": [
{
"script": {
"source": """
def categories = ctx.categories;
def kv_pairs = new ArrayList();
for (def pair : categories.entrySet()) {
def k = pair.getKey();
def v = pair.getValue();
kv_pairs.add(["key": k, "value": v]);
}
ctx.categories = kv_pairs;
"""
}
}
]
},
"docs": [
{
"_source": {
"categories": {
"82": 4,
"93": 7
}
}
}
]
}
P.S.: Do make sure your categories field is mapped as nested b/c otherwise you'll lose the connections between the keys & the values (also called flattening).

Coupling in Elastic Search

We are thinking of doing coupling in the items stored in Elastic Search. While indexing we index the coupling information of the item in the item doc. Is there a way to query the Elastic search so that the coupled items come together in the result?
For eg:
item1 = {
...
coupled_item: item2
...
}
item2 = {
...
coupled_item: item1
...
}
query_result = [item3, item6, item1, item2, item4, item5]
Approach 1
One of the approaches which we thought was to add a score key in the item doc and set the score of the coupled products as equal and then while querying, sort it by that score.
Cons
We are already doing the sorting using this technique, we do not want to hinder that order we just want to insert the coupled item from its place to right below the item.
Approach 2
The other approach we thought was to query all the items from the ES and then handle it through the code.
Cons
Cons are that this is not the optimal solution plus we also need to handle the pagination ourselves in this case.
Is there a feature provided by Elastic Search to handle coupling internally. If not then is there any other way we can handle this.
Coupled(dependant) documents can be duplicated inside each other.
item1 = {
id: item1
coupled_item: {
id: item2
}
...
}
item2 = {
id: item2
coupled_item: {
id: item1
}
...
}
es_query_result = [item1 {item2}, item3 {item4}, item5 {item6}]
application_flattened_result = [item1, item2, item3, item4, item5, item6]
This will add some challenges to document writes because now two documents will get updated with two requests, but can be done. Additionally, pagination of search query can get tricky as well.
Only internally supported feature, that I know, that comes close enough is - https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html

RethinkDB: how to append to an array in a nested structure

I am trying to append to an array in a nested field, which I have to find based on runtime information.
Here's an example:
r.db("test")
.table("test")
.insert({ "stock": [{ "bin":"abc", "entries":[{ "state":1 }] }] })
The idea is that the document contains a "stock" key, which is an array of multiple "storage bins". Each bin has a name and a number of entries. I need to be able to append to entries in one of the bins, atomically, without affecting other bins.
I tried this approach:
r.db("test")
.table("test")
.update(function(item) {
return {"stock": item("stock")
.filter({ "bin": "abc" })
.append({ "state":42 })
}
})
…but that does not append at the right level, and I am not certain if it will preserve existing bins with names other than "abc".
When updating an element of an array, you should use changeAt with an index, or map over the array instead of using filter.
Here is what that query might look like:
r.table("test")
.update(function(item){
return {"stock": item("stock").map(function(stock){
r.branch(stock.hasFields({"bin": "abc"})
stock.merge({"entries": stock("entries").append({"state": 42})}),
stock)})}})
Alternatively, if you stored your entries in an object instead of an array, like this:
{ "stock": {"abc": {"entries":[{ "state":1 }] }} }
The update query might look like this:
r.table("test")
.filter(r.row.hasFields({"stock": {"abc": true}}))
.update({"stock": {"abc": r.row("stock")("abc").append({"state": 42})}})

Sorting by value in multivalued field in elasticsearch

I have a multivalue field with integers in the document, for example
{
values: [1,2,3,4,5]
}
I apply range filter, for example from 2 to 4 and get list of document with values, contains 2,3,4.
Now I'd like to sort results, and first return documents, which contains 3.
I could do it using script sorting:
{
sort:{
_script: {
script: "doc['values'].getValues().contains(3) ? 0 : 1",
type: "number"
}
}
}
But I don't like it's performance, because getValues() returns a List actually, and contains methods is O(n).
Are any better ways?

Resources