Skip duplicates on field in a Elasticsearch search result - elasticsearch

Is it possible to remove duplicates on a given field?
For example the following query:
{
"query": {
"term": {
"name_admin": {
"value": "nike"
}
}
},
"_source": [
"name_admin",
"parent_sku",
"sku"
],
"size": 2
}
is retrieving
"hits" : [
{
"_index" : "product",
"_type" : "_doc",
"_id" : "central30603",
"_score" : 4.596813,
"_source" : {
"parent_sku" : "SSP57",
"sku" : "SSP57816401",
"name_admin" : "NIKE U NSW PRO CAP NIKE AIR"
}
},
{
"_index" : "product",
"_type" : "_doc",
"_id" : "central156578",
"_score" : 4.596813,
"_source" : {
"parent_sku" : "SSP57",
"sku" : "SSP57816395",
"name_admin" : "NIKE U NSW PRO CAP NIKE AIR"
}
}
]
I'd like to skip duplicates on parent_sku so I only have one result per parent_sku like it's possible with suggestion by doing something like "skip_duplicates": true.
I know I cloud achieve this with an aggregation but I'd like to stick with a search, as my query is a bit more complicated and as I'm using the scroll API which doesn't work with aggregations.

Field collapsing should help here
{
"query": {
"term": {
"name_admin": {
"value": "nike"
}
}
},
"collapse" : {
"field" : "parent_sku",
"inner_hits": {
"name": "parent",
"size": 1
}
},
"_source": false,
"size": 2
}
The above query will return one document par parent_sku.

Related

Adding a new document to a separate index using Elasticsearch processors

Is there a way to populate a separate index when I index some document(s)?
Let's assume I have something like:
PUT person/_doc/1
{
"name": "Jonh Doe",
"languages": ["english", "spanish"]
}
PUT person/_doc/2
{
"name": "Jane Doe",
"languages": ["english", "russian"]
}
What I want is that every time a person is added, a language is added to a language index.
Something like:
GET languages/_search
would give:
...
"hits" : [
{
"_index" : "languages",
"_type" : "doc",
"_id" : "russian",
"_score" : 1.0,
"_source" : {
"value" : "russian"
}
},
{
"_index" : "languages",
"_type" : "doc",
"_id" : "english",
"_score" : 1.0,
"_source" : {
"value" : "english"
}
},
{
"_index" : "languages",
"_type" : "doc",
"_id" : "spanish",
"_score" : 1.0,
"_source" : {
"value" : "spanish"
}
}
...
Thinking of pipelines, but I don't see any processor that allow such a thing.
Maybe the answer is to create a custom processor. I have one already, but not sure how could I insert a document in a separate index there.
Update: Use transforms as described in #Val answer works, and seems to be the right answer indeed...
However, I am using Open Distro for Elasticsearch and transforms are not available there. Some alternative solution that works there would be greatly appreciated :)
Update 2: Looks like OpenSearch is replacing Open Distro for Elasticsearch. And there is a transform api \o/
Each document entering an ingest pipeline cannot be cloned or split like it is doable in Logstash for instance. So from a single document, you cannot index two documents.
However, just after indexing your person documents, it's definitely possible to hit the _transform API endpoint and create the languages index from the person one:
First create the transform:
PUT _transform/languages-transform
{
"source": {
"index": "person"
},
"pivot": {
"group_by": {
"language": {
"terms": {
"field": "languages.keyword"
}
}
},
"aggregations": {
"count": {
"value_count": {
"field": "languages.keyword"
}
}
}
},
"dest": {
"index": "languages",
"pipeline": "set-id"
}
}
You also need to create the pipeline that will set the proper ID for your language documents:
PUT _ingest/pipeline/set-id
{
"processors": [
{
"set": {
"field": "_id",
"value": "{{language}}"
}
}
]
}
Then, you can start the transform:
POST _transform/languages-transform/_start
And when it's done you'll have a new index called languages whose content is
GET languages/_search
=>
"hits" : [
{
"_index" : "languages",
"_type" : "_doc",
"_id" : "english",
"_score" : 1.0,
"_source" : {
"count" : 4,
"language" : "english"
}
},
{
"_index" : "languages",
"_type" : "_doc",
"_id" : "russian",
"_score" : 1.0,
"_source" : {
"count" : 2,
"language" : "russian"
}
},
{
"_index" : "languages",
"_type" : "_doc",
"_id" : "spanish",
"_score" : 1.0,
"_source" : {
"count" : 2,
"language" : "spanish"
}
}
]
Note that you can also set that transform on schedule so that it runs regularly, or you can run it manually whenever suits you, to rebuild the languages index.
OpenSearch has its own _transform API. It works slightly different, the transform could be created this way:
PUT _plugins/_transform/languages-transform
{
"transform": {
"enabled": true,
"description": "Insert languages",
"schedule": {
"interval": {
"period": 1,
"unit": "minutes"
}
},
"source_index": "person",
"target_index": "languages",
"data_selection_query": {
"match_all": {}
},
"page_size": 1,
"groups": [{
"terms": {
"source_field": "languages.keyword",
"target_field": "value"
}
}]
}
}
You will just need to change your _index field name in the ingest pipeline:
{
"description" : "sets the value of count to 1",
"set": {
"if": "[*your condition here*]",
"field": "_index",
"value": "languages",
"override": true
}
}

Elasticsearch dis_max query, return exact matching query

Lets assume i want to perform this query
GET /_search
{
"query": {
"dis_max" : {
"queries" : [
{ "term" : { "title" : "Quick pets" }},
{ "term" : { "body" : "Quick pets" }}
],
"tie_breaker" : 0.7
}
}
}
According to the documentation of elasticsearch, this query returns a list of documents with the highest relevance score from any matching clause.
But how can i determine which underlying query caused the document to appear in the result list? How can i determine if a result appears due to query 1 or query 2 in the list of queries? Can i somehow return this for each result document?
You can use named queries
Query:
{
"query": {
"dis_max": {
"queries": [
{
"term": {
"title.keyword": {
"value": "Quick pets",
"_name": "title" --> give name for each query
}
}
},
{
"term": {
"body.keyword": {
"value": "Quick pets",
"_name": "body"
}
}
}
],
"tie_breaker": 0.7
}
}
}
Result:
"hits" : [
{
"_index" : "index55",
"_type" : "_doc",
"_id" : "mAGWNXIBrjSHR7JVvY4C",
"_score" : 0.6931471,
"_source" : {
"title" : "Quick pets"
},
"matched_queries" : [
"title"
]
},
{
"_index" : "index55",
"_type" : "_doc",
"_id" : "mQGXNXIBrjSHR7JVGI4E",
"_score" : 0.2876821,
"_source" : {
"title" : "ddd",
"body" : "Quick pets"
},
"matched_queries" : [
"body"
]
}
]
To get a breakdown of how the score was computed you can add the "explain": true option in the request body.
That will give you the full explanation of which clause accounted for which score.
Don't forget that dis_max retuns a score equal to the highest scoring clause plus tie_braker times the rest of the scores.
Official ES documentation for explain here.

Bool Filter not showing just the filtered data in Elastic Search

I have an index "tag_nested" which has data of following type :
{
"jobid": 1,
"table_name": "table_A",
"Tags": [
{
"TagType": "WorkType",
"Tag": "ETL"
},
{
"TagType": "Subject Area",
"Tag": "Telecom"
}
]
}
When I fire the query to filter data on "Tag" and "TagType" by firing following query :
POST /tag_nested/_search
{
"query": {
"bool": {
"must": {"match_all": {}},
"filter": [
{"term": {
"Tags.Tag.keyword": "ETL"
}},
{"term": {
"Tags.TagType.keyword": "WorkType"
}}
]
}
}
}
It gives me the following output. The problem I am facing is while the above query filters documents which doesn't have filtered data BUT it shows all the "Tags" of that document instead of just the filter one
{
"_index" : "tag_nested",
"_type" : "_doc",
"_id" : "9",
"_score" : 1.0,
"_source" : {
"jobid" : 1,
"table_name" : "table_A",
"Tags" : [
{
"TagType" : "WorkType",
"Tag" : "ETL"
},
{
"TagType" : "Subject Area",
"Tag" : "Telecom"
}
]
}
}
Instead of above result I want my output to be like :
{
"_index" : "tag_nested",
"_type" : "_doc",
"_id" : "9",
"_score" : 1.0,
"_source" : {
"jobid" : 1,
"table_name" : "table_A",
"Tags" : [
{
"TagType" : "WorkType",
"Tag" : "ETL"
}
]
}
}
Already answered here, here and here.
TL;DR you'll need to make your Tags field of type nested, resync your index & use inner_hits to only fetch the applicable tag group.

How to get results weighted by references from ElasticSearch

I have a dataset consisting of Notes referencing other Notes.
{id:"aaa", note: "lorem ipsum", references: ["ccc"]},
{id:"bbb", note: "lorem ipsum", references: ["aaa","ccc"]},
{id:"ccc", note: "lorem ipsum", references: ["bbb"]},
I want elastic search to use the references to weight the results, so in this case if I search for lorem I should get id "ccc" back since it has the most references. According to their docs, their graph solution does exactly this, but I also see examples where they are doing similar things.
But no explanation of how this is mapped to the Index. So my question is: how does one set up an ES index that uses references (indices)?
Other answers gave some clues, but then #7379490 provided the answer in another channel:
There is no way of doing this directly in ES. There are two possible solutions:
pre calculate references and pass them into ES by mapping a new value to the document.
Or aggregate and use aggregation to sort the response:
{
"query": {
"function_score": {
"query": {
"match": {
"note": "lorem"
}
},
"aggs": {
"references":{
"terms" : {
"field" : "references.keyword",
"order": { "_count": "desc" }
}
}
}
}
}
Function_score can be used to give higher score to documents with more values for reference field
Mapping:
{
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"note": {
"type": "text"
},
"references": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Query:
{
"query": {
"function_score": {
"query": {
"match": {
"note": "lorem"
}
},
"functions": [
{
"script_score": {
"script": "_score * doc['references.keyword'].length" --> references length
}
}
]
}
}
}
Result:
"hits" : [
{
"_index" : "index71",
"_type" : "_doc",
"_id" : "iJObRHIBXTOhZcHPaYks",
"_score" : 0.035661265,
"_source" : {
"id" : 2,
"note" : "lorem ipsum",
"references" : [
1,
3
]
}
},
{
"_index" : "index71",
"_type" : "_doc",
"_id" : "h5ObRHIBXTOhZcHPYok2",
"_score" : 0.017830633,
"_source" : {
"id" : 1,
"note" : "lorem ipsum",
"references" : [
3
]
}
},
{
"_index" : "index71",
"_type" : "_doc",
"_id" : "iZObRHIBXTOhZcHPb4k4",
"_score" : 0.017830633,
"_source" : {
"id" : 3,
"note" : "lorem ipsum",
"references" : [
2
]
}
}
]
If your data contain referencesGiven and you need to search by referencesReceived, I would recommend a 2-way pass:
insert all documents with an empty field for referencesReceived: []
(or referencesReceivedCount: 0 if that is enough)
for each document, for each item in referencesGiven, update the document receiving the reference

Returning all documents when query string is empty

Say I have the following mapping:
{
'properties': {
{'title': {'type': 'text'},
{'created': {'type': 'text'}}
}
}
Sometimes the user will query by created, and sometimes by title and created. In both cases I want the query JSON to be as similar as possible. What's a good way to create a query that filters only by created when the user is not using the title to query?
I tried something like:
{
bool: {
must: [
{range: {created: {gte: '2010-01-01'}}},
{query: {match_all: {}}}
]
}
}
But that didn't work. What would be the best way of writing this query?
Your query didn't work cause created is of type text and not date, range queries on string dates will not work as expected, you should change your mappings from type text to date and reindex your data.
Follow this to reindex your data (with the new mappings) step by step.
Now if I understand correctly you want to use a generic query which filters title or/and created depending on the user input.
In this case, my suggestion is to use Query String.
An example (version 7.4.x):
Mappings
PUT my_index
{
"mappings": {
"properties": {
"title": {
"type": "text"
},
"created": { -------> change type to date instead of text
"type": "date"
}
}
}
}
Index a few documents
PUT my_index/_doc/1
{
"title":"test1",
"created": "2010-01-01"
}
PUT my_index/_doc/2
{
"title":"test2",
"created": "2010-02-01"
}
PUT my_index/_doc/3
{
"title":"test3",
"created": "2010-03-01"
}
Search Query (created)
GET my_index/_search
{
"query": {
"query_string": {
"query": "created:>=2010-02-01",
"fields" : ["created"]
}
}
}
Results
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"title" : "test2",
"created" : "2010-02-01"
}
},
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"title" : "test3",
"created" : "2010-03-01"
}
}]
Search Query (title)
GET my_index/_search
{
"query": {
"query_string": {
"query": "test2",
"fields" : ["title"]
}
}
}
Results
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.9808292,
"_source" : {
"title" : "test2",
"created" : "2010-02-01"
}
}
]
Search Query (title and created)
GET my_index/_search
{
"query": {
"query_string": {
"query": "(created:>=2010-02-01) AND test3"
}
}
}
Results
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.9808292,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.9808292,
"_source" : {
"title" : "test3",
"created" : "2010-03-01"
}
}
]
fields in query string - you can mention both fields. if you remove fields then the query will apply on all fields in your mappings.
Hope this helps

Resources