Adding a new document to a separate index using Elasticsearch processors - elasticsearch

Is there a way to populate a separate index when I index some document(s)?
Let's assume I have something like:
PUT person/_doc/1
{
"name": "Jonh Doe",
"languages": ["english", "spanish"]
}
PUT person/_doc/2
{
"name": "Jane Doe",
"languages": ["english", "russian"]
}
What I want is that every time a person is added, a language is added to a language index.
Something like:
GET languages/_search
would give:
...
"hits" : [
{
"_index" : "languages",
"_type" : "doc",
"_id" : "russian",
"_score" : 1.0,
"_source" : {
"value" : "russian"
}
},
{
"_index" : "languages",
"_type" : "doc",
"_id" : "english",
"_score" : 1.0,
"_source" : {
"value" : "english"
}
},
{
"_index" : "languages",
"_type" : "doc",
"_id" : "spanish",
"_score" : 1.0,
"_source" : {
"value" : "spanish"
}
}
...
Thinking of pipelines, but I don't see any processor that allow such a thing.
Maybe the answer is to create a custom processor. I have one already, but not sure how could I insert a document in a separate index there.
Update: Use transforms as described in #Val answer works, and seems to be the right answer indeed...
However, I am using Open Distro for Elasticsearch and transforms are not available there. Some alternative solution that works there would be greatly appreciated :)
Update 2: Looks like OpenSearch is replacing Open Distro for Elasticsearch. And there is a transform api \o/

Each document entering an ingest pipeline cannot be cloned or split like it is doable in Logstash for instance. So from a single document, you cannot index two documents.
However, just after indexing your person documents, it's definitely possible to hit the _transform API endpoint and create the languages index from the person one:
First create the transform:
PUT _transform/languages-transform
{
"source": {
"index": "person"
},
"pivot": {
"group_by": {
"language": {
"terms": {
"field": "languages.keyword"
}
}
},
"aggregations": {
"count": {
"value_count": {
"field": "languages.keyword"
}
}
}
},
"dest": {
"index": "languages",
"pipeline": "set-id"
}
}
You also need to create the pipeline that will set the proper ID for your language documents:
PUT _ingest/pipeline/set-id
{
"processors": [
{
"set": {
"field": "_id",
"value": "{{language}}"
}
}
]
}
Then, you can start the transform:
POST _transform/languages-transform/_start
And when it's done you'll have a new index called languages whose content is
GET languages/_search
=>
"hits" : [
{
"_index" : "languages",
"_type" : "_doc",
"_id" : "english",
"_score" : 1.0,
"_source" : {
"count" : 4,
"language" : "english"
}
},
{
"_index" : "languages",
"_type" : "_doc",
"_id" : "russian",
"_score" : 1.0,
"_source" : {
"count" : 2,
"language" : "russian"
}
},
{
"_index" : "languages",
"_type" : "_doc",
"_id" : "spanish",
"_score" : 1.0,
"_source" : {
"count" : 2,
"language" : "spanish"
}
}
]
Note that you can also set that transform on schedule so that it runs regularly, or you can run it manually whenever suits you, to rebuild the languages index.
OpenSearch has its own _transform API. It works slightly different, the transform could be created this way:
PUT _plugins/_transform/languages-transform
{
"transform": {
"enabled": true,
"description": "Insert languages",
"schedule": {
"interval": {
"period": 1,
"unit": "minutes"
}
},
"source_index": "person",
"target_index": "languages",
"data_selection_query": {
"match_all": {}
},
"page_size": 1,
"groups": [{
"terms": {
"source_field": "languages.keyword",
"target_field": "value"
}
}]
}
}

You will just need to change your _index field name in the ingest pipeline:
{
"description" : "sets the value of count to 1",
"set": {
"if": "[*your condition here*]",
"field": "_index",
"value": "languages",
"override": true
}
}

Related

How to Order Completion Suggester with Fuzziness

When using a Completion Suggester with Fuzziness defined the ordering of results for suggestions are alphabetical instead of most relevant. It seems that whatever the fuzzines is set to is removed from the search/query term at the end of the term. This is not what I expected from reading the Completion Suggester Fuzziness docs which state:
Suggestions that share the longest prefix to the query prefix will be scored higher.
But that is not true. Here is a use case that proves this:
PUT test/
{
"mappings":{
"properties":{
"id":{
"type":"integer"
},
"title":{
"type":"keyword",
"fields": {
"suggest": {
"type": "completion"
}
}
}
}
}
}
POST test/_bulk
{ "index" : {"_id": "1"}}
{ "title": "HOLARAT" }
{ "index" : {"_id": "2"}}
{ "title": "HOLBROOK" }
{ "index" : {"_id": "3"}}
{ "title": "HOLCONNEN" }
{ "index" : {"_id": "4"}}
{ "title": "HOLDEN" }
{ "index" : {"_id": "5"}}
{ "title": "HOLLAND" }
The above creates an index and adds some data.
If a suggestion query is done on said data:
POST test/_search
{
"_source": {
"includes": [
"title"
]
},
"suggest": {
"title-suggestion": {
"completion": {
"fuzzy": {
"fuzziness": "1"
},
"field": "title.suggest",
"size": 3
},
"prefix": "HOLL"
}
}
}
It returns the first 3 results in alphabetical order of the last matching character, instead of the longest prefix (which would be HOLLAND):
{
...
"suggest" : {
"title-suggestion" : [
{
"text" : "HOLL",
"offset" : 0,
"length" : 4,
"options" : [
{
"text" : "HOLARAT",
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 3.0,
"_source" : {
"title" : "HOLARAT"
}
},
{
"text" : "HOLBROOK",
"_index" : "test",
"_type" : "_doc",
"_id" : "2",
"_score" : 3.0,
"_source" : {
"title" : "HOLBROOK"
}
},
{
"text" : "HOLCONNEN",
"_index" : "test",
"_type" : "_doc",
"_id" : "3",
"_score" : 3.0,
"_source" : {
"title" : "HOLCONNEN"
}
}
]
}
]
}
}
If the size param is removed then we can see that the score is the same for all entries, instead of the longest prefix being higher as stated.
With this being the case, how can results from Completion Suggesters with Fuzziness defined be ordered with the longest prefix at the top?
This has been reported in the past and this behavior is actually by design.
What I usually do in this case is to send two suggest queries (similar to what has been suggested here), one for exact match and another for fuzzy match. If the exact match contains a suggestion, I use it, otherwise I resort to using the fuzzy ones.
With the suggest query below, you'll get HOLLAND as exact-suggestion and then the fuzzy matches in fuzzy-suggestion:
POST test/_search
{
"_source": {
"includes": [
"title"
]
},
"suggest": {
"fuzzy-suggestion": {
"completion": {
"fuzzy": {
"fuzziness": "1"
},
"field": "title.suggest",
"size": 3
},
"prefix": "HOLL"
},
"exact-suggestion": {
"completion": {
"field": "title.suggest",
"size": 3
},
"prefix": "HOLL"
}
}
}

Skip duplicates on field in a Elasticsearch search result

Is it possible to remove duplicates on a given field?
For example the following query:
{
"query": {
"term": {
"name_admin": {
"value": "nike"
}
}
},
"_source": [
"name_admin",
"parent_sku",
"sku"
],
"size": 2
}
is retrieving
"hits" : [
{
"_index" : "product",
"_type" : "_doc",
"_id" : "central30603",
"_score" : 4.596813,
"_source" : {
"parent_sku" : "SSP57",
"sku" : "SSP57816401",
"name_admin" : "NIKE U NSW PRO CAP NIKE AIR"
}
},
{
"_index" : "product",
"_type" : "_doc",
"_id" : "central156578",
"_score" : 4.596813,
"_source" : {
"parent_sku" : "SSP57",
"sku" : "SSP57816395",
"name_admin" : "NIKE U NSW PRO CAP NIKE AIR"
}
}
]
I'd like to skip duplicates on parent_sku so I only have one result per parent_sku like it's possible with suggestion by doing something like "skip_duplicates": true.
I know I cloud achieve this with an aggregation but I'd like to stick with a search, as my query is a bit more complicated and as I'm using the scroll API which doesn't work with aggregations.
Field collapsing should help here
{
"query": {
"term": {
"name_admin": {
"value": "nike"
}
}
},
"collapse" : {
"field" : "parent_sku",
"inner_hits": {
"name": "parent",
"size": 1
}
},
"_source": false,
"size": 2
}
The above query will return one document par parent_sku.

Bool Filter not showing just the filtered data in Elastic Search

I have an index "tag_nested" which has data of following type :
{
"jobid": 1,
"table_name": "table_A",
"Tags": [
{
"TagType": "WorkType",
"Tag": "ETL"
},
{
"TagType": "Subject Area",
"Tag": "Telecom"
}
]
}
When I fire the query to filter data on "Tag" and "TagType" by firing following query :
POST /tag_nested/_search
{
"query": {
"bool": {
"must": {"match_all": {}},
"filter": [
{"term": {
"Tags.Tag.keyword": "ETL"
}},
{"term": {
"Tags.TagType.keyword": "WorkType"
}}
]
}
}
}
It gives me the following output. The problem I am facing is while the above query filters documents which doesn't have filtered data BUT it shows all the "Tags" of that document instead of just the filter one
{
"_index" : "tag_nested",
"_type" : "_doc",
"_id" : "9",
"_score" : 1.0,
"_source" : {
"jobid" : 1,
"table_name" : "table_A",
"Tags" : [
{
"TagType" : "WorkType",
"Tag" : "ETL"
},
{
"TagType" : "Subject Area",
"Tag" : "Telecom"
}
]
}
}
Instead of above result I want my output to be like :
{
"_index" : "tag_nested",
"_type" : "_doc",
"_id" : "9",
"_score" : 1.0,
"_source" : {
"jobid" : 1,
"table_name" : "table_A",
"Tags" : [
{
"TagType" : "WorkType",
"Tag" : "ETL"
}
]
}
}
Already answered here, here and here.
TL;DR you'll need to make your Tags field of type nested, resync your index & use inner_hits to only fetch the applicable tag group.

Returning all documents when query string is empty

Say I have the following mapping:
{
'properties': {
{'title': {'type': 'text'},
{'created': {'type': 'text'}}
}
}
Sometimes the user will query by created, and sometimes by title and created. In both cases I want the query JSON to be as similar as possible. What's a good way to create a query that filters only by created when the user is not using the title to query?
I tried something like:
{
bool: {
must: [
{range: {created: {gte: '2010-01-01'}}},
{query: {match_all: {}}}
]
}
}
But that didn't work. What would be the best way of writing this query?
Your query didn't work cause created is of type text and not date, range queries on string dates will not work as expected, you should change your mappings from type text to date and reindex your data.
Follow this to reindex your data (with the new mappings) step by step.
Now if I understand correctly you want to use a generic query which filters title or/and created depending on the user input.
In this case, my suggestion is to use Query String.
An example (version 7.4.x):
Mappings
PUT my_index
{
"mappings": {
"properties": {
"title": {
"type": "text"
},
"created": { -------> change type to date instead of text
"type": "date"
}
}
}
}
Index a few documents
PUT my_index/_doc/1
{
"title":"test1",
"created": "2010-01-01"
}
PUT my_index/_doc/2
{
"title":"test2",
"created": "2010-02-01"
}
PUT my_index/_doc/3
{
"title":"test3",
"created": "2010-03-01"
}
Search Query (created)
GET my_index/_search
{
"query": {
"query_string": {
"query": "created:>=2010-02-01",
"fields" : ["created"]
}
}
}
Results
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"title" : "test2",
"created" : "2010-02-01"
}
},
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"title" : "test3",
"created" : "2010-03-01"
}
}]
Search Query (title)
GET my_index/_search
{
"query": {
"query_string": {
"query": "test2",
"fields" : ["title"]
}
}
}
Results
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.9808292,
"_source" : {
"title" : "test2",
"created" : "2010-02-01"
}
}
]
Search Query (title and created)
GET my_index/_search
{
"query": {
"query_string": {
"query": "(created:>=2010-02-01) AND test3"
}
}
}
Results
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.9808292,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.9808292,
"_source" : {
"title" : "test3",
"created" : "2010-03-01"
}
}
]
fields in query string - you can mention both fields. if you remove fields then the query will apply on all fields in your mappings.
Hope this helps

Is there any way to add the field in document but hide it from _source, also document should be analysed and searchable

I want to add one field to the document which should be searchable but when we do get/search it should not appear under _source.
I have tried index and store options but its not achievable through it.
Its more like _all or copy_to, but in my case value is provided by me (not collecting from other fields of the document.)
I am looking for mapping through which I can achieve below cases.
When I put document :
PUT my_index/_doc/1
{
"title": "Some short title",
"date": "2015-01-01",
"content": "A very long content field..."
}
and do search
GET my_index/_search
output should be
{
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"title" : "Some short title",
"date" : "2015-01-01"
}
}
]
}
}
also when I do the below search
GET my_index/_search
{
"query": {
"query_string": {
"default_field": "content",
"query": "long content"
}
}
}
it should result me
"hits" : {
"total" : 1,
"max_score" : 0.5753642,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.5753642,
"_source" : {
"title" : "Some short title",
"date" : "2015-01-01"
}
}
]
}
Simply use source filtering to exclude the content field:
GET my_index/_search
{
"_source": {
"excludes": [ "content" ]
},
"query": {
"query_string": {
"default_field": "content",
"query": "long content"
}
}
}
We can achieve this using below mapping :
PUT my_index
{
"mappings": {
"_doc": {
"_source": {
"excludes": [
"content"
]
},
"properties": {
"title": {
"type": "text",
"store": true
},
"date": {
"type": "date",
"store": true
},
"content": {
"type": "text"
}
}
}
}
}
Add document :
PUT my_index/_doc/1
{
"title": "Some short title",
"date": "2015-01-01",
"content": "A very long content field..."
}
When you run the query to search content on the field 'content' :
GET my_index/_search
{
"query": {
"query_string": {
"default_field": "content",
"query": "long content"
}
}
}
You will get the result with hits as below:
"hits" : {
"total" : 1,
"max_score" : 0.5753642,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.5753642,
"_source" : {
"date" : "2015-01-01",
"title" : "Some short title"
}
}
]
}
It hides the field 'content'. :)
Hence achieved it with the help of mapping. You don't need to exclude it from query each time you make get/search call.
More read on source :
https://www.elastic.co/guide/en/elasticsearch/reference/6.6/mapping-source-field.html

Resources