How to turn an array of object to array of string while reindexing in elasticsearch? - elasticsearch

Let say the source index have a document like this :
{
"name":"John Doe",
"sport":[
{
"name":"surf",
"since":"2 years"
},
{
"name":"mountainbike",
"since":"4 years"
},
]
}
How to discard the "since" information so once reindexed the object will contain only sport names? Like this :
{
"name":"John Doe",
"sport":["surf","mountainbike"]
}
Note that it would be fine if the resulting field keep the same name, but it's not mandatory.

I don't know which version of elasticsearch you're using, but here is a solution based on pipelines, introduced with ingest nodes in ES v5.0.
1) A script processor is used to extract the values from each subobject and set it in another field (here, sports)
2) The previous sport field is removed with a remove processor
You can use the Simulate pipeline API to test it :
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "random description",
"processors": [
{
"script": {
"lang": "painless",
"source": "ctx.sports =[]; for (def item : ctx.sport) { ctx.sports.add(item.name) }"
}
},
{
"remove": {
"field": "sport"
}
}
]
},
"docs": [
{
"_index": "index",
"_type": "doc",
"_id": "id",
"_source": {
"name": "John Doe",
"sport": [
{
"name": "surf",
"since": "2 years"
},
{
"name": "mountainbike",
"since": "4 years"
}
]
}
}
]
}
which outputs the following result :
{
"docs": [
{
"doc": {
"_index": "index",
"_type": "doc",
"_id": "id",
"_source": {
"name": "John Doe",
"sports": [
"surf",
"mountainbike"
]
},
"_ingest": {
"timestamp": "2018-07-12T14:07:25.495Z"
}
}
}
]
}
There may be a better solution, as I've not used pipelines a lot, or you could make this with Logstash filters before submitting the documents to your Elasticsearch cluster.
For more information about the pipelines, take a look at the reference documentation of ingest nodes.

Related

How to update a text type field in Elasticsearch to a keyword field, where each word becomes a keyword in a list?

I’m looking to update a field in Elasticsearch from text to keyword type.
I’ve tried changing the type from text to keyword in the mapping and then reindexing, but with this method the entire text value is converted into one big keyword. For example, ‘limited time offer’ is converted into one keyword, rather than being broken up into something like ['limited', 'time', 'offer'].
Is it possible to change a text field into a list of keywords, rather than one big keyword? Also, is there a way to do this with only a mapping change and then reindexing?
You need create a new index and reindex using a pipeline to create a list words.
Pipeline
POST _ingest/pipeline/_simulate
{
"pipeline": {
"processors": [
{
"split": {
"field": "items",
"target_field": "new_list",
"separator": " ",
"preserve_trailing": true
}
}
]
},
"docs": [
{
"_index": "index",
"_id": "id",
"_source": {
"items": "limited time offer"
}
}
]
}
Results
{
"docs": [
{
"doc": {
"_index": "index",
"_id": "id",
"_version": "-3",
"_source": {
"items": "limited time offer",
"new_list": [
"limited",
"time",
"offer"
]
},
"_ingest": {
"timestamp": "2022-11-11T14:49:15.9814242Z"
}
}
}
]
}
Steps
1 - Create a new index
2 - Create a pipeline
PUT _ingest/pipeline/split_words_field
{
"processors": [
{
"split": {
"field": "items",
"target_field": "new_list",
"separator": " ",
"preserve_trailing": true
}
}
]
}
3 - Reindex with pipeline
POST _reindex
{
"source": {
"index": "idx_01"
},
"dest": {
"index": "idx_02",
"pipeline": "split_words_field"
}
}
Example:
PUT _ingest/pipeline/split_words_field
{
"processors": [
{
"split": {
"field": "items",
"target_field": "new_list",
"separator": " ",
"preserve_trailing": true
}
}
]
}
POST idx_01/_doc
{
"items": "limited time offer"
}
POST _reindex
{
"source": {
"index": "idx_01"
},
"dest": {
"index": "idx_02",
"pipeline": "split_words_field"
}
}
GET idx_02/_search

Elasticsearch case-insensitive partial match over multiple fields

I'm implementing a search box in Elasticsearch and I have an Elasticsearch index with the following mappings:
{
"mappings": {
"properties": {
"name": {
"type": "text"
},
"brand": {
"type": "text"
}
}
}
}
And I'd like, quite simply, to do a query such as (in SQL):
SELECT * FROM <table> WHERE brand ILIKE '%test%' OR name ILIKE '%test%';
I've tried a query such as:
{
"query": {
"query_string": {
"query": "*test*",
"fields": ["brand", "name"]
}
}
}
and that gives me my desired result, however, I've noticed that the docs recommend not using query_string for a search box as it can lead to performance issues.
I then tried a multi_match query:
{
"query": {
"multi_match" : {
"query": "test"
}
}
}
But that yielded no results. Further, when I used an ngram tokenizer, it returned all documents all the time.
I've consulted countless resources on this and even on StackOverflow there are countless unanswered questions regarding this topic. Could somebody explain how this is achieved in the Elasticsearch world, or am I simply using the wrong tool for the job? Thanks.
Since you have not provided the sample documents, I have created complete example, what you are trying to do is very much possible in Elasticsearch, with simple boolean should wildcard queries as shown below
{
"query": {
"bool": {
"should": [
{
"wildcard": {
"name.keyword": {
"value": "*test*"
}
}
},
{
"wildcard": {
"brand.keyword": {
"value": "*test*"
}
}
}
],
"minimum_should_match": 1,
"boost": 1.0
}
}
}
You can test above query on below sample documents
{
"brand" : "test",
"name" : "name foo according to use"
}
{
"brand" : "barand name is foo",
"name" : "name foo according to use"
}
{
"brand" : "barand name is test",
"name" : "name tested according to use"
}
{
"brand" : "barand name is testing",
"name" : "test the name"
}
on above 4 sample documents, query returns below documents
"hits": [
{
"_index": "73885469",
"_id": "1",
"_score": 2.0,
"_source": {
"brand": "barand name is testing",
"name": "test the name"
}
},
{
"_index": "73885469",
"_id": "2",
"_score": 2.0,
"_source": {
"brand": "barand name is test",
"name": "name tested according to use"
}
},
{
"_index": "73885469",
"_id": "4",
"_score": 1.0,
"_source": {
"brand": "test",
"name": "name foo according to use"
}
}
]
Which is i believe your expected documents

search first element of a multivalue text field in elasticsearch

I want to search first element of array in documents of elasticsearch, but I can't.
I don't find it that how can I search.
For test, I created new index with fielddata=true, but I still didn't get the response that I wanted
Document
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
Values
name : ["John", "Doe"]
My request
{
"query": {
"bool" : {
"must" : {
"script" : {
"script" : {
"source": "doc['name'][0]=params.param1",
"params" : {
"param1" : "john"
}
}
}
}
}
}
}
Incoming Response
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [name] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
You can use the following script that is used in a search request to return a scripted field:
{
"script_fields": {
"firstElement": {
"script": {
"lang": "painless",
"inline": "params._source.name[0]"
}
}
}
}
Search Result:
"hits": [
{
"_index": "stof_64391432",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"fields": {
"firstElement": [
"John" <-- note this
]
}
}
]
You can use a Painless script to create a script field to return a customized value for each document in the results of a query.
You need to use equality equals operator '==' to COMPARE two
values where the resultant boolean type value is true if the two
values are equal and false otherwise in the script query.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"mappings":{
"properties":{
"name":{
"type":"text",
"fielddata":true
}
}
}
}
Index data:
{
"name": [
"John",
"Doe"
]
}
Search Query:
{
"script_fields": {
"my_field": {
"script": {
"lang": "painless",
"source": "params['_source']['name'][0] == params.params1",
"params": {
"params1": "John"
}
}
}
}
}
Search Result:
"hits": [
{
"_index": "test",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"fields": {
"my_field": [
true <-- note this
]
}
}
]
Arrays of objects do not work as you would expect: you cannot query
each object independently of the other objects in the array. If you
need to be able to do this then you should use the nested data type
instead of the object data type.
You can use the script as shown in my another answer if you want to just compare the value of the first element of the array to some other value. But based on your comments, it looks like your use case is quite different.
If you want to search the first element of the array you need to convert your data, into nested form. Using arrays of object at search time you can’t refer to “the first element” or “the last element”.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"mappings": {
"properties": {
"name": {
"type": "nested"
}
}
}
}
Index Data:
{
"booking_id": 2,
"name": [
{
"first": "John Doe",
"second": "abc"
}
]
}
{
"booking_id": 1,
"name": [
{
"first": "Adam Simith",
"second": "John Doe"
}
]
}
{
"booking_id": 3,
"name": [
{
"first": "John Doe",
"second": "Adam Simith"
}
]
}
Search Query:
{
"query": {
"nested": {
"path": "name",
"query": {
"bool": {
"must": [
{
"match_phrase": {
"name.first": "John Doe"
}
}
]
}
}
}
}
}
Search Result:
"hits": [
{
"_index": "test",
"_type": "_doc",
"_id": "2",
"_score": 0.9400072,
"_source": {
"booking_id": 2,
"name": [
{
"first": "John Doe",
"second": "abc"
}
]
}
},
{
"_index": "test",
"_type": "_doc",
"_id": "3",
"_score": 0.9400072,
"_source": {
"booking_id": 3,
"name": [
{
"first": "John Doe",
"second": "Adam Simith"
}
]
}
}
]

Elasticsearch: Transpose and aggregate the data

I am using the ES 6.5. When I fetch the required messages, I have to transpose and aggregate it. See example for more details.
Message retrieved - 2 messages retried for example:
{
"_index": "index_name",
"_type": "data",
"_id": "data_id",
"_score": 5.0851293,
"_source": {
"header": {
"id": "System_20190729152502239_57246_16667",
"creationTimestamp": "2019-07-29T15:25:02.239Z",
},
"messageData": {
"messageHeader": {
"date": "2019-06-03",
"mId": "1000",
"mDescription": "TEST",
},
"messageBreakDown": [
{
"category": "New",
"subCategory": "Sub",
"messageDetails": [
{
"Amount": 5.30
}
]
}
]
}
}
},
{
"_index": "index_name",
"_type": "data",
"_id": "data_id",
"_score": 5.09512,
"_source": {
"header": {
"id": "System_20190729152502239_57246_16667",
"creationTimestamp": "2019-07-29T15:25:02.239Z",
},
"messageData": {
"messageHeader": {
"date": "2019-06-03",
"mId": "1000",
"mDescription": "TEST",
},
"messageBreakDown": [
{
"category": "Old",
"subCategory": "Sub",
"messageDetails": [
{
"Amount": 4.30
}
]
}
]
}
}
}
Now I am looking for a query to post on ES which will transpose the data and group by on category and sub category .
So basically if you check the messages, they have same header.id (which is the main search criteria). Within this header.id, one message is for category New and other Old (messageData.messageBreakDown is array and in it category value).
So ideally as you see the output, both messages belong to same mId, and it has New price and Old Price.
How to aggregate for the desired results ?
Final output message can have desired fields only e.g. date, mId, mDesciption, New price and Old price (both in one output)?
UPDATE:
Below is the mapping,
{"index_name":{"mappings":{"data":{"properties":{"header":{"properties":{"id":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"creationTimestamp":{"type":"date"}}},"messageData":{"properties":{"messageBreakDown":{"properties":{"category":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"messageDetails":{"properties":{"Amount":{"type":"float"}}},"subCategory":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}},"messageHeader":{"properties":{"mDescription":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"mId":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"date":{"type":"date"}}}}}}}}}}

Name searching in ElasticSearch

I have a index created in ElasticSearch with the field name where I store the whole name of a person: Name and Surname. I want to perform full text search over that field so I have indexed it using the analyzer.
My issue now is that if I search:
"John Rham Rham"
And in the index I had "John Rham Rham Luck", that value has higher score than "John Rham Rham".
Is there any posibility to have better score on the exact field than in the field with more values in the string?
Thanks in advance!
I worked out a small example (assuming you're running on ES 5.x cause of the difference in scoring):
DELETE test
PUT test
{
"settings": {
"similarity": {
"my_bm25": {
"type": "BM25",
"b": 0
}
}
},
"mappings": {
"test": {
"properties": {
"name": {
"type": "text",
"similarity": "my_bm25",
"fields": {
"length": {
"type": "token_count",
"analyzer": "standard"
}
}
}
}
}
}
}
POST test/test/1
{
"name": "John Rham Rham"
}
POST test/test/2
{
"name": "John Rham Rham Luck"
}
GET test/_search
{
"query": {
"function_score": {
"query": {
"match": {
"name": {
"query": "John Rham Rham",
"operator": "and"
}
}
},
"functions": [
{
"script_score": {
"script": "_score / doc['name.length'].getValue()"
}
}
]
}
}
}
This code does the following:
Replace the default BM25 implementation with a custom one, tweaking the B parameter (field length normalisation)
-- You could also change the similarity to 'classic' to go back to TF/IDF which doesn't have this normilisation
Create an inner field for your name field, which counts the number of tokens inside your name field.
Update the score according to the length of the token
This will result in:
"hits": {
"total": 2,
"max_score": 0.3596026,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 0.3596026,
"_source": {
"name": "John Rham Rham"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.26970196,
"_source": {
"name": "John Rham Rham Luck"
}
}
]
}
}
Not sure if this is the best way of doing it, but it maybe point you in the right direction :)

Resources