How to update a text type field in Elasticsearch to a keyword field, where each word becomes a keyword in a list? - elasticsearch

I’m looking to update a field in Elasticsearch from text to keyword type.
I’ve tried changing the type from text to keyword in the mapping and then reindexing, but with this method the entire text value is converted into one big keyword. For example, ‘limited time offer’ is converted into one keyword, rather than being broken up into something like ['limited', 'time', 'offer'].
Is it possible to change a text field into a list of keywords, rather than one big keyword? Also, is there a way to do this with only a mapping change and then reindexing?

You need create a new index and reindex using a pipeline to create a list words.
Pipeline
POST _ingest/pipeline/_simulate
{
"pipeline": {
"processors": [
{
"split": {
"field": "items",
"target_field": "new_list",
"separator": " ",
"preserve_trailing": true
}
}
]
},
"docs": [
{
"_index": "index",
"_id": "id",
"_source": {
"items": "limited time offer"
}
}
]
}
Results
{
"docs": [
{
"doc": {
"_index": "index",
"_id": "id",
"_version": "-3",
"_source": {
"items": "limited time offer",
"new_list": [
"limited",
"time",
"offer"
]
},
"_ingest": {
"timestamp": "2022-11-11T14:49:15.9814242Z"
}
}
}
]
}
Steps
1 - Create a new index
2 - Create a pipeline
PUT _ingest/pipeline/split_words_field
{
"processors": [
{
"split": {
"field": "items",
"target_field": "new_list",
"separator": " ",
"preserve_trailing": true
}
}
]
}
3 - Reindex with pipeline
POST _reindex
{
"source": {
"index": "idx_01"
},
"dest": {
"index": "idx_02",
"pipeline": "split_words_field"
}
}
Example:
PUT _ingest/pipeline/split_words_field
{
"processors": [
{
"split": {
"field": "items",
"target_field": "new_list",
"separator": " ",
"preserve_trailing": true
}
}
]
}
POST idx_01/_doc
{
"items": "limited time offer"
}
POST _reindex
{
"source": {
"index": "idx_01"
},
"dest": {
"index": "idx_02",
"pipeline": "split_words_field"
}
}
GET idx_02/_search

Related

search first element of a multivalue text field in elasticsearch

I want to search first element of array in documents of elasticsearch, but I can't.
I don't find it that how can I search.
For test, I created new index with fielddata=true, but I still didn't get the response that I wanted
Document
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
Values
name : ["John", "Doe"]
My request
{
"query": {
"bool" : {
"must" : {
"script" : {
"script" : {
"source": "doc['name'][0]=params.param1",
"params" : {
"param1" : "john"
}
}
}
}
}
}
}
Incoming Response
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [name] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
You can use the following script that is used in a search request to return a scripted field:
{
"script_fields": {
"firstElement": {
"script": {
"lang": "painless",
"inline": "params._source.name[0]"
}
}
}
}
Search Result:
"hits": [
{
"_index": "stof_64391432",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"fields": {
"firstElement": [
"John" <-- note this
]
}
}
]
You can use a Painless script to create a script field to return a customized value for each document in the results of a query.
You need to use equality equals operator '==' to COMPARE two
values where the resultant boolean type value is true if the two
values are equal and false otherwise in the script query.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"mappings":{
"properties":{
"name":{
"type":"text",
"fielddata":true
}
}
}
}
Index data:
{
"name": [
"John",
"Doe"
]
}
Search Query:
{
"script_fields": {
"my_field": {
"script": {
"lang": "painless",
"source": "params['_source']['name'][0] == params.params1",
"params": {
"params1": "John"
}
}
}
}
}
Search Result:
"hits": [
{
"_index": "test",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"fields": {
"my_field": [
true <-- note this
]
}
}
]
Arrays of objects do not work as you would expect: you cannot query
each object independently of the other objects in the array. If you
need to be able to do this then you should use the nested data type
instead of the object data type.
You can use the script as shown in my another answer if you want to just compare the value of the first element of the array to some other value. But based on your comments, it looks like your use case is quite different.
If you want to search the first element of the array you need to convert your data, into nested form. Using arrays of object at search time you can’t refer to “the first element” or “the last element”.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"mappings": {
"properties": {
"name": {
"type": "nested"
}
}
}
}
Index Data:
{
"booking_id": 2,
"name": [
{
"first": "John Doe",
"second": "abc"
}
]
}
{
"booking_id": 1,
"name": [
{
"first": "Adam Simith",
"second": "John Doe"
}
]
}
{
"booking_id": 3,
"name": [
{
"first": "John Doe",
"second": "Adam Simith"
}
]
}
Search Query:
{
"query": {
"nested": {
"path": "name",
"query": {
"bool": {
"must": [
{
"match_phrase": {
"name.first": "John Doe"
}
}
]
}
}
}
}
}
Search Result:
"hits": [
{
"_index": "test",
"_type": "_doc",
"_id": "2",
"_score": 0.9400072,
"_source": {
"booking_id": 2,
"name": [
{
"first": "John Doe",
"second": "abc"
}
]
}
},
{
"_index": "test",
"_type": "_doc",
"_id": "3",
"_score": 0.9400072,
"_source": {
"booking_id": 3,
"name": [
{
"first": "John Doe",
"second": "Adam Simith"
}
]
}
}
]

Term aggregation on ElasticSearch join

I would like to perform an aggregation on a join relation using ElasticSearch 7.7.
I need to know how many children I have for each parent.
The only way that I found to solve my issue is to use script inside term aggregation, but my concern is about performance.
/my_index/_search
{
"size": 0,
"aggs": {
"total": {
"terms": {
"script": {
"lang": "painless",
"source": "params['_source']['my_join']['parent']"
}
}
},
"max_total": {
"max_bucket": {
"buckets_path": "total>_count"
}
}
}
}
Someone knows a more fast way to execute this aggregation avoiding the script?
If the join field wasn't a parent/child I could replace the term aggregation with:
"terms": { "field": "my_field" }
To give more context I add some information about mapping:
I'm using Elastic 7.7.
I also attach a mapping with some sample documents:
{
"mappings": {
"properties": {
"my_join": {
"relations": {
"other": "doc"
},
"type": "join"
},
"reader": {
"type": "keyword"
},
"name": {
"type": "text"
},
"content": {
"type": "text"
}
}
}
}
PUT example/_doc/1
{
"reader": [
"A",
"B"
],
"my_join": {
"name": "other"
}
}
PUT example/_doc/2
{
"reader": [
"A",
"B"
],
"my_join": {
"name": "other"
}
}
PUT example/_doc/3
{
"content": "abc",
"my_join": {
"name": "doc",
"parent": 1
}
}
PUT example/_doc/4
{
"content": "def",
"my_join": {
"name": "doc"
"parent": 2
}
}
PUT example/_doc/5
{
"content": "def",
"acl_join": {
"name": "doc"
"parent": 1
}
}

How to turn an array of object to array of string while reindexing in elasticsearch?

Let say the source index have a document like this :
{
"name":"John Doe",
"sport":[
{
"name":"surf",
"since":"2 years"
},
{
"name":"mountainbike",
"since":"4 years"
},
]
}
How to discard the "since" information so once reindexed the object will contain only sport names? Like this :
{
"name":"John Doe",
"sport":["surf","mountainbike"]
}
Note that it would be fine if the resulting field keep the same name, but it's not mandatory.
I don't know which version of elasticsearch you're using, but here is a solution based on pipelines, introduced with ingest nodes in ES v5.0.
1) A script processor is used to extract the values from each subobject and set it in another field (here, sports)
2) The previous sport field is removed with a remove processor
You can use the Simulate pipeline API to test it :
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "random description",
"processors": [
{
"script": {
"lang": "painless",
"source": "ctx.sports =[]; for (def item : ctx.sport) { ctx.sports.add(item.name) }"
}
},
{
"remove": {
"field": "sport"
}
}
]
},
"docs": [
{
"_index": "index",
"_type": "doc",
"_id": "id",
"_source": {
"name": "John Doe",
"sport": [
{
"name": "surf",
"since": "2 years"
},
{
"name": "mountainbike",
"since": "4 years"
}
]
}
}
]
}
which outputs the following result :
{
"docs": [
{
"doc": {
"_index": "index",
"_type": "doc",
"_id": "id",
"_source": {
"name": "John Doe",
"sports": [
"surf",
"mountainbike"
]
},
"_ingest": {
"timestamp": "2018-07-12T14:07:25.495Z"
}
}
}
]
}
There may be a better solution, as I've not used pipelines a lot, or you could make this with Logstash filters before submitting the documents to your Elasticsearch cluster.
For more information about the pipelines, take a look at the reference documentation of ingest nodes.

How to configure elasticsearch regexp query

I try to configure elasticsearch request. I use DSL and try to find some data with word "swagger" into "message" field.
Here is one of correct answer I want to show :
{
"_index": "apiconnect508",
"_type": "audit",
"_id": "AWF1us1T4ztincEzswAr",
"_score": 1,
"_source": {
"consumerOrgId": null,
"headers": {
"http_accept": "application/json",
"content_type": "application/json",
"request_path": "/apim-5a7c34e0e4b02e66c60edbb2-2018.02/auditevent",
"http_version": "HTTP/1.1",
"http_connection": "keep-alive",
"request_method": "POST",
"http_host": "localhost:9700",
"request_uri": "/apim-5a7c34e0e4b02e66c60edbb2-2018.02/auditevent",
"content_length": "533",
"http_user_agent": "Wink Client v1.1.1"
},
"nlsMessage": {
"resource": "messages",
"replacements": [
"test",
"1.0.0",
"ext_mafashagov#rencredit.ru"
],
"key": "swagger.import.notification"
},
"notificationType": "EVENT",
"eventType": "AUDIT",
"source": null,
"envId": null,
"message": "API test version 1.0.0 was created from a Swagger document by ext_mafashagov#rencredit.ru.",
"userId": "ext_mafashagov#rencredit.ru",
"orgId": "5a7c34e0e4b02e66c60edbb2",
"assetType": "api",
"tags": [
"_geoip_lookup_failure"
],
"gateway_geoip": {},
"datetime": "2018-02-08T14:04:32.731Z",
"#timestamp": "2018-02-08T14:04:32.747Z",
"assetId": "5a7c58f0e4b02e66c60edc53",
"#version": "1",
"host": "127.0.0.1",
"id": "5a7c58f0e4b02e66c60edc55",
"client_geoip": {}
}
}
I try to find ths JSON by :
POST myAddress/_search
Next query works without "regexp" field. How should I configure regexp part of my query?
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"range": {
"#timestamp" : {"gte" : "now-100d"}
}
},
{
"term": {
"_type": "audit"
}
},
{
"regexp" : {
"message": "*wagger*"
}
}
]
}
}
}
},
"sort": {
"TraceDateTime": {
"order": "desc",
"ignore_unmapped": "true"
}
}
}
If message field is analyzed, this simple match query should work:
"match":{
"message":"*swagger*"
}
However if it is not analyzed, these two queries should also work for you:
These two queries are case sensitive so you should consider lower casing your field if you wish to keep it not analyzed.
"wildcard":{
"message":"*swagger*"
}
or
"regexp":{
"message":"swagger"
}
Be careful as wildcard and regexp queries degrade performance.

Elasticsearch OR filtered query does not return results

I have the following data set:
{
"_index": "myIndex",
"_type": "myType",
"_id": "220005",
"_score": 1,
"_source": {
"id": "220005",
"name": "Some Name",
"type": "myDataType",
"doc_as_upsert": true
}
}
Doing a direct match query like so:
GET typo3data/destination/_search
{
"query": {
"match": {
"name": "Some Name"
}
},
"size": 500
}
Will return the data just fine:
"hits": {
"total": 1,
"max_score": 3.442347,
"hits": [...
Doing an OR-query however (I am not sure which syntax is correct, the first syntax is taken from elasticsearch docs, the second is a working query taken from another project with the same versions):
GET typo3data/destination/_search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"or": {
"filters": [
{
"term": {
"name": "Some Name"
}
}
]
}
}
}
},
"size": 500
}
or
{
"query":
{
"match_all": {}
},
"filter":
{
"or":
[
{ "term": { "name": "Some Name"} },
{ "term": { "name": "Some Other Name"} }
]
},
"size": 1000
}
Does not return anything.
The mapping for the name field is:
"name": {
"type": "string",
"index": "not_analyzed"
}
Elasticsearch version is 1.4.4.
When indexing "some name" , this is broken into tokens as follows -
"some name" => [ "some" , "name" ]
Now in a normal match query , it also does the same above process before matching result. If either "same" or "name" is present , that document is qualified as result
match query ("some name") => search for term "some" or "name"
The term query does not analyze or tokenize your query. This means that it looks for a exact token or term of "some name" which is not present.
term query ("some name") => search for term "some name"
Hence you wont be seeing any result.
Things should work fine if you make the field not_analyzed , but then make sure the case is also matching,
You can read more about the same here.
After extending our mapping to include every field we have:
PUT typo3data/_mapping/destination
{
"someType": {
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "string",
"index": "not_analyzed"
},
"parentId": {
"type": "integer"
},
"type": {
"type": "string"
},
"generatedUid": {
"type": "integer"
}
}
}
}
The or-filters were working. So the general answer is: If you have such a problem, check your mappings closely and rather do too much work on them than too little.
If someone has an explanation why this might be happening, I will gladly pass the answer mark on to it.

Resources