How to insert a scripted field using igestion pipeline - elasticsearch

So I have two fields in my docs
{
emails: ["", "", ""]
name: "",
}
And I want to have a new field once the docs are indexed called uid which will just contain the concatenated strings of all the emails and the name for every doc.
I am able to get scripted field like that using this GET request on my index _search endpoint
{
"script_fields": {
"combined": {
"script": {
"lang": "painless",
"source": "def result=''; for (String email: doc['emails.keyword']) { result = result + email;} return doc['name'].value + result;"
}
}
}
}
I want to know what my ingest pipeline PUT request body should look like if I want to have the same scripted field indexed with my docs?

Let's say I have the below sample index and sample document.
Sample Source Index
For the sake of understanding, I've created the below mapping.
PUT my_source_index
{
"mappings": {
"properties": {
"email":{
"type":"text"
},
"name":{
"type": "text"
}
}
}
}
Sample Document:
POST my_source_index/_doc/1
{
"email": ["john#gmail.com","doe#outlook.com"],
"name": "johndoe"
}
Just follow the below steps
Step 1: Create Ingest Pipeline
PUT _ingest/pipeline/my-pipeline-concat
{
"description" : "describe pipeline",
"processors" : [
{
"join": {
"field": "email",
"target_field": "temp_uuid",
"separator": "-"
}
},
{
"set": {
"field": "uuid",
"value": "{{name}}-{{temp_uuid}}"
}
},
{
"remove":{
"field": "temp_uuid"
}
}
]
}
Notice that I've made use of Ingest API where I am using three processors while creating the above pipeline which would be executed in sequence:
The first processor is a Join Processor, which concatenates all the email ids and creates temp_uuid.
Second Processor is a Set Processor, I am combining name with temp_uuid.
And in the third step, I am removing the temp_uuid using Remove Processor
Note that I am using - as delimiter between all values. You can feel free to use anything you want.
Step 2: Create Destination Index:
PUT my_dest_index
{
"mappings": {
"properties": {
"email":{
"type":"text"
},
"name":{
"type": "text"
},
"uuid":{ <--- Do not forget to add this
"type": "text"
}
}
}
}
Step 3: Apply Reindex API:
POST _reindex
{
"source": {
"index": "my_source_index"
},
"dest": {
"index": "my_dest_index",
"pipeline": "my-pipeline-concat" <--- Make sure you add pipeline here
}
}
Note how I've mentioned the pipeline while using Reindex API
Step 4: Verify Destination Index:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "my_dest_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"name" : "johndoe",
"uuid" : "johndoe-john#gmail.com-doe#outlook.com", <--- Note this
"email" : [
"john#gmail.com",
"doe#outlook.com"
]
}
}
]
}
}
Hope this helps!

Related

ElasticSearch Query fields based on conditions on another field

Mapping
PUT /employee
{
"mappings": {
"post": {
"properties": {
"name": {
"type": "keyword"
},
"email_ids": {
"properties":{
"id" : { "type" : "integer"},
"value" : { "type" : "keyword"}
}
},
"primary_email_id":{
"type": "integer"
}
}
}
}
}
Data
POST employee/post/1
{
"name": "John",
"email_ids": [
{
"id" : 1,
"value" : "1#email.com"
},
{
"id" : 2,
"value" : "2#email.com"
}
],
"primary_email_id": 2 // Here 2 refers to the id field of email_ids.id (2#email.com).
}
I need help to form a query to check if an email id is already taken as a primary email?
eg: If I query for 1#email.com I should get result as No as 1#email.com is not a primary email id.
If I query for 2#email.com I should get result as Yes as 2#email.com is a primary email id for John.
As far as i know with this mapping you can not achive what you are expecting.
But, You can create email_ids field as nested type and add one more field like isPrimary and set value of it to true whenever email is primary email.
Index Mapping
PUT employee
{
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"email_ids": {
"type": "nested",
"properties": {
"id": {
"type": "integer"
},
"value": {
"type": "keyword"
},
"isPrimary":{
"type": "boolean"
}
}
},
"primary_email_id": {
"type": "integer"
}
}
}
}
Sample Document
POST employee/_doc/1
{
"name": "John",
"email_ids": [
{
"id": 1,
"value": "1#email.com"
},
{
"id": 2,
"value": "2#email.com",
"isPrimary": true
}
],
"primary_email_id": 2
}
Query
You need to keep below query as it is and only need to change email address when you want to see if email is primary or not.
POST employee/_search
{
"_source": false,
"query": {
"nested": {
"path": "email_ids",
"query": {
"bool": {
"must": [
{
"term": {
"email_ids.value": {
"value": "2#email.com"
}
}
},
{
"term": {
"email_ids.isPrimary": {
"value": "true"
}
}
}
]
}
}
}
}
}
Result
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.98082924,
"hits" : [
{
"_index" : "employee",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.98082924
}
]
}
}
Interpret Result:
Elasticsearch will not return result in boolean like true or false but you can implement it at application level. You can consider value of hits.total.value from result, if it is 0 then you can consider false otherwise true.
PS: Answer is based on ES version 7.10.

Can't Highlight Dynamic Template Field Value in Elasticsearch

Follow up to this question.
I have a dynamic template which copies the text of a JSON blob to a single text field, and I'd like to search on that field and highlight matches. Here is my full code for ES 6.5
DELETE /test
PUT /test?include_type_name=true
{
"settings": {"number_of_shards": 1,"number_of_replicas": 1},
"mappings": {
"_doc": {
"dynamic_templates": [
{
"full_name": {
"match_mapping_type": "string",
"path_match": "content.*",
"mapping": {
"type": "text",
"copy_to": "content_text"
}
}
}
],
"properties": {
"content_text": {
"type": "text"
},
"content": {
"type": "object",
"enabled": "true"
}
}
}
}
}
PUT /test/_doc/1?refresh=true
{
"content": {
"a": {
"b": {
"text": "42"
}
}
}
}
GET /test/_search
{
"query": {
"match": {
"content_text": "42"
}
},
"highlight": {
"fields": {
"content_text": {}
}
}
}
The response does not show the highlighted content_text
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"content" : {
"a" : {
"b" : {
"text" : "42"
}
}
}
}
}
]
}
}
As you can see, the content_text field is not highlight. It's also not in the response at all. How do I get highlights for this field to show up?
This is a tricky one, but will make sense once you read what follows.
As per the official documentation on highlighting, the actual content of a field is required to exist somewhere. So if the field is not stored (i.e. the mapping does not set store to true), the actual _source is loaded and the relevant field is extracted from _source.
In your case, the content_text field doesn't exist in the _source document (i.e. it is just indexed from other text fields present in content.*) and in the mapping, the store parameter is not set to true (it is false by default).
So you simply need to change your mapping to this:
"content_text": {
"store": true,
"type": "text"
},
And then your query will yield this:
"highlight" : {
"content_text" : [
"<em>42</em>"
]
}

Elastic Search shows "Unknown key for a START_OBJECT" exception

I am sending the following query to elastic search in order to get data which are within the range of the values between the from and to:
{
"range" : {
"variables.value.long" : {
"from" : -1.0E19,
"to" : 9.1E18,
"include_lower" : true,
"include_upper" : true,
"boost" : 1.0
}.
}
}
Despite that elastic search throws the following error:
{
"error": {
"root_cause": [
{
"type": "parsing_exception",
"reason": "Unknown key for a START_OBJECT in [range].",
"line": 2,
"col": 13
}
],
"type": "parsing_exception",
"reason": "Unknown key for a START_OBJECT in [range].",
"line": 2,
"col": 13
},
"status": 400
}
Does anybody know what this error means and why I am getting it?
There is some lack of context here like your mappings or the full query you are running, but this is how a range query should look for your document.
Create index
PUT test_andromachiii
{
"mappings": {
"properties": {
"variables": {
"properties": {
"values": {
"properties": {
"long": {
"type": "double"
}
}
}
}
}
}
}
}
Index document
POST test_andromachiii/_doc
{
"variables": {
"values": {
"long": 9.1E18
}
}
}
Run Query
POST test_andromachiii/_search
{
"query": {
"range": {
"variables.values.long": {
"lte": -1.0E19,
"gte": 9.1E18,
"boost": 1
}
}
}
}
Note lte means lower or equals to, gte greater or equals to.
Response
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "test_andromachiii",
"_type" : "_doc",
"_id" : "gtGj73cBbr4pOF0Is9my",
"_score" : 1.0,
"_source" : {
"variables" : {
"values" : {
"long" : 9.1E18
}
}
}
}
]
}
}
It looks like you're using version <0.90.4. If that's the case, simply wrap your range in a parent query object:
{
"query":{
"range":{
"variables.value.long":{
"from":-1.0E19,
"to":9.1E18,
"include_lower":true,
"include_upper":true,
"boost":1.0
}
}
}
}
If you're using any newer version than that, note that:
The from, to, include_lower and include_upper parameters have been deprecated in 0.90.4 in favour of gt, gte, lt, and lte.
This error is saying (somewhat cryptically) that you have a key range with an Object value, in a place where that key isn't recognised.
The specific cause here is that your range needs to be part of a higher query key such as (i.e.) the bool query, not part of the main.
Credit: https://discuss.elastic.co/t/unknown-key-for-a-start-object-in-should/140008/3

Elasticsearch data model

I'm currently parsing text from internal résumés in my company. The goal is to index everything in elasticsearch to perform search on them.
for the moment I have the following JSON document with no mapping defined :
Each coworker has a list of project with the client name
{
name: "Jean Wisser"
position: "Junior Developer"
"projects": [
{
"client": "SutrixMedia",
"missions": [
"Responsible for the quality on time and within budget",
"Writing specs, testing,..."
],
"technologies": "JIRA/Mantis/Adobe CQ5 (AEM)"
},
{
"client": "Société Générale",
"missions": [
" Writing test cases and scenarios",
" UAT"
],
"technologies": "HP QTP/QC"
}
]
}
The 2 main questions we would like to answer are :
Which coworker has already worked in this company ?
Which client use this technology ?
The first question is really easy to answer, for example:
Projects.client="SutrixMedia" returns me the right resume.
But how can I answer to the second one ?
I would like to make a query like this : Projects.technologies="HP QTP/QC" and the answer would be only the client name ("Société Générale" in this case) and NOT the entire document.
Is it possible to get this answer by defining a mapping with nested type ?
Or should I go for a parent/child mapping ?
Yes, indeed, that's possible with ES 1.5.* if you map projects as nested type and then retrieve nested inner_hits.
So here goes the mapping for your sample document above:
curl -XPUT localhost:9200/resumes -d '
{
"mappings": {
"resume": {
"properties": {
"name": {
"type": "string"
},
"position": {
"type": "string"
},
"projects": {
"type": "nested", <--- declare "projects" as nested type
"properties": {
"client": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"missions": {
"type": "string"
},
"technologies": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
}
}'
Then, you can index your sample document from above:
curl -XPUT localhost:9200/resumes/resume/1 -d '{...}'
Finally, with the following query which only retrieves the nested inner_hits you can retrieve only the nested object that matches Projects.technologies="HP QTP/QC"
curl -XPOST localhost:9200/resumes/resume/_search -d '
{
"_source": false,
"query": {
"nested": {
"path": "projects",
"query": {
"term": {
"projects.technologies.raw": "HP QTP/QC"
}
},
"inner_hits": { <----- only retrieve the matching nested document
"_source": "client" <----- and only the "client" field
}
}
}
}'
which yields only the client name instead of the whole matching document:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.4054651,
"hits" : [ {
"_index" : "resumes",
"_type" : "resume",
"_id" : "1",
"_score" : 1.4054651,
"inner_hits" : {
"projects" : {
"hits" : {
"total" : 1,
"max_score" : 1.4054651,
"hits" : [ {
"_index" : "resumes",
"_type" : "resume",
"_id" : "1",
"_nested" : {
"field" : "projects",
"offset" : 1
},
"_score" : 1.4054651,
"_source":{"client":"Société Générale"} <--- here is the client name
} ]
}
}
}
} ]
}
}

ElasticSearch : search and return nested type

I am pretty new to ElasticSearch and I am having trouble using nested mapping / query.
I have the following data structure added to my index :
{
"_id": "3",
"_rev": "6-e9e1bc15b39e333bb4186de05ec1b167",
"skuCode": "test",
"name": "Dragon vol. 1",
"pages": [
{
"id": "1",
"tags": [
{
"name": "dragon"
},
{
"name": "japonese"
}
]
},
{
"id": "2",
"tags": [
{
"name": "tagforanotherpage"
}
]
}
]
}
This index mapping is defined as bellow :
{
"metabook" : {
"metabook" : {
"properties" : {
"_rev" : {
"type" : "string"
},
"name" : {
"type" : "string"
},
"pages" : {
"type" : "nested",
"properties" : {
"tags" : {
"properties" : {
"name" : {
"type" : "string"
}
}
}
}
},
"skuCode" : {
"type" : "string"
}
}
}
}
}
My goal is to search all pages containing a specific tag, and return the book object with the filtered page list (I would like ES to return only pages that match the given tag). Something like (ignoring the second page) :
{
"_id": "3",
"_rev": "6-e9e1bc15b39e333bb4186de05ec1b167",
"skuCode": "test",
"name": "Dragon vol. 1",
"pages": [
{
"id": "1",
"tags": [
{
"name": "dragon"
},
{
"name": "japonese"
}
]
}
]
}
Here is the query I actually use :
{
"from": 0,
"size": 10,
"query" : {
"nested" : {
"path" : "pages",
"score_mode" : "avg",
"query" : {
"term" : { "tags.name" : "japonese" }
}
}
}
}
But it actually returns an empty result. What am I doing wrong ? Maybe I should index my "pages" directly instead of books ? What am I missing ?
Thank you in advance !
Sadly you can't get back only parts of the a document. If the document matches a query, you will get the whole thing back; the root and all nested docs. If you want to get only parts back, then you could look at using parent/child docs.
Also you aren't seeing any hits as you have a small syntax error in the nested query. Look closely at the field name:
{
"from": 0,
"size": 10,
"query" : {
"nested" : {
"path" : "pages",
"score_mode" : "avg",
"query" : {
"term" : { "pages.tags.name" : "japonese" }
}
}
}
}
If you need help with parent child docs feel free to ask! (There should be examples if you do a google search)
Good luck!

Resources