Querying child properties - elasticsearch

I want to copy my index over but skipping a property when it matches a specific value. I found out how to exclude a property all together, but I need something like this:
exclude 'terms property' WHERE 'source_terminology subproperty' not like '%earthMaterials%
Is this possible in ElasticSearch or should I approach it in a different way?
POST _reindex
{
"source" : {
"index" : "documents3",
"_source":{
"excludes": [
"terms"
]
}
},
"dest" : {
"index" : "documents4"
}
}
This is a reduced version of my mapping:
{
"documents4": {
"mappings": {
"doc": {
"properties": {
"abstract": {
"type": "text"
},
"author": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"terms": {
"properties": {
"source_terminology": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
}
}
},
"uri": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
}
}
}
}
}
}
}
}
}
}
This is a bit how my data looks like now:
{
"_index": "documents4",
"_type": "doc",
"_id": "6bf03d1e-f7dc-40c6-a32d-c9aa09e7b051",
"_score": 1,
"_source": {
"terms": [
{
"source_terminology": "exploration-activity-type",
"label": "feasibility study",
"uri": "http://resource.geosciml.org/classifier/cgi/exploration-activity-type/feasibility-study"
},
{
"source_terminology": "earthMaterialsAT",
"label": "rock",
"uri": "http://www.similarto.com/ontologies/lithology/2010/12/earthMaterialsAT#rock"
},
"title": "Miguel Auza Initial Prospectus"
}
}

You can use painless script to add the conditions you need.
POST _reindex
{
"source" : {
"index" : "documents4"
},
"dest" : {
"index" : "documents4-copy3"
},
"script": {
"source": "int index = 0; def list = new ArrayList(); for(term in ctx._source.terms) { if(term.source_terminology =~ /^(?:(?!exploration).)+$/) { list.add(0, index) } index++;} for(item in list) { ctx._source.terms.remove(item)}",
"lang": "painless"
}
}
You need script.painless.regex.enabled value set to true in elasticsearch.yml file for this to work.
Formatted version of Painless script
int index = 0;
def list = new ArrayList();
for (term in ctx._source.terms) {
if (term.source_terminology = ~ /^(?:(?!earthMaterials).)+$/) {
// Need to add matched index at start to avoid
// index_out_of_bounds_exception when removing items later
list.add(0, index)
// If you try to remove item as soon as match is found,
// you will get concurrent_modification_exception
}
index++;
}
for (item in list) {
ctx._source.terms.remove(item)
}

Related

Matching the stored values and queries in Elastic Search

I have a field called that is inside a nested field "name" that is a "Keyword" in elastic search.
Name field contains 2 values.
Jagannathan Rajagopalan
Rajagopalan.
If I query "Rajagopalan", I should get only the item #2.
If I query the complete Jagannathan Rajagopalan, I should get #1.
How do I achieve it?
You need to use the term query which is used for exact search. Added a working example according to your use-case.
Index mapping
{
"mappings": {
"properties": {
"name": {
"type": "nested",
"properties": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Index sample docs
{
"name" : {
"keyword" : "Jagannathan Rajagopalan"
}
}
And another doc
{
"name" : {
"keyword" : "Jagannathan"
}
}
And search query
{
"query": {
"nested": {
"path": "name",
"query": {
"bool": {
"must": [
{
"match": {
"name.keyword": "Jagannathan Rajagopalan"
}
}
]
}
}
}
}
}
Search result
"hits": [
{
"_index": "key",
"_type": "_doc",
"_id": "2",
"_score": 0.6931471,
"_source": {
"name": {
"keyword": "Jagannathan Rajagopalan"
}
}
}
]

Include joined children with Elasticsearch GET request

I have an Elasticsearch index events that has a join field so that an event can have multiple instances (i.e. the same event can occur on different dates). In this simplified mapping, an event doc has fields for title and url while an instance doc has start/end date fields:
{
"mappings": {
"properties": {
"title": {
"type": "text"
},
"url": {
"type": "keyword"
},
"dt": {
"type": "date"
},
"end_dt": {
"type": "date"
},
"event_or_instance": {
"type": "join",
"eager_global_ordinals": true,
"relations": {
"event": "instance"
}
}
}
}
}
I know how to get an event and includes all of its instances using has_child:
GET /events/_search
{
"query" : {
"bool": {
"filter": [
{
"term": {
"_id": {
"value": "c8871a79-1907-46c0-958c-9731c529b93e"
}
}
},
{
"has_child" : {
"type" : "instance",
"query" : { "match_all": {} },
"inner_hits" : {
"_source": true,
"sort": [{"dt": "asc"}]
}
}
}
]
}
},
"_source": true
}
This works fine, but is there a way to do this using the Get/Multi-get API instead of the Search API?

Elasticsearch remove a field from an object of an array in a dynamically generated index

I'm trying to delete fields from an object of an array in Elasticsearch. The index has been dynamically generated.
This is the mapping:
{
"mapping": {
"_doc": {
"properties": {
"age": {
"type": "long"
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"result": {
"properties": {
"resultid": {
"type": "long"
},
"resultname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
},
"timestamp": {
"type": "date"
}
}
}
}
}
}
this is a document:
{
"result": [
{
"resultid": 69,
"resultname": "SFO"
},
{
"resultid": 151,
"resultname": "NYC"
}
],
"age": 54,
"name": "Jorge",
"timestamp": "2020-04-02T16:07:47.292000"
}
My goals is to remove all the fields resultid in result in all the document of the index. After update the document should look like this:
{
"result": [
{
"resultname": "SFO"
},
{
"resultname": "NYC"
}
],
"age": 54,
"name": "Jorge",
"timestamp": "2020-04-02T16:07:47.292000"
}
I tried using the following articles on stackoverflow but with no luck:
Remove elements/objects From Array in ElasticSearch Followed by Matching Query
remove objects from array that satisfying the condition in elastic search with javascript api
Delete nested array in elasticsearch
Removing objects from nested fields in ElasticSearch
Hopefully someone can help me find a solution.
You should reindex your index in a new one with _reindex API and call a script to remove your fields :
POST _reindex
{
"source": {
"index": "my-index"
},
"dest": {
"index": "my-index-reindex"
},
"script": {
"source": """
for (int i=0;i<ctx._source.result.length;i++) {
ctx._source.result[i].remove("resultid")
}
"""
}
}
After you can delete your first index :
DELETE my-index
And reindex it :
POST _reindex
{
"source": {
"index": "my-index-reindex"
},
"dest": {
"index": "my-index"
}
}
I combined the answer from Luc E with some of my own knowledge in order to reach a solution without reindexing.
POST INDEXNAME/TYPE/_update_by_query?wait_for_completion=false&conflicts=proceed
{
"script": {
"source": "for (int i=0;i<ctx._source.result.length;i++) { ctx._source.result[i].remove(\"resultid\")}"
},
"query": {
"bool": {
"must": [
{
"exists": {
"field": "result.id"
}
}
]
}
}
}
Thanks again Luc!
If your array has more than one copy of element you want to remove. Use this:
ctx._source.some_array.removeIf(tag -> tag == params['c'])

Count total number of words of all documents pointing to specific fields

Someone asked this question but no one seems to answer or tried to suggest possible ways to solve it: https://discuss.elastic.co/t/count-the-number-of-words-in-the-field-elastic-search-6-2/121373
Now, I'm trying to produce a report from Elasticsearch to count the number of WORDS / TOKENS from a specific field called title and content
Is there a proper aggregation for this?
For example, I have this query:
GET web/_search
{
"query":{
"bool":{
"must":[
{
"query_string":{
"fields":[
"title",
"content"
],
"query":"((\"Hello\") AND (\"World\")"
}
},
{
"range":{
"pub_date":{
"from":1569456000,
"to":1570060800
}
}
}
]
}
}
}
And for example, this query produced 23 DOCUMENTS, I want to make a response telling me how MANY words do those 23 documents contain based from the title and content fields?
I would leverage the token_count data type. In your index, you can add a sub-field of type token_count to your title and content fields, like this:
PUT web
{
"mappings": {
"properties": {
"title": {
"type": "text",
"fields": {
"length": {
"type": "token_count",
"analyzer": "standard"
}
}
},
"content": {
"type": "text",
"fields": {
"length": {
"type": "token_count",
"analyzer": "standard"
}
}
}
}
}
}
Then, in order to find out the number of tokens, you can simply run a sum aggregation on the .length sub-field, like this:
POST web/_search
{
"size": 0,
"aggs": {
"title_tokens": {
"sum": {
"field": "title.length"
}
},
"content_tokens": {
"sum": {
"field": "content.length"
}
}
}
}
I am using data type called token_count It will calculate and store the count of tokens for each text. This count value can be utilized to get the token count of fields
PUT index18
{
"mappings": {
"properties": {
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
},
"length": {
"type": "token_count",
"analyzer": "standard"
}
}
},
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
},
"length": {
"type": "token_count",
"analyzer": "standard"
}
}
}
}
}
}
Data:
"hits" : [
{
"_index" : "index18",
"_type" : "_doc",
"_id" : "edJPtW0BVHM68p7X-Wlu",
"_score" : 1.0,
"_source" : {
"title" : "Mayor Isko"
}
},
{
"_index" : "index18",
"_type" : "_doc",
"_id" : "etJQtW0BVHM68p7XGmmr",
"_score" : 1.0,
"_source" : {
"title" : "Isko"
}
}
]
Query
GET index18/_search
{
"query": {"match_all": {}},
"aggs": {
"WordCount": {
"sum": {
"field": "title.length"
}
}
}
}

elasticsearch reindex nested object's element to keyword

I have an index structured like below:
"my_index": {
"mappings": {
"my_index": {
"properties": {
"adId": {
"type": "keyword"
},
"name": {
"type": "keyword"
},
"title": {
"type": "keyword"
},
"creativeStatistics": {
"type": "nested",
"properties": {
"clicks": {
"type": "long"
},
"creativeId": {
"type": "keyword"
}
}
}
}
}
}
}
I need to remove the nested object in a new index and just save the creativeId as a new keyword (to make it clear: I know I will loose the clicks data, and it is not important). It means the final new index scheme would be:
"my_new_index": {
"mappings": {
"my_new_index": {
"properties": {
"adId": {
"type": "keyword"
},
"name": {
"type": "keyword"
},
"title": {
"type": "keyword"
},
"creativeId": {
"type": "keyword"
}
}
}
}
}
Right now each row has exactly one creativeStatistics. and therefore there is no complexity in selecting one of the creativeIds.
I know it is possible to reindex using painless scripts, but I don't know how can I do that. Any help will be appreciated.
You can do it like this:
POST _reindex
{
"source": {
"index": "my_old_index"
},
"dest": {
"index": "my_new_index"
},
"script": {
"source": "if (ctx._source.creativeStatistics != null && ctx._source.creativeStatistics.size() > 0) {ctx._source.creativeId = ctx._source.creativeStatistics[0].creativeId; ctx._source.remove('creativeStatistics')}",
"lang": "painless"
}
}
You can also create a Pipeline by creating a Script Processor as follows:
PUT _ingest/pipeline/my_pipeline
{
"description" : "My pipeline",
"processors" : [
{ "script" : {
"source": "for (item in ctx.creativeStatistics) { if(item.creativeId!=null) {ctx.creativeId = item.creativeId;} }"
}
},
{
"remove": {
"field": "creativeStatistics"
}
}
]
}
Note that if you have multiple nested objects, it would append the last object's creativeId. And it would only add creativeId if a source document has one in its creativeStatistics.
Below is how you can then use reindex query:
POST _reindex
{
"source": {
"index": "creativeindex_src"
},
"dest": {
"index": "creativeindex_dest",
"pipeline": "my_pipeline"
}
}

Resources