How do you implement max(textstring) in ElasticSearch? - elasticsearch

I have a mapping like this:
{
"log": {
"mappings": {
"properties": {
"fingerprint": {
"type": "keyword"
},
"cid": {
"type": "integer"
},
"hash": {
"type": "keyword"
},
"index": {
"properties": {
"_index": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"path": {
"type": "text",
"analyzer": "path_analyzer"
},
"status": {
"type": "keyword"
},
"timestamp": {
"type": "date"
}
}
}
}
}
If I were going to do what I want in SQL, I would do something like this:
SELECT fingerprint, hash, MAX(path), SUM(CASE WHEN status = 1 THEN 1 ELSE 0 END) NOB, COUNT(DISTINCT cid) NOC FROM log GROUP BY fingerprint, hash, status;
I am using NEST and I am doing something like this:
var request = new SearchRequest("log")
{
Scroll = scrollRetain,
Size = size,
Query = new DateRangeQuery()
{
Field = "timestamp",
GreaterThan = startString,
LessThan = endString
},
Aggregations = new TermsAggregation("hashes")
{
Size = size,
Field = "hash",
Aggregations = new CardinalityAggregation("NOC", "cid") &&
new ValueCountAggregation("NOB", "status = 1")
}
};
Not sure how to replicate the MAX(varchar) in ElasticSearch, is this possible?

Related

Elasticsearch remove a field from an object of an array in a dynamically generated index

I'm trying to delete fields from an object of an array in Elasticsearch. The index has been dynamically generated.
This is the mapping:
{
"mapping": {
"_doc": {
"properties": {
"age": {
"type": "long"
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"result": {
"properties": {
"resultid": {
"type": "long"
},
"resultname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
},
"timestamp": {
"type": "date"
}
}
}
}
}
}
this is a document:
{
"result": [
{
"resultid": 69,
"resultname": "SFO"
},
{
"resultid": 151,
"resultname": "NYC"
}
],
"age": 54,
"name": "Jorge",
"timestamp": "2020-04-02T16:07:47.292000"
}
My goals is to remove all the fields resultid in result in all the document of the index. After update the document should look like this:
{
"result": [
{
"resultname": "SFO"
},
{
"resultname": "NYC"
}
],
"age": 54,
"name": "Jorge",
"timestamp": "2020-04-02T16:07:47.292000"
}
I tried using the following articles on stackoverflow but with no luck:
Remove elements/objects From Array in ElasticSearch Followed by Matching Query
remove objects from array that satisfying the condition in elastic search with javascript api
Delete nested array in elasticsearch
Removing objects from nested fields in ElasticSearch
Hopefully someone can help me find a solution.
You should reindex your index in a new one with _reindex API and call a script to remove your fields :
POST _reindex
{
"source": {
"index": "my-index"
},
"dest": {
"index": "my-index-reindex"
},
"script": {
"source": """
for (int i=0;i<ctx._source.result.length;i++) {
ctx._source.result[i].remove("resultid")
}
"""
}
}
After you can delete your first index :
DELETE my-index
And reindex it :
POST _reindex
{
"source": {
"index": "my-index-reindex"
},
"dest": {
"index": "my-index"
}
}
I combined the answer from Luc E with some of my own knowledge in order to reach a solution without reindexing.
POST INDEXNAME/TYPE/_update_by_query?wait_for_completion=false&conflicts=proceed
{
"script": {
"source": "for (int i=0;i<ctx._source.result.length;i++) { ctx._source.result[i].remove(\"resultid\")}"
},
"query": {
"bool": {
"must": [
{
"exists": {
"field": "result.id"
}
}
]
}
}
}
Thanks again Luc!
If your array has more than one copy of element you want to remove. Use this:
ctx._source.some_array.removeIf(tag -> tag == params['c'])

Querying child properties

I want to copy my index over but skipping a property when it matches a specific value. I found out how to exclude a property all together, but I need something like this:
exclude 'terms property' WHERE 'source_terminology subproperty' not like '%earthMaterials%
Is this possible in ElasticSearch or should I approach it in a different way?
POST _reindex
{
"source" : {
"index" : "documents3",
"_source":{
"excludes": [
"terms"
]
}
},
"dest" : {
"index" : "documents4"
}
}
This is a reduced version of my mapping:
{
"documents4": {
"mappings": {
"doc": {
"properties": {
"abstract": {
"type": "text"
},
"author": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"terms": {
"properties": {
"source_terminology": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
}
}
},
"uri": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
}
}
}
}
}
}
}
}
}
}
This is a bit how my data looks like now:
{
"_index": "documents4",
"_type": "doc",
"_id": "6bf03d1e-f7dc-40c6-a32d-c9aa09e7b051",
"_score": 1,
"_source": {
"terms": [
{
"source_terminology": "exploration-activity-type",
"label": "feasibility study",
"uri": "http://resource.geosciml.org/classifier/cgi/exploration-activity-type/feasibility-study"
},
{
"source_terminology": "earthMaterialsAT",
"label": "rock",
"uri": "http://www.similarto.com/ontologies/lithology/2010/12/earthMaterialsAT#rock"
},
"title": "Miguel Auza Initial Prospectus"
}
}
You can use painless script to add the conditions you need.
POST _reindex
{
"source" : {
"index" : "documents4"
},
"dest" : {
"index" : "documents4-copy3"
},
"script": {
"source": "int index = 0; def list = new ArrayList(); for(term in ctx._source.terms) { if(term.source_terminology =~ /^(?:(?!exploration).)+$/) { list.add(0, index) } index++;} for(item in list) { ctx._source.terms.remove(item)}",
"lang": "painless"
}
}
You need script.painless.regex.enabled value set to true in elasticsearch.yml file for this to work.
Formatted version of Painless script
int index = 0;
def list = new ArrayList();
for (term in ctx._source.terms) {
if (term.source_terminology = ~ /^(?:(?!earthMaterials).)+$/) {
// Need to add matched index at start to avoid
// index_out_of_bounds_exception when removing items later
list.add(0, index)
// If you try to remove item as soon as match is found,
// you will get concurrent_modification_exception
}
index++;
}
for (item in list) {
ctx._source.terms.remove(item)
}

Update field in a document based on the condition in Kibana/Elasticsearch

I am trying to update particular field in document based on some condition. In general sql way, I want to do following.
Update index indexname
set name = "XXXXXX"
where source: file and name : "YYYYYY"
I am using below to update all the documents but I am not able to add any condition.
POST indexname/_update_by_query
{
"query": {
"term": {
"name": "XXXXX"
}
}
}
Here is the template, I am using:
{
"indexname": {
"mappings": {
"idxname123": {
"_all": {
"enabled": false
},
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"date1": {
"type": "date",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"source": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
Could someone guide me how to add condition to it as mentioned above for the source and name.
Thanks,
Babu
You can make use of the below query to what you are looking for. I'm assuming name and source are your fields in your index.
POST <your_index_name>/_update_by_query
{
"script": {
"inline": "ctx._source.name = 'XXXXX'",
"lang": "painless"
},
"query": {
"bool": {
"must": [
{
"term": {
"name": {
"value": "YYYYY"
}
}
},
{
"term": {
"source": {
"value": "file"
}
}
}
]
}
}
}
You can probably make use of any of the Full Text Queries or Term Queries inside the Bool Query for either searching/updating/deletions.
Do spend sometime in going through them.
Note: Make use of Term Queries only if your field's datatype is keyword
Hope this helps!

elasticsearch reindex nested object's element to keyword

I have an index structured like below:
"my_index": {
"mappings": {
"my_index": {
"properties": {
"adId": {
"type": "keyword"
},
"name": {
"type": "keyword"
},
"title": {
"type": "keyword"
},
"creativeStatistics": {
"type": "nested",
"properties": {
"clicks": {
"type": "long"
},
"creativeId": {
"type": "keyword"
}
}
}
}
}
}
}
I need to remove the nested object in a new index and just save the creativeId as a new keyword (to make it clear: I know I will loose the clicks data, and it is not important). It means the final new index scheme would be:
"my_new_index": {
"mappings": {
"my_new_index": {
"properties": {
"adId": {
"type": "keyword"
},
"name": {
"type": "keyword"
},
"title": {
"type": "keyword"
},
"creativeId": {
"type": "keyword"
}
}
}
}
}
Right now each row has exactly one creativeStatistics. and therefore there is no complexity in selecting one of the creativeIds.
I know it is possible to reindex using painless scripts, but I don't know how can I do that. Any help will be appreciated.
You can do it like this:
POST _reindex
{
"source": {
"index": "my_old_index"
},
"dest": {
"index": "my_new_index"
},
"script": {
"source": "if (ctx._source.creativeStatistics != null && ctx._source.creativeStatistics.size() > 0) {ctx._source.creativeId = ctx._source.creativeStatistics[0].creativeId; ctx._source.remove('creativeStatistics')}",
"lang": "painless"
}
}
You can also create a Pipeline by creating a Script Processor as follows:
PUT _ingest/pipeline/my_pipeline
{
"description" : "My pipeline",
"processors" : [
{ "script" : {
"source": "for (item in ctx.creativeStatistics) { if(item.creativeId!=null) {ctx.creativeId = item.creativeId;} }"
}
},
{
"remove": {
"field": "creativeStatistics"
}
}
]
}
Note that if you have multiple nested objects, it would append the last object's creativeId. And it would only add creativeId if a source document has one in its creativeStatistics.
Below is how you can then use reindex query:
POST _reindex
{
"source": {
"index": "creativeindex_src"
},
"dest": {
"index": "creativeindex_dest",
"pipeline": "my_pipeline"
}
}

ElasticSearch - string concat aggregation?

I've got the following simple mapping:
"element": {
"dynamic": "false",
"properties": {
"id": { "type": "string", "index": "not_analyzed" },
"group": { "type": "string", "index": "not_analyzed" },
"type": { "type": "string", "index": "not_analyzed" }
}
}
Which basically is a way to store Group object:
{
id : "...",
elements : [
{id: "...", type: "..."},
...
{id: "...", type: "..."}
]
}
I want to find how many different groups exist sharing the same set of element types (ordered, including repetitions).
An obvious solution would be to change the schema to:
"element": {
"dynamic": "false",
"properties": {
"group": { "type": "string", "index": "not_analyzed" },
"concatenated_list_of_types": { "type": "string", "index": "not_analyzed" }
}
}
But, due to the requirements, we need to be able to exclude some types from group by (aggregation) :(
All fields of the document are mongo ids, so in SQL I would do something like this:
SELECT COUNT(id), concat_value FROM (
SELECT GROUP_CONCAT(type_id), group_id
FROM table
WHERE type_id != 'some_filtered_out_type_id'
GROUP BY group_id
) T GROUP BY concat_value
In Elastic with given mapping it's really easy to filter out, its also not a problem to count assuming we have a concated value. Needless to say, sum aggregation does not work for strings.
How can I get this working? :)
Thanks!
Finally I solved this problem with scripting and by changing the mapping.
{
"mappings": {
"group": {
"dynamic": "false",
"properties": {
"id": { "type": "string", "index": "not_analyzed" },
"elements": { "type": "string", "index": "not_analyzed" }
}
}
}
}
There are still some issues with duplicate elements in array (ScriptDocValues.Strings) for some reason strips out dups, but here's an aggregation that counts by string concat:
{
"aggs": {
"path": {
"scripted_metric": {
"map_script": "key = doc['elements'].join('-'); _agg[key] = _agg[key] ? _agg[key] + 1 : 1",
"combine_script": "_agg",
"reduce_script": "_aggs.collectMany { it.entrySet() }.inject( [:] ) { result, e -> result << [ (e.key):e.value + ( result[ e.key ] ?: 0 ) ]}"
}
}
}
}
The result would be as follows:
"aggregations" : {
"path" : {
"value" : {
"5639abfb5cba47087e8b457e" : 362,
"568bfc495cba47fc308b4567" : 3695,
"5666d9d65cba47701c413c53" : 14,
"5639abfb5cba47087e8b4571-5639abfb5cba47087e8b457b" : 1,
"570eb97abe529e83498b473d" : 1
}
}
}

Resources