Add geoIP data to old data from Elasticsearch index

Add geoIP data to old data from Elasticsearch index - elasticsearch

I recently added a GeoIP processor to my ingestion pipeline in Elasticsearch. this works well and adds new fields to the newly ingested documents.
I wanted to add the GeoIP fields to older data by doing an _update_by_query on an index, however, it seems that it doesn't accept "processors" as a parameter.
What I want to do is something like this:
POST my_index*/_update_by_query
{
"refresh": true,
"processors": [
{
"geoip" : {
"field": "doc['client_ip']",
"target_field" : "geo",
"database_file" : "GeoLite2-City.mmdb",
"properties":["continent_name", "country_iso_code", "country_name", "city_name", "timezone", "location"]
}
}
],
"script": {
"day_of_week": {
"type": "long",
"script": "emit(doc['#timestamp'].value.withZoneSameInstant(ZoneId.of(doc['geo.timezone'])).getDayOfWeek().getValue())"
},
"hour_of_day": {
"type": "long",
"script": "emit(doc['#timestamp'].value.withZoneSameInstant(ZoneId.of(doc['geo.timezone'])).getHour())"
},
"office_hours": {
"script": "if (doc['day_of_week'].value< 6 && doc['day_of_week'].value > 0) {if (doc['hour_of_day'].value> 7 && doc['hour_of_day'].value<19) {return 1;} else {return -1;} } else {return -1;}"
}
}
}
I receive the following error:
{
"error" : {
"root_cause" : [
{
"type" : "parse_exception",
"reason" : "Expected one of [source] or [id] fields, but found none"
}
],
"type" : "parse_exception",
"reason" : "Expected one of [source] or [id] fields, but found none"
},
"status" : 400
}

Since you have the ingestion pipeline ready, you simply need to reference it in your call to the _update_by_query endpoint, like this:
POST my_index*/_update_by_query?pipeline=my-pipeline
^
|
add this

Related

Is it possible to add a new similarity metric to an existing index in Elasticsearch?

Let's say there is an existing index with a customized BM25 similarity metric like this:
{
"settings": {
"index": {
"similarity": {
"BM25_v1": {
"type": "BM25",
"b": 1.0
}
},
"number_of_replicas": 0,
"number_of_shards": 3,
"refresh_interval": "120s"
}
}
}
And this similarity metric is used for two fields:
{
'some_field': {
'type': 'text',
'norms': 'true',
'similarity': 'BM25_v1'
},
'another_field': {
'type': 'text',
'norms': 'true',
'similarity': 'BM25_v1'
},
}
Now, I was wondering if it's possible to add another similarity metric (BM25_v2) to the same index and use this new metric for the another_field, like this:
"index": {
"similarity": {
# The existing metric, not changed.
"BM25_v1": {
"type": "BM25",
"b": 1.0
},
# The new similarity metric for this index.
"BM25_v2": {
"type": "BM25",
"b": 0.0
}
}
}
# ... and use the new metric for one of the fields:
{
'some_field': {
'type': 'text',
'norms': 'true',
'similarity': 'BM25_v1' # This field uses the same old metric.
},
'another_field': {
'type': 'text',
'norms': 'true',
'similarity': 'BM25_v2' # The new metric is used for this field.
},
}
I couldn't find any example for this scenario in the documentation, so I wasn't sure if this is possible at all.
Update: I have already seen this old still-open issue which concerns with dynamic update of similarity metrics in Elasticsearch. But it is not completely clear from that discussion what is and isn't possible. Also there have been some attempts for achieving some level of similarity update; but I think it's not documented (e.g. it is possible to change the parameters of an existing similarity metric, say b or k1 in an existing BM25-based metric).

TLDR;
I believe you can't.
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "Mapper for [title] conflicts with existing mapper:\n\tCannot update parameter [similarity] from [my_similarity] to [my_similarity_v2]"
}
],
"type" : "illegal_argument_exception",
"reason" : "Mapper for [title] conflicts with existing mapper:\n\tCannot update parameter [similarity] from [my_similarity] to [my_similarity_v2]"
},
"status" : 400
}
If you want to, I believe you will have to create a new field and re-index the data.
To reproduce
PUT /70973345
{
"settings": {
"index": {
"similarity": {
"my_similarity": {
"type": "BM25",
"b": 1.0
}
}
}
}
}
PUT /70973345/_mapping
{
"properties" : {
"title" : { "type" : "text", "similarity" : "my_similarity" }
}
}
We insert some dummy data, and retrieve it.
POST /70973345/_doc
{
"title": "I love rock'n roll"
}
POST /70973345/_doc
{
"title": "I love pasta al'arabita"
}
POST /70973345/_doc
{
"title": "pasta rock's"
}
GET /70973345/_search?explain=true
{
"query": {
"match": {
"title": "pasta"
}
}
}
If we try to update it the settings without closing, we get an error.
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "Can't update non dynamic settings ...."
}
],
"type" : "illegal_argument_exception",
"reason" : "Can't update non dynamic settings ...."
},
"status" : 400
}
POST /70973345/_close?wait_for_active_shards=0
PUT /70973345/_settings
{
"index": {
"similarity": {
"my_similarity": {
"type": "BM25",
"b": 1.0
},
"my_similarity_v2": {
"type": "BM25",
"b": 0
}
}
}
}
The update works fine, BUT :
PUT /70973345/_mapping
{
"properties": {
"title": {
"type": "text",
"similarity": "my_similarity_v2"
}
}
}
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "Mapper for [title] conflicts with existing mapper:\n\tCannot update parameter [similarity] from [my_similarity] to [my_similarity_v2]"
}
],
"type" : "illegal_argument_exception",
"reason" : "Mapper for [title] conflicts with existing mapper:\n\tCannot update parameter [similarity] from [my_similarity] to [my_similarity_v2]"
},
"status" : 400
}
It will not work, regardless of the open/close status of the index.
Which makes me believe this is not possible. you might need to re-index into a new indice the existing data.

ElasticSearch Accessing Nested Documents in Script - Null Pointer Exception

Gist: Trying to write a custom filter on nested documents using painless. Want to write error checks when there are no nested documents to surpass null_pointer_exception
I have a mapping as such (simplified and obfuscated)
{
"video_entry" : {
"aliases" : { },
"mappings" : {
"properties" : {
"captions_added" : {
"type" : "boolean"
},
"category" : {
"type" : "keyword"
},
"is_votable" : {
"type" : "boolean"
},
"members" : {
"type" : "nested",
"properties" : {
"country" : {
"type" : "keyword",
},
"date_of_birth" : {
"type" : "date",
}
}
}
}
Each video_entry document can have 0 or more members nested documents.
Sample Document
{
"captions_added": true,
"category" : "Mental Health",
"is_votable: : true,
"members": [
{"country": "Denmark", "date_of_birth": "1998-04-04T00:00:00"},
{"country": "Denmark", "date_of_birth": "1999-05-05T00:00:00"}
]
}
If one or more nested document exist, we want to write some painless scripts that'd check certain fields across all the nested documents. My script works on mappings with a few documents but when I try it on larger set of documents I get null pointer exceptions despite having every null check possible. I've tried various access patterns, error checking mechanisms but I get exceptions.
POST /video_entry/_search
{
"query": {
"script": {
"script": {
"source": """
// various NULL checks that I already tried
// also tried short circuiting on finding null values
if (!params['_source'].empty && params['_source'].containsKey('members')) {
def total = 0;
for (item in params._source.members) {
// custom logic here
// if above logic holds true
// total += 1;
}
return total > 3;
}
return true;
""",
"lang": "painless"
}
}
}
}
Other Statements That I've Tried
if (params._source == null) {
return true;
}
if (params._source.members == null) {
return true;
}
if (!ctx._source.contains('members')) {
return true;
}
if (!params['_source'].empty && params['_source'].containsKey('members') &&
params['_source'].members.value != null) {
// logic here
}
if (doc.containsKey('members')) {
for (mem in params._source.members) {
}
}
Error Message
&& params._source.members",
^---- HERE"
"caused_by" : {
"type" : "null_pointer_exception",
"reason" : null
}
I've looked into changing the structure (flattening the document) and the usage of must_not as indicated in this answer. They don't suit our use case as we need to incorporate some more custom logic.
Different tutorials use ctx, doc and some use params. To add to the confusion Debug.explain(doc.members), Debug.explain(params._source.members) return empty responses and I'm having a hard time figuring out the types.
Gist: Trying to write a custom filter on nested documents using painless. Want to write error checks when there are no nested documents to surpass null_pointer_exception
Any help is appreciated.

TLDr;
Elastic flatten objects. Such that
{
"group" : "fans",
"user" : [
{
"first" : "John",
"last" : "Smith"
},
{
"first" : "Alice",
"last" : "White"
}
]
}
Turn into:
{
"group" : "fans",
"user.first" : [ "alice", "john" ],
"user.last" : [ "smith", "white" ]
}
To access members inner value you need to reference it using doc['members.<field>'] as members will not exist on its own.
Details
As you may know, Elastic handles inner documents in its own way. [doc]
So you will need to reference them accordingly.
Here is what I did to make it work.
Btw, I have been using the Dev tools of kibana
PUT /so_test/
PUT /so_test/_mapping
{
"properties" : {
"captions_added" : {
"type" : "boolean"
},
"category" : {
"type" : "keyword"
},
"is_votable" : {
"type" : "boolean"
},
"members" : {
"properties" : {
"country" : {
"type" : "keyword"
},
"date_of_birth" : {
"type" : "date"
}
}
}
}
}
POST /so_test/_doc/
{
"captions_added": true,
"category" : "Mental Health",
"is_votable" : true,
"members": [
{"country": "Denmark", "date_of_birth": "1998-04-04T00:00:00"},
{"country": "Denmark", "date_of_birth": "1999-05-05T00:00:00"}
]
}
PUT /so_test/_doc/
{
"captions_added": true,
"category" : "Mental breakdown",
"is_votable" : true,
"members": []
}
POST /so_test/_doc/
{
"captions_added": true,
"category" : "Mental success",
"is_votable" : true,
"members": [
{"country": "France", "date_of_birth": "1998-04-04T00:00:00"},
{"country": "Japan", "date_of_birth": "1999-05-05T00:00:00"}
]
}
And then I did this query (it is only a bool filter, but I guess making it work for your own use case should not prove too difficult)
GET /so_test/_search
{
"query":{
"bool": {
"filter": {
"script": {
"script": {
"lang": "painless",
"source": """
def flag = false;
// /!\ notice how the field is referenced /!\
if(doc['members.country'].size() != 0)
{
for (item in doc['members.country']) {
if (item == params.country){
flag = true
}
}
}
return flag;
""",
"params": {
"country": "Japan"
}
}
}
}
}
}
}
BTW you were saying you were a bit confused about the context for painless. you can find in the documentation so details about it.
[doc]
In this case the filter context is the one we want to look at.

Painless scripting Elastic Search : variable is not defined error when trying to access values from doc

I am trying to learn painless scripting in Elastic Search by following the official documentation. ( https://www.elastic.co/guide/en/elasticsearch/painless/6.0/painless-examples.html )
A sample of the document I am working with :
{
"uid" : "CT6716617",
"old_username" : "xyz",
"new_username" : "abc"
}
the following script fields query using params._source to access document values works :
{
"script_fields": {
"sales_price": {
"script": {
"lang": "painless",
"source": "(params._source.old_username != params._source.new_username) ? \"change\" : \"nochange\"",
"params": {
"change": "change"
}
}
}
}
}
The same query but using the doc map to access values fails :
{
"script_fields": {
"sales_price": {
"script": {
"lang": "painless",
"source": "(doc['old_username'] != doc['new_username']) ? \"change\" : \"nochange\"",
"params": {
"change": "change"
}
}
}
}
}
The error message I get is :
"caused_by" : {
"type" : "illegal_argument_exception",
"reason" : "Variable [old_username] is not defined."
}
Based on the documentation both of the approaches should work, especially the 2nd one. I am not sure what I am missing here.?

ElasticSearch - Copy one field value to other field for all documents

We have a field "name" in the index. We recently added a new field "alias".
I want to copy name field value to the new field alias for all documents.
Is there any Update query that will do this?
If that is not possible , Help me to achieve this.
Thanks in advance
I am trying this query
http://URL/index/profile/_update_by_query
{
"query": {
"constant_score" : {
"filter" : {
"exists" : { "field" : "name" }
}
}
},
"script" : "ctx._source.alias = name;"
}
In the script , I am not sure how to give name field.
I getting error
{
"error": {
"root_cause": [
{
"type": "class_cast_exception",
"reason": "java.lang.String cannot be cast to java.util.Map"
}
],
"type": "class_cast_exception",
"reason": "java.lang.String cannot be cast to java.util.Map"
},
"status": 500
}

Indeed, the syntax has changed a tiny little bit since. You need to modify your query to this:
POST index/_update_by_query
{
"query": {
"constant_score" : {
"filter" : {
"exists" : { "field" : "name" }
}
}
},
"script" : {
"inline": "ctx._source.alias = ctx._source.name;"
}
}
UPDATE for ES 6
Use source instead of inline

CSV geodata into elasticsearch as a geo_point type using logstash

Below is a reproducible example of the problem I am having using to most recent versions of logstash and elasticsearch.
I am using logstash to input geospatial data from a csv into elasticsearch as geo_points.
The CSV looks like the following:
$ head simple_base_map.csv
"lon","lat"
-1.7841,50.7408
-1.7841,50.7408
-1.78411,50.7408
-1.78412,50.7408
-1.78413,50.7408
-1.78414,50.7408
-1.78415,50.7408
-1.78416,50.7408
-1.78416,50.7408
I have create a mapping template that looks like the following:
$ cat simple_base_map_template.json
{
"template": "base_map_template",
"order": 1,
"settings": {
"number_of_shards": 1
},
"mappings": {
"node_points" : {
"properties" : {
"location" : { "type" : "geo_point" }
}
}
}
}
and have a logstash config file that looks like the following:
$ cat simple_base_map.conf
input {
stdin {}
}
filter {
csv {
columns => [
"lon", "lat"
]
}
if [lon] == "lon" {
drop { }
} else {
mutate {
remove_field => [ "message", "host", "#timestamp", "#version" ]
}
mutate {
convert => { "lon" => "float" }
convert => { "lat" => "float" }
}
mutate {
rename => {
"lon" => "[location][lon]"
"lat" => "[location][lat]"
}
}
}
}
output {
stdout { codec => dots }
elasticsearch {
index => "base_map_simple"
template => "simple_base_map_template.json"
document_type => "node_points"
}
}
I then run the following:
$cat simple_base_map.csv | logstash-2.1.3/bin/logstash -f simple_base_map.conf
Settings: Default filter workers: 16
Logstash startup completed
....................................................................................................Logstash shutdown completed
However when looking at the index base_map_simple, it suggests the documents would not have a location: geo_point type in it...and rather it would be two doubles of lat and lon.
$ curl -XGET 'localhost:9200/base_map_simple?pretty'
{
"base_map_simple" : {
"aliases" : { },
"mappings" : {
"node_points" : {
"properties" : {
"location" : {
"properties" : {
"lat" : {
"type" : "double"
},
"lon" : {
"type" : "double"
}
}
}
}
}
},
"settings" : {
"index" : {
"creation_date" : "1457355015883",
"uuid" : "luWGyfB3ToKTObSrbBbcbw",
"number_of_replicas" : "1",
"number_of_shards" : "5",
"version" : {
"created" : "2020099"
}
}
},
"warmers" : { }
}
}
How would i need to change any of the above files to ensure that it goes into elastic search as a geo_point type?
Finally, I would like to be able to carry out a nearest neighbour search on the geo_points by using a command such as the following:
curl -XGET 'localhost:9200/base_map_simple/_search?pretty' -d'
{
"size": 1,
"sort": {
"_geo_distance" : {
"location" : {
"lat" : 50,
"lon" : -1
},
"order" : "asc",
"unit": "m"
}
}
}'
Thanks

The problem is that in your elasticsearch output you named the index base_map_simple while in your template the template property is base_map_template, hence the template is not being applied when creating the new index. The template property needs to somehow match the name of the index being created in order for the template to kick in.
It will work if you simply change the latter to base_map_*, i.e. as in:
{
"template": "base_map_*", <--- change this
"order": 1,
"settings": {
"index.number_of_shards": 1
},
"mappings": {
"node_points": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
}
UPDATE
Make sure to delete the current index as well as the template first., i.e.
curl -XDELETE localhost:9200/base_map_simple
curl -XDELETE localhost:9200/_template/logstash

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Add geoIP data to old data from Elasticsearch index - elasticsearch

Since you have the ingestion pipeline ready, you simply need to reference it in your call to the _update_by_query endpoint, like this: POST my_index*/_update_by_query?pipeline=my-pipeline ^ | add this

Related

Is it possible to add a new similarity metric to an existing index in Elasticsearch?

ElasticSearch Accessing Nested Documents in Script - Null Pointer Exception

Painless scripting Elastic Search : variable is not defined error when trying to access values from doc

ElasticSearch - Copy one field value to other field for all documents

CSV geodata into elasticsearch as a geo_point type using logstash

Categories

Resources