Query on number field matching string fields - elasticsearch

I have an ELK stack setup. When I am performing a query on number fields then it is also matching against string fields. For example, I am sending Load Balancer logs to ELK and if I perform backend_processing_time:>5 on that then it is matching against backend_processing_time with value 0.001 too.
On kibana interface, it is showing that the query is matching string in the request message. I am not able to understand how a query against a number field is matching against a string.
In the dev tools section on kibana i tried to run the same query
GET _search
{
"query": {
"range" : {
"backend_processing_time" : {
"gte" : 50000000000
}
}
}
}
Even with so much backend_processing_time i am getting results. I am not able to understand why this is happening.
I searched on other fields also which are of number type and found that all the queries done on number field are getting matched with string type fields.
I am providing a sample search result which i get for backend_processing_time:>500000000 query. It can be seen in this result that backend_processing_time field is so small but still getting a hit.
{
"_index": "logstash-2017.05.10",
"_type": "prod-quizelb-logs",
"_id": "AVvzYRgL49GPTZAKoDer",
"_score": null,
"_source": {
"backendport": 80,
"received_bytes": 0,
"request": "http://en.meaww.com:80/locales/en.json",
"backend_response": 200,
"verb": "GET",
"message": "2017-05-10T17:19:52.881044Z Prod-ELB 172.68.144.71:34803 10.1.91.253:80 0.000075 0.000606 0.000019 200 200 0 1881 \"GET http://en.meaww.com:80/locales/en.json HTTP/1.1\" \"Mozilla/5.0 (Linux; Android 6.0.1; SM-C900F Build/MMB29M; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/58.0.3029.83 Mobile Safari/537.36 [FB_IAB/FB4A;FBAV/122.0.0.17.71;]\" - -\n",
"type": "prod-quizelb-logs",
"clientport": 34803,
"request_processing_time": 0.000075,
"urihost": "en.meaww.com:80",
"response_processing_time": 0.000019,
"path": "/locales/en.json",
"#timestamp": "2017-05-10T17:21:18.280Z",
"port": "80",
"response": 200,
"bytes": 1881,
"clientip": "172.68.144.71",
"proto": "http",
"#version": "1",
"elb": "Prod-ELB",
"httpversion": "1.1",
"backendip": "10.1.91.253",
"backend_processing_time": 0.000606,
"timestamp": "2017-05-10T17:19:52.881044Z"
},
"fields": {
"#timestamp": [
1494436878280
],
"timestamp": [
1494436792881
]
},
"highlight": {
"backend_processing_time.keyword": [
"#kibana-highlighted-field#6.06E-4#/kibana-highlighted-field#"
],
"request": [
"#kibana-highlighted-field#http#/kibana-highlighted-field#://#kibana-highlighted-field#en.meaww.com#/kibana-highlighted-field#:#kibana-highlighted-field#80#/kibana-highlighted-field#/#kibana-highlighted-field#locales#/kibana-highlighted-field#/#kibana-highlighted-field#en.json#/kibana-highlighted-field#"
],
"elb.keyword": [
"#kibana-highlighted-field#Prod-ELB#/kibana-highlighted-field#"
],
"urihost.keyword": [
"#kibana-highlighted-field#en.meaww.com:80#/kibana-highlighted-field#"
],
"verb": [
"#kibana-highlighted-field#GET#/kibana-highlighted-field#"
],
"request.keyword": [
"#kibana-highlighted-field#http://en.meaww.com:80/locales/en.json#/kibana-highlighted-field#"
],
"type": [
"#kibana-highlighted-field#prod#/kibana-highlighted-field#-#kibana-highlighted-field#quizelb#/kibana-highlighted-field#-#kibana-highlighted-field#logs#/kibana-highlighted-field#"
],
"message": [
"2017-05-10T17:19:#kibana-highlighted-field#52.881044Z#/kibana-highlighted-field# #kibana-highlighted-field#Prod#/kibana-highlighted-field#-#kibana-highlighted-field#ELB#/kibana-highlighted-field# 172.68.144.71:34803 10.1.91.253:#kibana-highlighted-field#80#/kibana-highlighted-field# 0.000075 0.000606 0.000019 200 200 0 1881 \"#kibana-highlighted-field#GET#/kibana-highlighted-field# #kibana-highlighted-field#http#/kibana-highlighted-field#://#kibana-highlighted-field#en.meaww.com#/kibana-highlighted-field#:#kibana-highlighted-field#80#/kibana-highlighted-field#/#kibana-highlighted-field#locales#/kibana-highlighted-field#/#kibana-highlighted-field#en.json#/kibana-highlighted-field# #kibana-highlighted-field#HTTP#/kibana-highlighted-field#/1.1\" \"#kibana-highlighted-field#Mozilla#/kibana-highlighted-field#/5.0 (#kibana-highlighted-field#Linux#/kibana-highlighted-field#; #kibana-highlighted-field#Android#/kibana-highlighted-field# #kibana-highlighted-field#6.0.1#/kibana-highlighted-field#; #kibana-highlighted-field#SM#/kibana-highlighted-field#-#kibana-highlighted-field#C900F#/kibana-highlighted-field# #kibana-highlighted-field#Build#/kibana-highlighted-field#/#kibana-highlighted-field#MMB29M#/kibana-highlighted-field#; #kibana-highlighted-field#wv#/kibana-highlighted-field#) #kibana-highlighted-field#AppleWebKit#/kibana-highlighted-field#/#kibana-highlighted-field#537.36#/kibana-highlighted-field# (#kibana-highlighted-field#KHTML#/kibana-highlighted-field#, #kibana-highlighted-field#like#/kibana-highlighted-field# #kibana-highlighted-field#Gecko#/kibana-highlighted-field#) #kibana-highlighted-field#Version#/kibana-highlighted-field#/4.0 #kibana-highlighted-field#Chrome#/kibana-highlighted-field#/#kibana-highlighted-field#58.0.3029.83#/kibana-highlighted-field# #kibana-highlighted-field#Mobile#/kibana-highlighted-field# #kibana-highlighted-field#Safari#/kibana-highlighted-field#/#kibana-highlighted-field#537.36#/kibana-highlighted-field# [#kibana-highlighted-field#FB_IAB#/kibana-highlighted-field#/#kibana-highlighted-field#FB4A#/kibana-highlighted-field#;#kibana-highlighted-field#FBAV#/kibana-highlighted-field#/122.0.0.17.71;]\" - -\n"
],
"urihost": [
"#kibana-highlighted-field#en.meaww.com#/kibana-highlighted-field#:#kibana-highlighted-field#80#/kibana-highlighted-field#"
],
"path": [
"/#kibana-highlighted-field#locales#/kibana-highlighted-field#/#kibana-highlighted-field#en.json#/kibana-highlighted-field#"
],
"verb.keyword": [
"#kibana-highlighted-field#GET#/kibana-highlighted-field#"
],
"proto.keyword": [
"#kibana-highlighted-field#http#/kibana-highlighted-field#"
],
"port": [
"#kibana-highlighted-field#80#/kibana-highlighted-field#"
],
"type.keyword": [
"#kibana-highlighted-field#prod-quizelb-logs#/kibana-highlighted-field#"
],
"proto": [
"#kibana-highlighted-field#http#/kibana-highlighted-field#"
],
"elb": [
"#kibana-highlighted-field#Prod#/kibana-highlighted-field#-#kibana-highlighted-field#ELB#/kibana-highlighted-field#"
],
"backend_processing_time": [
"#kibana-highlighted-field#6.06E#/kibana-highlighted-field#-4"
],
"port.keyword": [
"#kibana-highlighted-field#80#/kibana-highlighted-field#"
]
},
"sort": [
1494436878280
]
}
EDIT
I got the mapping by running GET /logstash-2017.05.11/_mapping/prod-quizelb-logs query in kibana console.
The mapping which I am getting for backend_processing_time is showing this
"backend_processing_time": {
"type": "text",
"norms": false,
"fields": {
"keyword": {
"type": "keyword"
}
}
}
So it seems that this field is of text type thus causing this error to happen.
Now I have another confusion i.e. kibana is showing this as number but elasticsearch is showing this of type text. Also, this is getting mapped dynamically as i never created the mapping on my own. I think that they are getting created by logstash at the time grok filter is applied.

You need to take control of the mapping of those index(indices) so that your field will actually be a number. Otherwise, you will not be sure what kind of field type you'll have there. So, basically you need something like this, either in an index template, or a static mapping all the way:
"backend_processing_time": {
"type": "integer"
}

Remove space in your query_string. i.e Your query_string should look like this:
backend_processing_time:>0.5
Read more about query_string syntax here

Related

HTTP metrics per endpoint uri in spring boot app

I have enabled actuator for my project. I am interested in metrics per endpoint uri in my application.
I have two endpoints / and /hello. When I visit /actuator/metrics/http.server.requests I get the following result:
{
"name": "http.server.requests",
"description": null,
"baseUnit": "seconds",
"measurements": [
{
"statistic": "COUNT",
"value": 11
},
{
"statistic": "TOTAL_TIME",
"value": 0.07724317
},
{
"statistic": "MAX",
"value": 0.024692496
}
],
"availableTags": [
{
"tag": "exception",
"values": [
"None"
]
},
{
"tag": "method",
"values": [
"POST",
"GET"
]
},
{
"tag": "uri",
"values": [
"/actuator/metrics/{requiredMetricName}",
"/actuator/health",
"/**",
"/hello",
"/"
]
},
{
"tag": "outcome",
"values": [
"CLIENT_ERROR",
"SUCCESS"
]
},
{
"tag": "status",
"values": [
"404",
"200"
]
}
]
}
However I am interested in the metrics for each endpoint / and /hello, information such as average response time, max, min etc.
Is there a configuration parameter for this? Above only provides an aggregate metrics information. I would like to see each endpoints metris.
You can use tags to get aggregated results, for example, if you just want to get metrics for /hello you would request:
/actuator/metrics/http.server.requests?tag=uri:/hello
And you can combine tags, for example if you want to get metrics for all requests made to /hello that returned 200 you could request:
/actuator/metrics/http.server.requests?tag=uri:/hello&tag=status:200

elasticsearch filebeat mapper_parsing_exception when using decode_json_fields

I have ECK setup and im using filebeat to ship logs from Kubernetes to elasticsearch.
Ive recently added decode_json_fields processor to my configuration, so that im able decode the json that is usually in the message field.
- decode_json_fields:
fields: ["message"]
process_array: false
max_depth: 10
target: "log"
overwrite_keys: true
add_error_key: true
However logs have stopped appearing since adding it.
example log:
{
"_index": "filebeat-7.9.1-2020.10.01-000001",
"_type": "_doc",
"_id": "wF9hB3UBtUOF3QRTBcts",
"_score": 1,
"_source": {
"#timestamp": "2020-10-08T08:43:18.672Z",
"kubernetes": {
"labels": {
"controller-uid": "9f3f9d08-cfd8-454d-954d-24464172fa37",
"job-name": "stream-hatchet-cron-manual-rvd"
},
"container": {
"name": "stream-hatchet-cron",
"image": "<redacted>.dkr.ecr.us-east-2.amazonaws.com/stream-hatchet:v0.1.4"
},
"node": {
"name": "ip-172-20-32-60.us-east-2.compute.internal"
},
"pod": {
"uid": "041cb6d5-5da1-4efa-b8e9-d4120409af4b",
"name": "stream-hatchet-cron-manual-rvd-bh96h"
},
"namespace": "default"
},
"ecs": {
"version": "1.5.0"
},
"host": {
"mac": [],
"hostname": "ip-172-20-32-60",
"architecture": "x86_64",
"name": "ip-172-20-32-60",
"os": {
"codename": "Core",
"platform": "centos",
"version": "7 (Core)",
"family": "redhat",
"name": "CentOS Linux",
"kernel": "4.9.0-11-amd64"
},
"containerized": false,
"ip": []
},
"cloud": {
"instance": {
"id": "i-06c9d23210956ca5c"
},
"machine": {
"type": "m5.large"
},
"region": "us-east-2",
"availability_zone": "us-east-2a",
"account": {
"id": "<redacted>"
},
"image": {
"id": "ami-09d3627b4a09f6c4c"
},
"provider": "aws"
},
"stream": "stdout",
"message": "{\"message\":{\"log_type\":\"cron\",\"status\":\"start\"},\"level\":\"info\",\"timestamp\":\"2020-10-08T08:43:18.670Z\"}",
"input": {
"type": "container"
},
"log": {
"offset": 348,
"file": {
"path": "/var/log/containers/stream-hatchet-cron-manual-rvd-bh96h_default_stream-hatchet-cron-73069980b418e2aa5e5dcfaf1a29839a6d57e697c5072fea4d6e279da0c4e6ba.log"
}
},
"agent": {
"type": "filebeat",
"version": "7.9.1",
"hostname": "ip-172-20-32-60",
"ephemeral_id": "6b3ba0bd-af7f-4946-b9c5-74f0f3e526b1",
"id": "0f7fff14-6b51-45fc-8f41-34bd04dc0bce",
"name": "ip-172-20-32-60"
}
},
"fields": {
"#timestamp": [
"2020-10-08T08:43:18.672Z"
],
"suricata.eve.timestamp": [
"2020-10-08T08:43:18.672Z"
]
}
}
In the filebeat logs i can see the following error:
2020-10-08T09:25:43.562Z WARN [elasticsearch] elasticsearch/client.go:407 Cannot
index event
publisher.Event{Content:beat.Event{Timestamp:time.Time{wall:0x36b243a0,
ext:63737745936, loc:(*time.Location)(nil)}, Meta:null,
Fields:{"agent":{"ephemeral_id":"5f8afdba-39c3-4fb7-9502-be7ef8f2d982","hostname":"ip-172-20-32-60","id":"0f7fff14-6b51-45fc-8f41-34bd04dc0bce","name":"ip-172-20-32-60","type":"filebeat","version":"7.9.1"},"cloud":{"account":{"id":"700849607999"},"availability_zone":"us-east-2a","image":{"id":"ami-09d3627b4a09f6c4c"},"instance":{"id":"i-06c9d23210956ca5c"},"machine":{"type":"m5.large"},"provider":"aws","region":"us-east-2"},"ecs":{"version":"1.5.0"},"host":{"architecture":"x86_64","containerized":false,"hostname":"ip-172-20-32-60","ip":["172.20.32.60","fe80::af:9fff:febe:dc4","172.17.0.1","100.96.1.1","fe80::6010:94ff:fe17:fbae","fe80::d869:14ff:feb0:81b3","fe80::e4f3:b9ff:fed8:e266","fe80::1c19:bcff:feb3:ce95","fe80::fc68:21ff:fe08:7f24","fe80::1cc2:daff:fe84:2a5a","fe80::3426:78ff:fe22:269a","fe80::b871:52ff:fe15:10ab","fe80::54ff:cbff:fec0:f0f","fe80::cca6:42ff:fe82:53fd","fe80::bc85:e2ff:fe5f:a60d","fe80::e05e:b2ff:fe4d:a9a0","fe80::43a:dcff:fe6a:2307","fe80::581b:20ff:fe5f:b060","fe80::4056:29ff:fe07:edf5","fe80::c8a0:5aff:febd:a1a3","fe80::74e3:feff:fe45:d9d4","fe80::9c91:5cff:fee2:c0b9"],"mac":["02:af:9f:be:0d:c4","02:42:1b:56:ee:d3","62:10:94:17:fb:ae","da:69:14:b0:81:b3","e6:f3:b9:d8:e2:66","1e:19:bc:b3:ce:95","fe:68:21:08:7f:24","1e:c2:da:84:2a:5a","36:26:78:22:26:9a","ba:71:52:15:10:ab","56:ff:cb:c0:0f:0f","ce:a6:42:82:53:fd","be:85:e2:5f:a6:0d","e2:5e:b2:4d:a9:a0","06:3a:dc:6a:23:07","5a:1b:20:5f:b0:60","42:56:29:07:ed:f5","ca:a0:5a:bd:a1:a3","76:e3:fe:45:d9:d4","9e:91:5c:e2:c0:b9"],"name":"ip-172-20-32-60","os":{"codename":"Core","family":"redhat","kernel":"4.9.0-11-amd64","name":"CentOS
Linux","platform":"centos","version":"7
(Core)"}},"input":{"type":"container"},"kubernetes":{"container":{"image":"700849607999.dkr.ecr.us-east-2.amazonaws.com/stream-hatchet:v0.1.4","name":"stream-hatchet-cron"},"labels":{"controller-uid":"a79daeac-b159-4ba7-8cb0-48afbfc0711a","job-name":"stream-hatchet-cron-manual-c5r"},"namespace":"default","node":{"name":"ip-172-20-32-60.us-east-2.compute.internal"},"pod":{"name":"stream-hatchet-cron-manual-c5r-7cx5d","uid":"3251cc33-48a9-42b1-9359-9f6e345f75b6"}},"log":{"level":"info","message":{"log_type":"cron","status":"start"},"timestamp":"2020-10-08T09:25:36.916Z"},"message":"{"message":{"log_type":"cron","status":"start"},"level":"info","timestamp":"2020-10-08T09:25:36.916Z"}","stream":"stdout"},
Private:file.State{Id:"native::30998361-66306", PrevId:"",
Finished:false, Fileinfo:(*os.fileStat)(0xc001c14dd0),
Source:"/var/log/containers/stream-hatchet-cron-manual-c5r-7cx5d_default_stream-hatchet-cron-4278d956fff8641048efeaec23b383b41f2662773602c3a7daffe7c30f62fe5a.log",
Offset:539, Timestamp:time.Time{wall:0xbfd7d4a1e556bd72,
ext:916563812286, loc:(*time.Location)(0x607c540)}, TTL:-1,
Type:"container", Meta:map[string]string(nil),
FileStateOS:file.StateOS{Inode:0x1d8ff59, Device:0x10302},
IdentifierName:"native"}, TimeSeries:false}, Flags:0x1,
Cache:publisher.EventCache{m:common.MapStr(nil)}} (status=400):
{"type":"mapper_parsing_exception","reason":"failed to parse field
[log.message] of type [keyword] in document with id
'56aHB3UBLgYb8gz801DI'. Preview of field's value: '{log_type=cron,
status=start}'","caused_by":{"type":"illegal_state_exception","reason":"Can't
get text on a START_OBJECT at 1:113"}}
It throws an error because apparently log.message is of type "keyword" however this does not exist in the index mapping.
I thought this maybe an issue with the "target": "log" so ive tried changing this to something arbitrary like "my_parsed_message" or "m_log" or "mlog" and i get the same error for all of them.
{"type":"mapper_parsing_exception","reason":"failed to parse field
[mlog.message] of type [keyword] in document with id
'J5KlDHUB_yo5bfXcn2LE'. Preview of field's value: '{log_type=cron,
status=end}'","caused_by":{"type":"illegal_state_exception","reason":"Can't
get text on a START_OBJECT at 1:217"}}
Elastic version: 7.9.2
The problem is that some of your JSON messages contain a message field that is sometimes a simple string and other times a nested JSON object (like in the case you're showing in your question).
After this index was created, the very first message that was parsed was probably a string and hence the mapping has been modified to add the following field (line 10553):
"mlog": {
"properties": {
...
"message": {
"type": "keyword",
"ignore_above": 1024
},
}
}
You'll find the same pattern for my_parsed_message (line 10902), my_parsed_logs (line 10742), etc...
Hence the next message that comes with message being a JSON object, like
{"message":{"log_type":"cron","status":"start"}, ...
will not work because it's an object, not a string...
Looking at the fields of your custom JSON, it seems you don't really have the control over either their taxonomy (i.e. naming) or what they contain...
If you're serious about willing to search within those custom fields (which I think you are since you're parsing the field, otherwise you'd just store the stringified JSON), then I can only suggest to start figuring out a proper taxonomy in order to make sure that they all get a standard type.
If all you care about is logging your data, then I suggest to simply disable the indexing of that message field. Another solution is to set dynamic: false in your mapping to ignore those fields, i.e. not modify your mapping.

elasticsearch: unable to set geo_shape value using XContentBuilder

I have following mapping in elastic search. I am able to PUT documents using Sense plugin but unable to do so using XContentBuilder to set the geo_shape field value. I am getting following error:
error:
[106]: index [streets], type [street], id [{dc872755-f307-4c5e-93f6-bba9c95791c7}], message [MapperParsingException[failed to parse [shape]]; nested: ElasticsearchParseException[shape must be an object consisting of type and coordinates];]
mapping:
PUT /streets
{
"mappings": {
"street": {
"properties": {
"id": {
"type": "string"
},
"shape": {
"type": "geo_shape",
"tree": "quadtree"
}
}
}
}
}
code:
val bulkRequest:BulkRequestBuilder = esClient.prepareBulk()
//inloop
xb = jsonBuilder().startObject()
xb.field("id", guid)
xb.field("shape", jsonString) // removing this line creates the index OK but without the geo_shape
xb.endObject()
bulkRequest.add(esClient.prepareIndex("streets", "street", guid).setSource(xb))
//end loop
val bulkResponse:BulkResponse = bulkRequest.execute().actionGet()
if(bulkResponse.hasFailures){
println(bulkResponse.buildFailureMessage())
}
jsonString:
{
"id": "{98b8fd8d-074c-4349-a83b-6e892bf2d0ef}",
"shape": {
"type": "LineString",
"coordinates": [
[-70.81866815832467, 43.12187109162505],
[-70.83054813653018, 43.15917412985851],
[-70.81320737213957, 43.23522269547419],
[-70.90108590067649, 43.28102004268419]
],
"crs": {
"type": "name",
"properties": {
"name": "EPSG:4326"
}
}
}
}
Appreciate any feedback?
Thanks
It might be a bit late for you, but this could help someone facing a similar issue even nowadays.
Following your index mapping for the document streets, we have these properties: id and shape.
In your error message, it's described that:
shape must be an object consisting of type and coordinates
So for your concrete case, the crs array is just not accepted (don't know exactly why you can't add extra parameters).
This is an example for how to add a document into the streets index using CURL:
curl -X POST "localhost:9200/streets/_doc?pretty" -H 'Content-Type: application/json' -d '
{
"id": 123,
"shape": {
"type": "Polygon",
"coordinates": [
[
[
32.85444259643555,
39.928694653732364
],
[
32.847232818603516,
39.9257985682691
],
[
32.837791442871094,
39.91947941109337
],
[
32.837276458740234,
39.91579296675271
],
[
32.85392761230469,
39.913423004886894
],
[
32.86937713623047,
39.91329133793421
],
[
32.88036346435547,
39.91539797880347
],
[
32.85444259643555,
39.928694653732364
]
]
]
}
}'
If you need to add a LineString, instead of a Polygon, just change the 'type' attribute from the 'shape'.
I hope this helps people having to add documents with shapes into an ElasticSearch database.

How to index geojson file in elasticsearch?

I am trying to store spatial data in the form of geojson,csv files and shape files into elasticsearch USING PYTHON.I am new to elasticsearch and even after following the documentation i am not able to successfully index it. Any help would be appreciated.
sample geojson file :
{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"properties": {
"ID_0": 105,
"ISO": "IND",
"NAME_0": "India",
"ID_1": 1288,
"NAME_1": "Telangana",
"ID_2": 15715,
"NAME_2": "Telangana",
"VARNAME_2": null,
"NL_NAME_2": null,
"HASC_2": "IN.TS.AD",
"CC_2": null,
"TYPE_2": "State",
"ENGTYPE_2": "State",
"VALIDFR_2": "Unknown",
"VALIDTO_2": "Present",
"REMARKS_2": null,
"Shape_Leng": 8.103535,
"Shape_Area": 127258717496
},
"geometry": {
"type": "Polygon",
"coordinates": [
[
[
79.14429367552918,
19.500257885106404
],
[
79.14582245808431,
19.498859172536427
],
[
79.14600496956801,
19.498823981691853
],
[
79.14966523737327,
19.495821705263914
]
]
]
}
}
]
}
Code
import geojson
from datetime import datetime
from elasticsearch import Elasticsearch, helpers
def geojson_to_es(gj):
for feature in gj['features']:
date = datetime.strptime("-".join(feature["properties"]["event_date"].split('-')[0:2]) + "-" + feature["properties"]["year"], "%d-%b-%Y")
feature["properties"]["timestamp"] = int(date.timestamp())
feature["properties"]["event_date"] = date.strftime('%Y-%m-%d')
yield feature
with open("GeoObs.json") as f:
gj = geojson.load(f)
es = Elasticsearch(hosts=[{'host': 'localhost', 'port': 9200}])
k = ({
"_index": "YOUR_INDEX",
"_source": feature,
} for feature in geojson_to_es(gj))
helpers.bulk(es, k)
Explanation
with open("GeoObs.json") as f:
gj = geojson.load(f)
es = Elasticsearch(hosts=[{'host': 'localhost', 'port': 9200}])
This portion of the code loads an external geojson file, then connects to Elasticsearch.
k = ({
"_index": "conflict-data",
"_source": feature,
} for feature in geojson_to_es(gj))
helpers.bulk(es, k)
The ()s here creates a generator which we will feed to helpers.bulk(es, k). Remember _source is the original data as is in Elasticsearch speak - IE: our raw JSON. _index is just the index in which we want to put our data. You'll see other examples with _doc here. This is part of the mapping types and no longer exists in Elasticsearch 7.X+.
def geojson_to_es(gj):
for feature in gj['features']:
date = datetime.strptime("-".join(feature["properties"]["event_date"].split('-')[0:2]) + "-" + feature["properties"]["year"], "%d-%b-%Y")
feature["properties"]["timestamp"] = int(date.timestamp())
feature["properties"]["event_date"] = date.strftime('%Y-%m-%d')
yield feature
The function geojson uses a generator to produce events. A generator function will, instead of returning and finishingresume at the keywordyield` after each call. In this case, we are generating our GeoJSON features. In my code you also see:
date = datetime.strptime("-".join(feature["properties"]["event_date"].split('-')[0:2]) + "-" + feature["properties"]["year"], "%d-%b-%Y")
feature["properties"]["timestamp"] = int(date.timestamp())
feature["properties"]["event_date"] = date.strftime('%Y-%m-%d')
This is just an example of manipulating the data in the JSON before sending it out to Elasticsearch.
The key is in your mapping file you must have something tagged as geo_point or geo_shape. These data types are how Elasticsearch recognizes geo data. Example from my mapping file:
...
{
"properties": {
"geometry": {
"properties": {
"coordinates": {
"type": "geo_point"
},
"type": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
...
That is to say, before uploading your GeoJSON data with Python, you need to create your index, and then apply a mapping file which includes either geo_shape or geo_point using something like:
curl -X PUT "localhost:9200/YOUR_INDEX?pretty"
curl -X PUT localhost:9200/YOUR_INDEX/_mapping?pretty -H "Content-Type: application/json" -d #mapping.json
You must separate the GeoJson features into (1) geometry and (2) properties/attributes parts. You cannot index GeoJson features and feature collections directly (see documentation), only the geometry part is supported as a field type.
So you final indexable document would look somewhat flattened:
{
"ID_0": 105,
"ISO": "IND",
"NAME_0": "India",
"ID_1": 1288,
"NAME_1": "Telangana",
"ID_2": 15715,
"NAME_2": "Telangana",
"VARNAME_2": null,
"NL_NAME_2": null,
"HASC_2": "IN.TS.AD",
"CC_2": null,
"TYPE_2": "State",
"ENGTYPE_2": "State",
"VALIDFR_2": "Unknown",
"VALIDTO_2": "Present",
"REMARKS_2": null,
"Shape_Leng": 8.103535,
"Shape_Area": 127258717496,
"geometry": {
"type": "Polygon",
"coordinates": [
[
[
79.14429367552918,
19.500257885106404
],
[
79.14582245808431,
19.498859172536427
],
[
79.14600496956801,
19.498823981691853
],
[
79.14966523737327,
19.495821705263914
]
]
]
}
}

Filter by languages in 2nd and 4th level keys of a couchdb document

Given the following document in CouchDB....
{
"_id": "002bafd55b353692a7ab2968074310cc2cbff258",
"_rev": "1-bc853056ac61d817ae3c4ecb4f81322b",
"names": [
{ "locale": "en", "value": "Example" },
{ "locale": "de", "value": "Beispiel" },
{ "locale": "fr", "value": "Exemple" }
],
"details": [
{ "locale": "en", "value": "An Example is here" },
{ "locale": "de", "value": "Ein Beispiel ist heir" }
{ "locale": "en", "value": "Un exemple est ici" }
]
}
...how can I write a view that will allow me to return a partial document with
the undesired languages filtered out?
curl ..snip.. '_design/locale_filter/?locale=en,de,fr,it'
curl ..snip.. '_design/locale_filter/?locale=en,fr'
curl ..snip.. '_design/locale_filter/?locale=en'
Should return something looking like this:
{
"_id": "002bafd55b353692a7ab2968074310cc2cbff258",
"_rev": "1-bc853056ac61d817ae3c4ecb4f81322b",
"names": [
{ "locale": "en", "value": "Example" },
],
"details": [
{ "locale": "en", "value": "An Example is here" },
]
}
There's also a sub-case, where the documents have a further deeper structure,
which repeats the names and details structure, these would also be
filtered in an ideal world:
{
"_id": "002bafd55b353692a7ab2968074310cc2cbff258",
"_rev": "1-bc853056ac61d817ae3c4ecb4f81322b",
"names": [ ... snip ... ],
"details": [ ... snip ... ]
"deeper": {
"names": [
{ "locale": "en", "value": "Sub-Example" },
],
"details": [
{ "locale": "en", "value": "The Sub-Example is here" },
}
}
I also note that this might not be a view, but rather a show, from the
documentation couchdb says that a show is for transforming documents into any
format.
The final query from a beginner is whether there's some way to make it easier
to work on couchdb views and design docs, right now I'm experimenting with
erica which feels like overkill as I'm
pretty sure I don't want a couch app, I just want to easily maintain my views
in files on the disk, and sync them with the couch database whenever I've made
significant enough changes.
I was able to implement this using a show function, I implemented two show functions, one for convenience:
(doc, req) ->
all_locales = []
for name in doc.names
all_locales.push name.locale
toJSON(all_locales)
(I also implemented it on details, and remove duplicate locales in my real code)
This allows me to do the following:
GET /_design/dbname/_show/list_locales/c0db9ad..snip..
and returns ["en", "de", "fr"], for example - whatever locales the language happens to have.
I can then follow up with the function to retrieve the filtered document:
(doc, req) ->
locales = req.query.locales.split(",")
doc.names = doc.names.filter (name) ->
locales.indexOf(name.locale) > -1
doc.overviews = doc.details.filter (overview) ->
locales.indexOf(overview.locale) > -1
return toJSON(doc) + "\n"
The usage pattern for this is:
GET /_design/dbname/_show/restrict_locales/c0db9ad..snip..?locales=en,fr
GET /_design/dbname/_show/restrict_locales/c0db9ad..snip..?locales=fr
GET /_design/dbname/_show/restrict_locales/c0db9ad..snip..?locales=en,fr,de,it,hu,zh
It works quite remarkably well, and was much faster than I expected. I believe the show function results are aggressively cached by CouchDB.

Resources