Elasticsearch cannot index array of integers - elasticsearch

Given mapping:
{
"mappings": {
"properties": {
"user_ids": {"type": "integer"}
}
}
}
Observations:
Index {"user_ids": 1}, and data will show up correctly
Index {"user_ids": "x"}, and error is thrown failed to parse field [user_ids] of type [integer] in document, indicating that mapping is working correctly
However, indexing {"user_ids": [1]} just clears the field, without throwing error.
Question:
Why does this happen and how can I index arrays of integers?
Extra:
I removed all settings config, doesn't change anything
I tried keyword type, doesn't change anything
If relevant, I use latest opensearch-py

It's not clear what do you mean by clear the field, also indexing array of integers works perfectly fine as shown in below example, hope you are following same requests.
put <index-name>/_doc/1
{
"user_ids": [
1,2,3
]
}
And get API returns, all the integers in the array.
GET <index-name>/_doc/1
"_source": {
"user_ids": [
1,
2,
3
]
}
}

Turns out it's my own error: I was using elasticsearch-head for quick checking of values, and they don't support displaying of array values :/ Once I double checked with queries, they came back correct.

Related

Elasticsearch 7 number_format_exception for input value as a String

I have field in index with mapping as :
"sequence_number" : {
"type" : "long",
"copy_to" : [
"_custom_all"
]
}
and using search query as
POST /my_index/_search
{
"query": {
"term": {
"sequence_number": {
"value": "we"
}
}
}
}
I am getting error message :
,"index_uuid":"FTAW8qoYTPeTj-cbC5iTRw","index":"my_index","caused_by":{"type":"number_format_exception","reason":"For input string: \"we\""}}}]},"status":400}
at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:260) ~[elasticsearch-rest-client-7.1.1.jar:7.1.1]
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:238) ~[elasticsearch-rest-client-7.1.1.jar:7.1.1]
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:212) ~[elasticsearch-rest-client-7.1.1.jar:7.1.1]
at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1433) ~[elasticsearch-rest-high-level-client-7.1.1.jar:7.1.1]
at
How can i ignore number_format_exception errors, so the query just doesn't return anything or ignores this filter in particular - either is acceptable.
Thanks in advance.
What you are looking for is not possible, ideally, you should have coherce enabled on your numeric fields so that your index doesn't contain dirty data.
The best solution is that in your application which generated the Elasticsearch query(you should have a check for NumberFormatExcepton if you are searching for numeric fields as your index doesn't contain the dirty data in the first place and reject the query if you get an exception in your application).
Edit: Another interesting approach is to validate the data before inserting into ES, using the Validate API as suggested by #prakash, only thing is that it would add another network call but if your application is not latency-sensitive, it can be used as a workaround.

How to extract and visualize values from a log entry in OpenShift EFK stack

I have an OKD cluster setup with EFK stack for logging, as described here. I have never worked with one of the components before.
One deployment logs requests that contain a specific value that I'm interested in. I would like to extract just this value and visualize it with an area map in Kibana that shows the amount of requests and where they come from.
The content of the message field basically looks like this:
[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}
This plz is a German zip code, which I would like to visualize as described.
My problem here is that I have no idea how to extract this value.
A nice first success would be if I could find it with a regexp, but Kibana doesn't seem to work the way I think it does. Following its docs, I expect this /\"plz\":\"[0-9]{5}\"/ to deliver me the result, but I get 0 hits (time interval is set correctly). Even if this regexp matches, I would only find the log entry where this is contained and not just the specifc value. How do I go on here?
I guess I also need an external geocoding service, but at which point would I include it? Or does Kibana itself know how to map zip codes to geometries?
A beginner-friendly step-by-step guide would be perfect, but I could settle for some hints that guide me there.
It would be possible to parse the message field as the document gets indexed into ES, using an ingest pipeline with grok processor.
First, create the ingest pipeline like this:
PUT _ingest/pipeline/parse-plz
{
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{POSINT:plz}"
]
}
}
]
}
Then, when you index your data, you simply reference that pipeline:
PUT plz/_doc/1?pipeline=parse-plz
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}"""
}
And you will end up with a document like the one below, which now has a field called plz with the 12345 value in it:
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}""",
"plz": "12345"
}
When indexing your document from Fluentd, you can specify a pipeline to be used in the configuration. If you can't or don't want to modify your Fluentd configuration, you can also define a default pipeline for your index that will kick in every time a new document is indexed. Simply run this on your index and you won't need to specify ?pipeline=parse-plz when indexing documents:
PUT index/_settings
{
"index.default_pipeline": "parse-plz"
}
If you have several indexes, a better approach might be to define an index template instead, so that whenever a new index called project.foo-something is created, the settings are going to be applied:
PUT _template/project-indexes
{
"index_patterns": ["project.foo*"],
"settings": {
"index.default_pipeline": "parse-plz"
}
}
Now, in order to map that PLZ on a map, you'll first need to find a data set that provides you with geolocations for each PLZ.
You can then add a second processor in your pipeline in order to do the PLZ/ZIP to lat,lon mapping:
PUT _ingest/pipeline/parse-plz
{
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{POSINT:plz}"
]
}
},
{
"script": {
"lang": "painless",
"source": "ctx.location = params[ctx.plz];",
"params": {
"12345": {"lat": 42.36, "lon": 7.33}
}
}
}
]
}
Ultimately, your document will look like this and you'll be able to leverage the location field in a Kibana visualization:
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}""",
"plz": "12345",
"location": {
"lat": 42.36,
"lon": 7.33
}
}
So to sum it all up, it all boils down to only two things:
Create an ingest pipeline to parse documents as they get indexed
Create an index template for all project* indexes whose settings include the pipeline created in step 1

Can inclusion of specific fields change the elasticsearch result set?

I have an ES query that returns 414 documents if I exclude a specific field from results.
If I include this field, the document count drops to 328.
The documents that get dropped are consistent and this happens whether I scroll results or query directly.
The field map for the field that reduces the result set looks like this:
"completion": {
"type": "object",
"enabled": false
}
Nothing special to it and I have other "enabled": false object type fields that return just fine in this query.
I tested against multiple indexes with the same data to rule out corruption (I hope).
This 'completion' object is a nested and ignored object that has 4 or 5 levels of nesting but once again, I have other similarly nested objects that return just fine for this query.
The query is a simple terms match for 414 terms (yes, this is terrible, we are rethinking our strategy on this):
var { _scroll_id, hits } = await elastic.search({
index: index,
type: type,
body: shaQuery,
scroll: '10s',
_source_exclude: 'account,layout,surveydata,verificationdata,accounts,scores'
});
while (hits && hits.hits.length) {
// Append all new hits
allRecords.push(...hits.hits)
var { _scroll_id, hits } = await elastic.scroll({
scrollId: _scroll_id,
scroll: '10s'
})
}
The query is:
"query": {
"terms": {
"_id": [
"....",
"....",
"...."
}
}
}
In this example, I will only get back 328 results. If I add 'completion' to the _source_exclude then I get the full set back.
So, my question is: What are the scenarios where including a field in the result could limit the search when that field is totally unrelated to the search.
The #'s are specific to this example but consistent across queries. I just include them for context on the overall problem.
Also important is that this completion field has the same data and format across both included and excluded records, I can't see anything that would cause a problem.
The problem was found and it was obscure. What we saw was that it was always failing at the same point and when it was examined a little more closely, the same error was coming out:
{ took: 158,
timed_out: false,
_shards:
{ total: 5,
successful: 4,
skipped: 0,
failed: 1,
failures: [ [Object] ] },
[ { shard: 0,
index: ‘theindexname’,
node: ‘4X2vwouIRriYbQTQcHQ_sw’,
reason:
{ type: ‘illegal_argument_exception’,
reason:
‘cannot write xcontent for unknown value of type class java.math.BigInteger’ } } ]
Ok well thats strange, we are not using BigIntegers at all. But, thanks to the power of the Google this issue in the elasticsearch issue tracker was revealed:
https://github.com/elastic/elasticsearch/pull/32888
"XContentBuilder to handle BigInteger and BigDecimal" which is a bug in 6.3 where fields that used BigInteger and BigDecimal would fail to serialize and thus break when source filtering was applied. We were running 6.3.
It is unclear why our systems are triggering this issue but upgrading to 6.5 solved it entirely.
Obscure obscure obscure but solved thanks to Javier's persistence.

Elasticsearch 2.x sort with missing value treated as 0.0

According to mapping the field I'm sorting on is of type double
I was expecting the sort to treat missing values as 0 since I have some negative values which I expect to go below missing ones. (Currently missing goes below negative)
How do I achieve this?
There is exists query (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-exists-query.html), which you can use along with not filter in function score query (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html) for sorting, and give needed boost/score for missing values as 0.0 in your case.
But I suggest much simpler solution - just index your documents with that field equals 0.0 instead of missing value and it will make things much easier - you will be able just sort on the needed field.
You seem to describe that missing, or null values should be handled as 0 values. If you do not need to be able to distinguish them from actual '0' values, you can use a null_value, see the manual
If you do, in another situation, have to distinguish them, you might need to add a flag or something (so a second field) to actually say "but this was actually missing", but that's really depending on your usecase
a quick example:
PUT test
{
"mappings": {
"my_type": {
"properties": {
"status_code": {
"type": "integer",
"null_value": 0
}
}
}
}
}
PUT test/my_type/1
{
"status_code": null
}
PUT test/my_type/2
{
"status_code": -1
}
PUT test/my_type/3
{
"status_code": 1
}
GET test/my_type/_search
{
sort" : [
{ "status_code" : "desc" },
]
}
This will return the documents in order 2, 1, 3 as expected

Sorting a match query with ElasticSearch

I'm trying to use ElasticSearch to find all records containing a particular string. I'm using a match query for this, and it's working fine.
Now, I'm trying to sort the results based on a particular field. When I try this, I get some very unexpected output, and none of the records even contain my initial search query.
My request is structured as follows:
{
"query":
{
"match": {"_all": "some_search_string"}
},
"sort": [
{
"some_field": {
"order": "asc"
}
}
] }
Am I doing something wrong here?
In order to sort on a string field, your mapping must contain a non-analyzed version of this field. Here's a simple blog post I found that describes how you can do this using the multi_field mapping type.

Resources