When trying to index in elasticsearch 7.8.1, an error occurs saying "field" is too large, must be <= 32766 Is there a solution? - elasticsearch

When trying to index in elasticsearch 7.8.1, an error occurs saying "testField" is too large, must be <= 32766 Is there a solution?
Field Info
"testField":{
"type": "keyword",
"index": false
}

It is a known issue and it is not clear yet on what is best to solve it. Lucene enforces a maximum term length of 32766, beyond which the document is rejected.
Until this gets solved, there are two immediate options you can choose from:
A. Use a script ingest processor to truncate the value to at most 32766 bytes.
PUT _ingest/pipeline/truncate-pipeline
{
"description": "truncate",
"processors": [
{
"script": {
"source": """
ctx.testField = ctx.testField.substring(0, 32766);
"""
}
}
]
}
PUT my-index/_doc/123?pipeline=truncate-pipeline
{ "testField": "hgvuvhv....sjdhbcsdc" }
B. Use a text field with an appropriate analyzer that would truncate the value, but you'd lose the ability to aggregate and sort on that field.
If you want to keep your field as a keyword, I'd go with option A

Related

How to highlight regexp in elasticsearch to with patterns that include spaces

Am trying to use regexp in elasticsearch to find some patterns and highlight it, the pattern am trying to find contain spaces,
I also used keyword as to not analyze the text
{
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
the exact pattern is ".*( [a-zA-Z]( |,|.)){3,5}.*" the query looks like this
{
"_source": false,
"query": {
"bool": {
"should": [
{
"regexp": {
"transcript_data.transcript.keyword": {
"value": ".*( [a-zA-Z]){3,5}.*"
}
}
}
]
}
},
"highlight": {
"fields": {
"transcript_data.transcript.keyword": {}
}
}
}
the highlight seems to highlight the whole document (start to end) eventhough the pattern lies in the middle of text.
to clarify, consider the below difference between the 2 images
V.S.
for example
It's it's it's a steal. A hot eight mining and b k k t. I think these have went too the output should be <em>b k k t</em> I get ... <em>It's it's it's a steal. A hot eight mining and b k k t. I think these have went too</em> and I believe this because the .* but this also seem to be the way regex work in ES, what am doing wrong ?
As far as I know you cannot search for text type of field when regex have white space because text field are analzyed and it is split into multipul tokens. So when you do search on text field with whitespace in regex then it will not return any result.
Currently you are trying to search keyword, type of field which is not analyzed field and that's why you are able to search it. Also, it is highlighting entire field because keyword field is not analyzed and it will store entire single value.
You can use "require_field_match": "false" in highlighting if you want to search on other field and highlight on different field but this will also not work in your case.
You can try out shingle another and then try to search on keyword field and highlight on shingle field, but I am not sure if this will fit your usecase completelty.

How to treat certain field values as null in `Elasticsearch`

I'm parsing log files which for simplicity's sake let's say will have the following format :
{"message": "hello world", "size": 100, "forward-to": 127.0.0.1}
I'm indexing these lines into an Elasticsearch index, where I've defined a custom mapping such that message, size, and forward-to are of type text, integer, and ip respectively. However, some log lines will look like this :
{"message": "hello world", "size": "-", "forward-to": ""}
This leads to parsing errors when Elasticsearch tries to index these documents. For technical reasons, it's very much untrivial for me to pre-process these documents and change "-" and "" to null. Is there anyway to define which values my mapping should treat as null ? Is there perhaps an analyzer I can write which works on any field type whatsoever that I can add to all entries in my mapping ?
Basically I'm looking for somewhat of the opposite of the null_value option. Instead of telling Elasticsearch what to turn a null_value into, I'd like to tell it what it should turn into a null_value. Also acceptable would be a way to tell Elasticsearch to simply ignore fields that look a certain way but still parse the other fields in the document.
So this one's easy apparently. Add the following to your mapping settings :
{
"settings": {
"index": {
"mapping": {
"ignore_malformed": "true"
}
}
}
}
This will still index the field (contrary to what I've understood from the documentation...) but it will be ignored during aggregations (so if you have 3 entries in an integer field that are "1", 3, and "hello world", an averaging aggregation will yield 2).
Keep in mind that because of the way the option was implemented (and I would say this is a bug) this still fails for and object that is entered as a concrete value and vice versa. If you'd like to get around that you can set the field's enabled value to false like this :
{
"mappings": {
"my_mapping_name": {
"properties": {
"my_unpredictable_field": {
"enabled": false
}
}
}
}
}
This comes at a price though, since this means the field won't be indexed, but the values entered will be still be stored so you can still accessing them by searching for that document through another field. This usually shouldn't be an issue as you likely won't be filtering documents based on the value of such an unpredictable field, but that depends on your specific case use. See here for the official discussion of this issue.

Ngram Tokenizer on field, not on query

I'm having trouble finding the solution for a use case here.
Basically, it's pretty simple : I need to perform a "contains" query, like a SQL like '%...%'.
I've seen there is a regexp query, which I actually managed to get working perfectly, but as it seems to scale badly, i'm trying out nGrams. Now, I've played around with them before and know "how they work", but the behaviour isn't the one I expect it to be.
Basically, i've configured my analyzer to be mingram =2, maxgram = 20. Say I index a user called "Christophe". I want the query "Chris" to actually match, which it does, since Chris is a 5-gram of Christophe. The problem is, "Risotto" matches aswell, because it gets broken down into Ngrams and ultimately "is" is a 2-gram of "Christophe" and so it matches aswell.
What I need is the analyzer to actually break down the indexed field in nGrams at indexing time, and compare those to the FULL text query. Risotto should match Risotto, XXXRisottoXXX and so on, but not Risolo or something where the nGrams do match.
Is there any solution ?
You need to use search_analyzer setting to have distinct index time and search time analyzers.
Sample from docs:
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}

ElasticSearch + Kibana - Unique count using pre-computed hashes

update: Added
I want to perform unique count on my ElasticSearch cluster.
The cluster contains about 50 millions of records.
I've tried the following methods:
First method
Mentioned in this section:
Pre-computing hashes is usually only useful on very large and/or high-cardinality fields as it saves CPU and memory.
Second method
Mentioned in this section:
Unless you configure Elasticsearch to use doc_values as the field data format, the use of aggregations and facets is very demanding on heap space.
My property mapping
"my_prop": {
"index": "not_analyzed",
"fielddata": {
"format": "doc_values"
},
"doc_values": true,
"type": "string",
"fields": {
"hash": {
"type": "murmur3"
}
}
}
The problem
When I use unique count on my_prop.hash in Kibana I receive the following error:
Data too large, data for [my_prop.hash] would be larger than limit
ElasticSearch has 2g heap size.
The above also fails for a single index with 4 millions of records.
My questions
Am I missing something in my configurations?
Should I increase my machine? This does not seem to be the scalable solution.
ElasticSearch query
Was generated by Kibana:
http://pastebin.com/hf1yNLhE
ElasticSearch Stack trace
http://pastebin.com/BFTYUsVg
That error says you don't have enough memory (more specifically, memory for fielddata) to store all the values from hash, so you need to take them out from the heap and put them on disk, meaning using doc_values.
Since you are already using doc_values for my_prop I suggest doing the same for my_prop.hash (and, no, the settings from the main field are not inherited by the sub-fields): "hash": { "type": "murmur3", "index" : "no", "doc_values" : true }.

Elasticsearch sorting on string not returning expected results

When sorting on a string field with multiple words, Elasticsearch is splitting the string value and using the min or max as the sort value. I.E.: when sorting on a field with the value "Eye of the Tiger" in ascending order, the sort value is: "Eye" and when sorting in descending order the value is: "Tiger".
Lets say I have "Eye of the Tiger" and "Wheel of Death" as entries in my index, when I do an ascending sort on this field, I would expect, "Eye of the Tiger" to be first, since "E" comes before "W", but what I'm seeing when sorting on this field, "Wheel of Death" is coming up first, since "D" is the min value of that term and "E" is the min value of "Eye of the Tiger".
Does anyone know how to turn off this behavior and just allow a regular sort on this string field?
As mconlin mentioned if you want to sort on the unanalyzed doc field you need to specify "index": "not_analyzed" to sort as you described. But if you're looking to be able to keep this field tokenized to search on, this post by sloan shows a great example. Using multi-field to keep two different mappings for a field is very common in Elasticsearch.
Hope this helps, let me know if I can offer more explanation.
If you want the sorting to be case-insensitive "index": "not_analyzed" doesn't work, so I've created a custom sort analyzer.
index-settings.yml
index :
analysis :
analyzer :
sort :
type : custom
tokenizer : keyword
filter : [lowercase]
Mapping:
...
"articleName": {
"type": "string",
"analyzer": "standard",
"fields": {
"sort": {
"type": "string",
"analyzer": "sort"
}
}
}
...

Resources