Truncate and Index String values in Elasticsearch 2.3.x - elasticsearch

I am running ES 2.3.3. I want to index a non-analyzed String but truncate it to a certain number of characters. The ignore_above property, according to the documentation, will NOT index a field above the provided value. I don't want that. I want to take say a field that could potentially be 30K long and truncate it to 10K long, but still be able to filter and sort on the 10K that is retained.
Is this possible in ES 2.3.3 or do I need to do this using Java prior to indexing a document.

I want to index a non-analyzed String but truncate it to a certain number of characters.
Technically it's possible with Update API and Upsert option, but, depending on your exact needs, it may not be very handy.
Let's say you want to index this document:
{
"name": "foofoofoofoo",
"age": 29
}
but you need to truncate name field so that it has only 5 characters. Using Update API, you'd have to execute a script:
POST http://localhost:9200/insert/test/1/_update
{
"script" : "ctx._source.name = ctx._source.name.substring(0,5);",
"scripted_upsert": true,
"upsert" : {
"name": "foofoofoofoo",
"age": 29
}
}
It means that, if ES does not find the document with given id (here id=1), it should index the document that is inside upsert element, and perform given script. So as you can see, it's rather inconvenient if you want to have automatically generated ids, as you have to provide the id in URI.
Result:
GET http://localhost:9200/insert/test/1
{
"_index": "insert",
"_type": "test",
"_id": "1",
"_version": 1,
"found": true,
"_source": {
"name": "foofo",
"age": 29
}
}

Related

Calculate field data size and store to other field at indexing time ElasticSearch 7.17

I am looking for a way to store the size of a field (bytes) in a new field of a document.
I.e. when a document is created with a field message that contains the value hello, I want another field message_size_bytes written that in this example has the value 5.
I am aware of the possibilities using _update_by_query and _search using scripting fields, but I have so much data that I do not want to calculate the sizes while querying but at index time.
Is there a possibility to do this using Elasticsearch 7.17 only? I do not have access to the data before it's passed to elasticsearch.
You can use Ingest Pipeline with Script processor.
You can create pipeline using below command:
PUT _ingest/pipeline/calculate_bytes
{
"processors": [
{
"script": {
"description": "Calculate bytes of message field",
"lang": "painless",
"source": """
ctx['message_size_bytes '] = ctx['message'].length();
"""
}
}
]
}
After creating pipeline, you cna use pipeline name while indexing data like below (same you can use in logstash, java or anyother client as well):
POST 74906877/_doc/1?pipeline=calculate_bytes
{
"message":"hello"
}
Result:
"hits": [
{
"_index": "74906877",
"_id": "1",
"_score": 1,
"_source": {
"message": "hello",
"message_size_bytes ": 5
}
}
]

Does Elasticsearch execute operations in a specific order?

I read that ES is near real-time, and therefore all index/create/update/delete etc. operations are not executed immediately.
Let's say I index 3 documents with same id, in this order with 1 millisecond between each, and then force refresh:
{
"_id": "A",
"_source": { "text": "a" }
}
{
"_id": "A",
"_source": { "text": "b" }
}
{
"_id": "A",
"_source": { "text": "c" }
}
Then, if I search for a document with id "A", I will get 1 result, but which one?
When Elasticsearch performs a refresh, does it execute operations sequentially in the order in which they arrive?
in this instance it will come down to which indexing approach you take
a bulk request does not guarantee the order that you submitted it in is how it will be applied. it might be in the same order with (some of) your tests, but there's no guarantee that Elasticsearch provides there
you can manage this by specifying a version in your document, so a higher version of a document is always going to be what is indexed
indexing using 3 individual POSTs will be ordered, as you are making 3 separate and sequential requests one after the other. that's because each request has the same _id and will be directed to the same shard and actioned by the order they are received in

Empty fields aren't shown directly in Elasticsearch?

I added an extra field called "title" with the Put Mapping API and then tried a normal search on my index with GET index_name/type/_search but the records don't show any field with "title" in it. Is it because the field has no content in it? If so how do I get fields with no content?
Thank you.
if you have _source enabled, elasticsearch will return the field value(whether empty or not), which you sent to it. as shown in below example.
{
"title" : "" // see empty value
}
And GET API on this doc-id returns below response
{
"_index": "newso",
"_type": "_doc",
"_id": "1",
"_version": 2,
"_seq_no": 1,
"_primary_term": 1,
"found": true,
"_source": {
"title": "" // same value is returned in response.
}
}
EDIT:- Based on #Val comment, If you are looking to find this newly added title field in old documents, where you didn't index this field, you will not be able to find it, As elasticsearch is schema-less, and doesn't enforce that you have to mandatory index a field. Also you can add/remove fields without updating mapping as its schemaless.
For this matter, even if you index a new document, after adding this title field in mapping, and don't include title field, then for that document again title field will not be returned.

Elasticsearch query to get results irrespective of spaces in search text

I am trying to fetch data from Elasticsearch matching from a field name. I have following two records
{
"_index": "sam_index",
"_type": "doc",
"_id": "key",
"_version": 1,
"_score": 2,
"_source": {
"name": "Sample Name"
}
}
and
{
"_index": "sam_index",
"_type": "doc",
"_id": "key1",
"_version": 1,
"_score": 2,
"_source": {
"name": "Sample Name"
}
}
When I try to search using texts like sam, sample, Sa, etc, I able fetch both records by using match_phrase_prefix query. The query I tried with match_phrase_prefix is
GET sam_index/doc/_search
{
"query": {
"match_phrase_prefix" : {
"name": "sample"
}
}
}
I am not able to fetch the records when I try to search with string samplen. I need search and get results irrespective of spaces between texts. How can I achieve this in Elasticsearch?
First, you need to understand how Elasticsearch works and why it gives the result and doesn't give the result.
ES works on the token match, Documents which you index in ES goes through the analysis process and creates and stores the tokens generated from this process to inverted index which is used for searching.
Now when you make a query then that query also generates the search tokens, these can be as it is in the search query in case of term query or tokens based on the analyzer defined on the search field in case of match query. Hence it's very important to understand the internals of your search query.
Also, it's very important to understand the mapping of your index, ES uses the standard analyzer by default on the text fields.
You can use the Explain API to understand the internals of the query like which search tokens are generated by your search query, how documents matched to it and on what basis score is calculated.
In your case, I created the name field as text, which uses the word joined analyzer explained in Ignore spaces in Elasticsearch and I was able to get the document which consists of sample name when searched for samplen.
Let us know if you also want to achieve the same and if it solves your issue.

Can't create a visualization in Kibana (no compatible fields) - but I have compatible fields

I'd appreciate any help with this, I'm really stuck.
I am trying to create a simple visualization in Kibana, a line graph based on a number value in my data (origin_file_size_bytes). When I try to add a Visualization graph, I get this error:
No Compatible Fields: The "test*" index pattern does not contain any of the following field types: number or date
My actual index does contain a field with number, as does my data.
Thank you for any help!
Andrew
Here's a sample entry from the Discover Menu:
{
"_index": "lambda-index",
"_type": "lambda-type",
"_id": "LC08_L1TP_166077.TIF",
"_version": 1,
"_score": 2,
"_source": {.
"metadata_processed": {
"BOOL": true.
},
"origin_file_name": {
"S": "LC08_L1TP_166077.TIF"
},
"origin_file_size_bytes": {
"N": "61667800"
}
}
}
My Index pattern classifies as a string, even though it isn't:
origin_file_size_bytes.N string
You cannot aggregate on a string field. As seen from the screenshot above, your field has been indexed as string and NOT as a number. Elasticsearch dynamically determines mapping type of data if it is not explicitly defined. Since, you ingested the field as a string ES determined, correctly, that the field is of type string. See this link.
For ex. if you run the below to index a document with 2 fields as shown without an explicit mapping, ES creates message field as type 'string' and size field as type 'number' (long)
POST my_index\_doc\1
{
"message": "100",
"size": 100
}
Index your field into ES as a number instead and you should able to aggregate on it.

Resources