Elasticsearch manipulate existing field value to add new field - elasticsearch

I try to add new field which is value comes from hashed existing field value. So, i want to do;
my_index.hashedusername(new field) = crc32(my_index.username) (existing field)
For example
POST _update_by_query
{
"query": {
"match_all": {}
},
"script" : {
"source": "ctx._source.hashedusername = crc32(ctx._source.username);"
}
}
Please give me an idea how to do this..

java.util.zip.CRC32 is not available in the shared painless API so mocking that package will be non-trivial -- perhaps even unreasonable.
I'd suggest to compute the CRC32 hashes beforehand and only then send the docs to ES. Alternatively, scroll through all your documents, compute the hash and bulk-update your documents.
The painless API was designed to perform comparatively simple tasks and CRC32 is certainly outside of its purpose.

Related

Type of field for prefix search in Elastic Search

I'm confused on what index type I should apply for my field for prefix search, many show search_as_you_type but I think auto complete is not what I'm going for.
I have a UUID field:
id: 34y72ca1-3739-41ff-bbec-f6d17479384c
The following terms should return the doc above:
3
34
34y72ca1
34y72ca1-3739
34y72ca1-3739-41ff-bbec-f6d17479384c
Using 3739 should not return it as it doesn't start with 3739. Initially this is what I was going for but then the wildcard field is not supported by Amazon AWS, so I compromise for prefix search instead of partial search.
I tried search_as_you_type field but it doesn't return the result when I use the whole ID. Actually, my use case is when user click enter, the results will be shown, instead of real-live when they type, so if speed is compromised its OK, just that I hope for something that will be good for many rows of data.
Thanks
If you have not explicitly defined any index mapping, then you need to use id.keyword field instead of the id field for the prefix query to show the appropriate results. This uses the keyword analyzer instead of the standard analyzer
{
"query": {
"prefix": {
"id.keyword": {
"value": "34y72ca1"
}
}
}
}
Otherwise, you can modify your index mapping, by adding multi fields for id field

Kibana scripted field which loops through an array

I am trying to use the metricbeat http module to monitor F5 pools.
I make a request to the f5 api and bring back json, which is saved to kibana. But the json contains an array of pool members and I want to count the number which are up.
The advice seems to be that this can be done with a scripted field. However, I can't get the script to retrieve the array. eg
doc['http.f5pools.items.monitor'].value.length()
returns in the preview results with the same 'Additional Field' added for comparison:
[
{
"_id": "rT7wdGsBXQSGm_pQoH6Y",
"http": {
"f5pools": {
"items": [
{
"monitor": "default"
},
{
"monitor": "default"
}
]
}
},
"pool.MemberCount": [
7
]
},
If I try
doc['http.f5pools.items']
Or similar I just get an error:
"reason": "No field found for [http.f5pools.items] in mapping with types []"
Googling suggests that the doc construct does not contain arrays?
Is it possible to make a scripted field which can access the set of values? ie is my code or the way I'm indexing the data wrong.
If not is there an alternative approach within metricbeats? I don't want to have to make a whole new api to do the calculation and add a separate field
-- update.
Weirdly it seems that the number values in the array do return the expected results. ie.
doc['http.f5pools.items.ratio']
returns
{
"_id": "BT6WdWsBXQSGm_pQBbCa",
"pool.MemberCount": [
1,
1
]
},
-- update 2
Ok, so if the strings in the field have different values then you get all the values. if they are the same you just get one. wtf?
I'm adding another answer instead of deleting my previous one which is not the actual question but still may be helpful for someone else in future.
I found a hint in the same documentation:
Doc values are a columnar field value store
Upon googling this further I found this Doc Value Intro which says that the doc values are essentially "uninverted index" useful for operations like sorting; my hypotheses is while sorting you essentially dont want same values repeated and hence the data structure they use removes those duplicates. That still did not answer as to why it works different for string than number. Numbers are preserved but strings are filters into unique.
This “uninverted” structure is often called a “column-store” in other
systems. Essentially, it stores all the values for a single field
together in a single column of data, which makes it very efficient for
operations like sorting.
In Elasticsearch, this column-store is known as doc values, and is
enabled by default. Doc values are created at index-time: when a field
is indexed, Elasticsearch adds the tokens to the inverted index for
search. But it also extracts the terms and adds them to the columnar
doc values.
Some more deep-dive into doc values revealed it a compression technique which actually de-deuplicates the values for efficient and memory-friendly operations.
Here's a NOTE given on the link above which answers the question:
You may be thinking "Well that’s great for numbers, but what about
strings?" Strings are encoded similarly, with the help of an ordinal
table. The strings are de-duplicated and sorted into a table, assigned
an ID, and then those ID’s are used as numeric doc values. Which means
strings enjoy many of the same compression benefits that numerics do.
The ordinal table itself has some compression tricks, such as using
fixed, variable or prefix-encoded strings.
Also, if you dont want this behavior then you can disable doc-values
OK, solved it.
https://discuss.elastic.co/t/problem-looping-through-array-in-each-doc-with-painless/90648
So as I discovered arrays are prefiltered to only return distinct values (except in the case of ints apparently?)
The solution is to use params._source instead of doc[]
The answer for why doc doesnt work
Quoting below:
Doc values are a columnar field value store, enabled by default on all
fields except for analyzed text fields.
Doc-values can only return "simple" field values like numbers, dates,
geo- points, terms, etc, or arrays of these values if the field is
multi-valued. It cannot return JSON objects
Also, important to add a null check as mentioned below:
Missing fields
The doc['field'] will throw an error if field is
missing from the mappings. In painless, a check can first be done with
doc.containsKey('field')* to guard accessing the doc map.
Unfortunately, there is no way to check for the existence of the field
in mappings in an expression script.
Also, here is why _source works
Quoting below:
The document _source, which is really just a special stored field, can
be accessed using the _source.field_name syntax. The _source is loaded
as a map-of-maps, so properties within object fields can be accessed
as, for example, _source.name.first.
.
Responding to your comment with an example:
The kyeword here is: It cannot return JSON objects. The field doc['http.f5pools.items'] is a JSON object
Try running below and see the mapping it creates:
PUT t5/doc/2
{
"items": [
{
"monitor": "default"
},
{
"monitor": "default"
}
]
}
GET t5/_mapping
{
"t5" : {
"mappings" : {
"doc" : {
"properties" : {
"items" : {
"properties" : {
"monitor" : { <-- monitor is a property of items property(Object)
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
}
}

What is the difference between a field and a property in Elasticsearch?

I'm currently trying to understand the difference between fields (https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html) and properties (https://www.elastic.co/guide/en/elasticsearch/reference/current/properties.html).
They are both somehow defined as a "subfield/subproperty" of a type/mapping property, both can have separate types and analyzers (as far as I understood it), both are accessed by the dot notation (mappingProperty.subField or mappingProperty.property).
The docs are using the terms "field" and "property" randomly, I have the feeling, for example:
Type mappings, object fields and nested fields contain sub-fields,
called properties.
What is the difference between properties and (sub-)fields? How do I decide if I have a property or a field?
In other words, how do I decide if I use
{
"mappings": {
"_doc": {
"properties": {
"myProperty": {
"properties": {
}
}
}
}
}
}
or
{
"mappings": {
"_doc": {
"properties": {
"myProperty": {
"fields": {
}
}
}
}
}
}
Subfields are indexed from the parent property source. While sub-properties need to have a "real" value in the document's source.
If your source contains a real object, you need to create properties. Each property will correspond to a different value from your source.
If you only want to index the same value but with different analyzers then use subfields.
It is often useful to index the same field in different ways for
different purposes. This is the purpose of multi-fields. For instance,
a string field could be mapped as a text field for full-text search,
and as a keyword field for sorting or aggregations:
(sorry I find its hard to explain =| )
Note: This is an explanation from my current understanding. It may not be 100% accurate.
A property is what we used to call field in a RDBMS (a standard relationship db like MySQL). It stores properties of an object and provides the high-level structure for an index (which we can compare to a table in a relational DB).
A field, which is linked (or included) into the property concept, is a way to index that property using a specific analyzer.
So lets say you have:
One analyzer (A) to uppercase
One analyzer (B) to lowercase
One analyzer (C) to translate to Spanish (this doesn't even exist, just to give you an idea)
What an analyzer does is transform the input (the text on a property) into a series of tokens that will be indexed. When you do a search the same analyzer is used so the text is transformed into those tokens, it gives each one a score and then those tokens are used to grab documents from the index.
(A) Dog = DOG
(B) Dog = dog
(C) Dog = perro
To search using a specific field configuration you call it using a dot:
The text field uses the standard analyzer.
The text.english field uses the English analyzer.
So the fields basically allow you to perform searches using different token generation models.

nested count aggregations in elasticsearch

I have a type in elasticsearch where each user can post any number of posts(fields being "userid" and "post").Now I need the count of users who posted 0 post,1 post,2 posts and so on....how do I do it? I think it needs some nested aggregations implemented but I don't know how to proceed. Thanks in advance !
The best way of doing this is to add a separate field to store the number of posts.
Scripts are not too efficient (values are getting re-evaluated each time a query executes) and you get the value indexed properly which makes queries and aggregations very fast.
Of course you need to be sure you update this count each time you update the document.
You can use script in aggregation:
POST index_name/type_name/_search
{
"aggs": {
"group By Post Count": {
"terms": {
"script" : "doc['post'].size()"
}
}
}
}
Make sure you enable scriptig
Hope this helps you.

ElasticSearch - Statistical facet on length of string field

I would like to retrieve data about a string field like the min, max and average length (by counting the number of characters inside the string). My issue is that aggregations can only be used for numeric fields. Besides, I tried it using a simple statistical facet,
"query":{
"match_all": {}
},
"facets":{
"stat1":{
"statistical":{
"field":"title"}
}
}
but I get shard failures and SearchPhaseExecutionException. When trying with a script field the error returned is an OutOfMemoryError:
"query":{
"match_all": {}
},
"script_fields":{
"test1":{"script": "doc[\"title\"].value" }
}
Is it possible to retrive such data about a simple "title" string field using CURL? Thank you!
I haven't actually tried the following, but I believe it should work.
First some useful doc-references:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-facets-statistical-facet.html.
In order to implement the statistical facet, the relevant field values
are loaded into memory from the index. This means that per shard,
there should be enough memory to contain them. Since by default,
dynamic introduced types are long and double, one option to reduce the
memory footprint is to explicitly set the types for the relevant
fields to either short, integer, or float when possible.
I'm not sure directly how to set the type of the script-field to 'short' which is probably what you want. to reduce memory. it SHOULD be possible though.
ALSO: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-script-fields.html
It’s important to understand the difference between
doc['my_field'].value and _source.my_field. The first, using the doc
keyword, will cause the terms for that field to be loaded to memory
(cached), which will result in faster execution, but more memory
consumption. Also, the doc[...] notation only allows for simple valued
fields (can’t return a json object from it) and make sense only on
non-analyzed or single term based fields.
So ALTERNATIVE: would be to use _source instead of doc which would not cache the lengths.
Gives:
{
"query" : {
"match_all" : {}
},
"facets" : {
"stat1" : {
"statistical" : {
"script" : "doc['title'].value.length()
//"script" : "_source.title.length() //ALTERNATIVE which isn't cached
}
}
}
}

Resources