How to calculate Elasticsearch Field Size - elasticsearch

The question, is there a way to calculate the most expensive field in a Elasticsearch index.
AIM is to calculate and compare the storage and index size of two fields in a elasticsearch Index.
Also is it wise to use dual type fields?
like a string in elasticsearch has text field which is searchable and .keyword field which is aggregatable
Will it use double the storage and index space?

is it wise to use dual-type fields. Like a string in elasticsearch has text field which is searchable and .keyword field which is aggregatable
It totally depends on the use case. Maintain both keyword & text representation of a field value if :
a) You need advance searching capability on the field
b) Either your current or future requirements requires capability to either sort or aggregate on the field.
In real life i have seen for short text fields like 'name', 'business-name','tag' etc it makes sense to maintain both. But for larger texts e.g description i don't think there are use cases for aggregation & sorting (in general).

Related

elasticsearch - Tag data with lookup table values

I’m trying to tag my data according to a lookup table.
The lookup table has these fields:
• Key- represent the field name in the data I want to tag.
In the real data the field is a subfield of “Headers” field..
An example for the “Key” field:
“Server. (* is a wildcard)
• Value- represent the wanted value of the mentioned field above.
The value in the lookup table is only a part of a string in the real data value.
An example for the “Value” field:
“Avtech”.
• Vendor- the value I want to add to the real data if a combination of field- value is found in an document.
An example for combination in the real data:
“Headers.Server : Linux/2.x UPnP/1.0 Avtech/1.0”
A match with that document in the look up table will be:
Key= Server (with wildcard on both sides).
Value= Avtech(with wildcard on both sides)
Vendor= Avtech
So baisically I’ll need to add a field to that document with the value- “ Avtech”.
the subfields in “Headers” are dynamic fields that changes from document to document.
of a match is not found I’ll need to add to the tag field with value- “Unknown”.
I’ve tried to use the enrich processor , use the lookup table as the source data , the match field will be ”Value” and the enrich field will be “Vendor”.
In the enrich processor I didn’t know how to call to the field since it’s dynamic and I wanted to search if the value is anywhere in the “Headers” subfields.
Also, I don’t think that there will be a match between the “Value” in the lookup table and the value of the Headers subfield, since “Value” field in the lookup table is a substring with wildcards on both sides.
I can use some help to accomplish what I’m trying to do.. and how to search with wildcards inside an enrich processor.
or if you have other idea besides the enrich processor- such as parent- child and lookup terms mechanism.
Thanks!
Adi.
There are two ways to accomplish this:
Using the combination of Logstash & Elasticsearch
Using the only the Elastichsearch Ingest node
Constriant: You need to know the position of the Vendor term occuring in the Header field.
Approach 1
If so then you can use the GROK filter to extract the term. And based on the term found, do a lookup to get the corresponding value.
Reference
https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html
https://www.elastic.co/guide/en/logstash/current/plugins-filters-kv.html
https://www.elastic.co/guide/en/logstash/current/plugins-filters-jdbc_static.html
https://www.elastic.co/guide/en/logstash/current/plugins-filters-jdbc_streaming.html
Approach 2
Create an index consisting of KV pairs. In the ingest node, create a pipeline which consists of Grok processor and then Enrich it. The Grok would work the same way mentioned in the Approach 1. And you seem to have already got the Enrich part working.
Reference
https://www.elastic.co/guide/en/elasticsearch/reference/current/grok-processor.html
If you are able to isolate the sub field within the Header where the Term of interest is present then it would make things easier for you.

ElasticSearch and Searching in Arrays

We have an ES index which has a field which stores its data as an array. In this field, we include the original text, plus text without any punctuation, special characters, etc. The problem is, when searching on the field, the multiple values appears to be skewing the score.
For example, if we search on the term 'up', the document which has the array ['up, up and away', 'up up and away'] is scoring higher with a multi_match (we are using because we may search more than one field) than the document with the array as simply ['up'].
In the end, I guess what I am looking for is a score that emulates calculating a score for each item in the array and returning me the highest. I believe in this case, comparing 'up' to 'Up' and 'Up, Up and Away' will give me a higher score for 'Up'.
With my research, I believe I may need to do custom scoring on this field...? If that is true, am I looking at "score_mode": "max" as what I want?
I think you slightly over-engineered your index. You don't need to create duplicate fields for the same information and remove punctuation, lowercase fields yourself.
I'd recommend you to read what are elasticsearch token filters and how to create multiple analyzers for the same field.
For your exact use case, if you provided a document sample, it would certainly help. But in any case looking at what you are dealing with - index your array of strings with default analyzer and with a custom one that you'll build yourself. Then you can use the same field, but with different analyzers (differently processed text) to control your score.

Elasticsearch query on string representation of number

Good day:
I have an indexed field called amount, which is of string type. The value of amount can be either one or 1. Say in this example, we have amount=1 as an indexed document but, I try to search for one, ElasticSearch will not return the value unless I put 1 for the search query. Thoughts on how I can get this to work? I'm thinking a tokenizer is what's needed.
Thanks.
You probably don't want this for sevenmillionfourhundredfifteenthousendtwohundredfourteen and the like, but only for a small number of values.
At index time I would convert everything to a proper number and store it in a numerical field, which then even allows to sort --- if you need it. Apart from this I would use synonyms at index and at query time and map everything to the digit-strings, but in a general text field that is searched by default.

How fields are associtated with terms in inverted index in elasticsearch?

As per my understanding, elasticsearch uses a structure called inverted index to provide full text search. It is clear that inverted index has terms and ids of the documents which has that term but the document can have any number of fields and the field name can be used in the query time to look/search only on that field. In that case how elasticsearch restricts/limits search only to a particular field? I would like to know if inverted index contains fields name or field id along with terms and document id.
Similar thing happens when you sort based on any field. So there could be a way to associate terms with field names. Please help me understand the intricacies involved here.
Thanks in advance.
I would like to know if inverted index contains fields name or field id
along with terms and document id.
Quoting from Lucene Docs
The same string in two different fields is considered a different term. Thus terms are represented as a pair of strings, the first naming the field, and the second naming text within the field.
In that case how elasticsearch restricts/limits search only to a
particular field?
Each segment index maintains Term Vectors : For each field in each document, the term vector is stored. A term vector consists of term text and term frequency.
Hence, the indexes are maintained for each field in each document.
We have a inverted index per field per index.
And there is something called field data cache ( or doc values ) which has the inverted "inverted index". All doc to field value lookup happens here.
I was also having this question
I can share my understanding here with you.
Elasticsearch creates an inverted index for each full-text field of the document. So if an index has 10 fields that allow full-text search then Elasticsearch will create 10 different inverted index for the 10 fields and store the analyzer results in those inverted indices for each field.
Thus when you perform a search operation and specify what all fields you want to search then Elasticsearch will search on the inverted indices of those specific fields only
Thus to summarize, an inverted index is created at the field level.
I hope that helps
Thanks

Sorting Solr multivalue fields based on field values

I have multiple Solr instances with separate schemas.
I need to receive multivalue field in sorted order, e.g. by type: train_station, airport, city_district, and so on:
q=köln&sort=query({!v="type:(airport OR train_station)"}) desc
I would like to see airport type document before train_station type. For now I am always getting train_station type at the top.
How should I write the query?
You are getting train_stations at the top because of the IDF.
A quick hack to fix it would be to use a range query (which has the advantage of having constant scores) and query boosts: q=köln&sort=query({!v="type:([airport TO airport]^3 OR [train_station TO train_station]^2)"}) desc.
This way, documents which have airport in their type field will have a score of 3, documents which have train_station in their type field will have a score of 2 and documents which have airport and train_station in their field type will have a score of 2+3=5 (to a multiplicative constant).
A more elegant (and effective) way of doing this would be to write a custom query parser (or even a function query).
You can sort on a function only if it returns a single value per document. You definitely can't sort on a multiValued field or any field that is tokenized. Seems like you would need a function that returns "airport" if the field contains "airport" (even if it contains "train station") and "train station" if it contains "train station" but not "airport", and then sort on that.
Another option would be to handle this at index time. Add a field called "airport_train_station_sort" that returns 1 if the field contains "airport", 2 if the field contains "train station" but NOT airport, and 3 if it contains neither. Then simply sort on that field.
You cannot solve this problem inside SOLR. Check the documentation, SOLR does not sort multivalued fields. Older versions of SOLR let you try, but the results were undefined and unpredictable.
You either change your schema and put this sort data into single value indexed fields, or you need to make several queries, first for airports, then city districts, then train stations.
To order items within the field itself you have to either index it in order you want, or do post processing. Solr's sort will sort only docs!

Resources