How Keyword and Numeric data Types are stored in elastic search? is it stored in inverted index? - elasticsearch

put sana/_mapping/learn { "properties": { "name":{"type":"text"}, "age":{"type":"integer"} } }
POST sana/learn { "name":"rosy", "age":23 }

Quoting the Elasticsearch doc:
Most fields are indexed by default, which makes them searchable. The
inverted index allows queries to look up the search term in unique
sorted list of terms, and from that immediately have access to the
list of documents that contain the term.
Keyword and numeric data types are also indexed and stored in the inverted index so that these fields are searchable, but if you want you can disable it by setting index type to false, in your index mapping, also on these fields(keyword,numeric) doc_values is enabled by default sorting and aggregations etc, but not enabled on analyzed string(text) fields.
Hope I answered your question and let me know if you have any doubt.

Related

Type of field for prefix search in Elastic Search

I'm confused on what index type I should apply for my field for prefix search, many show search_as_you_type but I think auto complete is not what I'm going for.
I have a UUID field:
id: 34y72ca1-3739-41ff-bbec-f6d17479384c
The following terms should return the doc above:
3
34
34y72ca1
34y72ca1-3739
34y72ca1-3739-41ff-bbec-f6d17479384c
Using 3739 should not return it as it doesn't start with 3739. Initially this is what I was going for but then the wildcard field is not supported by Amazon AWS, so I compromise for prefix search instead of partial search.
I tried search_as_you_type field but it doesn't return the result when I use the whole ID. Actually, my use case is when user click enter, the results will be shown, instead of real-live when they type, so if speed is compromised its OK, just that I hope for something that will be good for many rows of data.
Thanks
If you have not explicitly defined any index mapping, then you need to use id.keyword field instead of the id field for the prefix query to show the appropriate results. This uses the keyword analyzer instead of the standard analyzer
{
"query": {
"prefix": {
"id.keyword": {
"value": "34y72ca1"
}
}
}
}
Otherwise, you can modify your index mapping, by adding multi fields for id field

What is the difference between source filtering, stored fields, and doc values in elsaticsearch?

I've read the docs for source filtering, stored fields, and doc values.
In certain situations it can make sense to store a field. For instance, if you have a document with a title, a date, and a very large content field, you may want to retrieve just the title and the date without having to extract those fields from a large _source field
The stored_fields parameter is about fields that are explicitly marked as stored in the mapping, which is off by default and generally not recommended. Use source filtering instead to select subsets of the original source document to be returned.
All fields which support doc values have them enabled by default.
Example 1
I have documents with title (short string), and content (>1MB). I want to search for matching titles, and return the title.
With source filtering
GET /_search
{ _source: "obj.title", ... }
With stored fields
GET /_search
{ _source: false, stored_fields: ["title"], ... }
With doc values
GET /_search
{_source: false, stored_fields: "_none_", docvalue_fields: "title", ... }
Okay, so
Will the source filtered request read the full _source, title and content, from disk then apply the filter and return only the title, or will elasticsearch only read the title from disk?
Will the source filtered reques use doc values?
Do stored fields store the analyzed tokens or the original value?
Are stored fields or doc values more or less efficient than _source?
Will the source filtered request read the full _source, title and content, from disk then apply the filter and return only the title, or will elasticsearch only read the title from disk?
The document you send for indexing to Elasticsearch will be stored in a field called _source (by default). So this means that if your document contains a large amount of data (like in the content field in your case), the full content will be stored in the _source field. When using source filtering, first the whole source document must be retrieved from the _source field and then only the title field will be returned. You're wasting space because nothing really happens with the content field, since you're searching on title and returning only the title value.
In your case, you'd be better off to not store the _source document, and only store the title field (but it has some disadvantages, too, so read this before you do), basically like this:
PUT index
{
"mappings": {
"_source": {
"enabled": false
},
"properties": {
"title": {
"type": "text",
"store": true
},
"content": {
"type": "text"
}
}
}
}
Will the source filtered request use doc values?
doc-values are enabled by default on all fields, except on analyzed text fields. If you use _source filtering, it's not using doc values, as explained above, the _source field is retrieved and the fields you specified are filtered.
Do stored fields store the analyzed tokens or the original value?
Stored fields store the exact value as present in the _source document
Are stored fields or doc values more or less efficient than _source?
doc_values is a different beast, it's more of a optimization to store the tokens of non-analyzed fields in a way to will make it easy to sort, filter and aggregate on those values.
Stored fields (default is false) are also an optimization if you don't want to store the full source but only a few important fields (as explained above).
The _source field itself is a stored field that contains the whole document.

ElasticSearch: Metric aggregation and doc values / field-data

How does ES internally implement metric aggregations ?
Suppose documents in the index have below structure:
{
category: A,
measure: 20
}
Would for the below query which does terms aggregation on category and calculate sum(measure), the 'measure' field values
be extracted from the document (i.e. _source) and summed or
would the values be taken from doc-values / field data of 'measure' field
Query:
{
size: 0,
aggs: {
cat_aggs: {
terms: {
field: 'category'
},
aggs: {
sumAgg: {
sum: {field: 'measure'}
}
}
}
}
}
From the official documentation on metrics aggregations (emphasis added):
The aggregations in this family compute metrics based on values extracted in one way or another from the documents that are being aggregated. The values are typically extracted from the fields of the document (using the field data), but can also be generated using scripts.
If you're using a newer ES 2.x version, then doc_values have become the norm over field data.
All fields which support doc values have them enabled by default. If you are sure that you don’t need to sort or aggregate on a field, or access the field value from a script, you can disable doc values in order to save disk space
So to answer your question clearly, metrics aggregations are computed based on either field data or doc values that have been stored at indexing time, i.e. not computed based on source parsing at query time, unless your doing it from a script which accesses the _source directly.

analyzed vs not_analyzed: storage size

I recently started using ElasticSearch 2. And as I undestand analyzed vs not_analyzed in the mapping, not_analyzed should be better in storage (https://www.elastic.co/blog/elasticsearch-storage-the-true-story-2.0 and https://www.elastic.co/blog/elasticsearch-storage-the-true-story).
For testing purposes I created some indexes with all the String field as analyzed (by default) and then I created some other indexes with all the fields as not_analyzed, my surprise came when I checked the size of the indexes and I saw that the indexes with the not_analyzed Strings were 40% bigger!! I was inserting the same documents in each index (35000 docs).
Any idea why this is happening? My documents are simple JSON documents. I have 60 String fields in each document that I want to set as not_analyzed and I tried both setting each field as not analyzed and also creating a dynamic template.
I edit for adding the mapping, although I think it has nothing special:
{
"mappings": {
"my_type" : {
"_ttl" : { "enabled" : true, "default" : "7d" },
"properties" : {
"field1" : {
"properties" : {
"field2" : {
"type" : "string", "index" : "not_analyzed"
}
more not_analyzed String fields here
...
...
...
}
not_analyzed fields are still indexed. They just don't have any transformations applied to them beforehand ("analysis" - in Lucene parlance).
As an example:
(Doc 1) "The quick brown fox jumped over the lazy dog"
(Doc 2) "Lazy like the fox"
Simplified postings list created by Standard Analyzer (default for analyzed string fields - tokenized, lowercased, stopwords removed):
"brown": [1]
"dog": [1]
"fox": [1,2]
"jumped": [1]
"lazy": [1,2]
"over": [1]
"quick": [1]
30 characters worth of string data
Simplified postings list created by "index": "not_analyzed":
"The quick brown fox jumped over the lazy dog": [1]
"Lazy like the fox": [2]
62 characters worth of string data
Analysis causes input to get tokenized and normalized for the purpose of being able to look up documents using a term.
But as a result, the unit of text is reduced to a normalized term (vs an entire field with not_analyzed), and all the redundant (normalized) terms across all documents are collapsed into a single logical list saving you all the space that would normally be consumed by repeated terms and stopwords.
From the documentation, it looks like not_analyzed makes the field act like a "keyword" instead of a "full-text" field -- let's compare these two!
Full text
These fields are analyzed, that is they are passed through an analyzer to convert the string into a list of individual terms before being indexed.
Keyword
Keyword fields are not_analyzed. Instead, the exact string value is added to the index as a single term.
I'm not surprised that storing an entire string as a term, rather than breaking it into a list of terms, doesn't necessarily translate to saved space. Honestly, it probably depends on the index's analyzer and the string being indexed.
As a side note, I just re-indexed about a million documents of production data and cut our index disk space usage by ~95%. The main difference I made was modifying what was actually saved in the source (AKA stored). We indexed PDFs for searching, but did not need them to be returned and so that saved us from saving this information in two different ways (analyzed and raw). There are some very real downsides to this, though, so be careful!
Doc1{
"name":"my name is mayank kumar"
}
Doc2.{
"name":"mayank"
}
Doc3.{
"name":"Mayank"
}
We have 3 documents.
So if field 'name' is 'not_analyzed' and we search for 'mayank' only second document would be returned.If we search for 'Mayank' only third document would be returned.
If field 'name' is 'analyzed' by a analyser 'lowercase analyser'(just as a example).We we search for 'mayank', all 3 documents would be returned.
If we search for 'kumar' ,first document would be returned.This happens because in first document the field value gets tokenised as "my" "name" "is" "mayank" "kumar"
'not_analyzed' is basically used for 'full-text' search(mostly except in wildcards matching).less space on disk.Takes less time during indexing.
'analyzed' is basically used for matching documents.more space on disk (if the analyze fields are big).Takes more time during indexing.(More fields due to analyze fields)

How to search fields with '-' characters in elastic search

I am new to elastic search. I have got following document where one of the field "eventId" has "-" in value.
When i try to search with complete value of eventId, i don't get any results.
Sample Document app/event
{
"tags": {}
"eventId": "cc98d57b-c6bc-424c-b54c-df1e3df0d942",
}
I haven't created any explicit settings for my index.
Thanks.
you should check if the tokenizer splits your value into multiple fields. Maybe your value is stored as 5 fields: "cc98d57b", "c6bc", "424c", "b54c" and "df1e3df0d942"
You can analyze that with the 'Kopf' Plugin (https://github.com/lmenezes/elasticsearch-kopf).
If that is your problem you should change your field mapping, so that the value is not analyzed ("index" : "not_analyzed").
For an example how to set that mapping see here: Elasticsearch mapping settings 'not_analyzed' and grouping by field in Java
After that, you should be able to search for your specific value.

Resources