How to index a field in elasticsearch but not store it in _source? - elasticsearch

I have a collection of documents with a text field "features", and would like to make this field indexed (so documents can be searched through the field) but not stored (in order to save disk space).
How to index a field in elasticsearch like this "features" field but not store it in _source?

The following index mapping, will index a field value but not store it
Index Mapping:
{
"mappings": {
"properties": {
"features": {
"type": "text",
"index": "true",
"store": "false"
}
}
}
}
Index Data:
{
"features": "capacity"
}
Search Query:
{
"stored_fields": [
"features"
]
}
Search Result:
"hits": [
{
"_index": "67155998",
"_type": "_doc",
"_id": "1",
"_score": 1.0
}
]
UPDATE 1:
When a field is indexed, then you can perform queries on it. If a field is stored the contents of the field can be shown when the document matches.
But if you want that the content of the field should also not to be displayed in the _source, then you need to disable the _source field.
You need to modify your index mapping as
{
"mappings": {
"_source": {
"enabled": false
},
"properties": {
"features": {
"type": "text",
"index": "true",
"store": "false"
}
}
}
}
Search Query:
{
"query":{
"match":{
"features":"capacity"
}
}
}
Search Result:
"hits": [
{
"_index": "67155998",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821
}
]

Related

Username search in Elasticsearch

I want to implement a simple username search within Elasticsearch. I don't want weighted username searches yet, so I would expect it wouldn't be to hard to find resources on how do this. But in the end, I came across NGrams and lot of outdated Elasticsearch tutorials and I completely lost track on the best practice on how to do this.
This is now my setup, but it is really bad because it matches so much unrelated usernames:
{
"settings": {
"index" : {
"max_ngram_diff": "11"
},
"analysis": {
"analyzer": {
"username_analyzer": {
"tokenizer": "username_tokenizer",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"username_tokenizer": {
"type": "ngram",
"min_gram": "1",
"max_gram": "12"
}
}
}
},
"mappings": {
"properties": {
"_all" : { "enabled" : false },
"username": {
"type": "text",
"analyzer": "username_analyzer"
}
}
}
}
I am using the newest Elasticsearch and I just want to query similar/exact usernames. I have a user db and users should be able to search for eachother, nothing to fancy.
If you want to search for exact usernames, then you can use the term query
Term query returns documents that contain an exact term in a provided field. If you have not defined any explicit index mapping, then you need to add .keyword to the field. This uses the keyword analyzer instead of the standard analyzer.
There is no need to use an n-gram tokenizer if you want to search for the exact term.
Adding a working example with index data, index mapping, search query, and search result
Index Mapping:
{
"mappings": {
"properties": {
"username": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Index Data:
{
"username": "Jack"
}
{
"username": "John"
}
Search Query:
{
"query": {
"term": {
"username.keyword": "Jack"
}
}
}
Search Result:
"hits": [
{
"_index": "68844541",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"username": "Jack"
}
}
]
Edit 1:
To match for similar terms, you can use the fuzziness parameter along with the match query
{
"query": {
"match": {
"username": {
"query": "someting",
"fuzziness":"auto"
}
}
}
}
Search Result will be
"hits": [
{
"_index": "68844541",
"_type": "_doc",
"_id": "3",
"_score": 0.6065038,
"_source": {
"username": "something"
}
}
]

How to search over all fields and return every document containing that search in elasticsearch?

I have a problem regarding searching in elasticsearch.
I have a index with multiple documents with several fields. I want to be able to search over all the fields running a query and want it to return all the documents that contains the value specified in the query. I Found that using simple_query_string worked well for this. However, it does not return consistent results. In my index I have documents with several fields that contain dates. For example:
"revisionDate" : "2008-01-01T00:00:00",
"projectSmirCreationDate" : "2008-07-01T00:00:00",
"changedDate" : "1971-01-01T00:00:00",
"dueDate" : "0001-01-01T00:00:00",
Those are just a few examples, however when I index for example:
GET new_document-20_v2/_search
{
"size": 1000,
"query": {
"simple_query_string" : {
"query": "2008"
}
}
}
It only returns two documents, this is a problem because I have much more documents than just two that contains the value "2008" in their fields.
I also have problem searching file names.
In my index there are fields that contain fileNames like this:
"fileName" : "testPDF.pdf",
"fileName" : "demo.pdf",
"fileName" : "demo.txt",
When i query:
GET new_document-20_v2/_search
{
"size": 1000,
"query": {
"simple_query_string" : {
"query": "demo"
}
}
}
I get no results
But if i query:
GET new_document-20_v2/_search
{
"size": 1000,
"query": {
"simple_query_string" : {
"query": "demo.txt"
}
}
}
I get the proper result.
Is there any better way to search across all documents and fields than I did? I want it to return all the document matching the query and not just two or zero.
Any help would be greatly appreciated.
Elasticsearch uses a standard analyzer if no analyzer is specified. Since no analyzer is specified on "fileName", demo.txt gets tokenized to
{
"tokens": [
{
"token": "demo.txt",
"start_offset": 0,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Now when you are searching for demo it will not give any result, but searching for demo.txt will give the result.
You can instead use a wildcard query to search for a document having demo in fileName
{
"query": {
"wildcard": {
"fileName": {
"value": "demo*"
}
}
}
}
Search Result will be
"hits": [
{
"_index": "67303015",
"_type": "_doc",
"_id": "2",
"_score": 1.0,
"_source": {
"fileName": "demo.pdf"
}
},
{
"_index": "67303015",
"_type": "_doc",
"_id": "3",
"_score": 1.0,
"_source": {
"fileName": "demo.txt"
}
}
]
Since revisionDate, projectSmirCreationDate, changedDate, dueDate are all of type date, so you cannot do a partial search on these dates.
You can use multi-fields, to add one more field (of text type) in the above fields. Modify your index mapping as shown below
{
"mappings": {
"properties": {
"changedDate": {
"type": "date",
"fields": {
"raw": {
"type": "text"
}
}
},
"projectSmirCreationDate": {
"type": "date",
"fields": {
"raw": {
"type": "text"
}
}
},
"dueDate": {
"type": "date",
"fields": {
"raw": {
"type": "text"
}
}
},
"revisionDate": {
"type": "date",
"fields": {
"raw": {
"type": "text"
}
}
}
}
}
}
Index Data:
{
"revisionDate": "2008-02-01T00:00:00",
"projectSmirCreationDate": "2008-02-01T00:00:00",
"changedDate": "1971-01-01T00:00:00",
"dueDate": "0001-01-01T00:00:00"
}
{
"revisionDate": "2008-01-01T00:00:00",
"projectSmirCreationDate": "2008-07-01T00:00:00",
"changedDate": "1971-01-01T00:00:00",
"dueDate": "0001-01-01T00:00:00"
}
Search Query:
{
"query": {
"multi_match": {
"query": "2008"
}
}
}
Search Result:
"hits": [
{
"_index": "67303015",
"_type": "_doc",
"_id": "2",
"_score": 1.0,
"_source": {
"revisionDate": "2008-01-01T00:00:00",
"projectSmirCreationDate": "2008-07-01T00:00:00",
"changedDate": "1971-01-01T00:00:00",
"dueDate": "0001-01-01T00:00:00"
}
},
{
"_index": "67303015",
"_type": "_doc",
"_id": "1",
"_score": 0.18232156,
"_source": {
"revisionDate": "2008-02-01T00:00:00",
"projectSmirCreationDate": "2008-02-01T00:00:00",
"changedDate": "1971-01-01T00:00:00",
"dueDate": "0001-01-01T00:00:00"
}
}
]

Scoring higher for shorter fields

I'm trying to get a higher score (or at least the same score) for the shortest values on Elastic Search.
Let's say I have these documents: "Abc", "Abca", "Abcb", "Abcc". The field label.ngram uses an EdgeNgram analyser.
With a really simple query like that:
{
"query": {
"match": {
"label.ngram": {
"query": "Ab"
}
}
}
}
I always get first the documents "Abca", "Abcb", "Abcc" instead of "Abc".
How can I get "Abc" first?
(should I use this: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html?)
Thanks!
This is happening due to field normalization and to get the same score, you have to disable the norms on the field.
Norms store various normalization factors that are later used at query
time in order to compute the score of a document relatively to a
query.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"norms": false,
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"title": "Abca"
}
{
"title": "Abcb"
}
{
"title": "Abcc"
}
{
"title": "Abc"
}
Search Query:
{
"query": {
"match": {
"title": {
"query": "Ab"
}
}
}
}
Search Result:
"hits": [
{
"_index": "65953349",
"_type": "_doc",
"_id": "1",
"_score": 0.1424427,
"_source": {
"title": "Abca"
}
},
{
"_index": "65953349",
"_type": "_doc",
"_id": "2",
"_score": 0.1424427,
"_source": {
"title": "Abcb"
}
},
{
"_index": "65953349",
"_type": "_doc",
"_id": "3",
"_score": 0.1424427,
"_source": {
"title": "Abcc"
}
},
{
"_index": "65953349",
"_type": "_doc",
"_id": "4",
"_score": 0.1424427,
"_source": {
"title": "Abc"
}
}
]
As mentioned by #ESCoder that using norms you can fix the scoring but this would not be very useful, if you want to score your search results, as this would cause all the documents in your search results to have the same score, which will impact the relevance of your search results big time.
Maybe you should tweak the document length norm param for default similarity algorithm(BM25) if you are on ES 5.X or higher. I tried doing this with your dataset and my setting but didn't make it to work.
Second option which will mostly work as suggested by you is to store the size of your fields in different field(but) this you should populate from your application as after analysis process, various tokens would be generated for same field. but this is extra overhead and I would prefer doing this by tweaking the similarity algo param.

I can't get nested stored_fields

All.
I'm using ElasticSearch 5.0 and I have next mapping:
{
"mappings": {
"object":{
"properties":{
"attributes":{
"type":"nested",
"properties":{
"name": { "type": "keyword", "store":true},
"value": { "type": "text", "store":true },
"converted": {"type": "double", "store":true},
"datetimestamp": { "type": "date", "store":true}
}
}
}
}
}
}
Then I add one document:
{
"attributes":[
{"name":"attribute_string", "value":"string_value","converted":null,"datetimestamp":null},
{"name":"attribute_double", "value":"1234.567","converted":1234.567,"datetimestamp":null},
{"name":"attribute_datetime", "value":"2015-01-01T12:10:30Z","converted":null,"datetimestamp":"2015-01-01T12:10:30Z"}
]
}
When I query w/ "stored_fields", I don't have fields in results:
_search
{
"stored_fields":["attributes.converted"]
}
Results:
{
"_index": "test_index",
"_type": "object",
"_id": "1",
"_score": 1
}
But when I use "_source":["attributes.converted"] , i have result:
{
"_index": "test_index",
"_type": "object",
"_id": "1",
"_score": 1,
"_source": {
"attributes": [
{ "converted": null },
{ "converted": 1234.567 },
{ "converted": null }
]
}
}
What is the proper way to use stored_fields?
Does usage of "_source" affect performance compare to "stored_fields" approach?
If "_source" approach is fast as "stored_fields", shall I remove "store":true for the fields?
Thank you.
You're using nested types, so use inner_hits.
In the nested case, documents are returned based on matches in nested inner objects.
Documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-inner-hits.html
As per elasticsearch-docs
On its own, stored_fields cannot be used to load fields in nested objects — if a field contains a nested object in its path, then no data will be returned for that stored field. To access nested fields, stored_fields must be used within an inner_hits block.

Can I extract the actual value of not_analyzed field when _source is disabled?

I have the following mapping:
{
"articles":{
"mappings":{
"article":{
"_all":{
"enabled":false
},
"_source":{
"enabled":false
},
"properties":{
"content":{
"type":"string",
"norms":{
"enabled":false
}
},
"url":{
"type":"string",
"index":"not_analyzed"
}
}
}
},
"settings":{
"index":{
"refresh_interval":"30s",
"number_of_shards":"20",
"analysis":{
"analyzer":{
"default":{
"filter":[
"icu_folding",
"icu_normalizer"
],
"type":"custom",
"tokenizer":"icu_tokenizer"
}
}
},
"number_of_replicas":"1"
}
}
}
}
The questions is will it be possible to somehow extract the actual values of the url field since it not_analyzed and when _source is not enabled? I need to perform this only once for this index, so even a hacky way will be acceptable.
I know that not_analyzed means that the string won't be tokenized, so it makes sense to me that it should be store somewhere, but I don't know if it is hashes or 1:1 and I couldn't find information about this in the documentation.
My servers are running ES version 1.4.4 with JVM: 1.8.0_31
You can read the field data to retrieve the url from the document. We will be reading straight from the ES index, so we will get exactly what we are "matching" on, in this case, the exact URL you indexed as it is not analyzed.
Using the example index you provided, I indexed two URLs (on a smaller subset of your provided index:
POST /articles/article/1
{
"url":"https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-fielddata-fields.html"
}
POST /articles/article/2
{
"url":"http://stackoverflow.com/questions/37488389/can-i-extract-the-actual-value-of-not-analyzed-field-when-source-is-disabled"
}
And then this query will provide me a new "fields" object for each hit:
GET /articles/article/_search
{
"fielddata_fields" : ["url"]
}
Giving us these results:
"hits": [
{
"_index": "articles",
"_type": "article",
"_id": "2",
"_score": 1,
"fields": {
"url": [
"http://stackoverflow.com/questions/37488389/can-i-extract-the-actual-value-of-not-analyzed-field-when-source-is-disabled"
]
}
},
{
"_index": "articles",
"_type": "article",
"_id": "1",
"_score": 1,
"fields": {
"url": [
"https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-fielddata-fields.html"
]
}
}
]
Hope that helps!

Resources