I can't get nested stored_fields - elasticsearch

All.
I'm using ElasticSearch 5.0 and I have next mapping:
{
"mappings": {
"object":{
"properties":{
"attributes":{
"type":"nested",
"properties":{
"name": { "type": "keyword", "store":true},
"value": { "type": "text", "store":true },
"converted": {"type": "double", "store":true},
"datetimestamp": { "type": "date", "store":true}
}
}
}
}
}
}
Then I add one document:
{
"attributes":[
{"name":"attribute_string", "value":"string_value","converted":null,"datetimestamp":null},
{"name":"attribute_double", "value":"1234.567","converted":1234.567,"datetimestamp":null},
{"name":"attribute_datetime", "value":"2015-01-01T12:10:30Z","converted":null,"datetimestamp":"2015-01-01T12:10:30Z"}
]
}
When I query w/ "stored_fields", I don't have fields in results:
_search
{
"stored_fields":["attributes.converted"]
}
Results:
{
"_index": "test_index",
"_type": "object",
"_id": "1",
"_score": 1
}
But when I use "_source":["attributes.converted"] , i have result:
{
"_index": "test_index",
"_type": "object",
"_id": "1",
"_score": 1,
"_source": {
"attributes": [
{ "converted": null },
{ "converted": 1234.567 },
{ "converted": null }
]
}
}
What is the proper way to use stored_fields?
Does usage of "_source" affect performance compare to "stored_fields" approach?
If "_source" approach is fast as "stored_fields", shall I remove "store":true for the fields?
Thank you.

You're using nested types, so use inner_hits.
In the nested case, documents are returned based on matches in nested inner objects.
Documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-inner-hits.html

As per elasticsearch-docs
On its own, stored_fields cannot be used to load fields in nested objects — if a field contains a nested object in its path, then no data will be returned for that stored field. To access nested fields, stored_fields must be used within an inner_hits block.

Related

Username search in Elasticsearch

I want to implement a simple username search within Elasticsearch. I don't want weighted username searches yet, so I would expect it wouldn't be to hard to find resources on how do this. But in the end, I came across NGrams and lot of outdated Elasticsearch tutorials and I completely lost track on the best practice on how to do this.
This is now my setup, but it is really bad because it matches so much unrelated usernames:
{
"settings": {
"index" : {
"max_ngram_diff": "11"
},
"analysis": {
"analyzer": {
"username_analyzer": {
"tokenizer": "username_tokenizer",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"username_tokenizer": {
"type": "ngram",
"min_gram": "1",
"max_gram": "12"
}
}
}
},
"mappings": {
"properties": {
"_all" : { "enabled" : false },
"username": {
"type": "text",
"analyzer": "username_analyzer"
}
}
}
}
I am using the newest Elasticsearch and I just want to query similar/exact usernames. I have a user db and users should be able to search for eachother, nothing to fancy.
If you want to search for exact usernames, then you can use the term query
Term query returns documents that contain an exact term in a provided field. If you have not defined any explicit index mapping, then you need to add .keyword to the field. This uses the keyword analyzer instead of the standard analyzer.
There is no need to use an n-gram tokenizer if you want to search for the exact term.
Adding a working example with index data, index mapping, search query, and search result
Index Mapping:
{
"mappings": {
"properties": {
"username": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Index Data:
{
"username": "Jack"
}
{
"username": "John"
}
Search Query:
{
"query": {
"term": {
"username.keyword": "Jack"
}
}
}
Search Result:
"hits": [
{
"_index": "68844541",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"username": "Jack"
}
}
]
Edit 1:
To match for similar terms, you can use the fuzziness parameter along with the match query
{
"query": {
"match": {
"username": {
"query": "someting",
"fuzziness":"auto"
}
}
}
}
Search Result will be
"hits": [
{
"_index": "68844541",
"_type": "_doc",
"_id": "3",
"_score": 0.6065038,
"_source": {
"username": "something"
}
}
]

How to index a field in elasticsearch but not store it in _source?

I have a collection of documents with a text field "features", and would like to make this field indexed (so documents can be searched through the field) but not stored (in order to save disk space).
How to index a field in elasticsearch like this "features" field but not store it in _source?
The following index mapping, will index a field value but not store it
Index Mapping:
{
"mappings": {
"properties": {
"features": {
"type": "text",
"index": "true",
"store": "false"
}
}
}
}
Index Data:
{
"features": "capacity"
}
Search Query:
{
"stored_fields": [
"features"
]
}
Search Result:
"hits": [
{
"_index": "67155998",
"_type": "_doc",
"_id": "1",
"_score": 1.0
}
]
UPDATE 1:
When a field is indexed, then you can perform queries on it. If a field is stored the contents of the field can be shown when the document matches.
But if you want that the content of the field should also not to be displayed in the _source, then you need to disable the _source field.
You need to modify your index mapping as
{
"mappings": {
"_source": {
"enabled": false
},
"properties": {
"features": {
"type": "text",
"index": "true",
"store": "false"
}
}
}
}
Search Query:
{
"query":{
"match":{
"features":"capacity"
}
}
}
Search Result:
"hits": [
{
"_index": "67155998",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821
}
]

Scoring higher for shorter fields

I'm trying to get a higher score (or at least the same score) for the shortest values on Elastic Search.
Let's say I have these documents: "Abc", "Abca", "Abcb", "Abcc". The field label.ngram uses an EdgeNgram analyser.
With a really simple query like that:
{
"query": {
"match": {
"label.ngram": {
"query": "Ab"
}
}
}
}
I always get first the documents "Abca", "Abcb", "Abcc" instead of "Abc".
How can I get "Abc" first?
(should I use this: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html?)
Thanks!
This is happening due to field normalization and to get the same score, you have to disable the norms on the field.
Norms store various normalization factors that are later used at query
time in order to compute the score of a document relatively to a
query.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"norms": false,
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"title": "Abca"
}
{
"title": "Abcb"
}
{
"title": "Abcc"
}
{
"title": "Abc"
}
Search Query:
{
"query": {
"match": {
"title": {
"query": "Ab"
}
}
}
}
Search Result:
"hits": [
{
"_index": "65953349",
"_type": "_doc",
"_id": "1",
"_score": 0.1424427,
"_source": {
"title": "Abca"
}
},
{
"_index": "65953349",
"_type": "_doc",
"_id": "2",
"_score": 0.1424427,
"_source": {
"title": "Abcb"
}
},
{
"_index": "65953349",
"_type": "_doc",
"_id": "3",
"_score": 0.1424427,
"_source": {
"title": "Abcc"
}
},
{
"_index": "65953349",
"_type": "_doc",
"_id": "4",
"_score": 0.1424427,
"_source": {
"title": "Abc"
}
}
]
As mentioned by #ESCoder that using norms you can fix the scoring but this would not be very useful, if you want to score your search results, as this would cause all the documents in your search results to have the same score, which will impact the relevance of your search results big time.
Maybe you should tweak the document length norm param for default similarity algorithm(BM25) if you are on ES 5.X or higher. I tried doing this with your dataset and my setting but didn't make it to work.
Second option which will mostly work as suggested by you is to store the size of your fields in different field(but) this you should populate from your application as after analysis process, various tokens would be generated for same field. but this is extra overhead and I would prefer doing this by tweaking the similarity algo param.

Can I extract the actual value of not_analyzed field when _source is disabled?

I have the following mapping:
{
"articles":{
"mappings":{
"article":{
"_all":{
"enabled":false
},
"_source":{
"enabled":false
},
"properties":{
"content":{
"type":"string",
"norms":{
"enabled":false
}
},
"url":{
"type":"string",
"index":"not_analyzed"
}
}
}
},
"settings":{
"index":{
"refresh_interval":"30s",
"number_of_shards":"20",
"analysis":{
"analyzer":{
"default":{
"filter":[
"icu_folding",
"icu_normalizer"
],
"type":"custom",
"tokenizer":"icu_tokenizer"
}
}
},
"number_of_replicas":"1"
}
}
}
}
The questions is will it be possible to somehow extract the actual values of the url field since it not_analyzed and when _source is not enabled? I need to perform this only once for this index, so even a hacky way will be acceptable.
I know that not_analyzed means that the string won't be tokenized, so it makes sense to me that it should be store somewhere, but I don't know if it is hashes or 1:1 and I couldn't find information about this in the documentation.
My servers are running ES version 1.4.4 with JVM: 1.8.0_31
You can read the field data to retrieve the url from the document. We will be reading straight from the ES index, so we will get exactly what we are "matching" on, in this case, the exact URL you indexed as it is not analyzed.
Using the example index you provided, I indexed two URLs (on a smaller subset of your provided index:
POST /articles/article/1
{
"url":"https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-fielddata-fields.html"
}
POST /articles/article/2
{
"url":"http://stackoverflow.com/questions/37488389/can-i-extract-the-actual-value-of-not-analyzed-field-when-source-is-disabled"
}
And then this query will provide me a new "fields" object for each hit:
GET /articles/article/_search
{
"fielddata_fields" : ["url"]
}
Giving us these results:
"hits": [
{
"_index": "articles",
"_type": "article",
"_id": "2",
"_score": 1,
"fields": {
"url": [
"http://stackoverflow.com/questions/37488389/can-i-extract-the-actual-value-of-not-analyzed-field-when-source-is-disabled"
]
}
},
{
"_index": "articles",
"_type": "article",
"_id": "1",
"_score": 1,
"fields": {
"url": [
"https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-fielddata-fields.html"
]
}
}
]
Hope that helps!

Elasticsearch "strict" mapping not working for fields with null values

I have an index for which I have set the mapping to "dynamic":"strict".
As expected, for the most part, if a field that is not listed in the mapping is introduced, Elasticsearch will reject it.
However I am finding that any field with a null value is not caught, and will make it into my index. Here is what my mapping looks like:
{
"myindex": {
"mappings": {
"mystuff": {
"dynamic": "strict",
"_id": {
"store": true,
"index": "not_analyzed"
},
"_timestamp": {
"enabled": true,
"store": true
},
"_index": {
"enabled": true
},
"_type": {
"store": true
},
"properties": {
"entitlements": {
"type": "nested",
"properties": {
"accountNumber": {
"type": "string",
"index": "not_analyzed"
},
"active": {
"type": "string",
"index": "not_analyzed"
},
"assetEndDate": {
"type": "date",
"format": "date_time_no_millis"
}
}
}
}
}
}
}
}
EDIT (including example scenarios)
With the mapping above, here are the scenarios I am seeing:
1) When Posting a valid document (one that follows the mapping), 200 OK.
posted document:
{
"entitlements": [
{
"accountNumber": "123213",
"active": "true",
"assetEndDate": "2016-10-13T00:00:00Z"
}
]
}
elasticsearch response:
{
"_index": "myindex",
"_type": "mystuff",
"_id": "5",
"_version": 1,
"created": true
}
2) When posting an invalid document (one that does not follow the mapping), 400 StrictDynamicMappingException.
posted document:
{
"entitlements": [
{
"accountNumber":"123213",
"XXXXactive": "true",
"assetEndDate": "2016-10-13T00:00:00Z"
}
]
}
elasticsearch response:
{
"error": "StrictDynamicMappingException[mapping set to strict, dynamic introduction of [Xactive] within [entitlements] is not allowed]",
"status": 400
}
3) When posting an invalid document (one that does not follow the mapping) with a value that is null for the invalid field, 200 OK.
posted document:
{
"entitlements": [
{
"accountNumber":"123213",
"XXXXactive": null,
"assetEndDate": "2016-10-13T00:00:00Z"
}
]
}
elasticsearch response:
{
"_index": "myindex",
"_type": "mystuff",
"_id": "7",
"_version": 1,
"created": true
}
4) When posting an invalid document (one that does not follow the mapping) with a value that is null for the invalid field, 200 OK.
posted document:
{
"entitlements": [
{
"accountNumber":"123213",
"XXXXactive": null,
"assetEndDate": "2016-10-13T00:00:00Z",
"THIS_SHOULD_NOT_BE_HERE": null
}
]
}
elasticsearch response:
{
"_index": "myindex",
"_type": "mystuff",
"_id": "9",
"_version": 1,
"created": true
}
It is the 3rd and 4th scenarios, that I am concerned about.
It looks like this issue (or one very similar) was raised one the Elasticsearch git repository here and has since been closed. However, the problem appears to still be an issue in version 1.7 .
This is being seen locally, as well as on indexes I have deployed with AWS Elasticsearch Service.
Am I making a mistake somewhere, or Has anyone found a solution to this problem ?

Resources