Can I extract the actual value of not_analyzed field when _source is disabled? - elasticsearch

I have the following mapping:
{
"articles":{
"mappings":{
"article":{
"_all":{
"enabled":false
},
"_source":{
"enabled":false
},
"properties":{
"content":{
"type":"string",
"norms":{
"enabled":false
}
},
"url":{
"type":"string",
"index":"not_analyzed"
}
}
}
},
"settings":{
"index":{
"refresh_interval":"30s",
"number_of_shards":"20",
"analysis":{
"analyzer":{
"default":{
"filter":[
"icu_folding",
"icu_normalizer"
],
"type":"custom",
"tokenizer":"icu_tokenizer"
}
}
},
"number_of_replicas":"1"
}
}
}
}
The questions is will it be possible to somehow extract the actual values of the url field since it not_analyzed and when _source is not enabled? I need to perform this only once for this index, so even a hacky way will be acceptable.
I know that not_analyzed means that the string won't be tokenized, so it makes sense to me that it should be store somewhere, but I don't know if it is hashes or 1:1 and I couldn't find information about this in the documentation.
My servers are running ES version 1.4.4 with JVM: 1.8.0_31

You can read the field data to retrieve the url from the document. We will be reading straight from the ES index, so we will get exactly what we are "matching" on, in this case, the exact URL you indexed as it is not analyzed.
Using the example index you provided, I indexed two URLs (on a smaller subset of your provided index:
POST /articles/article/1
{
"url":"https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-fielddata-fields.html"
}
POST /articles/article/2
{
"url":"http://stackoverflow.com/questions/37488389/can-i-extract-the-actual-value-of-not-analyzed-field-when-source-is-disabled"
}
And then this query will provide me a new "fields" object for each hit:
GET /articles/article/_search
{
"fielddata_fields" : ["url"]
}
Giving us these results:
"hits": [
{
"_index": "articles",
"_type": "article",
"_id": "2",
"_score": 1,
"fields": {
"url": [
"http://stackoverflow.com/questions/37488389/can-i-extract-the-actual-value-of-not-analyzed-field-when-source-is-disabled"
]
}
},
{
"_index": "articles",
"_type": "article",
"_id": "1",
"_score": 1,
"fields": {
"url": [
"https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-fielddata-fields.html"
]
}
}
]
Hope that helps!

Related

Elastic returns unexpected result from Search using edge_ngram

I am working out how to store my data in elasticsearch. First I tried the fuzzy function and while that worked okay I did not receive the expected results. Afterwards I tried the ngram and then the edge_ngram tokenizer. The edge_ngram tokenizer looked like it works like an autocomplete. Exactly what I needed. But it still gives unexpected results. I configured min 1 and max 5 to get all results starting with the first letter I search for. While this works I still get those results as I continue typing.
Example: I have a name field filled with documents named The New York Times and The Guardian. Now when I search for T both occur as expected. But the same happens when I search for TT, TTT and so on.
In that case it does not matter wether I execute the search in Kibana or from my application (which useses MultiMatch on all fields). Kibana even shows me the that it matched the single letter T.
So what did I miss and how can I achieve getting the results like with an autocomplete but without having too many results?
When defining your index mapping, you need to specify search_analyzer as standard. If no search_analyzer is defined explicitly, then by default elasticsearch considers search_analyzer to be the same as that of analyzer specified.
Adding a working example with index data, mapping, search query and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 5,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard" // note this
}
}
}
}
Index Data:
{
"name":"The Guardian"
}
{
"name":"The New York Times"
}
Search Query:
{
"query": {
"match": {
"name": "T"
}
}
}
Search Result:
"hits": [
{
"_index": "69027911",
"_type": "_doc",
"_id": "1",
"_score": 0.23092544,
"_source": {
"name": "The New York Times"
}
},
{
"_index": "69027911",
"_type": "_doc",
"_id": "2",
"_score": 0.20824991,
"_source": {
"name": "The Guardian"
}
}
]

Username search in Elasticsearch

I want to implement a simple username search within Elasticsearch. I don't want weighted username searches yet, so I would expect it wouldn't be to hard to find resources on how do this. But in the end, I came across NGrams and lot of outdated Elasticsearch tutorials and I completely lost track on the best practice on how to do this.
This is now my setup, but it is really bad because it matches so much unrelated usernames:
{
"settings": {
"index" : {
"max_ngram_diff": "11"
},
"analysis": {
"analyzer": {
"username_analyzer": {
"tokenizer": "username_tokenizer",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"username_tokenizer": {
"type": "ngram",
"min_gram": "1",
"max_gram": "12"
}
}
}
},
"mappings": {
"properties": {
"_all" : { "enabled" : false },
"username": {
"type": "text",
"analyzer": "username_analyzer"
}
}
}
}
I am using the newest Elasticsearch and I just want to query similar/exact usernames. I have a user db and users should be able to search for eachother, nothing to fancy.
If you want to search for exact usernames, then you can use the term query
Term query returns documents that contain an exact term in a provided field. If you have not defined any explicit index mapping, then you need to add .keyword to the field. This uses the keyword analyzer instead of the standard analyzer.
There is no need to use an n-gram tokenizer if you want to search for the exact term.
Adding a working example with index data, index mapping, search query, and search result
Index Mapping:
{
"mappings": {
"properties": {
"username": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Index Data:
{
"username": "Jack"
}
{
"username": "John"
}
Search Query:
{
"query": {
"term": {
"username.keyword": "Jack"
}
}
}
Search Result:
"hits": [
{
"_index": "68844541",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"username": "Jack"
}
}
]
Edit 1:
To match for similar terms, you can use the fuzziness parameter along with the match query
{
"query": {
"match": {
"username": {
"query": "someting",
"fuzziness":"auto"
}
}
}
}
Search Result will be
"hits": [
{
"_index": "68844541",
"_type": "_doc",
"_id": "3",
"_score": 0.6065038,
"_source": {
"username": "something"
}
}
]

How to index a field in elasticsearch but not store it in _source?

I have a collection of documents with a text field "features", and would like to make this field indexed (so documents can be searched through the field) but not stored (in order to save disk space).
How to index a field in elasticsearch like this "features" field but not store it in _source?
The following index mapping, will index a field value but not store it
Index Mapping:
{
"mappings": {
"properties": {
"features": {
"type": "text",
"index": "true",
"store": "false"
}
}
}
}
Index Data:
{
"features": "capacity"
}
Search Query:
{
"stored_fields": [
"features"
]
}
Search Result:
"hits": [
{
"_index": "67155998",
"_type": "_doc",
"_id": "1",
"_score": 1.0
}
]
UPDATE 1:
When a field is indexed, then you can perform queries on it. If a field is stored the contents of the field can be shown when the document matches.
But if you want that the content of the field should also not to be displayed in the _source, then you need to disable the _source field.
You need to modify your index mapping as
{
"mappings": {
"_source": {
"enabled": false
},
"properties": {
"features": {
"type": "text",
"index": "true",
"store": "false"
}
}
}
}
Search Query:
{
"query":{
"match":{
"features":"capacity"
}
}
}
Search Result:
"hits": [
{
"_index": "67155998",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821
}
]

Elasticsearch query returns strange sorted (score based) result

I'm using Elasticsearch v5.3.2
I have the following mapping:
{
"mappings":{
"info":{
"_all":{
"enabled": false
},
"properties":{
"info":{
"properties":{
"email":{
"doc_values":"false",
"fields":{
"ngram":{
"analyzer":"custom_nGram_analyzer",
"type":"text"
}
},
"type":"keyword"
}
}
}
}
}
},
"settings":{
"analysis":{
"analyzer":{
"custom_nGram_analyzer":{
"filter":[
"lowercase",
"asciifolding",
"custom_nGram_filter"
],
"tokenizer":"whitespace",
"type":"custom"
}
},
"filter":{
"custom_nGram_filter":{
"max_gram":16,
"min_gram":3,
"type":"ngram"
}
}
}
}
}
I see very strange results in terms of document scores when I execute the following query:
GET /info_2017_08/info/_search
{
"query": {
"multi_match": {
"query": "hotmail",
"fields": [
"info.email.ngram"
]
}
}
}
It brings the following results:
"hits": {
"total": 3,
"max_score": 1.3834574,
"hits": [
{
"_index": "info_2017_08",
"_type": "info",
"_id": "AV4uQnCjzNcTF2GMY730",
"_score": 1.3834574,
"_source": {
"info": {
"email": "pv53p8vg#gmail.com"
}
}
},
{
"_index": "info_2017_08",
"_type": "info",
"_id": "AV4uQm93zNcTF2GMY73x",
"_score": 0.3967861,
"_source": {
"info": {
"email": "-vb6sbw54#hotmail.com"
}
}
},
{
"_index": "info_2017_08",
"_type": "info",
"_id": "AV4uQmYbzNcTF2GMY73P",
"_score": 0.36409757,
"_source": {
"info": {
"email": "985pu4c.r02a#gmail.com"
}
}
}
]
}
Now pay attention to scores. How come the first result has a higher score than the second one if the first one is ...#gmail.com and the second one is ...#hotmail.com, if I have searched for the term "hotmail"?
The second one should match the query with ngrams "mail" and "hotmail", while the first one will match the query only by ngram "mail", so what is the reason for such an outcome?
Thanks in advance.
Elasticsearch calculates scores of a document on each shard independently using TF/IDF statistics. Because of that, if you have two shards with next content:
"info.email": "985pu4c.r02a#gmail.com"
"info.email": "1085pu4c.r02a#gmail.com", "info.email": "-vb6sbw54#hotmail.com"
Then for your specific query single document from the first shard will have a higher score than any document from the second shard.
You can examine content of each shards using next API call: GET index/_search?preference=_shards:0

I can't get nested stored_fields

All.
I'm using ElasticSearch 5.0 and I have next mapping:
{
"mappings": {
"object":{
"properties":{
"attributes":{
"type":"nested",
"properties":{
"name": { "type": "keyword", "store":true},
"value": { "type": "text", "store":true },
"converted": {"type": "double", "store":true},
"datetimestamp": { "type": "date", "store":true}
}
}
}
}
}
}
Then I add one document:
{
"attributes":[
{"name":"attribute_string", "value":"string_value","converted":null,"datetimestamp":null},
{"name":"attribute_double", "value":"1234.567","converted":1234.567,"datetimestamp":null},
{"name":"attribute_datetime", "value":"2015-01-01T12:10:30Z","converted":null,"datetimestamp":"2015-01-01T12:10:30Z"}
]
}
When I query w/ "stored_fields", I don't have fields in results:
_search
{
"stored_fields":["attributes.converted"]
}
Results:
{
"_index": "test_index",
"_type": "object",
"_id": "1",
"_score": 1
}
But when I use "_source":["attributes.converted"] , i have result:
{
"_index": "test_index",
"_type": "object",
"_id": "1",
"_score": 1,
"_source": {
"attributes": [
{ "converted": null },
{ "converted": 1234.567 },
{ "converted": null }
]
}
}
What is the proper way to use stored_fields?
Does usage of "_source" affect performance compare to "stored_fields" approach?
If "_source" approach is fast as "stored_fields", shall I remove "store":true for the fields?
Thank you.
You're using nested types, so use inner_hits.
In the nested case, documents are returned based on matches in nested inner objects.
Documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-inner-hits.html
As per elasticsearch-docs
On its own, stored_fields cannot be used to load fields in nested objects — if a field contains a nested object in its path, then no data will be returned for that stored field. To access nested fields, stored_fields must be used within an inner_hits block.

Resources