Related
I have a keyword field that can contain characters with diacritics. Queries without diacritics should return results with those diacritics, but not vice versa. The first part can be resolved by using a normalizer, the configuration for which is also described in a related question. If I use that for e.g. {"title": "Sulgi"} and {"title": "Šulgi"}, searching for "Sulgi" will (correctly) return both documents. However, searching for "Šulgi" also returns both documents, instead of just the one with the diacritic. It seems ES is also normalizing the query input, which is generally good, but is it possible to change that behavior?
PUT _template/test
{
"index_patterns": ["*"],
"settings": {
"analysis": {
"normalizer": {
"exact": {
"type": "custom",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "keyword",
"normalizer": "exact"
}
}
}
}
POST test/_doc/1
{
"title": "Sulgi"
}
POST test/_doc/2
{
"title": "Šulgi"
}
Example search query:
POST test/_search
{
"query": {
"term": {
"title":"Šulgi"
}
}
}
{
"took": 294,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 0.18232156,
"hits": [
{
"_index": "test",
"_type": "_doc",
"_id": "1",
"_score": 0.18232156,
"_source": {
"title": "Šulgi"
}
},
{
"_index": "test",
"_type": "_doc",
"_id": "2",
"_score": 0.18232156,
"_source": {
"title": "Sulgi"
}
}
]
}
}
I've setup a normalizer on an index field to support case insensitive searches, cant seem to get it to work.
GET users/
Returns the following mapping:
{
"users": {
"aliases": {},
"mappings": {
"user": {
"properties": {
"active": {
"type": "boolean"
},
"first_name": {
"type": "keyword",
"fields": {
"normalize": {
"type": "keyword",
"normalizer": "search_normalizer"
}
}
},
},
"settings": {
"index": {
"number_of_shards": "5",
"provided_name": "users",
"creation_date": "1567936315432",
"analysis": {
"normalizer": {
"search_normalizer": {
"filter": [
"lowercase"
],
"type": "custom"
}
}
},
"number_of_replicas": "1",
"uuid": "5SknFdwJTpmF",
"version": {
"created": "6040299"
}
}
}
}
}
Although first_name is normalized to lowercase, queries on the first_name field are case sensitive.
Using the following query for a user with first name Dave
GET users/_search
{
"query": {
"bool": {
"should": [
{
"regexp": {
"first_name": {
"value": ".*dave.*"
}
}
}
]
}
}
}
GET users/_analyze
{
"analyzer" : "standard",
"text": "Dave"
}
returns
{
"tokens": [
{
"token": "dave",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Although "Dave" is tokenized to "dave" the following query
GET users/_search
{
"query": {
"match": {
"first_name": "dave"
}
}
}
Returns no hits.
Is there an issue with my current mapping? or the query?
I think you have missed first_name.normalize in query
Indexing Records
{"first_name": "Daveraj"}
{"index": {}}
{"first_name": "RajdaveN"}
{"index": {}}
{"first_name": "Dave"}
Query
"query": {
"bool": {
"should": [
{
"regexp": {
"first_name.normalize": {
"value": ".*dave.*"
}
}
}
]
}
}
}
Result
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1.0,
"hits": [
{
"_index": "test3",
"_type": "test3_type",
"_id": "M8-lEG0BLCpzI1hbBWYC",
"_score": 1.0,
"_source": {
"first_name": "Dave"
}
},
{
"_index": "test3",
"_type": "test3_type",
"_id": "Mc-lEG0BLCpzI1hbBWYC",
"_score": 1.0,
"_source": {
"first_name": "Daveraj"
}
},
{
"_index": "test3",
"_type": "test3_type",
"_id": "Ms-lEG0BLCpzI1hbBWYC",
"_score": 1.0,
"_source": {
"first_name": "RajdaveN"
}
}
]
}
}```
You have created a normalized multi-field: first_name.normalize , but you are searching on the original field first_name which doesn't have any analyzer specified (will default to index-default analyzer or standard).
The examples given here might help:
https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html
You need to explicitly specify the multi-field you want to search on, note even though a multi-field cant have its own content, it indexes different terms as opposed to its parent (although not always) as a result of possibly being analyzed using diff analyzers/char/token filters.
Assuming I have a document like this in elasticSearch :
{
"videoName": "taylor.mp4",
"type": "long"
}
I tried full-text search using the DSL query:
{
"query": {
"match":{
"videoName": "taylor"
}
}
}
I need to get the above document, but I don't get it .If I specify taylor.mp4, it returns the document.
So, I would like to know, how to make full-text search with delimiters.
Edit after KARTHEEK answer:
The regexp fetches the taylor.mp4 document. Take the situation, where the document in video index are:
{
"videoName": "Akon - smack that.mp4",
"type": "long"
}
So, the query for retrieving this document can be ,
{
"query": {
"match":{
"videoName": "smack that"
}
}
}
In this case, the document will be retrieved, since we use smack in the query string. match does the full-text search and gets us the document. But, say I only know the that keyword and the match, doesn't get the document. I need to use regexp for that.
{
"query": {
"regexp":{
"videoName": "smack.* that.*"
}
}
}
On the Other hand, if i take up regexp and make all my query strings to smack.* that.*, this will also not retrieve any documents. And, we dont know which word will have its suffix .mp4. So, my question is we need to do the full-text search with match, and it should also detect the delimiters. Is there any other way ?
Edit after Richa asked the mapping of index
for http://localhost:9200/example/videos/_mapping
{
"example": {
"mappings": {
"videos": {
"properties": {
"query": {
"properties": {
"match": {
"properties": {
"videoName": {
"type": "string"
}
}
}
}
},
"type": {
"type": "string"
},
"videoName": {
"type": "string"
}
}
}
}
}
}
Depending upon above query you mentioned right we can use regular expression in order get the result.Please find attached result for your perusal and let me know if there are anything else you want.
curl -XGET "http://localhost:9200/test/sample/_search" -d'
{
"query": {
"regexp":{
"videoName": "taylor.*"
}
}
}'
Result:
{
"took": 22,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "sample",
"_id": "1",
"_score": 1,
"_source": {
"videoName": "taylor.mp4",
"type": "long"
}
}
]
}
}
Please use this mapping
PUT /test_index
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"doc": {
"properties": {
"videoName": {
"type": "string",
"term_vector": "yes"
}
}
}
}
}
After that you need to index a document that you mentioned earlier:
PUT test_index/doc/1
{
"videoName": "Akon - smack that.mp4",
"type": "long"
}
Output:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.15342641,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.15342641,
"_source": {
"videoName": "Akon - smack that.mp4",
"type": "long"
}
}
]
}
}
Query to get results:
GET /test_index/doc/1/_termvector?fields=videoName
Results:
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_version": 1,
"found": true,
"took": 1,
"term_vectors": {
"videoName": {
"field_statistics": {
"sum_doc_freq": 3,
"doc_count": 1,
"sum_ttf": 3
},
"terms": {
"akon": {
"term_freq": 1
},
"smack": {
"term_freq": 1
},
"that.mp4": {
"term_freq": 1
}
}
}
}
}
By using this we will search based on "smack"
POST /test_index/_search
{
"query": {
"match": {
"_all": "smack"
}
}
}
Result:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.15342641,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.15342641,
"_source": {
"videoName": "Akon - smack that.mp4",
"type": "long"
}
}
]
}
}
I have a mapping like this:
"properties": {
"id": {"type": "long", "index": "not_analyzed"},
"name": {"type": "string", "index": "not_analyzed"},
"skills": {"type": "string", "index": "not_analyzed"}
}
I wanna store students' profiles in elasticsearch using the given mapping. skills is a list of computer skills they specified on their profiles (python, javascript, ...).
Given a skill set like ['html', 'css', 'sass', 'javascript', 'django', 'bootstrap', 'angularjs', 'backbone'], I wanna find all profiles that have at least 3 of the skills in this skill set. I am not interested in knowing which skills they have in common with our desired list, just interested in the count. Is there a way to do this in elasticsearch?
There might be a better way I'm not thinking of, but you can do it with a script filter.
I set up a simplified version of your index, with a few docs:
PUT /test_index
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"doc": {
"properties": {
"skills": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"skills":["html","css","javascript"]}
{"index":{"_id":2}}
{"skills":["bootstrap", "angularjs", "backbone"]}
{"index":{"_id":3}}
{"skills":["python", "javascript", "ruby","java"]}
Then ran this query:
POST /test_index/_search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"script": {
"script": "count=0; for(s: doc['skills'].values){ for(x: skills){ if(s == x){ count +=1 } } } count >= 3",
"params": {
"skills": ["html", "css", "sass", "javascript", "django", "bootstrap", "angularjs", "backbone"]
}
}
}
}
}
}
and got back what I expected:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"_source": {
"skills": [
"html",
"css",
"javascript"
]
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 1,
"_source": {
"skills": [
"bootstrap",
"angularjs",
"backbone"
]
}
}
]
}
}
Here's the code all together:
http://sense.qbox.io/gist/1018a01f1df29cb793ea15661f22bc8b25ed3476
One could use the query string with minimum_should_match option
Example:
POST <index>/_search
{
"query": {
"filtered": {
"filter": {
"query": {
"query_string": {
"default_field": "skills",
"query": "html css sass javascript django bootstrap angularjs backbone \"ruby on rails\" ",
"minimum_should_match" : "3"
}
}
}
}
}
}
Although the Lucene logic structure, I'm trying to make my nested fields to be highlighted when some search result is present in their content.
Here is the explanation from Elasticsearch documentation (mapping nested type`)
Internal Implementation
Internally, nested objects are indexed as additional documents, but, since they can be guaranteed to be indexed within the same "block", it allows for extremely fast joining with parent docs.
Those internal nested documents are automatically masked away when doing operations against the index (like searching with a match_all query), and they bubble out when using the nested query.
Because nested docs are always masked to the parent doc, the nested docs can never be accessed outside the scope of the nested query. For example stored fields can be enabled on fields inside nested objects, but there is no way of retrieving them, since stored fields are fetched outside of the nested query scope.
0. In my case
I have an Elasticsearch index containing a mapping like the following:
{
"my_documents": {
"dynamic_date_formats": [
"dd.MM.yyyy",
"yyyy-MM-dd",
"yyyy-MM-dd HH:mm:ss"
],
"index_analyzer": "Analyzer2_index",
"search_analyzer": "Analyzer2_search_decompound",
"_timestamp": {
"enabled": true
},
"properties": {
"identifier": {
"type": "string"
},
"description": {
"type": "multi_field",
"fields": {
"sort": {
"type": "string",
"index": "not_analyzed"
},
"description": {
"type": "string"
}
}
},
"files": {
"type": "nested",
"include_in_root": true,
"properties": {
"content": {
"type": "string",
"include_in_root": true
}
}
},
"and then some other": "normal string fields"
}
}
}
I'm trying to execute a query like this:
{
"size": 100,
"query": {
"bool": {
"should": [
{
"nested": {
"path": "files",
"query": {
"bool": {
"should": {
"match": {
"content": {
"query": "burpcontrol",
"minimum_should_match": "85%"
}
}
}
}
}
}
},
{
"match": {
"description": {
"query": "burpcontrol",
"minimum_should_match": "85%"
}
}
},
{
"match": {
"identifier": {
"query": "burpcontrol",
"minimum_should_match": "85%"
}
}
} ]
}
},
"highlight": {
"pre_tags": [
"<span style=\"background-color: yellow\">"
],
"post_tags": [
"</span>"
],
"order": "score",
"no_match_size": 100,
"fragment_size": 50,
"number_of_fragments": 3,
"require_field_match": true,
"fields": {
"files.content": {},
"description": {},
"identifier": {}
}
}
}
The problem I have are:
1. require_field_match
If I use "require_field_match": false I obtain that, even if highlighting doesn't work on nested fields, the search term is highlighted anyway in ALL the fields.
This is the solution I'm actually using, but the performances are horrible. For 50 documents my query needs 25secs. 100 documents about 50secs. 10 documents 5secs.
And if I remove the nested field from the highlighting everything works fast as light!
2 .include_in_root
I would like to have a flattened version of my nested fields (so to store them as normal objects/fields.
To do this I should specify
"files": { "type": "nested", "include_in_root": true, ...
but I don't know why, after reindexing, I cannot see any additional flattened field in the document root (while I was expecting something like "files.content":["content1", "content2", "..."]).
If it would work it would be instead possible to access (in the flattened field) the content of the nested field, and perform the highlighting on it.
Do you know if is it possible to achieve a good (and performant) highlighting on nested fields or, at least, suggest me why my query is so slow? (I already optimised the fragments)
There are a number of things you can do here, with a parent/child relationship. I'll go over a few, and hopefully that will lead you in the right direction; it will still take lots of testing to figure out whether this solution is going to be more performant for you. Also, I left out a few of the details of your setup, for clarity. Please forgive the long post.
I set up a parent/child mapping as follows:
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"parent_doc": {
"properties": {
"identifier": {
"type": "string"
},
"description": {
"type": "string"
}
}
},
"child_doc": {
"_parent": {
"type": "parent_doc"
},
"properties": {
"content": {
"type": "string"
}
}
}
}
}
Then added some test docs:
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"parent_doc","_id":1}}
{"identifier": "first", "description":"some special text"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":1}}
{"content":"text that is special"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":1}}
{"content":"text that is not"}
{"index":{"_index":"test_index","_type":"parent_doc","_id":2}}
{"identifier": "second", "description":"some different text"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":2}}
{"content":"different child text, but special"}
{"index":{"_index":"test_index","_type":"parent_doc","_id":3}}
{"identifier": "third", "description":"we don't want this parent"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":3}}
{"content":"or this child"}
If I'm understanding your specs correctly, we would want a query for "special" to return every one of these documents except the last two (correct me if I'm wrong). We want docs that match the text, have a child that matches the text, or have a parent that matches the text.
We can get back parents that match the query like this:
POST /test_index/parent_doc/_search
{
"query": {
"match": {
"description": "special"
}
},
"highlight": {
"fields": {
"description": {},
"identifier": {}
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.1263815,
"hits": [
{
"_index": "test_index",
"_type": "parent_doc",
"_id": "1",
"_score": 1.1263815,
"_source": {
"identifier": "first",
"description": "some special text"
},
"highlight": {
"description": [
"some <em>special</em> text"
]
}
}
]
}
}
And we can get back children that match the query like this:
POST /test_index/child_doc/_search
{
"query": {
"match": {
"content": "special"
}
},
"highlight": {
"fields": {
"content": {}
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.92364895,
"hits": [
{
"_index": "test_index",
"_type": "child_doc",
"_id": "geUFenxITZSL7epvB568uA",
"_score": 0.92364895,
"_source": {
"content": "text that is special"
},
"highlight": {
"content": [
"text that is <em>special</em>"
]
}
},
{
"_index": "test_index",
"_type": "child_doc",
"_id": "IMHXhM3VRsCLGkshx52uAQ",
"_score": 0.80819285,
"_source": {
"content": "different child text, but special"
},
"highlight": {
"content": [
"different child text, but <em>special</em>"
]
}
}
]
}
}
We can get back parents that match the text and children that match the text like this:
POST /test_index/parent_doc,child_doc/_search
{
"query": {
"multi_match": {
"query": "special",
"fields": ["description", "content"]
}
},
"highlight": {
"fields": {
"description": {},
"identifier": {},
"content": {}
}
}
}
...
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1.1263815,
"hits": [
{
"_index": "test_index",
"_type": "parent_doc",
"_id": "1",
"_score": 1.1263815,
"_source": {
"identifier": "first",
"description": "some special text"
},
"highlight": {
"description": [
"some <em>special</em> text"
]
}
},
{
"_index": "test_index",
"_type": "child_doc",
"_id": "geUFenxITZSL7epvB568uA",
"_score": 0.75740534,
"_source": {
"content": "text that is special"
},
"highlight": {
"content": [
"text that is <em>special</em>"
]
}
},
{
"_index": "test_index",
"_type": "child_doc",
"_id": "IMHXhM3VRsCLGkshx52uAQ",
"_score": 0.6627297,
"_source": {
"content": "different child text, but special"
},
"highlight": {
"content": [
"different child text, but <em>special</em>"
]
}
}
]
}
}
However, to get back all the docs related to this query, we need to use a bool query:
POST /test_index/parent_doc,child_doc/_search
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "special",
"fields": [
"description",
"content"
]
}
},
{
"has_child": {
"type": "child_doc",
"query": {
"match": {
"content": "special"
}
}
}
},
{
"has_parent": {
"type": "parent_doc",
"query": {
"match": {
"description": "special"
}
}
}
}
]
}
},
"highlight": {
"fields": {
"description": {},
"identifier": {},
"content": {}
}
},
"fields": ["_parent", "_source"]
}
...
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0.8866254,
"hits": [
{
"_index": "test_index",
"_type": "parent_doc",
"_id": "1",
"_score": 0.8866254,
"_source": {
"identifier": "first",
"description": "some special text"
},
"highlight": {
"description": [
"some <em>special</em> text"
]
}
},
{
"_index": "test_index",
"_type": "child_doc",
"_id": "geUFenxITZSL7epvB568uA",
"_score": 0.67829096,
"_source": {
"content": "text that is special"
},
"fields": {
"_parent": "1"
},
"highlight": {
"content": [
"text that is <em>special</em>"
]
}
},
{
"_index": "test_index",
"_type": "child_doc",
"_id": "IMHXhM3VRsCLGkshx52uAQ",
"_score": 0.18709806,
"_source": {
"content": "different child text, but special"
},
"fields": {
"_parent": "2"
},
"highlight": {
"content": [
"different child text, but <em>special</em>"
]
}
},
{
"_index": "test_index",
"_type": "child_doc",
"_id": "NiwsP2VEQBKjqu1M4AdjCg",
"_score": 0.12531912,
"_source": {
"content": "text that is not"
},
"fields": {
"_parent": "1"
}
},
{
"_index": "test_index",
"_type": "parent_doc",
"_id": "2",
"_score": 0.12531912,
"_source": {
"identifier": "second",
"description": "some different text"
}
}
]
}
}
(I included the "_parent" field to make it easier to see why docs were included in the results, as shown here).
Let me know if this helps.
Here is the code I used:
http://sense.qbox.io/gist/d69a4d6531dc063faa4b4e094cff2a472a73c5a6