ElastcSearch - Get unsorted data while retrieving results using "docvalue_fields" - elasticsearch

I'm trying to retrieve data using docvalue_fields from a text field. Below is the mapping:
PUT dvtest
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"nl-tokenizer": {
"type": "simple_pattern_split",
"pattern": "\n|\r\n|\r|\n\r"
}
},
"analyzer": {
"newline": {
"type": "custom",
"filter": [
"trim"
],
"tokenizer": "nl-tokenizer"
}
}
}
}
},
"mappings": {
"indexMapping": {
"properties": {
"data": {
"norms": false,
"type": "text",
"analyzer": "newline",
"fielddata": true
}
}
}
}
}
Query to put sample data in index:
POST dvtest/indexMapping
{
"data":"""
The quick brown fox
Jumps over the lazy dog
"""
}
Query I'm using to retrieve data:
POST dvtest/indexMapping/_search
{
"_source": false
, "docvalue_fields": ["data"]
}
Now, When I try to retrieve data using above query I get the result as shown below. Tokens in data fields are sorted in alphabetical order and I want to retrieve it in order as it is indexed.
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "dvtest",
"_type": "indexMapping",
"_id": "Kp5jNmQBrlmAX-4PVbm9",
"_score": 1,
"fields": {
"data": [
"Jumps over the lazy dog",
"The quick brown fox"
]
}
}
]
}
So, my desired result is:
...
"fields": {
"data": [
"The quick brown fox",
"Jumps over the lazy dog"
]
}
...
I have tried searching on google for help also checked GitHub but couldn't find anything useful!
Thanks in advance for any help! :)

Related

How do I search documents with their synonyms in Elasticsearch?

I have an index with some documents. These documents have the field name. But now, my documents are able to have several names. And the number of names a document can have is uncertain. A document can have only one name, or there can be 10 names of one document.
The question is, how to organize my index, document and query and then search for 1 document by different names?
For example, there's a document with names: "automobile", "automobil", "自動車". And whenever I query one of these names, I should get this document. Can I create kind of an array of these names and build a query to search for each one? Or there's more appropriate way to do this.
Tldr;
I feels like you are looking for something like synonyms?
Solution
In the following example I am creating an index, with a specific text analyser.
This analyser, handle automobile, automobil and 自動車 as the same token.
PUT /74472994
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "standard",
"filter": ["synonym" ]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms": [ "automobile, automobil, 自動車" ]
}
}
}
}
},
"mappings": {
"properties": {
"name":{
"type": "text",
"analyzer": "synonym"
}
}
}
}
POST /74472994/_doc
{
"name": "automobile"
}
which allow me to perform the following request:
GET /74472994/_search
{
"query": {
"match": {
"name": "automobil"
}
}
}
GET /74472994/_search
{
"query": {
"match": {
"name": "自動車"
}
}
}
And always get:
{
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.7198386,
"hits": [
{
"_index": "74472994",
"_id": "ROfyhoQBcn6Q8d0DlI_z",
"_score": 1.7198386,
"_source": {
"name": "automobile"
}
}
]
}
}

Elasticsearch NEST API: How to write Query descriptor to implement search with Starts with?

Using Elasticsearch Nest Client to search for company name store in Elasticsearch. Here is sample of my queryExtentions.
I want to change it to make sure when I search for "Starbucks", it should only return record starting with letter "Starbucks". Currently it is rerurning all the records where it has "StarBucks".
Based on documentation, I need to search on "Keyword" filed in order to get the result.
Need sample code to how to achieve this.
****Elastic Search Index Column"
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
Code*
var escapedSearchTerm = ElasticsearchQueryExtensions.EscapeQuery(companyName);
return new QueryContainerDescriptor<SearchResponseStorageContractV1>().Bool(b => b.Must(mu => mu
.QueryString(qs => qs
.AllowLeadingWildcard(true)
.AnalyzeWildcard(true)
.Fields(f => f.Field(s => s.Company.Name).Field(s => s.Organization.CommonName))
.Query(escapedSearchTerm)
)));
I am not familiar with Elastic Search Nest Client, but in JSON format you can implement search with functionality using prefix query
Adding a working example with index data,mapping,search query and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"name":"Starbucks is a American multinational chain of coffeehouses"
}
{
"name":"coffee at Starbucks"
}
Search Query:
{
"query": {
"prefix": {
"name": {
"value": "Starbucks",
"case_insensitive": true // this param was introduced in 7.10.0
}
}
}
}
Search Result:
"hits": [
{
"_index": "67424740",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"name": "Starbucks is a American multinational chain of coffeehouses"
}
}
]

How to exclude asterisks while searching with analyzer

I need to search by an array of values, and each value can be either simple text or text with askterisks(*).
For example:
["MYULTRATEXT"]
And I have the next index(i have a really big index, so I will simplify it):
................
{
"settings": {
"analysis": {
"char_filter": {
"asterisk_remove": {
"type": "pattern_replace",
"pattern": "(\\d+)*(?=\\d)",
"replacement": "1$"
}
},
"analyzer": {
"custom_search_analyzer": {
"char_filter": [
"asterisk_remove"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"name": {
"type": "text",
"analyzer":"keyword",
"search_analyzer": "custom_search_analyzer"
},
......................
And all data in the index is stored with asterisks * e.g.:
curl -X PUT "localhost:9200/locations/_doc/2?pretty" -H 'Content-Type: application/json' -d'
{
"name" : "MY*ULTRA*TEXT"
}
I need to return exact the same name value when I search by this string MYULTRATEXT
curl -XPOST 'localhost:9200/locations/_search?pretty' -d '
{
"query": { terms: { "name": ["MYULTRATEXT"] } }
}'
It Should return MY*ULTRA*TEXT, but it does not work, so can't find a workaround. Any thoughts?
I tried pattern_replace but seems like I am doing something wrong or I am missing something here.
So I need to replace all * to empty `` while searching
There appears to be a problem with the regex you provided and the replacement pattern.
I think what you want is:
"char_filter": {
"asterisk_remove": {
"type": "pattern_replace",
"pattern": "(\\w+)\\*(?=\\w)",
"replacement": "$1"
}
}
Note the following changes:
\d => \w (match word characters instead of only digits)
escape * since asterisks have a special meaning for regexes
1$ => $1 ($<GROUPNUM> is how you reference captured groups)
To see how Elasticsearch will analyze the text against an analyzer, or to check that you defined an analyzer correctly, Elasticsearch has the ANALYZE API endpoint that you can use: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html
If you try this API with your current definition of custom_search_analyzer, you will find that "MY*ULTRA*TEXT" is analyzed to "MY*ULTRA*TEXT" and not "MYULTRATEXT" as you intend.
I have a personal app that I use to more easily interact with and visualize the results of the ANALYZE API. I tried your example and you can find it here: Elasticsearch Analysis Inspector.
This might help you - your regex pattern is the issue.
You want to replace all * occurrences with `` the pattern below will do the trick..
PUT my_index
{
"mappings": {
"doc": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer":"my_analyzer"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"asterisk_remove": {
"type": "pattern_replace",
"pattern": "(?<=\\w)(\\*)(?=\\w)",
"replacement": ""
}
},
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase",
"asterisk_remove"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}
}
Analyze query
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": ["MY*ULTRA*TEXT"]
}
Results of analyze query
{
"tokens": [
{
"token": "myultratext",
"start_offset": 0,
"end_offset": 13,
"type": "word",
"position": 0
}
]
}
Post a document
POST my_index/doc/1
{
"name" : "MY*ULTRA*TEXT"
}
Search query
GET my_index/_search
{
"query": {
"match": {
"name": "MYULTRATEXT"
}
}
}
Or
GET my_index/_search
{
"query": {
"match": {
"name": "myultratext"
}
}
}
Results search query
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2876821,
"hits": [
{
"_index": "my_index",
"_type": "doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"name": "MY*ULTRA*TEXT"
}
}
]
}
}
Hope it helps

Return only exact matches (substrings) in full text search (elasticsearch)

I have an index in elasticsearch with a 'title' field (analyzed string field). If I have the following documents indexed:
{title: "Joe Dirt"}
{title: "Meet Joe Black"}
{title: "Tomorrow Never Dies"}
and the search query is "I want to watch the movie Joe Dirt tomorrow"
I want to find results where the full title matches as a substring of the search query. If I use a straight match query, all of these documents will be returned because they all match one of the words. I really just want to return "Joe Dirt" because the title is an exact match substring of the search query.
Is that possible in elasticsearch?
Thanks!
One way to achieve this is as follows :
1) while indexing index title using keyword tokenizer
2) While searching use shingle token-filter to extract substring from the query string and match against the title
Example:
Index Settings
put test
{
"settings": {
"analysis": {
"analyzer": {
"substring": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"substring"
]
},
"exact": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
},
"filter": {
"substring": {
"type":"shingle",
"output_unigrams" : true
}
}
}
},
"mappings": {
"movie": {
"properties": {
"title": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"analyzer": "exact"
}
}
}
}
}
}
}
Index Documents
put test/movie/1
{"title": "Joe Dirt"}
put test/movie/2
{"title": "Meet Joe Black"}
put test/movie/3
{"title": "Tomorrow Never Dies"}
Query
post test/_search
{
"query": {
"match": {
"title.raw" : {
"analyzer": "substring",
"query": "Joe Dirt tomorrow"
}
}
}
}
Result :
"hits": {
"total": 1,
"max_score": 0.015511602,
"hits": [
{
"_index": "test",
"_type": "movie",
"_id": "1",
"_score": 0.015511602,
"_source": {
"title": "Joe Dirt"
}
}
]
}

elasticsearch: How to rank first appearing words or phrases higher

For example, if I have the following documents:
1. Casa Road
2. Jalan Casa
Say my query term is "cas"... on searching, both documents have same scores. I want the one with casa appearing earlier (i.e. document 1 here) and to rank first in my query output.
I am using an edgeNGram Analyzer. Also I am using aggregations so I cannot use the normal sorting that happens after querying.
You can use the Bool Query to boost the items that start with the search query:
{
"bool" : {
"must" : {
"match" : { "name" : "cas" }
},
"should": {
"prefix" : { "name" : "cas" }
},
}
}
I'm assuming the values you gave is in the name field, and that that field is not analyzed. If it is analyzed, maybe look at this answer for more ideas.
The way it works is:
Both documents will match the query in the must clause, and will receive the same score for that. A document won't be included if it doesn't match the must query.
Only the document with the term starting with cas will match the query in the should clause, causing it to receive a higher score. A document won't be excluded if it doesn't match the should query.
This might be a bit more involved, but it should work.
Basically, you need the position of the term within the text itself and, also, the number of terms from the text. The actual scoring is computed using scripts, so you need to enable dynamic scripting in elasticsearch.yml config file:
script.engine.groovy.inline.search: on
This is what you need:
a mapping that is using term_vector set to with_positions, and edgeNGram and a sub-field of type token_count:
PUT /test
{
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"term_vector": "with_positions",
"index_analyzer": "edgengram_analyzer",
"search_analyzer": "keyword",
"fields": {
"word_count": {
"type": "token_count",
"store": "yes",
"analyzer": "standard"
}
}
}
}
}
},
"settings": {
"analysis": {
"filter": {
"name_ngrams": {
"min_gram": "2",
"type": "edgeNGram",
"max_gram": "30"
}
},
"analyzer": {
"edgengram_analyzer": {
"type": "custom",
"filter": [
"standard",
"lowercase",
"name_ngrams"
],
"tokenizer": "standard"
}
}
}
}
}
test documents:
POST /test/test/1
{"text":"Casa Road"}
POST /test/test/2
{"text":"Jalan Casa"}
the query itself:
GET /test/test/_search
{
"query": {
"bool": {
"must": [
{
"function_score": {
"query": {
"term": {
"text": {
"value": "cas"
}
}
},
"script_score": {
"script": "termInfo=_index['text'].get('cas',_POSITIONS);wordCount=doc['text.word_count'].value;if (termInfo) {for(pos in termInfo){return (wordCount-pos.position)/wordCount}};"
},
"boost_mode": "sum"
}
}
]
}
}
}
and the results:
"hits": {
"total": 2,
"max_score": 1.3715843,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 1.3715843,
"_source": {
"text": "Casa Road"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.8715843,
"_source": {
"text": "Jalan Casa"
}
}
]
}

Resources