How to match "prefix" and not whole string in elasticsearch? - elasticsearch

I have indexed documents, each with a field: "CodeName" that has values like the following:
document 1 has CodeName: "AAA01"
document 2 has CodeName: "AAA02"
document 3 has CodeName: "AAA03"
document 4 has CodeName: "BBB02"
When I try to use a match query on field:
query: {
"match": {
"CodeName": "AAA"
}
}
I expect to get results for "AAA01" and "AAA02", but instead, I am getting an empty array. When I pass in "AAA01" (I type in the whole thing), I get a result. How do I make it such that it matches more generically? I tried using "prefix" instead of "match" and am getting the same problem.
The mapping for "CodeName" is a "type": "string".

I expect to get results for "AAA01" and "AAA02"
This is not what Elastic Search expects. ES breaks your string into tokens using a tokenizer that you specify. If you didn't specify any tokenizer/analyzer, the default standard tokenizer splits words on spaces and hyphens etc. In your case, the tokens are stored as "AAA01", "AAA02" and so on. There is no such term as "AAA", and hence you don't get any results back.
To fix this, you can use match_phrase_prefix query or set the type of match query to be phrase_prefix . Try this code:
"query": {
"match_phrase_prefix": {
"CodeName": "AAA"
}
}
OR
"query": {
"match": {
"CodeName": {
"query": "AAA",
"type": "phrase_prefix"
}
}
}
Here is the documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html. Also pay attention to the max_expansions parameter, as this query can be slow sometimes depending upon your data.
Note that for this technique, you should go with default mapping. You don't not to use nGram.

As far as I know first of all you sould index your data using a tokenizer of type nGram.
You can check detailes in documentation
COMMENT RELATED:
I'm familiar with symfony-way of using elasticsearch and we are using it like this:
indexes:
search:
client: default
settings:
index:
analysis:
custom_index_analyzer:
type: custom
tokenizer: nGram
filter: [lowercase, kstem]
tokenizer:
nGram:
type: nGram
min_gram: 2
max_gram: 20
types:
skill:
mappings:
skill.name:
search_analyzer: custom_index_analyzer
index_analyzer: custom_index_analyzer
type: string
boost: 1

Related

Atlas Search Index partial match

I have a test collection with these two documents:
{ _id: ObjectId("636ce11889a00c51cac27779"), sku: 'kw-lids-0009' }
{ _id: ObjectId("636ce14b89a00c51cac2777a"), sku: 'kw-fs66-gre' }
I've created a search index with this definition:
{
"analyzer": "lucene.standard",
"searchAnalyzer": "lucene.standard",
"mappings": {
"dynamic": false,
"fields": {
"sku": {
"type": "string"
}
}
}
}
If I run this aggregation:
[{
$search: {
index: 'test',
text: {
query: 'kw-fs',
path: 'sku'
}
}
}]
Why do I get 2 results? I only expected the one with sku: 'kw-fs66-gre' 😬
During indexing, the standard anlyzer breaks the string "kw-lids-0009" into 3 tokens [kw][lids][0009], and similarly tokenizes "kw-fs66-gre" as [kw][fs66][gre]. When you query for "kw-fs", the same analyzer tokenizes the query as [kw][fs], and so Lucene matches on both documents, as both have the [kw] token in the index.
To get the behavior you're looking for, you should index the sku field as type autocomplete and use the autocomplete operator in your $search stage instead of text
You're still getting 2 results because of the tokenization, i.e., you're still matching on [kw] in two documents. If you search for "fs66", you'll get a single match only. Results are scored based on relevance, they are not filtered. You can add {$project: {score: { $meta: "searchScore" }}} to your pipeline and see the difference in score between the matching documents.
If you are looking to get exact matches only, you can look to using the keyword analyzer or a custom analyzer that will strip the dashes, so you deal w/ a single token per field and not 3

ElasticSearch: Using match_phrase for all fields

As a user of ElasticSearch 5, I have been using something like this to search for a given phrase in all fields:
GET /my_index/_search
{
"query": {
"match_phrase": {
"_all": "this is a phrase"
}
}
}
Now, the _all field is going away, and match_phrase does not seem to work like query_string, where you can simply use something like this to run a search for all fields:
"query": {
"query_string": {
"query": "word"
}
}
What is the alternative for a exact phrase search for all fields without using the _all field from version 6.0?
I have many fields per document so specifying all of them in the query is not really a solution for me.
You can find answer in Elasticsearch documentation https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-all-field.html
It says:
Use a custom field and the mapping copy_to parameter
So, you have to create custom fields in source, and copy all other fields to it.

Elasticsearch 6.2: terms query require lowercase input when searching on keyword

I've created an example index, with the following mapping:
{
"_doc": {
"_source": {
"enabled": False
},
"properties": {
"status": { "type": "keyword" }
}
}
}
And indexed a document:
{"status": "CMP"}
When searching the documents with this status with a terms query, I find no results:
{
"query" : {
"terms": { "status": ["CMP"]}
}
}
However, if I make the same query by putting the input in lowercase, I will find my document:
{
"query" : {
"terms": { "status": ["cmp"]}
}
}
Why is it? Since I'm searching on a keyword field, the indexed content should not be analyzed and should match an uppercase value...
no more #Oliver Charlesworth Now - in Elastic 6.x - you could continue to use a keyword datatype, lowercasing your text with a normalizer,doc here. However in every cases you should change your index mapping and reindex your docs
The index and mapping creation and the search were part of a test suite. It seems that the setup part of the test suite was not executed, and the mapping was not applied to the index.
The index was then using the default types instead of the mapping types, resulting of the use of string fields instead of keywords.
After changing the setup method of the automated tests, the mappings are well applied to the index, and the uppercase values for the status "CMP" are now matching documents.
The symptoms you're seeing shouldn't occur, unless something else is wrong.
A keyword index is not analysed, so your index should contain only CMP. A terms query is also not analysed, etc. so your index is searched only for CMP. Hence there should be a match.

Elasticsearch find missing word in phrase

How can i use Elasticsearch to find the missing word in a phrase? For example i want to find all documents which contain this pattern make * great again, i tried using a wildcard query but it returned no results:
{
"fields": [
"file_name",
"mime_type",
"id",
"sha1",
"added_at",
"content.title",
"content.keywords",
"content.author"
],
"highlight": {
"encoder": "html",
"fields": {
"content.content": {
"number_of_fragments": 5
}
},
"order": "score",
"tags_schema": "styled"
},
"query": {
"wildcard": {
"content.content": "make * great again"
}
}
}
If i put in a word and use a match_phrase query i get results, so i know i have data which matches the pattern.
Which type of query should i use? or do i need to add some type of custom analyzer to the field?
Wildcard queries operate on terms, so if you use it on an analyzed field, it will actually try to match every term in that field separately. In your case, you can create a not_analyzed sub-field (such as content.content.raw) and run the wildcard query on that. Or just map the actual field to not be analyzed, if you don't need to query it in other ways.

Implementing ElasticSearch custom filter for all queries

I'm trying to upgrade from using ElasticSearch 0.90.3 to 2.0 and running into some issues. The people who originally setup and configured ES are no longer available so I'm going about this with very little knowledge about how it works.
They configured ES 0.90.3 to use ElasticSearch-ServiceWrapper and Tire, beyond that there are only a couple small configuration changes.
For the most part, the upgrade went smooth, I replaced the setup information in the cap deploy process to now install ES 2.0 instead of 0.90.3 and the service is coming up, however, I cannot get the partial matching that was taking place before to work. I need to setup a standard filter that applies to all searches that will search all fields using partial matches. I've done tons of google searches and this is the closest I can come up with, but it still isn't returning partial matches.
index:
settings:
analysis:
filter:
autocomplete_filter:
type: edge_ngram
min_gram: 2
max_gram: 32
analyzer:
autocomplete:
type: custom
tokenizer: standard
filter: [ lowercase, autocomplete_filter ]
mappings:
access_point_status:
properties:
text:
type: string
analyzer: autocomplete
search_analyzer: standard
I was hoping to not need to replace Tire as that would make this upgrade much more involved, but if the problem lies within the queries and not the setup then I will go down that road. This is a sample query that is not returning the desired results:
curl -X GET 'http://localhost:9200/access_point_status/_search?from=0&size=100&pretty' -d
'{ "query":
{ "bool":
{ "must": [
{ "match":
{ "_all":
{ "query":
"1925","type":"phrase_prefix"
}
}
}
]}
}
,"sort": [ { "name":"asc" } ]
,"filter": { "term": { "domain":"domain_1" } }
,"size":100,"from":0
}'
Thanks
So I've found most of the issues. The indexes were being created by Tire and data_tables, using a different mapping. This couldn't be overwritten once created.
I created these filters and then applied them to the fields,
index:
analysis:
filter:
edge_ngram_filter:
type: edge_ngram
min_gram: 2
max_gram: 32
side: front
analyzer:
character_only:
type: custom
tokenizer: standard
filter: [ lowercase, edge_ngram_filter ]
special_character:
type: custom
tokenizer: keyword
filter: [ lowercase, edge_ngram_filter ]
And I'm matching about 95% of the things I would hope too with
curl -X GET 'http://localhost:9200/access_point_status/_search from=0&size=100&pretty' -d '
{
"query":
{
"bool":
{
"must":[
{
"prefix":
{
"_all":"bsap-"
}
}]
}
},"sort":[
{
"name":"asc"
}],"filter":
{
"term":
{
"domain":"domain_1"
}
},"size":100,"from":0
}'
The only thing I'm missing is getting special characters to match and uppercase characters are not matched. I've tried several types of queries, query_string doesn't seem to match any partials. Anyone have any thoughts on other queries?
I need to match things like mac addresses, ip's, and then combined text/number fields with -_,. as seperators.

Resources