wildcard on different tokens in elastic search

wildcard on different tokens in elastic search - elasticsearch

I have a document which looks like this
Name
Thomy tyson Olando Magua
Using ngram i was able to acheive the wildcard search so that if i type in omy tyson it can return me the above document pretty much similar to this sql query
select name from table where name like '%omy tyson%'
PUT sample
{
"settings": {
"analysis": {
"analyzer": {
"my_ngram_analyzer": {
"tokenizer": "my_ngram_tokenizer"
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "2",
"max_gram": "15"
}
}
}
},
"mappings": {
"typename": {
"properties": {
"name": {
"type": "string",
"fields": {
"search": {
"type": "string",
"analyzer": "my_ngram_analyzer"
}
}
}
}
}
}
}
PUT sample/typename/2
{
"name": "Thomy tyson Olando Magua"
}
{
"query": {
"bool": {
"should": [
{
"term": {
"name.search": "omy tyson"
}
}
]
}
}
}
Is there a way in elastic search where i can perform wildcard search on 2 different words separated by other words like
select name from table where name like '%omy Magua%'
So in this case i would like to perform partial search on first and fourth word.
Any feedback would be helpfull

Related

how to query for phrases(shingles) in Elasticsearch

I have the following string "Word1 Word2 StopWord1 StopWord2 Word3 Word4".
When I query for this string using ["bool"]["must"]["match"], I would like to return all text that matches "Word1Word2" and/or "Word3Word4".
I have created an analyzer that I would like to use for indexing and searching.
Using analyze API, I have confirmed that indexing is being done correctly. The shingles returned are "Word1Word2" and "Word3Word4"
I want to query so that text matching "Word1Word2" and/or "Word3Word4" are returned. How can I do this dynamically - meaning, I don't know up front how many shingles will be generated, so I don't know how many match_phrase to code up in a query.
"should":[
{ "match_phrase" : {"content": phrases[0]}},
{ "match_phrase" : {"content": phrases[1]}}
]

To query for shingles(and unigrams), you could set up your mappings to handle them cleanly in separate fields. In the example below, the field "shingles" will be used to analyze and retrieve shingles, while the implicit field will be used to handle unigrams.
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"my_shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 2,
"output_unigrams": false
}
},
"analyzer": {
"my_shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_shingle_filter"
]
}
}
}
}
}
PUT /my_index/_mapping/my_type
{
"my_type": {
"properties": {
"title": {
"type": "string",
"fields": {
"shingles": {
"type": "string",
"analyzer": "my_shingle_analyzer"
}
}
}
}
}
}
GET /my_index/my_type/_search
{
"query": {
"bool": {
"must": {
"match": {
"title": "<your query string>"
}
},
"should": {
"match": {
"title.shingles": "<your query string"
}
}
}
}
}
Ref. Elasticsearch: The Definitive Guide....

How to create and add values to a standard lowercase analyzer in elastic search

Ive been around the houses with this for the past few days trying things in various orders but cant figure out why its not working.
I am trying to create an index in Elasticsearch with an analyzer which is the same as the "standard" analyzer but retains upper case characters when records are stored.
I create my analyzer and index as follows:
PUT /upper
{
"settings": {
"index" : {
"analysis" : {
"analyzer": {
"rebuilt_standard": {
"tokenizer": "standard",
"filter": [
"standard"
]
}
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "rebuilt_standard"
}
}
}
}
}
Then add two records to test like this...
POST /upper/doc
{
"text" : "TEST"
}
Add a second record...
POST /upper/doc
{
"text" : "test"
}
Using /upper/_settings gives the following:
{
"upper": {
"settings": {
"index": {
"number_of_shards": "5",
"provided_name": "upper",
"creation_date": "1537788581060",
"analysis": {
"analyzer": {
"rebuilt_standard": {
"filter": [
"standard"
],
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "s4oDgdsFTxOwsdRuPAWEkg",
"version": {
"created": "6030299"
}
}
}
}
}
But when I search with the following query I still get two matches! Both the upper and lower cases which must mean the analyser is not applied when I store the records.
Search like so...
GET /upper/_search
{
"query": {
"term": {
"text": {
"value": "test"
}
}
}
}
Thanks in advance!

first thing first you set your analyzer on the title field instead of upon the text field (since your search is on the text property, and since you are indexing doc with only text property)
"properties": {
"title": {
"type": "text",
"analyzer": "rebuilt_standard"
}
}
try
"properties": {
"text": {
"type": "text",
"analyzer": "rebuilt_standard"
}
}
and keep us posted ;)

Custom analyzer, use case : zip-code [ElasticSearch]

Let be a set index/type named customers/customer.
Each document of this set has a zip-code as property.
Basically, a zip-code can be like:
String-String (ex : 8907-1009)
String String (ex : 211-20)
String (ex : 30200)
I'd like to set my index analyzer to get as many documents as possible that could match. Currently, I work like that :
PUT /customers/
{
"mappings":{
"customer":{
"properties":{
"zip-code": {
"type":"string"
"index":"not_analyzed"
}
some string properties ...
}
}
}
When I search a document I'm using that request :
GET /customers/customer/_search
{
"query":{
"prefix":{
"zip-code":"211-20"
}
}
}
That works if you want to search rigourously. But for instance if the zip-code is "200 30", then searching with "200-30" will not give any results.
I'd like to give orders to my index analyser in order to don't have this problem.
Can someone help me ?
Thanks.
P.S. If you want more information, please let me know ;)

As soon as you want to find variations you don't want to use not_analyzed.
Let's try this with a different mapping:
PUT zip
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"zip_code": {
"tokenizer": "standard",
"filter": [ ]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"zip": {
"type": "text",
"analyzer": "zip_code"
}
}
}
}
}
We're using the standard tokenizer; strings will be broken up at whitespaces and punctuation marks (including dashes) into tokens. You can see the actual tokens if you run the following query:
POST zip/_analyze
{
"analyzer": "zip_code",
"text": ["8907-1009", "211-20", "30200"]
}
Add your examples:
POST zip/_doc
{
"zip": "8907-1009"
}
POST zip/_doc
{
"zip": "211-20"
}
POST zip/_doc
{
"zip": "30200"
}
Now the query seems to work fine:
GET zip/_search
{
"query": {
"match": {
"zip": "211-20"
}
}
}
This will also work if you just search for "211". However, this might be too lenient, since it will also find "20", "20-211", "211-10",...
What you probably want is a phrase search where all the tokens in your query need to be in the field and also in the right order:
GET zip/_search
{
"query": {
"match_phrase": {
"zip": "211"
}
}
}
Addition:
If the ZIP codes have a hierarchical meaning (if you have "211-20" you want this to be found when searching for "211", but not when searching for "20"), you can use the path_hierarchy tokenizer.
So changing the mapping to this:
PUT zip
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"zip_code": {
"tokenizer": "zip_tokenizer",
"filter": [ ]
}
},
"tokenizer": {
"zip_tokenizer": {
"type": "path_hierarchy",
"delimiter": "-"
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"zip": {
"type": "text",
"analyzer": "zip_code"
}
}
}
}
}
Using the same 3 documents from above you can use the match query now:
GET zip/_search
{
"query": {
"match": {
"zip": "1009"
}
}
}
"1009" won't find anything, but "8907" or "8907-1009" will.
If you want to also find "1009", but with a lower score, you'll have to analyze the zip code with both variations I have shown (combine the 2 versions of the mapping):
PUT zip
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"zip_hierarchical": {
"tokenizer": "zip_tokenizer",
"filter": [ ]
},
"zip_standard": {
"tokenizer": "standard",
"filter": [ ]
}
},
"tokenizer": {
"zip_tokenizer": {
"type": "path_hierarchy",
"delimiter": "-"
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"zip": {
"type": "text",
"analyzer": "zip_standard",
"fields": {
"hierarchical": {
"type": "text",
"analyzer": "zip_hierarchical"
}
}
}
}
}
}
}
Add a document with the inverse order to properly test it:
POST zip/_doc
{
"zip": "1009-111"
}
Then search both fields, but boost the one with the hierarchical tokenizer by 3:
GET zip/_search
{
"query": {
"multi_match" : {
"query" : "1009",
"fields" : [ "zip", "zip.hierarchical^3" ]
}
}
}
Then you can see that "1009-111" has a much higher score than "8907-1009".

Elastic search Unorderered Partial phrase matching with ngram

Maybe I am going down the wrong route, but I am trying to set up Elasticsearch to use Partial Phrase matching to return parts of words from any order of a sentence.
Eg. I have the following input
test name
tester name
name test
namey mcname face
test
And I hope to do a search for "test name" (or "name test"), and I hope all of these return (hopefully sorted in order of score). I can do partial searches, and also can do out of order searches, but not able to combine the 2. I am sure this would be a very common issue.
Below is my Settings
{
"myIndex": {
"settings": {
"index": {
"analysis": {
"filter": {
"mynGram": {
"type": "nGram",
"min_gram": "2",
"max_gram": "5"
}
},
"analyzer": {
"custom_analyser": {
"filter": [
"lowercase",
"mynGram"
],
"type": "custom",
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "nGram",
"min_gram": "2",
"max_gram": "5"
}
}
}
}
}
}
}
My mapping
{
"myIndex": {
"mappings": {
"myIndex": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
},
"analyzer": "custom_analyser"
}
}
}
}
}
}
And my query
{
"query": {
"bool": {
"must": [{
"match_phrase": {
"name": {
"query": "test name",
"slop": 5
}
}
}]
}
}
}
Any help would be greatly appreciated.
Thanks in advance

not sure if you found your solution - I bet you did because this is such an old post, but I was on the hunt for the same thing and found this: Query-Time Search-as-you-type
Look up slop.

ElasticSearch Reverse Wildcard Search

In ElasticSearch v5.2.2 I can search for "Jo*" using Wildcard and it will match the index value containing "Joseph"
But what if my index also has these values "Joseph","Jo", "Jos", "Jose" and "Josep" and I want to reverse the query.
How can I find "Jo", "Jos", "Jose" and "Josep" in the index using the string "Joseph" as search criteria?

That's possible, but you need to create an edgeNGram search analyzer in your index settings.
First create the settings like this. The name field will be indexed with the standard analyzer but searched with your custom prefix_search analyzer instead.
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"prefix_search": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"prefix"
]
}
},
"filter": {
"prefix": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 10
}
}
}
},
"mappings": {
"doc": {
"properties": {
"name": {
"type": "string",
"analyzer": "standard",
"search_analyzer": "prefix_search"
}
}
}
}
}
Then if you create a document like this:
PUT test/doc/1
{
"name": "Jos"
}
You can find it with a query like this one:
POST /test/doc/_search
{
"query": {
"match": {
"name": "Joseph"
}
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

wildcard on different tokens in elastic search - elasticsearch

Related

how to query for phrases(shingles) in Elasticsearch

How to create and add values to a standard lowercase analyzer in elastic search

Custom analyzer, use case : zip-code [ElasticSearch]

Elastic search Unorderered Partial phrase matching with ngram

ElasticSearch Reverse Wildcard Search

Categories

Resources