Elasticsearch search pattern with Start string - elasticsearch

I am new to elasticsearch and trying to implement search. Below is my index and settingscurl -XPUT localhost:9200/rets_data/ -d '{
"settings":{
"index":{
"analysis":{
"analyzer":{
"analyzer_startswith":{
"tokenizer":"keyword",
"filter":"lowercase"
},
"analyzer_whitespacewith":{
"tokenizer":"whitespace",
"filter":"lowercase"
}
}
}
}
},
"mappings":{
"city":{
"properties":{
"CityName":{
"analyzer":"analyzer_startswith",
"type":"string"
}
}
},
"rets_aux_subdivision":{
"properties":{
"nn":{
"analyzer":"analyzer_whitespacewith",
"type":"string"
},
"field_LIST_77":{
"analyzer":"analyzer_whitespacewith",
"type":"string"
},
"SubDivisionName":{
"analyzer":"analyzer_whitespacewith",
"type":"string"
},
"SubDivisionAlias":{
"analyzer":"analyzer_whitespacewith",
"type":"string"
}
}
},
"rental_aux_subdivision":{
"properties":{
"nn":{
"analyzer":"analyzer_whitespacewith",
"type":"string"
},
"field_LIST_77":{
"analyzer":"analyzer_whitespacewith",
"type":"string"
},
"SubDivisionName":{
"analyzer":"analyzer_whitespacewith",
"type":"string"
},
"SubDivisionAlias":{
"analyzer":"analyzer_whitespacewith",
"type":"string"
}
}
}
}
}'
Below is search string
curl -XGET localhost:9200/rets_data/rets_aux_subdivision/_search?pretty -d '{"query":{"match_phrase_prefix":{"nn":{"query":"boca w","max_expansions":50}}},"sort":{"total":{"order":"desc"}},"size":100}'
When i am searching for any text like "Boca r", "Boca w" it is not giving me result.
My expected result is below.
"Boca w" should give me result starting with "Boca w". i.e "Boca west", "Boca Woods", "Boca Winds"
Please help me on this.
Thanks

You should use edgeNgram. Check this out in elasticsearch documentation. EdgeNgram filter prepare multiple words from one like this:
Woods->[W,Wo,Woo,Wood,Woods]
It makes index bigger, but searching will be more efficient than any other option like wildcards etc. Here is my simple index creation with ngrams on title.ngram:
{
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
"ngram_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : ["lowercase","my_ngram"]
}
},
"filter" : {
"my_ngram" : {
"type" : "edge_ngram",
"min_gram" : 1,
"max_gram" : 50
}
}
}
}
},
"mappings":
{
"post":
{
"properties":
{
"id":
{
"type": "integer",
"index":"no"
},
"title":
{
"type": "text",
"analyzer":"ngram_analyzer"
}
}
}
}
}
And search query:
{
"from" : 0,
"size" : 10,
"query" : {
"match" : {
"title":
{
"query":"press key han",
"operator":"or",
"analyzer":"standard"
}
}
}
}

What if you have your match something like this:
"query": {
"match_phrase": {
"text": {
"query": "boca w"
}
}
},
"sort":{
"total":{
"order":"desc"
}
},
"size":100
Or you could use the wildcard query:
"query": {
"wildcard" : {
"yourfield" : "boca w*"
}
}
This SO could be helpful. Hope it helps!

Related

Searching in all fields, case insensitive, and not analyzed

In elasticSearch,
How can I define a dynamic default mapping for any field (the fields are not predefined) that is searchable with spaces and case insensitive values.
For example, if i have two documents:
PUT myindex/mytype/1
{
"transaction": "test"
}
and
PUT myindex/mytype/2
{
"transaction": "test SPACE"
}
I'd like to perform the following queries:
Querying: "test", Expected result: "test"
Querying: "test space", Expected result "test SPACE"
I've tried to use:
PUT myindex
{
"settings":{
"index":{
"analysis":{
"analyzer":{
"analyzer_keyword":{
"tokenizer":"keyword",
"filter":"lowercase"
}
}
}
}
},
"mappings":{
"test":{
"properties":{
"title":{
"analyzer":"analyzer_keyword",
"type":"string"
}
}
}
}
}
But it gives me both document as result when looking for "test".
Apparently there was a mistake running my query:
Here's a solution I found to this problem, when using multi field query:
#any field mapping - not analyzed and case insensitive
PUT /test_index
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"analyzer_keyword": {
"tokenizer": "keyword",
"filter": ["lowercase"]
}
}
}
}
},
"mappings": {
"doc": {
"dynamic_templates": [
{ "notanalyzed": {
"match_mapping_type": "string",
"mapping": {
"type": "string",
"analyzer":"analyzer_keyword"
}
}
}
]
}
}
}
#index test data
POST /test_index/doc/_bulk
{"index":{"_id":3}}
{"name":"Company Solutions", "a" : "a1"}
{"index":{"_id":4}}
{"name":"Company", "a" : "a2"}
#search for document with name “company” and a “a1”
POST /test_index/doc/_search
{
"query" : {
"filtered" : {
"filter": {
"and": {
"filters": [
{
"query": {
"match": {
"name": "company"
}
}
},
{
"query": {
"match": {
"a": "a2"
}
}
}
]
}
}
}
}
}

Elasticsearch - Cardinality over Full Field Value

I have a document that looks like this:
{
"_id":"some_id_value",
"_source":{
"client":{
"name":"x"
},
"project":{
"name":"x November 2016"
}
}
}
I am attempting to perform a query that will fetch me the count of unique project names for each client. For this, I am using a query with cardinality over the project.name. I am sure that there are only 4 unique project names for this particular client. However, when I run my query, I get a count of 5, which I know is wrong.
The project names all contain the name of the client. For instance, if a client is "X", project names will be "X Testing November 2016", or "X Jan 2016", etc. I don't know if that is a consideration.
This is the mapping for the document type
{
"mappings":{
"vma_docs":{
"properties":{
"client":{
"properties":{
"contact":{
"type":"string"
},
"name":{
"type":"string"
}
}
},
"project":{
"properties":{
"end_date":{
"format":"yyyy-MM-dd",
"type":"date"
},
"project_type":{
"type":"string"
},
"name":{
"type":"string"
},
"project_manager":{
"index":"not_analyzed",
"type":"string"
},
"start_date":{
"format":"yyyy-MM-dd",
"type":"date"
}
}
}
}
}
}
}
This is my search query
{
"fields":[
"client.name",
"project.name"
],
"query":{
"bool":{
"must":{
"match":{
"client.name":{
"operator":"and",
"query":"ABC systems"
}
}
}
}
},
"aggs":{
"num_projects":{
"cardinality":{
"field":"project.name"
}
}
},
"size":5
}
These are the results I get (I have only posted 2 results for the sake of brevity). Please find that the num_projects aggregation returns 5, but must only return 4, which are the total number of projects.
{
"hits":{
"hits":[
{
"_score":5.8553367,
"_type":"vma_docs",
"_id":"AVTMIM9IBwwoAW3mzgKz",
"fields":{
"project.name":[
"ABC"
],
"client.name":[
"ABC systems Pvt Ltd"
]
},
"_index":"vma"
},
{
"_score":5.8553367,
"_type":"vma_docs",
"_id":"AVTMIM9YBwwoAW3mzgK2",
"fields":{
"project.name":[
"ABC"
],
"client.name":[
"ABC systems Pvt Ltd"
]
},
"_index":"vma"
}
],
"total":18,
"max_score":5.8553367
},
"_shards":{
"successful":5,
"failed":0,
"total":5
},
"took":4,
"aggregations":{
"num_projects":{
"value":5
}
},
"timed_out":false
}
FYI: The project names are ABC, ABC Nov 2016, ABC retest November, ABC Mobile App
You need the following mapping for your project.name field:
{
"mappings": {
"vma_docs": {
"properties": {
"client": {
"properties": {
"contact": {
"type": "string"
},
"name": {
"type": "string"
}
}
},
"project": {
"properties": {
"end_date": {
"format": "yyyy-MM-dd",
"type": "date"
},
"project_type": {
"type": "string"
},
"name": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"project_manager": {
"index": "not_analyzed",
"type": "string"
},
"start_date": {
"format": "yyyy-MM-dd",
"type": "date"
}
}
}
}
}
}
}
It's basically a subfield called raw where the same value put in project.name is put in project.name.raw but without touching it (tokenizing or analyzing it). And then the query you need to use is:
{
"fields": [
"client.name",
"project.name"
],
"query": {
"bool": {
"must": {
"match": {
"client.name": {
"operator": "and",
"query": "ABC systems"
}
}
}
}
},
"aggs": {
"num_projects": {
"cardinality": {
"field": "project.name.raw"
}
}
},
"size": 5
}

Optimising ElasticSearch aggregated search suggestions

I'm working on implementing an autocomplete field where the suggestions also contain the number of matching documents.
I have implemented this simply using a terms aggregation with include filter. So for instance given a user typing 'Chrysler' the following query may be generated:
{
"size": 0,
"query": {
"bool": {
"must": [
...
]
}
},
"aggs": {
"filtered": {
"filter": {
...
},
"aggs": {
"suggestions": {
"terms": {
"field": "prefLabel",
"include": "Chry.*",
"min_doc_count": 0
}
}
}
}
}
}
This works fine and I am able to get the data I need. However, I am concerned that this is not very well optimised and more could be done when the documents are indexed.
Currently we have the following mapping:
{
...
"prefLabel":{
"type":"string",
"index":"not_analyzed"
}
}
And I am wondering whether to add an analysed field, like so:
{
...
"prefLabel":{
"type":"string",
"index":"not_analyzed",
"copy_to":"searchLabel"
},
"searchLabel":{
"type":"string",
"analyzer":"???"
}
}
So my question is: what is the most optimal index-time analyser for this? (or, is this just crazy?)
I think that edge ngram tokenizer would speed things up:
curl -XPUT 'localhost:9200/test_ngram' -d '{
"settings" : {
"analysis" : {
"analyzer" : {
"suggester_analyzer" : {
"tokenizer" : "ngram_tokenizer"
}
},
"tokenizer" : {
"ngram_tokenizer" : {
"type" : "edgeNGram",
"min_gram" : "2",
"max_gram" : "7",
"token_chars": [ "letter", "digit" ]
}
}
}
},
"mappings": {
...
"searchLabel": {
"type": "string",
"index_analyzer": "suggster_analyzer",
"search_analyzer": "standard"
}
...
}
}'

Multi field analyzer not working as expected

I'm confused. I have the following document indexed:
POST test/topic
{
"title": "antiemetics"
}
With the following query:
{
"query": {
"query_string" : {
"fields" : ["title*"],
"default_operator": "AND",
"query" :"anti emetics",
"use_dis_max" : true
}
},
"highlight" : {
"fields" : {
"*" : {
"fragment_size" : 200,
"pre_tags" : ["<mark>"],
"post_tags" : ["</mark>"]
}
}
}
}
and the following settings and mappings:
POST test{
"settings":{
"index":{
"number_of_shards":1,
"analysis":{
"analyzer":{
"merge":{
"type":"custom",
"tokenizer":"keyword",
"filter":[
"lowercase"
],
"char_filter":[
"hyphen",
"space",
"html_strip"
]
}
},
"char_filter":{
"hyphen":{
"type":"pattern_replace",
"pattern":"[-]",
"replacement":""
},
"space":{
"type":"pattern_replace",
"pattern":" ",
"replacement":""
}
}
}
}
},
"mappings":{
"topic":{
"properties":{
"title":{
"analyzer":"standard",
"search_analyzer":"standard",
"type":"string",
"fields":{
"specialised":{
"type":"string",
"index":"analyzed",
"analyzer":"standard",
"search_analyzer":"merge"
}
}
}
}
}
}
}
I know my use of a multi-field doesn't make sense as I'm using the same index analyzer as the title so please just ignore that however I'm more interested in my understanding with regard to analyzers. I was expecting the merge analyzer to change the following query "anti emetics" to "antiemetics" and I was hoping the multifield setting that has the analyzer applied would match against the token "antiemetics" but I don't get any results back even though I have tested that the analyzer is removing white spaces from the query by running the analyze API. Any idea why?
This seems to work with your setup:
POST /test_index/_search
{
"query": {
"match": {
"title.specialised": "anti emetics"
}
}
}
Here's some code I set up to play with it:
http://sense.qbox.io/gist/3ef6926644213cf7db568557a801fec6cb15eaf9

Highlight hits within attachment content with ElasticSearch

I'm having trouble getting ElasticSearch to highlight hits within attachment content indexed using the elasticsearch-mapper-attachments.
My data at /stuff/file looks like this:
{
"id": "string"
"name": "string"
"attachment": "...base 64 encoded file"
}
My mapper configuration put to /stuff/file/_mapper looks like this:
{
"file" : {
"properties" : {
"attachment" : {
"type" : "attachment",
"path" : "full",
"fields": {
"name": { "store": true },
"title": { "store": true },
"content": { "store": true },
"attachment": {
"type": "string",
"term_vector": "with_positions_offsets",
"store": true
}
}
}
}
}
}
And when I query it at /stuff/_mapper/file I get this returned:
{
"stuff":{
"mappings":{
"file":{
"properties":{
"attachment":{
"type":"attachment",
"path":"full",
"fields":{
"attachment":{
"type":"string"
},
"author":{
"type":"string"
},
"title":{
"type":"string"
},
"name":{
"type":"string"
},
"date":{
"type":"date",
"format":"dateOptionalTime"
},
"keywords":{
"type":"string"
},
"content_type":{
"type":"string"
},
"content_length":{
"type":"integer"
},
"language":{
"type":"string"
}
}
},
"id":{
"type":"string"
},
"name":{
"type":"string"
}
}
}
}
}
}
And my query looks like this:
{
"size": 30,
"query": {
"multi_match": {
"query": "{{.Query}}",
"operator": "and",
"fields": ["id", "name^4", "attachment"],
"fuzziness": "AUTO",
"minimum_should_match": "80%"
}
},
"highlight" : {
"fields" : {
"attachment": { }
}
}
}
When I search for a term that is in the attachment, it returns the correct result but there is no highlighting. There was a similar question from a few years ago that swapped attachment for file in a few places but there were comments that this has changed again... What's the right configuration to get the highlighting to work?
Turns out that you can't overwrite mapper configurations using PUT. You need to delete the existing configuration first (I actually had delete the entire database, DELETE on the configuration didn't seem to have any effect). Once the mapper configuration was actually updated, highlighting works fine.

Resources