Highlight hits within attachment content with ElasticSearch - elasticsearch

I'm having trouble getting ElasticSearch to highlight hits within attachment content indexed using the elasticsearch-mapper-attachments.
My data at /stuff/file looks like this:
{
"id": "string"
"name": "string"
"attachment": "...base 64 encoded file"
}
My mapper configuration put to /stuff/file/_mapper looks like this:
{
"file" : {
"properties" : {
"attachment" : {
"type" : "attachment",
"path" : "full",
"fields": {
"name": { "store": true },
"title": { "store": true },
"content": { "store": true },
"attachment": {
"type": "string",
"term_vector": "with_positions_offsets",
"store": true
}
}
}
}
}
}
And when I query it at /stuff/_mapper/file I get this returned:
{
"stuff":{
"mappings":{
"file":{
"properties":{
"attachment":{
"type":"attachment",
"path":"full",
"fields":{
"attachment":{
"type":"string"
},
"author":{
"type":"string"
},
"title":{
"type":"string"
},
"name":{
"type":"string"
},
"date":{
"type":"date",
"format":"dateOptionalTime"
},
"keywords":{
"type":"string"
},
"content_type":{
"type":"string"
},
"content_length":{
"type":"integer"
},
"language":{
"type":"string"
}
}
},
"id":{
"type":"string"
},
"name":{
"type":"string"
}
}
}
}
}
}
And my query looks like this:
{
"size": 30,
"query": {
"multi_match": {
"query": "{{.Query}}",
"operator": "and",
"fields": ["id", "name^4", "attachment"],
"fuzziness": "AUTO",
"minimum_should_match": "80%"
}
},
"highlight" : {
"fields" : {
"attachment": { }
}
}
}
When I search for a term that is in the attachment, it returns the correct result but there is no highlighting. There was a similar question from a few years ago that swapped attachment for file in a few places but there were comments that this has changed again... What's the right configuration to get the highlighting to work?

Turns out that you can't overwrite mapper configurations using PUT. You need to delete the existing configuration first (I actually had delete the entire database, DELETE on the configuration didn't seem to have any effect). Once the mapper configuration was actually updated, highlighting works fine.

Related

How I can get the distinct result?

What I am trying to do is the query to elastic search (ver 6.4), to get the unique search result (named eids). I made a query as below. What I'd like to do is first text search from both 2 fields called eLabel and pLabel, and get the distinct result called eid. But actually the result is not aggregated, showing redundant ids from 0 to over 20. How I can adjust the query?
{
"query": {
"multi_match": {
"query": "Brazil Capital",
"fields": [
"eLabel",
"pLabel"
]
}
},
"size": 200,
"_source": [
"eid",
"eLabel"
],
"aggs": {
"eids": {
"terms": {
"field": "eid"
}
}
}
}
my current mappings are as follows.
eid : id of entity
eLabel: entity label (ex, Brazil)
prop_id: property id of the entity (eid)
pLabel: the label of the property (ex, is the capital of, is located at ...)
"mappings": {
"entity": {
"properties": {
"eLabel": {
"type": "text" ,
"index_options": "docs" ,
"analyzer": "my_analyzer"
} ,
"eid": {
"type": "keyword"
} ,
"subclass": {
"type": "boolean"
} ,
"pLabel": {
"type": "text" ,
"index_options": "docs" ,
"analyzer": "my_analyzer"
} ,
"prop_id": {
"type": "keyword"
} ,
"pType": {
"type": "keyword"
} ,
"way": {
"type": "keyword"
} ,
"chain": {
"type": "integer"
} ,
"siteKey": {
"type": "keyword"
},
"version": {
"type": "integer"
},
"docId": {
"type": "integer"
}
}
}
}
Based on your comment, you can make use of the below query using Bool. Don't think anything is wrong with aggregation query, just replace the query you have with the bool query I've mentioned and I think it would suffice.
When you make use of multi_match query, it would retrieve even if the document has eLabel = "Rio is capital of brazil" & pLabel = "something else entirely here"
POST <your_index_name>/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"eLabel": "capital"
}
},
{
"match": {
"pLabel": "brazil"
}
}
]
}
},
"size": 200,
"_source": [
"eid",
"eLabel"
],
"aggs": {
"eids": {
"terms": {
"field": "eid"
}
}
}
}
Note that if you only want the values of eid and do not want the documents, you can set "size":0 in the above query. That way you'd only have aggregation results returned.
Let me know if this helps!!

Elasticsearch: Why can't I use "5m" for precision in context queries?

I'm running on Elasticsearch 5.5
I have a document with the following mapping
"mappings": {
"shops": {
"properties": {
"locations": {
"type": "geo_point"
},
"name": {
"type": "keyword"
},
"suggest": {
"type": "completion",
"contexts": [
{
"name": "location",
"type": "GEO",
"precision": "10m",
"path": "locations"
}
]
}
}
}
I'll add a document as follows:
PUT my_index/shops
{
"name":"random shop",
"suggest":{
"input":"random shop"
},
"locations":[
{
"lat":42.38471212,
"lon":-71.12612357
}
]
}
I try to query for the document with the follow JSON call
GET my_shops/_search
{
"suggest": {
"result": {
"prefix": "random",
"completion": {
"field": "suggest",
"size": 5,
"fuzzy": true,
"contexts": {
"location": [{
"lat": 42.38471212,
"lon": -71.12612357,
"precision": "10mi"
}]
}
}
}
}
}
I get the following errors:
(source: discourse.org)
But when I change the "precision" field to an int, I get the intended search results.
I'm confused on two fronts.
Why is there a context error? The documentation seems to say that this is ok
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/suggester-context.html
Why can't I use string values for the precision values?
At the bottom of the page, I see that the precision values can take either distances or numeric values.

Elasticsearch search pattern with Start string

I am new to elasticsearch and trying to implement search. Below is my index and settingscurl -XPUT localhost:9200/rets_data/ -d '{
"settings":{
"index":{
"analysis":{
"analyzer":{
"analyzer_startswith":{
"tokenizer":"keyword",
"filter":"lowercase"
},
"analyzer_whitespacewith":{
"tokenizer":"whitespace",
"filter":"lowercase"
}
}
}
}
},
"mappings":{
"city":{
"properties":{
"CityName":{
"analyzer":"analyzer_startswith",
"type":"string"
}
}
},
"rets_aux_subdivision":{
"properties":{
"nn":{
"analyzer":"analyzer_whitespacewith",
"type":"string"
},
"field_LIST_77":{
"analyzer":"analyzer_whitespacewith",
"type":"string"
},
"SubDivisionName":{
"analyzer":"analyzer_whitespacewith",
"type":"string"
},
"SubDivisionAlias":{
"analyzer":"analyzer_whitespacewith",
"type":"string"
}
}
},
"rental_aux_subdivision":{
"properties":{
"nn":{
"analyzer":"analyzer_whitespacewith",
"type":"string"
},
"field_LIST_77":{
"analyzer":"analyzer_whitespacewith",
"type":"string"
},
"SubDivisionName":{
"analyzer":"analyzer_whitespacewith",
"type":"string"
},
"SubDivisionAlias":{
"analyzer":"analyzer_whitespacewith",
"type":"string"
}
}
}
}
}'
Below is search string
curl -XGET localhost:9200/rets_data/rets_aux_subdivision/_search?pretty -d '{"query":{"match_phrase_prefix":{"nn":{"query":"boca w","max_expansions":50}}},"sort":{"total":{"order":"desc"}},"size":100}'
When i am searching for any text like "Boca r", "Boca w" it is not giving me result.
My expected result is below.
"Boca w" should give me result starting with "Boca w". i.e "Boca west", "Boca Woods", "Boca Winds"
Please help me on this.
Thanks
You should use edgeNgram. Check this out in elasticsearch documentation. EdgeNgram filter prepare multiple words from one like this:
Woods->[W,Wo,Woo,Wood,Woods]
It makes index bigger, but searching will be more efficient than any other option like wildcards etc. Here is my simple index creation with ngrams on title.ngram:
{
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
"ngram_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : ["lowercase","my_ngram"]
}
},
"filter" : {
"my_ngram" : {
"type" : "edge_ngram",
"min_gram" : 1,
"max_gram" : 50
}
}
}
}
},
"mappings":
{
"post":
{
"properties":
{
"id":
{
"type": "integer",
"index":"no"
},
"title":
{
"type": "text",
"analyzer":"ngram_analyzer"
}
}
}
}
}
And search query:
{
"from" : 0,
"size" : 10,
"query" : {
"match" : {
"title":
{
"query":"press key han",
"operator":"or",
"analyzer":"standard"
}
}
}
}
What if you have your match something like this:
"query": {
"match_phrase": {
"text": {
"query": "boca w"
}
}
},
"sort":{
"total":{
"order":"desc"
}
},
"size":100
Or you could use the wildcard query:
"query": {
"wildcard" : {
"yourfield" : "boca w*"
}
}
This SO could be helpful. Hope it helps!

Elasticsearch - Cardinality over Full Field Value

I have a document that looks like this:
{
"_id":"some_id_value",
"_source":{
"client":{
"name":"x"
},
"project":{
"name":"x November 2016"
}
}
}
I am attempting to perform a query that will fetch me the count of unique project names for each client. For this, I am using a query with cardinality over the project.name. I am sure that there are only 4 unique project names for this particular client. However, when I run my query, I get a count of 5, which I know is wrong.
The project names all contain the name of the client. For instance, if a client is "X", project names will be "X Testing November 2016", or "X Jan 2016", etc. I don't know if that is a consideration.
This is the mapping for the document type
{
"mappings":{
"vma_docs":{
"properties":{
"client":{
"properties":{
"contact":{
"type":"string"
},
"name":{
"type":"string"
}
}
},
"project":{
"properties":{
"end_date":{
"format":"yyyy-MM-dd",
"type":"date"
},
"project_type":{
"type":"string"
},
"name":{
"type":"string"
},
"project_manager":{
"index":"not_analyzed",
"type":"string"
},
"start_date":{
"format":"yyyy-MM-dd",
"type":"date"
}
}
}
}
}
}
}
This is my search query
{
"fields":[
"client.name",
"project.name"
],
"query":{
"bool":{
"must":{
"match":{
"client.name":{
"operator":"and",
"query":"ABC systems"
}
}
}
}
},
"aggs":{
"num_projects":{
"cardinality":{
"field":"project.name"
}
}
},
"size":5
}
These are the results I get (I have only posted 2 results for the sake of brevity). Please find that the num_projects aggregation returns 5, but must only return 4, which are the total number of projects.
{
"hits":{
"hits":[
{
"_score":5.8553367,
"_type":"vma_docs",
"_id":"AVTMIM9IBwwoAW3mzgKz",
"fields":{
"project.name":[
"ABC"
],
"client.name":[
"ABC systems Pvt Ltd"
]
},
"_index":"vma"
},
{
"_score":5.8553367,
"_type":"vma_docs",
"_id":"AVTMIM9YBwwoAW3mzgK2",
"fields":{
"project.name":[
"ABC"
],
"client.name":[
"ABC systems Pvt Ltd"
]
},
"_index":"vma"
}
],
"total":18,
"max_score":5.8553367
},
"_shards":{
"successful":5,
"failed":0,
"total":5
},
"took":4,
"aggregations":{
"num_projects":{
"value":5
}
},
"timed_out":false
}
FYI: The project names are ABC, ABC Nov 2016, ABC retest November, ABC Mobile App
You need the following mapping for your project.name field:
{
"mappings": {
"vma_docs": {
"properties": {
"client": {
"properties": {
"contact": {
"type": "string"
},
"name": {
"type": "string"
}
}
},
"project": {
"properties": {
"end_date": {
"format": "yyyy-MM-dd",
"type": "date"
},
"project_type": {
"type": "string"
},
"name": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"project_manager": {
"index": "not_analyzed",
"type": "string"
},
"start_date": {
"format": "yyyy-MM-dd",
"type": "date"
}
}
}
}
}
}
}
It's basically a subfield called raw where the same value put in project.name is put in project.name.raw but without touching it (tokenizing or analyzing it). And then the query you need to use is:
{
"fields": [
"client.name",
"project.name"
],
"query": {
"bool": {
"must": {
"match": {
"client.name": {
"operator": "and",
"query": "ABC systems"
}
}
}
}
},
"aggs": {
"num_projects": {
"cardinality": {
"field": "project.name.raw"
}
}
},
"size": 5
}

Partial word search - ElasticSearch 1.7.2

I've been trying to build a search module for an application, using ElasticSearch. Below is the Index Structure I've constructed from sample code I read from other StackOverflow posts.
{
"megacorp4":{
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"type":"custom",
"tokenizer":"my_ngram_tokenizer",
"filter":[
"my_ngram_filter"
]
}
},
"filter":{
"my_ngram_filter":{
"type":"edgeNGram",
"min_gram":3,
"max_gram":15
}
},
"tokenizer":{
"my_ngram_tokenizer":{
"type":"edgeNGram",
"min_gram":3,
"max_gram":15
}
}
},
"mappings":{
"employee":{
"properties":{
"about":{
"type":"string",
"analyzer":"my_analyzer"
},
"age":{
"type":"long"
},
"first_name":{
"type":"string"
},
"interests":{
"type":"string",
"analyzer":"my_analyzer"
},
"last_name":{
"type":"string"
}
}
}
}
}
}
}
Below are the records I inserted to test the search functionality
[
{
"first_name":"John",
"last_name":"Smith",
"age":25,
"about":"I love to go rock climbing",
"interests":[
"sports",
"music"
]
},
{
"first_name":"Douglas",
"last_name":"Fir",
"age":35,
"about":"I like to build album climb cabinets",
"interests":[
"forestry",
"music"
]
},
{
"first_name":"Jane",
"last_name":"Smith",
"age":32,
"about":"I like to collect rock albums",
"interests":[
"music"
]
}
]
I ran a search on the 'about' column, both using API (through POSTMAN) and in the Python client as follows :
API Query:
localhost:9200/megacorp4/_search?q=climb
Python Query :
from elasticsearch import Elasticsearch
from pprint import pprint
es = Elasticsearch()
res = es.search(index="megacorp4", body={"query": {"match": {'about':"climb"}}})
pprint(res)
I'm able to obtain only exact match, and I don't get the result with 'climbing' in the output. However when I replace 'climb' with 'climb*' in the query, I get 2 records with 'climb' and 'climbing'. I don't want to use '*' wildcard approach.
I've also tried using 'english', 'standard' & 'ngram' inbuilt analyzers, but nothing seemed to work.
In need of help to implement Search a key as Partial words in Full Text.
Thanks in advance.
Use this mapping instead:
DELETE test
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"my_ngram_filter"
]
}
},
"filter": {
"my_ngram_filter": {
"type": "edgeNGram",
"min_gram": 3,
"max_gram": 15
}
}
}
},
"mappings": {
"employee": {
"properties": {
"about": {
"type": "string",
"analyzer": "my_analyzer"
},
"age": {
"type": "long"
},
"first_name": {
"type": "string"
},
"interests": {
"type": "string",
"analyzer": "my_analyzer"
},
"last_name": {
"type": "string"
}
}
}
}
}
POST /test/employee/_bulk
{"index":{}}
{"first_name":"John","last_name":"Smith","age":25,"about":"I love to go rock climbing","interests":["sports","music"]}
{"index":{}}
{"first_name":"Douglas","last_name":"Fir","age":35,"about":"I like to build album climb cabinets","interests":["forestry","music"]}
{"index":{}}
{"first_name":"Jane","last_name":"Smith","age":32,"about":"I like to collect rock albums","interests":["music"]}
GET /test/_search?q=about:climb
GET /test/_search
{
"query": {
"query_string": {
"query": "about:climb"
}
}
}
GET /test/_search
{
"query": {
"match": {
"about": "climb"
}
}
}
Two changes:
you need another closing curly bracket for the settings part
replace your custom tokenizer (which will not help you since you already have the edgeNGram filter) with another one, my suggestion is standard tokenizer
And for the ?q=climb part, by default this searches the _all field which is analyzed with standard analyzer and not with your custom one.
So, the correct query is localhost:9200/megacorp4/_search?q=about:climb.

Resources