Related
I have Real Estate data. I am looking into storing it into elastic search to allow users to search the database real time.
I want to be able to let my users search by key fields like price, lot size, year-built, total bedrooms, etc. However, I also want to be able to let the user filter by keywords or amenities like "Has Pool", "Has Spa", "Parking Space", "Community"..
Additionally, I need to keep a distinct list of property type, property status, schools, community, etc so I can create drop down menu for my user to select from.
What should the stored data structure look like? How can I maintain a list of the distinct schools, community, type to use that to create drop down menu for the user to pick from?
The current data I have is basically a key/value pairs. I can clean it up and standardize it before storing it into Elastic Search but puzzled on what is considered a good approach to store this data?
Based on your question I will provide baseline mappings and a basic query with facets/filters for you to start working with.
Mappings
PUT test_jay
{
"mappings": {
"properties": {
"amenities": {
"type": "keyword"
},
"description": {
"type": "text"
},
"location": {
"type": "geo_point"
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"status": {
"type": "keyword"
},
"type": {
"type": "keyword"
}
}
}
}
We will use "keyword" field type for that fields you will be always be doing exact matches like a drop down list.
For fields we want to do only full text search like description we use type "text". In some cases like titles I want to have both field types.
I created a location geo_type field in case you want to put your properties in a map or do distance based searches, like near houses.
For amenities a keyword field type is enough to store an array of amenities.
Ingest document
POST test_jay/_doc
{
"name": "Nice property",
"description": "nice located fancy property",
"location": {
"lat": 37.371623,
"lon": -122.003338
},
"amenities": [
"Pool",
"Parking",
"Community"
],
"type": "House",
"status": "On sale"
}
Remember keyword fields are case sensitive!
Search query
POST test_jay/_search
{
"query": {
"bool": {
"must": {
"multi_match": {
"query": "nice",
"fields": [
"name",
"description"
]
}
},
"filter": [
{
"term": {
"status": "On sale"
}
},
{
"term": {
"amenities":"Pool"
}
},
{
"term": {
"type": "House"
}
}
]
}
},
"aggs": {
"amenities": {
"terms": {
"field": "amenities",
"size": 10
}
},
"status": {
"terms": {
"field": "status",
"size": 10
}
},
"type": {
"terms": {
"field": "type",
"size": 10
}
}
}
}
The multi match part will do a full text search in the title and description fields. You are filling this one with the regular search box.
Then the filter part is filled by dropdown lists.
Query Response
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "test_jay",
"_type" : "_doc",
"_id" : "zWysGHgBLiMtJ3pUuvZH",
"_score" : 0.2876821,
"_source" : {
"name" : "Nice property",
"description" : "nice located fancy property",
"location" : {
"lat" : 37.371623,
"lon" : -122.003338
},
"amenities" : [
"Pool",
"Parking",
"Community"
],
"type" : "House",
"status" : "On sale"
}
}
]
},
"aggregations" : {
"amenities" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Community",
"doc_count" : 1
},
{
"key" : "Parking",
"doc_count" : 1
},
{
"key" : "Pool",
"doc_count" : 1
}
]
},
"type" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "House",
"doc_count" : 1
}
]
},
"status" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "On sale",
"doc_count" : 1
}
]
}
}
}
With the query response you can fill the facets for future filters.
I recommend you to play around with this and then come back with more specific questions.
I've been reading about Elasticsearch suggesters, match phrase prefix and highlighting and i'm a bit confused as to which to use to suit my problem.
Requirement: i have a bunch of different text fields, and need to be able to autocomplete and autosuggest across all of them, as well as misspelling. Basically the way Google works.
See in the following Google snapshot, when we start typing "Can", it lists word like Canadian, Canada, etc. This is auto complete. However it lists additional words also like tire, post, post tracking, coronavirus etc. This is auto suggest. It searches for most relevant word in all fields. If we type "canxad" it should also misspel suggest the same results.
Could someone please give me some hints on how i can implement the above functionality across a bunch of text fields?
At first i tried this:
GET /myindex/_search
{
"query": {
"match_phrase_prefix": {
"myFieldThatIsCombinedViaCopyTo": "revis"
}
},
"highlight": {
"fields": {
"*": {}
},
"require_field_match" : false
}
}
but it returns highlights like this:
"In the aforesaid revision filed by the members of the Committee, the present revisionist was also party",
So that's not a "prefix" anymore...
Also tried this:
GET /myindex/_search
{
"query": {
"multi_match": {
"query": "revis",
"fields": ["myFieldThatIsCombinedViaCopyTo"],
"type": "phrase_prefix",
"operator": "and"
}
},
"highlight": {
"fields": {
"*": {}
}
}
}
But it still returns
"In the aforesaid revision filed by the members of the Committee, the present revisionist was also party",
Note: I have about 5 "text" fields that I need to search upon. One of those fields is quite long (1000s of words). If I break things up into keywords, I lose the phrase. So it's like I need match phrase prefix across a combined text field, with fuzziness?
EDIT
Here's an example of a document (some fields taken out, content snipped):
{
"id" : 1,
"respondent" : "Union of India",
"caseContent" : "<snip>..against the Union of India, through the ...<snip>"
}
As #Vlad suggested, i tried this:
POST /cases/_search
POST /cases/_search
{
"suggest": {
"respondent-suggest": {
"prefix": "uni",
"completion": {
"field": "respondent.suggest",
"skip_duplicates": true
}
},
"caseContent-suggest": {
"prefix": "uni",
"completion": {
"field": "caseContent.suggest",
"skip_duplicates": true
}
}
}
}
Which returns this:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"suggest" : {
"caseContent-suggest" : [
{
"text" : "uni",
"offset" : 0,
"length" : 3,
"options" : [ ]
}
],
"respondent-suggest" : [
{
"text" : "uni",
"offset" : 0,
"length" : 3,
"options" : [
{
"text" : "Union of India",
"_index" : "cases",
"_type" : "_doc",
"_id" : "dI5hh3IBEqNFLVH6-aB9",
"_score" : 1.0,
"_ignored" : [
"headNote.suggest"
],
"_source" : {
<snip>
}
}
]
}
]
}
}
So looks like it matches on the respondent field, which is great! But, it didn't match on the caseContent field, even though the text (see above) includes the phrase "against the Union of India".. shouldn't it match there? or is it because how the text is broken up?
Since you need autocomplete/suggest on each field, then you need to run a suggest query on each field and not on the copy_to field. That way you're guaranteed to have the proper prefixes.
copy_to fields are great for searching in multiple fields, but not so good for auto-suggest/-complete type of queries.
The idea is that for each of your fields, you should have a completion sub-field so that you can get auto-complete results for each of them.
PUT index
{
"mappings": {
"properties": {
"text1": {
"type": "text",
"fields": {
"suggest": {
"type": "completion"
}
}
},
"text2": {
"type": "text",
"fields": {
"suggest": {
"type": "completion"
}
}
},
"text3": {
"type": "text",
"fields": {
"suggest": {
"type": "completion"
}
}
}
}
}
}
Your suggest queries would then run on all the sub-fields directly:
POST index/_search?pretty
{
"suggest": {
"text1-suggest" : {
"prefix" : "revis",
"completion" : {
"field" : "text1.suggest"
}
},
"text2-suggest" : {
"prefix" : "revis",
"completion" : {
"field" : "text2.suggest"
}
},
"text3-suggest" : {
"prefix" : "revis",
"completion" : {
"field" : "text3.suggest"
}
}
}
}
That takes care of the auto-complete/-suggest part. For misspellings, the suggest queries allow you to specify a fuzzy parameter as well
UPDATE
If you need to do prefix search on all sentences within a body of text, the approach needs to change a bit.
The new mapping below creates a new completion field next to the text one. The idea is to apply a small transformation (i.e. split sentences) to what you're going to store in the completion field. So first create the index mapping like this:
PUT index
{
"mappings": {
"properties": {
"text1": {
"type": "text",
},
"text1Suggest": {
"type": "completion"
}
}
}
}
Then create an ingest pipeline that will populate the text1Suggest field with sentences from the text1 field:
PUT _ingest/pipeline/sentence
{
"processors": [
{
"split": {
"field": "text1",
"target_field": "text1Suggest.input",
"separator": "\\.\\s+"
}
}
]
}
Then we can index a document such as this one (with only the text1 field as the completion field will be built dynamically)
PUT test/_doc/1?pipeline=sentence
{
"text1": "The crazy fox. The quick snail. John goes to the beach"
}
What gets indexed looks like this (your text1 field + another completion field optimized for sentence prefix completion):
{
"text1": "The crazy fox. The cat drinks milk. John goes to the beach",
"text1Suggest": {
"input": [
"The crazy fox",
"The cat drinks milk",
"John goes to the beach"
]
}
}
And finally you can search for prefixes of any sentence, below we search for John and you should get a suggestion:
POST test/_search?pretty
{
"suggest": {
"text1-suggest": {
"prefix": "John",
"completion": {
"field": "text1Suggest"
}
}
}
}
In short, I have an index with text data extracted from pdfs, grouped into paragraphs (called blocks).
Each document consists of a list of 'blocks', where each 'block' contains text, page number and coordinates for the bounding box. e.g.:
{
blocks:[
{
text:"Some text",
bbox:[0,1,2,3],
page: 1
},
{
text:"Some more text",
bbox:[0,1,2,3],
page: 2
},
{
text:"Some other text",
bbox:[0,1,2,3],
page: 2
},
],
document_issuer: 12345
}
I would like to obtain a list of all documents where e.g. the word "cash" appears and all the blocks where this appears.
My index mappings is as follows; note how 'blocks' is defined as a nested object:
{
"mappings" : {
"properties" : {
"blocks" : {
"type" : "nested",
"properties" : {
"bbox" : {
"type" : "float"
},
"page" : {
"type" : "long"
},
"text" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
},
"document_issuer" : {
"type" : "long"
}
}
}
}
My query looks like:
{
'query':{
'nested':{
'path': 'blocks',
'query': {
'match':{'blocks.text': 'cash'}
},
'inner_hits': {}
}
}
}
Now, the surprising thing is that I get back inner_hits, but not every instance of cash in the documents is highlighted. Using the example above, I'd see inner_hits contain, maybe, the last 2 blocks, but not the first one when searching for the term "text".
Are inner hits not supposed to show every single hit?
If I understand it correctly, you're wondering why your inner_hits don't always return every block. The idea of inner_docs, though, is precisely that. If you have tons of blocks in within your nested blocks and since they're considered standalone subdocuments, inner_hits will only return those that matched, not all of them like in the parent doc.
In other words, if I sync the following where only 1 block contains 'cash'
POST block_index/block
{"blocks":[{"text":"cash","bbox":[0,1,2,3],"page":1},{"text":"Some more text","bbox":[0,1,2,3],"page":2},{"text":"Some other text","bbox":[0,1,2,3],"page":2}],"document_issuer":12345}
and then limit what I want to see by using _source
GET block_index/_search
{
"_source": ["blocks.text", "inner_hits"], <----
"query": {
"nested": {
"path": "blocks",
"query": {
"match": {
"blocks.text": "cash"
}
},
"inner_hits": {
"_source": "blocks.text" <-----
}
}
}
}
I'll get something along the lines of
{
...
"hits" : {
"total" : 1,
"max_score" : 1.2800652,
"hits" : [
{
"_index" : "block_index",
"_type" : "block",
"_id" : "0iQ9mXEBdiyDG0RsIKyn",
"_score" : 1.2800652,
"_source" : {
"blocks" : [ <----
{
"text" : "cash"
},
{
"text" : "Some more text"
},
{
"text" : "Some other text"
}
]
},
"inner_hits" : {
"blocks" : {
"hits" : {
"total" : 1,
"max_score" : 1.2800652,
"hits" : [
{
"_index" : "block_index",
"_type" : "block",
"_id" : "0iQ9mXEBdiyDG0RsIKyn",
"_nested" : {
"field" : "blocks",
"offset" : 0
},
"_score" : 1.2800652,
"_source" : {
"text" : "cash" <-----
}
}
]
}
}
}
}
]
}
}
While I may want to see all my blocks' texts, I'm more probably interested in the one that actually caused the whole parent doc do match after performing my nested query.
Hope this helps.
I know that Elasticsearch does not support fuzziness with the cross_fields type in a multi_match query. I have a very difficult time with the Elasticsearch API and so I'm finding it challenging to build an analogous query that searches across multiple document fields with fuzzy string matching.
I have an index called papers with various fields such as Title, Author.FirstName, Author.LastName, PublicationDate, Journal etc... I want to be able to query with a string like "John Doe paper title 2015 journal name". cross_fields is the perfect multi_match type but it doesn't support fuzziness which is critical for my application.
Can anyone suggest a reasonable way to approach this? I've spent hours going through solutions on SO and the Elasticsearch forums with little success.
You can make use of copy_to field for this scenario. Basically you are copying all the values from different fields into one new field (my_search_field in the below details) and on this field, you would be able to perform fuzzy query via fuzziness parameter using simple match query.
Below is how a sample mapping, document and query would be:
Mapping:
PUT my_fuzzy_index
{
"mappings": {
"properties": {
"my_search_field":{ <---- Note this field
"type": "text"
},
"Title":{
"type": "text",
"copy_to": "my_search_field" <---- Note this
},
"Author":{
"type": "nested",
"properties": {
"FirstName":{
"type":"text",
"copy_to": "my_search_field" <---- Note this
},
"LastName":{
"type":"text",
"copy_to": "my_search_field" <---- Note this
}
}
},
"PublicationDate":{
"type": "date",
"copy_to": "my_search_field" <---- Note this
},
"Journal":{
"type":"text",
"copy_to": "my_search_field" <---- Note this
}
}
}
}
Sample Document:
POST my_fuzzy_index/_doc/1
{
"Title": "Fountainhead",
"Author":[
{
"FirstName": "Ayn",
"LastName": "Rand"
}
],
"PublicationDate": "2015",
"Journal": "journal"
}
Query Request:
POST my_fuzzy_index/_search
{
"query": {
"match": {
"my_search_field": { <---- Note this field
"query": "Aynnn Ranaad Fountainhead 2015 journal",
"fuzziness": 3 <---- Fuzzy parameter
}
}
}
}
Response:
{
"took" : 15,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.1027813,
"hits" : [
{
"_index" : "my_fuzzy_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.1027813,
"_source" : {
"Title" : "Fountainhead",
"Author" : [
{
"FirstName" : "Ayn",
"LastName" : "Rand"
}
],
"PublicationDate" : "2015",
"Journal" : "journal"
}
}
]
}
}
So instead of thinking of applying fuzzy query on multiple fields, you can instead go for this approach. That way your query would be simplified.
Let me know if this helps!
Having these documents:
{
"created_at" : "2017-07-31T20:30:14-04:00",
"description" : null,
"height" : 3213,
"id" : "1",
"tags" : [
{
"confidence" : 65.48948436785749,
"tag" : "beach"
},
{
"confidence" : 57.31950504425406,
"tag" : "sea"
},
{
"confidence" : 43.58207236617374,
"tag" : "coast"
},
{
"confidence" : 35.6857910950816,
"tag" : "sand"
},
{
"confidence" : 33.660057321079655,
"tag" : "landscape"
},
{
"confidence" : 32.53252312423727,
"tag" : "sky"
}
],
"width" : 5712,
"color" : "#0C0A07",
"boost_multiplier" : 1
}
and
{
"created_at" : "2017-07-31T20:43:17-04:00",
"description" : null,
"height" : 4934,
"id" : "2",
"tags" : [
{
"confidence" : 84.09123410403951,
"tag" : "mountain"
},
{
"confidence" : 56.412795342449456,
"tag" : "valley"
},
{
"confidence" : 48.36547551196872,
"tag" : "landscape"
},
{
"confidence" : 40.51100450186575,
"tag" : "mountains"
},
{
"confidence" : 33.14263528292239,
"tag" : "sky"
},
{
"confidence" : 31.064394646169404,
"tag" : "peak"
},
{
"confidence" : 29.372,
"tag" : "natural elevation"
}
],
"width" : 4016,
"color" : "#FEEBF9",
"boost_multiplier" : 1
}
I want to get the _score calculated based on the confidence values for each tag. For example if you search "mountain" it should return only doc with id 1 obviously, if you search "landscape", score of 2 should be higher then 1, as confidence of landscape in 2 is higher than 1 (48.36 vs 33.66). If you search for "coast landscape", this time score of 1 should be higher than 2, because doc 1 has both coast and landscape in the tags array. I also want to multiply the score with "boost_multiplier" to boost some documents against others.
I found this question in SO, Elasticsearch: Influence scoring with custom score field in document
But when I tried the accepted solution (i enabled scripting in my ES server), it returns both documents with having _score 1.0, regardless the search term. Here is my query that I tried:
{
"query": {
"nested": {
"path": "tags",
"score_mode": "sum",
"query": {
"function_score": {
"query": {
"match": {
"tags.tag": "coast landscape"
}
},
"script_score": {
"script": "doc[\"confidence\"].value"
}
}
}
}
}
}
I also tried what #yahermann suggested in the comments, replacing "script_score" with "field_value_factor" : { "field" : "confidence" }, still the same result. Any idea why it fails, or is there better way to do it?
Just to have complete picture, here is the mapping definition that I've used:
{
"mappings": {
"photo": {
"properties": {
"created_at": {
"type": "date"
},
"description": {
"type": "text"
},
"height": {
"type": "short"
},
"id": {
"type": "keyword"
},
"tags": {
"type": "nested",
"properties": {
"tag": { "type": "string" },
"confidence": { "type": "float"}
}
},
"width": {
"type": "short"
},
"color": {
"type": "string"
},
"boost_multiplier": {
"type": "float"
}
}
}
},
"settings": {
"number_of_shards": 1
}
}
UPDATE
Following the answer of #Joanna below, I tried the query, but in fact, whatever I put in match query, coast, foo, bar, it always return both documents with _score 1.0 for both of them, I tried it on elasticsearch 2.4.6, 5.3, 5.5.1 in Docker. Here is the response I get:
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 1635
{"took":24,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":1.0,"hits":[{"_index":"my_index","_type":"my_type","_id":"2","_score":1.0,"_source":{
"created_at" : "2017-07-31T20:43:17-04:00",
"description" : null,
"height" : 4934,
"id" : "2",
"tags" : [
{
"confidence" : 84.09123410403951,
"tag" : "mountain"
},
{
"confidence" : 56.412795342449456,
"tag" : "valley"
},
{
"confidence" : 48.36547551196872,
"tag" : "landscape"
},
{
"confidence" : 40.51100450186575,
"tag" : "mountains"
},
{
"confidence" : 33.14263528292239,
"tag" : "sky"
},
{
"confidence" : 31.064394646169404,
"tag" : "peak"
},
{
"confidence" : 29.372,
"tag" : "natural elevation"
}
],
"width" : 4016,
"color" : "#FEEBF9",
"boost_multiplier" : 1
}
},{"_index":"my_index","_type":"my_type","_id":"1","_score":1.0,"_source":{
"created_at" : "2017-07-31T20:30:14-04:00",
"description" : null,
"height" : 3213,
"id" : "1",
"tags" : [
{
"confidence" : 65.48948436785749,
"tag" : "beach"
},
{
"confidence" : 57.31950504425406,
"tag" : "sea"
},
{
"confidence" : 43.58207236617374,
"tag" : "coast"
},
{
"confidence" : 35.6857910950816,
"tag" : "sand"
},
{
"confidence" : 33.660057321079655,
"tag" : "landscape"
},
{
"confidence" : 32.53252312423727,
"tag" : "sky"
}
],
"width" : 5712,
"color" : "#0C0A07",
"boost_multiplier" : 1
}
}]}}
UPDATE-2
I found this one on SO: Elasticsearch: "function_score" with "boost_mode":"replace" ignores function score
It basically says, if function doesn't match, it returns 1. That makes sense, but I'm running the query for the same docs. That's confusing.
FINAL UPDATE
Finally I found the problem, stupid me. ES101, if you send GET request to search api, it returns all documents with score 1.0 :) You should send POST request... Thx a lot #Joanna, it works perfectly!!!
You may try this query - it combines scoring with both: confidence and boost_multiplier fields:
{
"query": {
"function_score": {
"query": {
"bool": {
"should": [{
"nested": {
"path": "tags",
"score_mode": "sum",
"query": {
"function_score": {
"query": {
"match": {
"tags.tag": "landscape"
}
},
"field_value_factor": {
"field": "tags.confidence",
"factor": 1,
"missing": 0
}
}
}
}
}]
}
},
"field_value_factor": {
"field": "boost_multiplier",
"factor": 1,
"missing": 0
}
}
}
}
When I search with coast term - it returns:
document with id=1 as only this one has this term, and the scoring is "_score": 100.27469.
When I search with landscape term - it returns two documents:
document with id=2 and scoring "_score": 85.83046
document with id=1 and scoring "_score": 59.7339
As document with id=2 has higher value of confidence field, it gets higher scoring.
When I search with coast landscape term - it returns two documents:
document with id=1 and scoring "_score": 160.00859
document with id=2 and scoring "_score": 85.83046
Although document with id=2 has higher value of confidence field, document with id=1 has both matching words so it gets much higher scoring. By changing the value of "factor": 1 parameter, you can decide how much confidence should influence the results.
boost_muliplier field
More interesting thing happens when I index a new document: let's say it is almost the same as document with id=2 but I set "boost_multiplier" : 4 and "id": 3:
{
"created_at" : "2017-07-31T20:43:17-04:00",
"description" : null,
"height" : 4934,
"id" : "3",
"tags" : [
...
{
"confidence" : 48.36547551196872,
"tag" : "landscape"
},
...
],
"width" : 4016,
"color" : "#FEEBF9",
"boost_multiplier" : 4
}
Running the same query with coast landscape term returns three documents:
document with id=3 and scoring "_score": 360.02664
document with id=1 and scoring "_score": 182.09859
document with id=2 and scoring "_score": 90.00666
Although document with id=3 has only one matching word (landscape), its boost_multiplier value considerably increased the scoring. Here, with "factor": 1, you can also decide how much this value should increase scoring and with "missing": 0 decide what should happen if no such field is indexed.