How to store real estate data in an elastic search? - elasticsearch

I have Real Estate data. I am looking into storing it into elastic search to allow users to search the database real time.
I want to be able to let my users search by key fields like price, lot size, year-built, total bedrooms, etc. However, I also want to be able to let the user filter by keywords or amenities like "Has Pool", "Has Spa", "Parking Space", "Community"..
Additionally, I need to keep a distinct list of property type, property status, schools, community, etc so I can create drop down menu for my user to select from.
What should the stored data structure look like? How can I maintain a list of the distinct schools, community, type to use that to create drop down menu for the user to pick from?
The current data I have is basically a key/value pairs. I can clean it up and standardize it before storing it into Elastic Search but puzzled on what is considered a good approach to store this data?

Based on your question I will provide baseline mappings and a basic query with facets/filters for you to start working with.
Mappings
PUT test_jay
{
"mappings": {
"properties": {
"amenities": {
"type": "keyword"
},
"description": {
"type": "text"
},
"location": {
"type": "geo_point"
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"status": {
"type": "keyword"
},
"type": {
"type": "keyword"
}
}
}
}
We will use "keyword" field type for that fields you will be always be doing exact matches like a drop down list.
For fields we want to do only full text search like description we use type "text". In some cases like titles I want to have both field types.
I created a location geo_type field in case you want to put your properties in a map or do distance based searches, like near houses.
For amenities a keyword field type is enough to store an array of amenities.
Ingest document
POST test_jay/_doc
{
"name": "Nice property",
"description": "nice located fancy property",
"location": {
"lat": 37.371623,
"lon": -122.003338
},
"amenities": [
"Pool",
"Parking",
"Community"
],
"type": "House",
"status": "On sale"
}
Remember keyword fields are case sensitive!
Search query
POST test_jay/_search
{
"query": {
"bool": {
"must": {
"multi_match": {
"query": "nice",
"fields": [
"name",
"description"
]
}
},
"filter": [
{
"term": {
"status": "On sale"
}
},
{
"term": {
"amenities":"Pool"
}
},
{
"term": {
"type": "House"
}
}
]
}
},
"aggs": {
"amenities": {
"terms": {
"field": "amenities",
"size": 10
}
},
"status": {
"terms": {
"field": "status",
"size": 10
}
},
"type": {
"terms": {
"field": "type",
"size": 10
}
}
}
}
The multi match part will do a full text search in the title and description fields. You are filling this one with the regular search box.
Then the filter part is filled by dropdown lists.
Query Response
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "test_jay",
"_type" : "_doc",
"_id" : "zWysGHgBLiMtJ3pUuvZH",
"_score" : 0.2876821,
"_source" : {
"name" : "Nice property",
"description" : "nice located fancy property",
"location" : {
"lat" : 37.371623,
"lon" : -122.003338
},
"amenities" : [
"Pool",
"Parking",
"Community"
],
"type" : "House",
"status" : "On sale"
}
}
]
},
"aggregations" : {
"amenities" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Community",
"doc_count" : 1
},
{
"key" : "Parking",
"doc_count" : 1
},
{
"key" : "Pool",
"doc_count" : 1
}
]
},
"type" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "House",
"doc_count" : 1
}
]
},
"status" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "On sale",
"doc_count" : 1
}
]
}
}
}
With the query response you can fill the facets for future filters.
I recommend you to play around with this and then come back with more specific questions.

Related

Elasticsearch aggregation on different search in same query

I want to make a query to aggregate base only on match no matter what other parameters(terms , term , etc...) are used.
To be more specific I have an online shop where I use multiple filters (color ,size etc..) If I check a field for example color : red the other colors are no longer aggregated.
A solution that I am using is to make 2 separated queries (one for search where filters are applied and other for aggregation. Any idea how can I combine the 2 separated queries ?
You can take advantage of post_filter which will not apply to your aggregations but will only filter the to-be-returned hits. For example:
Create a shop
PUT online_shop
{
"mappings": {
"properties": {
"color": {
"type": "keyword"
},
"size": {
"type": "integer"
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Populate it w/ a few products
POST online_shop/_doc
{"color":"red","size":35,"name":"Louboutin High heels abc"}
POST online_shop/_doc
{"color":"black","size":34,"name":"Louboutin Boots abc"}
POST online_shop/_doc
{"color":"yellow","size":36,"name":"XYZ abc"}
Apply a shared query to the hits as well as aggregations and use post_filter to ... post-filter the hits:
GET online_shop/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "abc"
}
}
]
}
},
"aggs": {
"by_color": {
"terms": {
"field": "color"
}
},
"by_size": {
"terms": {
"field": "size"
}
}
},
"post_filter": {
"bool": {
"must": [
{
"term": {
"color": {
"value": "red"
}
}
}
]
}
}
}
Expected result
{
...
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.11750763,
"hits" : [
{
"_index" : "online_shop",
"_type" : "_doc",
"_id" : "cehma3IBG_KW3EFn1QYa",
"_score" : 0.11750763,
"_source" : {
"color" : "red",
"size" : 35,
"name" : "Louboutin High heels abc"
}
}
]
},
"aggregations" : {
"by_color" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "black",
"doc_count" : 1
},
{
"key" : "red",
"doc_count" : 1
},
{
"key" : "yellow",
"doc_count" : 1
}
]
},
"by_size" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 34,
"doc_count" : 1
},
{
"key" : 35,
"doc_count" : 1
},
{
"key" : 36,
"doc_count" : 1
}
]
}
}
}

Is there a way to get related documents if a match occurs after the query?

I am currently doing a fuzzy name search on some documents. These documents can be related to each other (for example name field of one document may contain the name and another may contain the alias for the same person). I will give these documents the same unique identifier. My question is, can I get the documents with same unique identifier if a match occurs in any of them?
Suppose that there are 4 documents like this.
{
{
"name": "Bob"
"uid": "1"
},
{
"name": "Bilbo"
"uid": "1"
},
{
"name": "Jack"
"uid": "2"
},
{
"name": "Mary"
"uid" : "3"
}
}
When I query name "Bob", I expect to get both documents with "uid" = "1"
{
{
"name": "Bob"
"uid": "1"
},
{
"name": "Bilbo"
"uid": "1"
}
}
Elasticsearch doesn't have concept of JOINS. So documents cannot be fetched by joining on "uid"
1. Using two queries
i. Get documents with name "Bob"
{
"query": {
"term": {
"name.keyword": {
"value": "Bob"
}
}
}
}
ii. Fetch documents using above returned ids.
2. Using terms and bucket selector aggregation
Mapping:
{
"<mapping_name>" : {
"mappings" : {
"properties" : {
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"uid" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
Query:
1. Create a bucket(collection) of uid.
2. Create sub bucket of name which includes only "Bob" so uid 1 will have a bucket of key Bob , uid 2 will be empty
3. Use bucket_selector aggregation to select where count of sub bucket name is greater than equal to 1. This will remove uid 2
4. Use top_hits aggregation to get documents.
{
"size": 0,
"aggs": {
"uid": {
"terms": {
"field": "uid.keyword",
"size": 10
},
"aggs": {
"documents":{
"top_hits": { --> to get documents under parent term
"size": 10
}
},
"name": {
"terms": {
"field": "name.keyword", --> terms need non_analyzed field so keyword
"include":"Bob", --> get terms with name bob
"size": 10
}
},
"my_bucket":{
"bucket_selector": { --> select buckets which have atleast one name
"buckets_path": {"count":"name._bucket_count"},
"script": "if(params.count>=1) return true;"
}
}
}
}
}
}
Result: All docuents with uid 1(same uid as "Bob") are returned
"aggregations" : {
"uid" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1",
"doc_count" : 2,
"documents" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "index61",
"_type" : "_doc",
"_id" : "uCP1-nAB_Wo5RvhlZM6k",
"_score" : 1.0,
"_source" : {
"name" : "Bob",
"uid" : "1"
}
},
{
"_index" : "index61",
"_type" : "_doc",
"_id" : "uSP1-nAB_Wo5Rvhlbc4S",
"_score" : 1.0,
"_source" : {
"name" : "Bilbo",
"uid" : "1"
}
}
]
}
},
"name" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Bob",
"doc_count" : 1
}
]
}
}
]
}
}

Elasticsearch is not returning a document I expect in the search results

I have a collection of customers that have a first name, last name, email, description and owner id. I want to take a character string from the app, and search on all the fields, with a priority order. Im using boost to achieve that.
Currently I have a lot of test customers with the name Sean in various fields within the documents. I have 2 documents that contain an email with sean.jones#email.com. One document contains the same email in the description.
When I perform the following search, im missing the document in the search results that does not contain the email in the description.
Here is my query:
{
"query" : {
"bool" : {
"filter" : {
"match" : {
"ownerId" : "acct_123"
}
},
"must" : [
{
"bool" : {
"should" : [
{
"prefix" : {
"firstName" : {
"value" : "sean",
"boost" : 3
}
}
},
{
"prefix" : {
"lastName" : {
"value" : "sean",
"boost" : 3
}
}
},
{
"terms" : {
"boost" : 2,
"description" : [
"sean"
]
}
},
{
"prefix" : {
"email" : {
"value" : "sean",
"boost" : 1
}
}
}
]
}
}
]
}
}
}
Here is the document that Im missing:
{
"_index" : "xxx",
"_id" : "cus_123",
"_version" : 1,
"_type" : "customers",
"_seq_no" : 9096,
"_primary_term" : 1,
"found" : true,
"_source" : {
"firstName" : null,
"id" : "cus_123",
"lastName" : null,
"email" : "sean.jones#email.com",
"ownerId" : "acct_123",
"description" : null
}
}
When I look at the current results, all of the documents have a score of 3.0. They have "Sean" in the name as well, so they score higher. When I do an _explain on the document im missing, with the query above, I get the following:
{
"_index": "xxx",
"_type": "customers",
"_id": "cus_123",
"matched": true,
"explanation": {
"value": 1.0,
"description": "sum of:",
"details": [
{
"value": 1.0,
"description": "sum of:",
"details": [
{
"value": 1.0,
"description": "ConstantScore(email._index_prefix:sean)",
"details": []
}
]
},
{
"value": 0.0,
"description": "match on required clause, product of:",
"details": [
{
"value": 0.0,
"description": "# clause",
"details": []
},
{
"value": 1.0,
"description": "ownerId:acct_123",
"details": []
}
]
}
]
}
}
Here are my mappings:
{
"properties": {
"firstName": {
"type": "text",
"index_prefixes": {
"max_chars": 10,
"min_chars": 1
}
},
"email": {
"analyzer": "my_email_analyzer",
"type": "text",
"index_prefixes": {
"max_chars": 10,
"min_chars": 1
}
},
"lastName": {
"type": "text",
"index_prefixes": {
"max_chars": 10,
"min_chars": 1
}
},
"description": {
"type": "text"
},
"ownerId": {
"type": "text"
}
}
}
"my_email_analyzer": {
"type": "custom",
"tokenizer": "uax_url_email"
}
If im understanding this correctly, because this document is only scoring a 1, its not meeting a particular threshold. Ive tried adjusting the min_score but I had no luck. Any thoughts on how I can get this document to be included in the search results?
thanks so much
It depends on what mean by "missing":
is it, that the document does not make it into the number of hits (the "total")?
or is it, that the document itself does not show up as a hit in the hits list?
If it's #2 you may want to increase the number of documents Elasticsearch fetches and returns, by adding a size-clause to your search request (default size is 10):
Example
"size": 50

Elasticsearch: Influence scoring with custom score field in document pt.2

Having these documents:
{
"created_at" : "2017-07-31T20:30:14-04:00",
"description" : null,
"height" : 3213,
"id" : "1",
"tags" : [
{
"confidence" : 65.48948436785749,
"tag" : "beach"
},
{
"confidence" : 57.31950504425406,
"tag" : "sea"
},
{
"confidence" : 43.58207236617374,
"tag" : "coast"
},
{
"confidence" : 35.6857910950816,
"tag" : "sand"
},
{
"confidence" : 33.660057321079655,
"tag" : "landscape"
},
{
"confidence" : 32.53252312423727,
"tag" : "sky"
}
],
"width" : 5712,
"color" : "#0C0A07",
"boost_multiplier" : 1
}
and
{
"created_at" : "2017-07-31T20:43:17-04:00",
"description" : null,
"height" : 4934,
"id" : "2",
"tags" : [
{
"confidence" : 84.09123410403951,
"tag" : "mountain"
},
{
"confidence" : 56.412795342449456,
"tag" : "valley"
},
{
"confidence" : 48.36547551196872,
"tag" : "landscape"
},
{
"confidence" : 40.51100450186575,
"tag" : "mountains"
},
{
"confidence" : 33.14263528292239,
"tag" : "sky"
},
{
"confidence" : 31.064394646169404,
"tag" : "peak"
},
{
"confidence" : 29.372,
"tag" : "natural elevation"
}
],
"width" : 4016,
"color" : "#FEEBF9",
"boost_multiplier" : 1
}
I want to get the _score calculated based on the confidence values for each tag. For example if you search "mountain" it should return only doc with id 1 obviously, if you search "landscape", score of 2 should be higher then 1, as confidence of landscape in 2 is higher than 1 (48.36 vs 33.66). If you search for "coast landscape", this time score of 1 should be higher than 2, because doc 1 has both coast and landscape in the tags array. I also want to multiply the score with "boost_multiplier" to boost some documents against others.
I found this question in SO, Elasticsearch: Influence scoring with custom score field in document
But when I tried the accepted solution (i enabled scripting in my ES server), it returns both documents with having _score 1.0, regardless the search term. Here is my query that I tried:
{
"query": {
"nested": {
"path": "tags",
"score_mode": "sum",
"query": {
"function_score": {
"query": {
"match": {
"tags.tag": "coast landscape"
}
},
"script_score": {
"script": "doc[\"confidence\"].value"
}
}
}
}
}
}
I also tried what #yahermann suggested in the comments, replacing "script_score" with "field_value_factor" : { "field" : "confidence" }, still the same result. Any idea why it fails, or is there better way to do it?
Just to have complete picture, here is the mapping definition that I've used:
{
"mappings": {
"photo": {
"properties": {
"created_at": {
"type": "date"
},
"description": {
"type": "text"
},
"height": {
"type": "short"
},
"id": {
"type": "keyword"
},
"tags": {
"type": "nested",
"properties": {
"tag": { "type": "string" },
"confidence": { "type": "float"}
}
},
"width": {
"type": "short"
},
"color": {
"type": "string"
},
"boost_multiplier": {
"type": "float"
}
}
}
},
"settings": {
"number_of_shards": 1
}
}
UPDATE
Following the answer of #Joanna below, I tried the query, but in fact, whatever I put in match query, coast, foo, bar, it always return both documents with _score 1.0 for both of them, I tried it on elasticsearch 2.4.6, 5.3, 5.5.1 in Docker. Here is the response I get:
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 1635
{"took":24,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":1.0,"hits":[{"_index":"my_index","_type":"my_type","_id":"2","_score":1.0,"_source":{
"created_at" : "2017-07-31T20:43:17-04:00",
"description" : null,
"height" : 4934,
"id" : "2",
"tags" : [
{
"confidence" : 84.09123410403951,
"tag" : "mountain"
},
{
"confidence" : 56.412795342449456,
"tag" : "valley"
},
{
"confidence" : 48.36547551196872,
"tag" : "landscape"
},
{
"confidence" : 40.51100450186575,
"tag" : "mountains"
},
{
"confidence" : 33.14263528292239,
"tag" : "sky"
},
{
"confidence" : 31.064394646169404,
"tag" : "peak"
},
{
"confidence" : 29.372,
"tag" : "natural elevation"
}
],
"width" : 4016,
"color" : "#FEEBF9",
"boost_multiplier" : 1
}
},{"_index":"my_index","_type":"my_type","_id":"1","_score":1.0,"_source":{
"created_at" : "2017-07-31T20:30:14-04:00",
"description" : null,
"height" : 3213,
"id" : "1",
"tags" : [
{
"confidence" : 65.48948436785749,
"tag" : "beach"
},
{
"confidence" : 57.31950504425406,
"tag" : "sea"
},
{
"confidence" : 43.58207236617374,
"tag" : "coast"
},
{
"confidence" : 35.6857910950816,
"tag" : "sand"
},
{
"confidence" : 33.660057321079655,
"tag" : "landscape"
},
{
"confidence" : 32.53252312423727,
"tag" : "sky"
}
],
"width" : 5712,
"color" : "#0C0A07",
"boost_multiplier" : 1
}
}]}}
UPDATE-2
I found this one on SO: Elasticsearch: "function_score" with "boost_mode":"replace" ignores function score
It basically says, if function doesn't match, it returns 1. That makes sense, but I'm running the query for the same docs. That's confusing.
FINAL UPDATE
Finally I found the problem, stupid me. ES101, if you send GET request to search api, it returns all documents with score 1.0 :) You should send POST request... Thx a lot #Joanna, it works perfectly!!!
You may try this query - it combines scoring with both: confidence and boost_multiplier fields:
{
"query": {
"function_score": {
"query": {
"bool": {
"should": [{
"nested": {
"path": "tags",
"score_mode": "sum",
"query": {
"function_score": {
"query": {
"match": {
"tags.tag": "landscape"
}
},
"field_value_factor": {
"field": "tags.confidence",
"factor": 1,
"missing": 0
}
}
}
}
}]
}
},
"field_value_factor": {
"field": "boost_multiplier",
"factor": 1,
"missing": 0
}
}
}
}
When I search with coast term - it returns:
document with id=1 as only this one has this term, and the scoring is "_score": 100.27469.
When I search with landscape term - it returns two documents:
document with id=2 and scoring "_score": 85.83046
document with id=1 and scoring "_score": 59.7339
As document with id=2 has higher value of confidence field, it gets higher scoring.
When I search with coast landscape term - it returns two documents:
document with id=1 and scoring "_score": 160.00859
document with id=2 and scoring "_score": 85.83046
Although document with id=2 has higher value of confidence field, document with id=1 has both matching words so it gets much higher scoring. By changing the value of "factor": 1 parameter, you can decide how much confidence should influence the results.
boost_muliplier field
More interesting thing happens when I index a new document: let's say it is almost the same as document with id=2 but I set "boost_multiplier" : 4 and "id": 3:
{
"created_at" : "2017-07-31T20:43:17-04:00",
"description" : null,
"height" : 4934,
"id" : "3",
"tags" : [
...
{
"confidence" : 48.36547551196872,
"tag" : "landscape"
},
...
],
"width" : 4016,
"color" : "#FEEBF9",
"boost_multiplier" : 4
}
Running the same query with coast landscape term returns three documents:
document with id=3 and scoring "_score": 360.02664
document with id=1 and scoring "_score": 182.09859
document with id=2 and scoring "_score": 90.00666
Although document with id=3 has only one matching word (landscape), its boost_multiplier value considerably increased the scoring. Here, with "factor": 1, you can also decide how much this value should increase scoring and with "missing": 0 decide what should happen if no such field is indexed.

How can I get element at a particular index in elasticsearch?

I have stored three json objects in elasticsearch, each object has a title and projects array.
{"name": "haris","projects": [{"title": "Splunk"},{"title": "QRadar"},{"title": "LogAnalysis"}]}
{"name": "khalid","projects": [{"title": "MS"},{"title": "Google"},{"title": "Apple"}]}
{"name": "Hamid","projects": [{"title": "Toyota"},{"title": "Honda"},{"title": "Kia"}]}
I have written a query to extract a particular object by _id and its specific property projects
curl -XGET 'localhost:9200/jsontest/_search?pretty' -d '{"query" : { "match" : {"_id":"AV1kzzZqAzHWQ2S7B8f1"} }, "_source": ["projects"]}'
As expected it returns projects object
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "jsontest",
"_type" : "json",
"_id" : "AV1kzzZqAzHWQ2S7B8f1",
"_score" : 1.0,
"_source" : {
"projects" : [{"title" : "Splunk"},{"title" : "QRadar"},{"title" : "LogAnalysis"}
]
}
}
]
}
}
Question: is there a way to retrieve value at a particular index of projects? This is dummy data, in my real scenario projects can have a large number of elements and each element itself is a json object with a lot of properties. I only need to retrieve value at certain index of projects.
Here is what i would do.
First the mapping
PUT test/my_objects/_mapping
{
"properties": {
"name":{
"type": "string",
"index": "not_analyzed"
},
"projects": {
"type": "nested"
}
}
}
Second Projects are indexed
PUT test/my_objects/1111
{
"name": "haris",
"projects": [
{"title": "Splunk"},
{"title": "QRadar"},
{"title": "LogAnalysis"}
]
}
Finally the aggregation query
GET test/my_objects/_search
{
"aggs": {
"by_name": {
"terms": {
"field": "name"
},
"aggs": {
"by_project": {
"nested": {
"path": "projects"
},
"aggs": {
"by_title": {
"terms": {
"field": "projects.title"
}
}
}
}
}
}
}
}
its not tested and a bit tedious because of the nested aggs but should work if you manipulate it further for you requirements

Resources