Geospatial marker clustering with elasticsearch - elasticsearch

I have several hundred thousand documents in an elasticsearch index with associated latitudes and longitudes (stored as geo_point types). I would like to be able to create a map visualization that looks something like this: http://leaflet.github.io/Leaflet.markercluster/example/marker-clustering-realworld.388.html
So, I think what I want is to run a query with a bounding box (i.e., the map boundaries that the user is looking at) and return a summary of the clusters within this bounding box. Is there a good way to accomplish this in elasticsearch? A new indexing strategy perhaps? Something like geohashes could work, but it would cluster things into a rectangular grid, rather than the arbitrary polygons based on point density as seen in the example above.
#kumetix - Good question. I'm responding to your comment here because the text was too long to put in another comment. The geohash_precision setting will dictate the maximum precision at which a geohash aggregation will be able to return. For example, if geohash_precision is set to 8, we can run a geohash aggregation on that field with at most precision 8. This would, according to the reference, return results grouped in geohash boxes of roughly 38.2m x 19m. A precision of 7 or 8 would probably be accurate enough for showing a web-based heatmap like the one I mentioned in the above example.
As far as how geohash_precision affects the cluster internals, I'm guessing the setting stores a geohash string of length <= geohash_precision inside the geo_point. Let's say we have a point at the Statue of Liberty: 40.6892,-74.0444. The geohash12 for this is: dr5r7p4xb2ts. Setting geohash_precision in the geo_point to 8 would internally store the strings:
d
dr
dr5
dr5r
dr5r7
dr5r7p
dr5r7p4
dr5r7p4x
and a geohash_precision of 12 would additionally internally store the strings:
dr5r7p4xb
dr5r7p4xb2
dr5r7p4xb2t
dr5r7p4xb2ts
resulting in a little more storage overhead for each geo_point. Setting the geohash_precision to a distance value (1km, 1m, etc) probably just stores it at the closest geohash string length precision value.
Note: How to calculate geohashes using python
$ pip install python-geohash
>>> import geohash
>>> geohash.encode(40.6892,-74.0444)
'dr5r7p4xb2ts'

In Elasticsearch 1.0, you can use the new Geohash Grid aggregation.
Something like geohashes could work, but it would cluster things into a rectangular grid, rather than the arbitrary polygons based on point density as seen in the example above.
This is true, but the geohash grid aggregation handles sparse data well, so all you need is enough points on your grid and you can achieve something pretty similar to the example in that map.

Try this:
https://github.com/triforkams/geohash-facet
We have been using it to do server-side clustering and it's pretty good.
Example query:
GET /things/thing/_search
{
"size": 0,
"query": {
"filtered": {
"filter": {
"geo_bounding_box": {
"Location"
: {
"top_left": {
"lat": 45.274886437048941,
"lon": -34.453125
},
"bottom_right": {
"lat": -35.317366329237856,
"lon": 1.845703125
}
}
}
}
}
},
"facets": {
"places": {
"geohash": {
"field": "Location",
"factor": 0.85
}
}
}
}

Related

How can I improve/make stronger text fuzzy searching in Elasticsearch?

Below is my setup. I am inserting a user in ElasticSearch and I am doing weighted fuzziness username searches. The problem is that the fuzziness could be... fuzzier? I show you what I mean, this code is my mapping:
{
"mappings": {
"properties": {
"user_id": {
"enabled": false
},
"username": {
"type": "text"
},
"d_likes": {
"type": "rank_feature"
}
}
}
}
And I am inserting 2 users:
user_id: random, username: pietje, d_likes: 3
user_id: random, username: p13tje, d_likes: 30
Now the problem is that I need to write a lot of characters in the username field to get hits. This is how I search:
{
"query": {
"bool": {
"must": [
{
"match": {
"username": {
"query": "piet",
"fuzziness": "auto"
}
}
}
],
"should": [
{
"rank_feature": {
"field": "d_likes"
}
}
]
}
}
}
'piet' gives no results. That looks strange to me, I was hoping I would actually see both p13tje and pietje (in that order) because they are so similar. When my search query is pietj, I only get pietje and not p13tje.
So I was wondering how can I get more hits with the fuzziness search? I want autocompletion for usernames, this is pretty bad user expierence, because it only gives autocompletion when you have filled in most the characters. I just want the search to be more loose and give more results.
ElasticSearch documentation:
When querying text or keyword fields, fuzziness is interpreted as a Levenshtein Edit Distance — the number of one character changes that need to be made to one string to make it the same as another string.
The Levenshtein Edit Distance essentially is a way of measuring the difference between 2 string values.
You've set the fuzziness parameter to AUTO, which is a great default decision. However, for some short strings like yours, it can prove to be not as fuzzy as you'd want it to be.
This is because ElasticSearch (ES) will generate an edit distance based on the length of the string, which will determine how many edits away the string in the index is from your search query.
You haven't specified any specific low or high values so for piet, as it's a 4 character string, only one edit will be allowed.
pietje is actually two edits away - piet needs a j as well as an e so it won't show up.
p13tje is actually four edits away - it needs a j, an e, a change from 1 to i & a change from 3 to e so it also won't show up.
The maximum allowed Levenshtein Edit Distance for ES fuzzy searching is 2 (larger differences are far more expensive to compute efficiently and are not processed by the Lucene search engine which ES is based on) so to fix this, set fuzziness to 2 manually.
"match": {
"username": {
"query": "piet",
"fuzziness": "2"
}
}
Hopefully, that will at least allow pietje to show up in the search and possibly even p13tje depending on if there are any other matches or not.
Instead of manually setting it to 2, you could also set the low and high distance arguments for AUTO however that will generate worse results (format is AUTO:[low],[high] e.g. AUTO:15,30).
For example, with a low of 8 and a high of 20:
Usernames with a character length of 8 or lower will not have any fuzzy searching as it will have to be an exact match
Usernames with a character length between 9 & 20 will only be allowed 1 edit
Usernames with a character length of 21 or higher will only be allowed 2 edit
You can try tweaking the low and high values if you'd like, but for the... fuzziest fuzziness, set the edit distance to the maximum allowed Levenshtein edit distance (2).

Solr spatial search using an indexed field for radius?

So I have an index of cities, looks something like this:
{
"location": "41.388587, 2.175888",
"name": "BARCELONA",
"radius": 20
}
We have a few dozen of these. I need to be able to query this index with a single lat/lng combination and see if it falls inside one of our "cities".
The location property is the centre of the city, and the radius is the radius of the city in km (assuming all the cities are circles). We can also assume no cities overlap.
How can I return whether or not a lat/lng combination falls within a city?
For example, given the point 40.419691, -3.701254, how can I determine if this falls within BARCELONA?
you can do it easily, in either Lucene, Solr or ES.
In Solr for example:
declare a type of SpatialRecursivePrefixTreeFieldType. This allows you to index different shapes, not just points
by using lat/long and the radius, you create an specific circle for each city, and you index that shape, in a field called 'shape' for example:
{
"location": "41.388587, 2.175888",
"name": "BARCELONA",
"shape": "CIRCLE (2.175888 41.388587, 20)"
}
then you just query for any doc that intersects with your point (untested):
fq=shape:"Intersects(40.419691 -3.701254)"
Be sure to check the docs and javadocs for the specific version of Lucene/Solr/ES you are using, as APIs have been changing in this space

Scoring documents by both textual match and distance to a point

I have an ElasticSearch index with a list of "shops".
I'd like to allow customers to search these shops by both geo_distance (so, search for a point and get shops near that location), and textual match, like matches on shop name / address.
I'd like to get results that match either of these two criteria, and I'd like the order of these results to be a combination of both. The stronger the textual match, and the closer to the point searched, the higher the result. (Obviously, there's going to be a formula to combine these two, that'll need tweaking, not too worried about that part yet).
My issue / what I've tried:
geo_distance is a filter, not a query, so I can't combine both on the query part of the request.
I can use a bool => should filter (rather than query) that matches on either name or location. This gives me the results I want, but not in order.
I can also have _geo_distance as part of a sort clause so that documents closer to the point rank higher.
What I haven't figured out is how I would take the "regular" _score that ElasticSearch gives to documents when doing textual matches, and combine that with the geo_distance score.
By having the textual match in the filter, it doesn't seem to affect the score of documents (which makes sense). And I don't see how I could combine the textual match in the query part and a geo_distance filter so it's an OR rather than an AND.
I guess my best bet would be the equivalent of this:
{
function_score: {
query: { ... },
functions: [
{ geo_distance function },
{ multi_match_result score },
],
score_mode: 'multiply'
}
}
but I'm not sure you can do geo_distance as a score function, and I don't know how to have multi_match_result score as a score function, or if it's even possible.
Any pointers will be greatly appreciated.
I'm working with ElasticSearch v1.4, but I can upgrade if necessary.
but I'm not sure you can do geo_distance as a score function, and I don't know how to have multi_match_result score as a score function, or if it's even possible.
You can't really do it in the way that you're asking, but you can do what you want just as easily. For the simpler case, you get scoring just by using a normal query.
The problem with filters is that they're yes/no questions, so if you use them in a function_score, then it either boosts the score or it doesn't. What you probably want is degradation of the score as the distance from the origin grows. It's the yes/no nature that stops them from impacting the score at all. There's no improvement to relevancy implied by matching a filter -- it just means that it's part of the answer, but it doesn't make sense to say that it should be closer to the top/bottom as a result.
This is where the Decay function score helps. It works with numbers, dates, and -- most helpfully here -- geo_points. In addition to the types of data it accepts, it can decay using either gaussian, exponential, or linear decay functions. The one that you want to choose is honestly arbitrary and you should give the one that chooses the best "experience". I would suggest to start with gauss.
"function_score": {
"functions": [
"gauss": {
"my_geo_point_field": {
"origin": "0, 1",
"scale": "5km",
"offset": "500m",
"decay": 0.5
}
}
]
}
Note that origin is in x, y format (due to standard GeoJSON), which is longitude, latitude.
Each one of the values impacts how the score decays based on the graph (taken wholesale from the documentation). If you would use an offset of 0, then the score begins to drop once it's not exactly at the origin. With the offset, it allows it some buffer to be considered just as good.
The scale is directly associated with the decay in that the score will be chopped down by the decay value once it is scale-distance away from the origin (+/- the offset). In my above example, anything 5km from the origin would get half of the score as anything at the origin.
Again, just note that the different types of decay functions change the shape of scoring.
I'd like the order of these results to be a combination of both.
This is the purpose of the bool / should compound query. You get OR behavior with score improvement based on each match. Combining this with the above, you'd want something like:
{
"query": {
"bool": {
"should": [
{
"multi_match": { ... }
},
{
"function_score": {
"functions": [
"gauss": {
"my_geo_point_field": {
"origin": "0, 1",
"scale": "5km",
"offset": "500m",
"decay": 0.5
}
}
]
}
}
]
}
}
}
NOTE: If you add a must, then the should behavior changes from literal OR-like behavior (at least 1 must match) to completely optional behavior (none must match).
I'm working with ElasticSearch v1.4, but I can upgrade if necessary.
Starting with Elasticsearch 2.0, every filter is a query and every query is also a filter. The only difference is the context that it's used in. This doesn't change my answer here, but it's something that may help you in the future in addition to what I say next.
Geo-related performance increased dramatically in ES 2.2+. You should upgrade (and recreate your geo-related indices) to take advantage of those changes. ES 5.0 will have similar benefits!

Calculate change rate of Time Series values

I have an application which writes Time Series data to Elasticsearch. The (simplified) data looks like the following:
{
"timestamp": 1425369600000,
"shares": 12271
},
{
"timestamp": 1425370200000,
"shares": 12575
},
{
"timestamp": 1425370800000,
"shares": 12725
},
...
I now would like to use an aggregation to calculate the change rate of the shares field by time "buckets", for example like
The change rate of the share values within the last 10 minute "bucket" could be IMHO calculated as
# of shares t1
--------------
# of shares t0
I tried the Date Histogram aggregation, but I guess that's not what I need to calculate the change rates, because this would only give me the doc_count, and it's not clear to me how I could calculate the change rate from these:
{
"aggs" : {
"shares_over_time" : {
"date_histogram" : {
"field" : "timestamp",
"interval" : "10m"
}
}
}
}
Is there a way to achieve my goal with aggregations within Elasticsearch? I search the docs, but didn't find a matching method.
Thanks a lot for any help!
I think it is hard to achieve with out-of-the-box aggregate functions. However, you can take a look at percentile_ranks_aggregation and add your own modifications to the script to to create point in time rates.
Also, sorry for the off-top, but I wonder: is the elastic search the best fit for that kind of stuff? As I understand, at any given point in time you need only the previous sample data to calculate the correct rate for the current sample. This sounds to me like a better fit for some sliding window algorithm real time implementation (even on some relational DB like Postgres), where you keep a fixed number of time buckets and counters you are interested in inside the bucket. Once the new sample 'arrives', you update (slide) the window and calculate the updated rate for the most recent time bucket.

ElasticSearch Custom Scoring with Arrays

Could anyone advice me on how to do custom scoring in ElasticSearch when searching for an array of keywords from an array of keywords?
For example, let's say there is an array of keywords in each document, like so:
{ // doc 1
keywords : [
red : {
weight : 1
},
green : {
weight : 2.0
},
blue : {
weight: 3.0
},
yellow : {
weight: 4.3
}
]
},
{ // doc 2
keywords : [
red : {
weight : 1.9
},
pink : {
weight : 7.2
},
white : {
weight: 3.1
},
]
},
...
And I want to get scores for each documents based on a search that matches keywords against this array:
{
keywords : [
red : {
weight : 2.2
},
blue : {
weight : 3.3
},
]
}
But instead of just determining whether they match, I want to use a very specific scoring algorithm:
Scoring a single field is easy enough, but I don't know how to manage it with arrays. Any thoughts?
Ah an interesting question! (And one I think we can solve with some communication)
Firstly, have you looked at custom script scoring? I'm pretty sure you can do this slowly with that. If you were to do this I would consider doing a rescore phase where scoring is only calculated after the doc is known to be a hit.
However I think you can do this with elasticsearch machinery. As I can work out you are doing a dot-product between docs, (where the weights are actually half way between what you are specifying and 1).
So, my first suggestion remove the x/2n term from your "custom scoring" (dot product) and put your weights half way between 1 and the custom weight (e.g. 1.9 => 1.45).
... I'm sorry I will have to come back and edit this question. I was thinking about using nested docs with a field defined boost level, but alas, the _boost mapping parameter is only available for the root doc
p.s. Just had a thought, you could have fields with defined boost levels and store teh terms there, then you can do this easily but you loose precision. A doc would then look like:
{
"boost_1": ["aquamarine"],
"boost_2": null, //don't need to send this, just showing for clarity
...
"boost_5": ["burgundy", "fuschia"]
...
}
You could then define a these boostings in your mapping. One thing to note is a fields boost value carries over to the _all field, so you would now have a bag of weighted terms in your _all field, then you could construct a bool: should query, with lots of term queries with different boost (for the weights of the second doc).
Let me know what you think! A very, very interesting question.

Resources