How do I architect my Elasticsearch indexes for rapidly changing data? - elasticsearch

I have two models in my MySQL database: Users and Posts
users have geolocation attributes (lat/long)
posts simply have a body of text.
I want to use Elasticsearch to find all posts that match a string of text plus use the user's location as a filter. The problem is -- the user's location always changes (as people walk around the city). I will be frequently updating the lat/long of each user.
This is my current solution:
Index the posts and have a geolocation attribute in each
document. When a user changes location, run an elasticsearch batch
update on all that user's posts, and modify the geolocation attribute
on those documents.
Obviously this is not a scalable solution -- what if the user has 2000 posts and walks around the city? I'd have to update 2000 documents every minute.
Is there a way to "relationally map" the posts to the user's object and use it as a filter, so that when the location changes, I only need to update that user's object instead of all his posts?

Updating 2000 posts per minute is not a big deal either with the update by query plugin or with the upcoming reindex API. However, if you have many users with many posts and you need to update them in short intervals (e.g. 1 min), it might not be that scalable, indeed. Say if it takes 500 milliseconds to update all posts from a user, you'd start to lag behind at around 120 users.
Clearly, since the users' posts need to "follow" the user and don't keep the location the user had when she posted them, I would first query the users around a given location and get their IDs, and then run a second query on posts filtered by those user IDs and the matching body text.
It is perfectly OK to keep both of your indices simple and only update the location in a single user's document every minute. Those two queries I'm suggesting should be quite fast and you should not be worried of running them. People are often worried when they need to run two or more queries in order to find their results. Sometimes, trying to tie the documents to tight together is not the solution and simply running two queries over two indices is the key and works perfectly well.
The query to retrieve users would look similar to the first one below, where you only retrieve the _id property of the user. I'm making the assumption that your user documents have the id of the user as their ES doc _id, so you do not have to retrieve the _source at all (i.e. "_source": false) which is even faster and you can simply return the _id with response filtering:
POST /users/_search?filter_path=hits.hits._id
{
"size": 1000,
"_source": false,
"query": {
"bool": {
"filter": [
{
"geo_distance": {
"distance": "100m",
"location": {
"lat": 32.5362723,
"lon": -80.3654783
}
}
}
]
}
}
}
You'll get all the _id values of the users who are currently 100 meters around the desired geographic location. Then the next query consists of filtering the posts by those ids while matching their body text.
POST /posts/_search
{
"size": 50,
"query": {
"bool": {
"must": {
"match": {
"body": "some text"
}
},
"filter": [
{
"terms": {
"user_id": [ 1, 2, 3, 4 ]
}
}
]
}
}
}

Related

Most performant way to update a single document in Elasticsearch via an alias

I have an Elasticsearch setup with an alias that points to many indices. I need to update a single document, but I don't know which index it resides in.
There are two ways I can accomplish this as far as I can see:
_update_by_query:
POST my-alias/_update_by_query
{
"query": {
"terms": {
"_id": ["my-id-to-update"]
}
},
"script": {
"source": "ctx._source['Field'] = 'new value'"
}
}
read (which returns the specific index) then write:
GET my-alias/_search
{
"query": {
"terms": {
"_id": ["my-id-to-update"]
}
}
}
POST my-index-returned-from-the-get/_update/my-id-to-update
{
"doc": {
"Field": "new value"
}
}
Which method is more performant?
Which method is preferred?
Is there a better way than either of these two?
The performance of both approach will be the same with one difference that your first approach only need to send one request compare to second one with two request, so it would be better to use first approach as you will reduce the API calls by half.
Also in my opinion the first approach is much cleaner and fits more in concept of aliases of Elasticsearch because you are encapsulating exact index name from your application, as application doesn't need to have any clue about exact index-name your documents are in.
An important note about updating a document in Elasticsearch is documents in Elasticsearch don't get updated, it means the document will be flagged as deleted and new document will be created (this is due to Lucene implementation), then during process of Lucene segment merging the document will be actually deleted.
you can find a good blog post about segment merging here.

Elastic Search Query the number of new vs returning users

Given a time range, I want to know how many users are new users and how many users are returning users.
My Elastic search index mapping contain field user_id, event_time and etc.
e.g. Given a record (user_id: jack, event_time: 2019-10-31 00:00:00:000). If the user_id does not exist in the past 2 weeks (from 2019-10-17 00:00:00:000 to 2019-10-31 00:00:00:000), then we consider the record with user_id 'jack' is a new user. Otherwise it's considered as a returning user.
I was wondering that if Elastic search support such kind of query which can tell me the number of new users and returning users?
Thanks in advance!
I think you could use a range aggregation for that.
{
"aggs": {
"activity": {
"range": {
"field": "event_time",
"ranges": [
{ "to": "now-2w/d" },
{ "from": "now-2w/d" }
]
}
}
}
}
This should go through your whole index and create two buckets based on event time. One for anything that's older than two weeks, another one for everything that's newer than two weeks.

Elasticsearch: Get sort index of the record

Let me describe my scenario with the real example.
I have a page where I need to show the list of the companies sorted by a field "overallRank" and with few filters (like companyType and employeeSize).
Now, it's easy to get the results from the ES index for the filter and then sort them by overallRank. But, I also want to know the rank of the company among all the company data and not only in the filtered result.
For example. Amazon is the 3rd company in the location US and companyType=Private. But, it is the 5th company in the US if we remove the companyType filter. While showing the result with the filter companyType I want to know this overall ranking (i.e 5th). Is it possible to include this field in the result somehow?
What I am currently doing is first getting the filtered result by companyType and location US. Then getting the sorted result by only location. This second query gives the result by overall ranking in the location (where Amazon is coming at 5th place). Now I iterate the first result and see where that company is in the second result to determine it's overall ranking.
The problem with this approach is that second method to determine the overall ranking in the whole company data is very expensive because it has to retrieve around 60k result. By giving the batch size 1000 it has to do a round trip around 60 times to ES to get all the results in the memory. It's time and space consuming both.
Can somebody please suggest a better way of doing this?
I think you can solve it using filtered aggregations: with top hits aggregation
As an example you can do something like:
{
"aggs": {
"filtered_companies_by_us": {
"filter": {
"term": {
"location": "US"
}
},
"aggs": {
"top_companies": {
"top_hits": {
"sort": [
{
"overallRank": {
"order": "desc"
}
}
],
"size": 5
}
}
}
}
}
}

Is it possible to make elasticsearch highlights linkable?

I'm successfully using ES for indexing documents and higlighting searched text. But now I have a new requirement - make all yellow highlights linkable, i.e user have to be able to dive into the page with selected occurence.
I haven't implemented page preview of document yet but I'm sure that there exists some software which gets page number or bytes offset and returns docx or pdf page as image. So, I want elastic to return index of occurence (most likely, byte offset from the beginning). After that I probably may use indexToImage soft for showing occurence page to user. Even if such software does not exist I may open RandomAccessFile and read occurence page and somehow show it to user. But anyway I need occurence index. is it possible to get it from elastic?
My search request looks like:
http://localhost:9200/mongofilesindex/_search?pretty&source={
"_source": ["filename",
"metadata"],
"query": {
"filtered": {
"query": {
"query_string": {
"query": "*test*"
}
}
}
},
"highlight": {
"pre_tags": ["<mark>"],
"post_tags": ["</mark>"],
"fields": {
"content": {
"fragment_size": 200,
"number_of_fragments": 10
}
}
}
}&size=10&from=0
Of course, I may use ES just for extracting matching documents and after that manually apply KMP in input stream which works in linear time. But I want something better than linear because I know that suffix automatas and other complex data structures may return occurences in O(search_string_len+occurences_count) which is much more better than O(doc_len).
I'm sure that elastic uses such cool data structures and probably I'm missing some API for getting occurences indices.

ElasticSearch query referencing document

I read some time ago that there was a way to build a query that references another document in your index. At the time, this wasn't helpful to me, but I now have very large GIS areas that I need to query against and sending this data to ElasticSearch in the query body every time seems wasteful.
While my specific use-case relates to GIS, geo_shape, etc, it's a general issue that can be applied to other types of queries.
I have a document type areas that holds all of the predefined search areas (these are things like suburbs, states, etc) and entities that hold all of my search data, including a geo_point type field with lat/lon.
I need to be able to construct a geo_shape query for entities documents that references the mpoly attribute (which is a GeoShape type) on an areas document for it's shape coordinates.
Unfortunately, neither Google nor reading the ElasticSearch docs have proved useful in this case, because generally nested documents (related, but not what I'm looking for) is what people seem to be more interested in.
Finally found the answer myself while looking for something different. Unfortunately, the information about the GeoShape filter is not in the GeoShape query manual pages:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-geo-shape-filter.html#_pre_indexed_shape
{
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"geo_shape": {
"location": {
"indexed_shape": {
"id": "DEU",
"type": "countries",
"index": "shapes",
"path": "location"
}
}
}
}
}
}
If anyone has better information about how to do this generically, I will happily accept their answer instead.

Resources