ElasticSearch - maintain original timestamp across versions - elasticsearch

I have enabled automatic _timestamp on my indexes but every time an index is updated during a _bulk request (or a regular update) the timestamp is also updated. This makes sense.
I want to know if there's a way to keep the original timestamp after an update? So we only ever see the timestamp for version 1 no matter how many times it is updated to a new version.
I have over 4 millions indexes and bulk update in chunks of 1000 so I'd rather not iterate through every single item to compare timestamps.
Any tips?

For anyone who comes across this issue in the future I ended up using a combination of bulk update with script to upsert a date. This only happens when an index is created and left alone during updates. I'm not sure if this is the most elegant solution but it works.
{ "update": {"_id" : "1"} },
{ "script": "", "upsert" : {"og_index_date" : 20130826}},
{ "update": {"_id" : "1"} },
{ "doc": {"field1" : "one", "field2": "two"}, "doc_as_upsert" : True }
Even though you're doubling your writes, it will maintain this date across versions.

Related

Elasticsearch past version document

I want to maintain last 2 versions of documents in Elasticsearch.
I created, for example, first update for product123
PUT /products/_doc/product123
{ "name" : "toothPaste",
"price" : 10
}
Then for second update product123:
PUT /products/_doc/product123
{
"name" : "toothPaste",
"price" : 12
}
When I query using GET API - I am getting "price": 12 - Current Version
Is it possible that I will get "price": 10 (Last Version) of the same index
the only way to do this in Elasticsearch is to manage it yourself, as any updates applied to a document do not retain the previous version
you could do this using separate documents as MAZux mentioned above, or you could do it in different fields, eg price and previous_price

Elasticsearch Multi Index Query and Filter

I have 2 indexes, one that stores data about an event and one that stores the availability of that event. I am trying to create a single query that gets events by a query but only returns ones that are available, and I am having difficulty doing so.
The events index stores
{
"id" : "152ce52d-e975-4ebd-849a-0a12f535e644",
"createdAt" : 1.5519999143126902E12,
"description" : "A very not so concise description",
"geoHash" : "dnh00x6x5",
"name" : "a name",
...etc...
}
The availability index stores availability like so:
{
"eventId" : "152ce52d-e975-4ebd-849a-0a12f535e644",
"maxGuests" : 8,
"availability" : {
"lte" : "2019-10-18T22:15:00.000Z",
"gte" : "2019-10-18T02:30:00.000Z"
}
}
I am trying to create a query like below, but what I can't figure out is how to filter by listings that meet the criteria in the events index AND are available in the availability index.
GET events,availability/_search
{
"size": 5,
"from": 0,
"_source": [
"id"
],
"query": {
"bool": {
"must": [
{
"geo_distance": {
"distance": "25mi",
"geoHash": {
"lat": 34.0389,
"lon": -84.3826
}
}
}
],
"should": [],
"filter":[
{
"range" : {
"availability" : {
"gte" : "2019-10-31",
"lte" : "2020-11-01",
"relation" : "within"
}
}
}
]
}
}
}
--
The reason I want to only do one query is that the client is expecting a certain specified number of events. If I filter out the unavailable events after I get the event data then I am likely to be left with fewer events than the client expected and would need to do yet another search to fill the gap.
Also, of course, I could merge the two indices so that the event also stores the availability info, but I originally set them up this way because the availability info may have hundreds or thousands of entries per event.
What you want to accomplish is an equivalent of a foreign key of SQL (join). There is no way to have exactly what you want, meaning to filter documents from index A by querying an index B. Your options are:
As you've mentioned solve it on application level (although this causes other problems for you, so it's not a solution).
Merge the data in one index and have duplicated event informatin. Although it seems expensive, the duplication of data in a NoSQL database is to be expected. If you need a relational model then maybe you should use a SQL solution.
Use parent/child (join datatype). The problem here is that you will need to have the data in the same index overall. Moreover, parent and child will be stored in the same shard as well.
One approach to this (a bit more complex though) that I believe would work for you is to use the nested datatype, which actually is a more compact approach for the solution number 2 (combine your data in one index, but save root information only once). Make events be at the root and availability appear as nested. When you want to add one availability you can use the update api, and when you query, you can search by the root fields and by the nested. If you need to retrieve specific availability entries for an event you can use inner hits
What you are trying to do (multi-index search) will not join your data automatically, it will not work. Elasticsearch doesn't work that way, and the relational model is not suited for this product.
One last thing, it's a good thing to plan ahead, but it's a bad thing to try to optimize early on.
The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.
An interesting read that summarizes the above

Joining logstash with parent record

I'm using logstash to analyze my web servers access. At this time, it works pretty well. I used a configuration file that produce to me this kind of data :
{
"type": "apache_access",
"clientip": "192.243.xxx.xxx",
"verb": "GET",
"request": "/publications/boreal:12345?direction=rtl&language=en",
...
"url_path": "/publications/boreal:12345",
"url_params": {
"direction": "rtl",
"language": "end"
},
"object_id": "boreal:12345"
...
}
This record are stored into "logstash-2016.10.02" index (one index per day).
I also created an other index named "publications". This index contains the publication metadata.
A json record looks like this :
{
"type": "publication",
"id": "boreal:12345",
"sm_title": "The title of the publication",
"sm_type": "thesis",
"sm_creator": [
"Smith, John",
"Dupont, Albert",
"Reegan, Ronald"
],
"sm_departement": [
"UCL/CORE - Center for Operations Research and Econometrics",
],
"sm_date": "2001",
"ss_state": "A"
...
}
And I would like to create a query like "give me all access for 'Smith, John' publications".
As all my data are not into the same index, I can't use parent-child relation (Am I right ?)
I read this on a forum but it's an old post :
By limiting itself to parent/child type relationships elasticsearch makes life
easier for itself: a child is always indexed in the same shard as its parent,
so has_child doesn’t have to do awkward cross shard operations.
Using logstash, I can't place all data in a single index nammed logstash. By month I have more than 1M access... In 1 year, I wil have more than 15M record into 1 index... and I need to store the web access data for minimum 5 year (1M * 12 * 15 = 180M).
I don't think it's a good idea to deal with a single index containing more than 18M record (if I'm wrong, please let me know).
Is it exists a solution to my problem ? I don't find any beautifull solution.
The only I have a this time in my python script is : A first query to collect all id's about 'Smith, John' publications ; a loop on each publication to get all WebServer access for this specific publication.
So if "Smith, John" has 321 publications, I send 312 http requests to ES and the response time is not acceptable (more than 7 seconds ; not so bad when you know the number of record in ES but not acceptable for final user.)
Thanks for your help ; sorry for my english
Renaud
An idea would be to use the elasticsearch logstash filter in order to get a given publication while an access log document is being processed by Logstash.
That filter would retrieve the sm_creator field in the publications index having the same object_id and enrich the access log with whatever fields from the publication document you need. Thereafter, you can simply query the logstash-* index.
elasticsearch {
hosts => ["localhost:9200"]
index => publications
query => "id:%{object_id}"
fields => {"sm_creator" => "author"}
}
As a result of this, your access log document will look like this afterwards and for "give me all access for 'Smith, John' publications" you can simply query the sm_creator field in all your logstash indices
{
"type": "apache_access",
"clientip": "192.243.xxx.xxx",
"verb": "GET",
"request": "/publications/boreal:12345?direction=rtl&language=en",
...
"url_path": "/publications/boreal:12345",
"url_params": {
"direction": "rtl",
"language": "end"
},
"object_id": "boreal:12345",
"author": [
"Smith, John",
"Dupont, Albert",
"Reegan, Ronald"
],
...
}

Why are Elasticsearch aliases not unique

The Elasticsearch documentation describes aliases as feature to reindex data with zero downtime:
Create a new index and index the whole data
Let your alias point to the new index
Delete the old index
This would be a great feature if aliases would be unique but it's possible that one alias points to multiple indexes. Considering that maybe the deletion of the old index fails my application might speak to two indexes which might not be in sync. Even worse: the application doesn't know about that.
Why is it possible to reuse an alias?
It allows you to easily have several indexes that are both used individually and together with other indexes. This is useful for example when having a logging index where sometimes you want to query the most recent (logs-recent alias) and sometimes want to query everything (logs alias). There are probably lots of other use cases but this one pops up as the first for me.
As per the documentation you can send both the remove and add in one request:
curl -XPOST 'http://localhost:9200/_aliases' -d '
{
"actions" : [
{ "remove" : { "index" : "test1", "alias" : "alias1" } },
{ "add" : { "index" : "test2", "alias" : "alias1" } }
]
}'
After that succeeds you can remove your old index and if that fails you will just have an extra index taking up some space until its cleaned out.

Performance tuning MongoDB query/update?

So I have a MongoDB instance where I am trying to update data in one collection with data from another collection. The two collections are participants with about 180k documents and questions with about 95k documents.
Documents in participants typically look something like this:
{
"_id" : ObjectId("52f90b8bbab16dd8594b82b4"),
"answers" : [
{
"_id" : ObjectId("52f90b8bbab16dd8594b82b9"),
"question_id" : 2081,
"sub_id" : null,
"values" : [
"Yes"
]
},
{
"_id" : ObjectId("52f90b8bbab16dd8594b82b8"),
"question_id" : 2082,
"sub_id" : 123,
"values" : [
"Would prefer to go alone"
]
},
{
"_id" : ObjectId("52f90b8bbab16dd8594b82b7"),
"question_id" : 2082,
"sub_id" : 456,
"values" : [
"Yes"
]
}
],
"created" : ISODate("2012-03-01T17:40:21Z"),
"email" : "anonymous",
"id" : 65,
"survey" : ObjectId("52f41d579af1ff4221399a7b"),
"survey_id" : 374
}
I am using the query below to perform the update:
db.participants.ensureIndex({"answers.question_id": 1, "answers.sub_id": 1});
print("created index for answer arrays!")
db.questions.find().forEach(function(doc){
db.participants.update(
{
"answers.question_id": doc.id,
"answers.sub_id": doc.sub_id
},
{
$set:
{
"answers.$.question": doc._id
}
},
false,
true
);
});
db.participants.dropIndex({"answers.question_id": 1, "answers.sub_id": 1});
But this takes about 20 minutes to run. I was hoping that adding the index would help with the performance, but it is still pretty slow. Is this index setup correctly considering that I am indexing fields in an array of objects? Can anyone see anything that I am doing that would cause the slowness? Suggestions on where to start looking to improve the performance of this query?
I think you need to consider what you are actually doing here in order to understand why the index is not helping and indeed why this operation takes so long.
The first part of the answer is explained by what you are doing here:
db.questions.find()
Now that part alone basically says that you are asking to retrieve every document in your questions collection. So we can see what you are trying to do is exactly that, as you want to update that content into your participants collection, particularly the document _id for the "question". But here, by definition of getting all documents, no index will be used.
So what you are doing is looping every document in the questions, then asking with your update operation to match the participants record with data from the "question". And what that means is you are pulling "over the wire" all of your 95K documents and sending back "over the wire" your update operation, 95K times. This is not happening on the server and there is network traffic between your application and your MongoDB.
The index itself is not going to do much other than improve the search of each participants record, which is better than scanning and you should be getting the match. But that's not the part that taking the time, its the fetching of the questions that will be the largest issue. Also note that if you were updating
So if it's possible to run your update process on a machine that is as close as possible in networking terms to the MongoDB server then that is going to be your best performance improvement. You could also wind back your Write Concern if you want to be a little daring and/or can live with checking the integrity in another opertation, and that will reduce your network traffic and waiting for a response to the update (which is actually happening) if you put it in "fire and forget" mode.
Also see the guide if you are not sure of the concepts:
http://docs.mongodb.org/manual/core/write-concern/
In case anyone is interested I was able to take the run time of this update query from 20 minutes down to about a minute and a half by using projection when selecting the questions documents. Since I am only using the _id, id and sub_id fields I was able to do the following:
db.questions.find({},{_id: 1, id: 1, sub_id: 1}).forEach(function(doc){
....
Which drastically improved performance. Hope this helps someone!

Resources