Javers production best practices - javers

Are there any production best practices (tuning and similar) to follow to avoid:
Performance issues
Uncontrolled data growth
Are also there any cleanup utilities or best practices in order to remove "old" or unnecessary data?
Could the usage of MongoDB solve eventual of performance or data maintenance problems?

There is one performance hint -- keep Javers data small. You should control the number of Snapshots persisted to JaversRepository.
Applications should track changes only in important data that are entered by users. You can call it core-domain data or business-relevant data. All technical data, data imported from other systems and generated data should be ignored. There are various ways of ignoring things in Javers.
At the end of the day, when you show a change log to your users, it should look like a concise, human-readable story, for example:
System.out.println(changes.prettyPrint());
Changes:
Commit 2.0 done by author at 15 Apr 2018, 22:50:15 :
* changes on Employee/Frodo :
- 'salary' changed from '10000' to '11000'
- 'subordinates' collection changes :
0. 'Employee/Sam' added
* new object: Employee/Sam
* changes on Employee/Sam :
- 'boss' changed from '' to 'Employee/Frodo'
- 'name' changed from '' to 'Sam'
- 'salary' changed from '0' to '2000'
Commit 1.0 done by author at 15 Apr 2018, 22:50:15 :
* new object: Employee/Frodo
* changes on Employee/Frodo :
- 'name' changed from '' to 'Frodo'
- 'salary' changed from '0' to '10000'

Related

DRF datetime field serializer 'Invalid datetime for the timezone' twice a year

I'm importing data from csv to InfluxDB through a Django Rest Framework API endpoint.
The relevant part of the viewset:
if request.method == "PUT":
measurements = InputMeasurementSerializer(data=request.data, many=True)
measurements.is_valid(raise_exception=True)
The serializer:
class InputMeasurementSerializer(serializers.Serializer):
value = serializers.FloatField()
time = serializers.DateTimeField(input_formats=[
"%Y-%m-%d_%H:%M:%S", ...])
The input data is in the form of time value paires for every 15 minutes:
time,value
2021.04.11 00:00:00,0.172
2021.04.11 00:15:00,0.76
2021.04.11 00:30:00,0.678
2021.04.11 00:45:00,1.211
It works fine for all dates, except for the time values on the 28.03.2021 between 02:00-03:00 and on the 25.10.2020 between 02:00-03:00 it throws the exception: Invalid datetime for the timezone "Europe/Budapest".
Could be related to the time setting because of the daylight saving - but I don't see how exactly. Has anyone any clue what could be the problem here?
In my settings.py:
TIME_ZONE = 'Europe/Budapest'
USE_TZ = True
It indeed is because of the daylight savings, explained in the documentation
Even if your website is available in only one time zone, it’s still good practice to store data in UTC in your database. The main reason is Daylight Saving Time (DST). Many countries have a system of DST, where clocks are moved forward in spring and backward in autumn. If you’re working in local time, you’re likely to encounter errors twice a year, when the transitions happen. (The pytz documentation discusses these issues in greater detail.) This probably doesn’t matter for your blog, but it’s a problem if you over-bill or under-bill your customers by one hour, twice a year, every year. The solution to this problem is to use UTC in the code and use local time only when interacting with end users.

Prometheus: Prevent starting new time series on label change

Assuming the following metric:
cpu_count{machine="srv1", owner="Alice", department="ops"} 8
cpu_count{machine="srv1", owner="Bob", department="ops"} 8
I'd like to be able to prevent starting a new time series on owner change. It should still be considered the same instance, but I would like to be able to look up by owner.
I don't particularly care if it matches only on my_metric{owner=~"Box"} or on both my_metric{owner=~"Box"} and my_metric{owner=~"Alice"}, I just need to make sure it does not count twice on my_metric{machine=~"srv1"} or my_metric{department=~"ops"}.
I'm willing to accept that using labels to group instances in this manner is not the correct approach, but what is?
When you add the label "owner" to this kind of metric I think you're trying to accomplish a kind of "asset management" which could be done better with some other tool developed specific to this goal. Prometheus isn't a suitable tool to keep the information of who is using each machine in your company.
Said that, every time the owner of a machine changes you could workaround this issue deleting the old data series using the REST API executing something like this:
curl --silent --user USER:PASS --globoff --request POST "https://PROMETHEUS-SERVER/api/v1/admin/tsdb/delete_series?match[]={machine='srv1',owner='Bob'}"
If you can change the code, it would be better to have a metric dedicated to the ownership:
# all metrics are identified a usual
cpu_count{machine="srv1", department="ops"} 8
# use an info metrics to give details about owner
machine_info{machine="srv1", owner="Alice", department="ops"} 1
You can still aggregate the information id you need it:
cpu_count * ON(machine,department) machine_info
That way, the owner is not polluting all your metrics. Still, you will have issues when changing the owner of a machine while waiting for the older metric to disappear (5 minutes before staleness).
I have not tried it but a solution could be to use the time at which the ownership changed (if you can provide it) as a metric value - epoch time in seconds.
# owner changed at Sun, 08 Mar 2020 22:05:53 GMT
machine_info{machine="srv1", owner="Alice", department="ops"} 1583705153
# Previous owner Sat, 01 Feb 2020 00:00:00 GMT
machine_info{machine="srv1", owner="Alice", department="ops"} 1580515200
And then use the following expression to get the latest owner whenever you need the current owner - only useful when owner has changed within the last 5 minutes:
machine_info == ON(machine,department) BOOL (max(machine_info) BY(machine,department) )
Quite a mouthful but it would approach what you want.

Elasticsearch Delete by Query Version Conflict

I am using Elasticsearch version 5.6.10. I have a query that deletes records for a given agency, so they can later be updated by a nightly script.
The query is in elasticsearch-dsl and look like this:
def remove_employees_from_search(jurisdiction_slug, year):
s = EmployeeDocument.search()
s = s.filter('term', year=year)
s = s.query('nested', path='jurisdiction', query=Q("term", **{'jurisdiction.slug': jurisdiction_slug}))
response = s.delete()
return response
The problem is I am getting a ConflictError exception when trying to delete the records via that function. I have read this occurs because the documents were different between the time the delete process started and executed. But I don't know how this can be, because nothing else is modifying the records during the delete process.
I am going to add s = s.params(conflicts='proceed') in order to silence the exception. But this is a band-aid as I do not understand why the delete is not processing as expected. Any ideas on how to troubleshoot this? A snapshot of the error is below:
ConflictError:TransportError(409,
u'{
"took":10,
"timed_out":false,
"total":55,
"deleted":0,
"batches":1,
"version_conflicts":55,
"noops":0,
"retries":{
"bulk":0,
"search":0
},
"throttled_millis":0,
"requests_per_second":-1.0,
"throttled_until_millis":0,
"failures":[
{
"index":"employees",
"type":"employee_document",
"id":"24681043",
"cause":{
"type":"version_conflict_engine_exception",
"reason":"[employee_document][24681043]: version conflict, current version [5] is different than the one provided [4]",
"index_uuid":"G1QPF-wcRUOCLhubdSpqYQ",
"shard":"0",
"index":"employees"
},
"status":409
},
{
"index":"employees",
"type":"employee_document",
"id":"24681063",
"cause":{
"type":"version_conflict_engine_exception",
"reason":"[employee_document][24681063]: version conflict, current version [5] is different than the one provided [4]",
"index_uuid":"G1QPF-wcRUOCLhubdSpqYQ",
"shard":"0",
"index":"employees"
},
"status":409
}
You could try making it do a refresh first
client.indices.refresh(index='your-index')
source https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/api-reference.html#_indices_refresh
First, this is a question that was asked 2 years ago, so take my response with a grain of salt due to the time gap.
I am using the javascript API, but I would bet that the flags are similar. When you index or delete there is a refresh flag which allows you to force the index to have the result appear to search.
I am not an Elasticsearch guru, but the engine must perform some systematic maintenance on the indices and shards so that it moves the indices to a stable state. It's probably done over time, so you would not necessarily get an immediate state update. Furthermore, from personal experience, I have seen when delete does not seemingly remove the item from the index. It might mark it as "deleted", give the document a new version number, but it seems to "stick around" (probably until general maintenance sweeps run).
Here I am showing the js API for delete, but it is the same for index and some of the other calls.
client.delete({
id: string,
index: string,
type: string,
wait_for_active_shards: string,
refresh: 'true' | 'false' | 'wait_for',
routing: string,
timeout: string,
if_seq_no: number,
if_primary_term: number,
version: number,
version_type: 'internal' | 'external' | 'external_gte' | 'force'
})
https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/api-reference.html#_delete
refresh
'true' | 'false' | 'wait_for' - If true then refresh the affected shards to make this operation visible to search, if wait_for then wait for a refresh to make this operation visible to search, if false (the default) then do nothing with refreshes.
For additional reference, here is the page on Elasticsearch refresh info and what might be a fairly relevant blurb for you.
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-refresh.html
Use the refresh API to explicitly refresh one or more indices. If the request targets a data stream, it refreshes the stream’s backing indices. A refresh makes all operations performed on an index since the last refresh available for search.
By default, Elasticsearch periodically refreshes indices every second, but only on indices that have received one search request or more in the last 30 seconds. You can change this default interval using the index.refresh_interval setting.

RethinkDB - Is this a valid optimistic locking implementation

I am trying out a lot of new ideas (DDD, Event Sourcing and CQRS) and evaluating RethinkDB as a potential data store for the Domain Events. In DDD, an aggregate is set of objects that work together to provide a specific behaviour. Each aggregate is a transactional/consistency boundary. The root of the aggregate is an object that provides an API and hides the internal implementation.
To persistent an aggregate it is usually recommended to use optimistic locking. The idea is to have a version number attribute in the aggregate and when the time comes to save an aggregate we check to make sure the version of the aggregate in the database matches the version of the aggregate that was read/updated in the application. This guarantees that nobody changed the aggregate in the meantime and prevents overwriting others changes.
Obviously this version checking can't just happen in the application layer (think multiple application servers scenario). The application needs support from data store for doing atomic updates that take this version number into consideration.
Here is a simple implementation using the RethinkDB Ruby API.
I created a table called 'applicants' with one record
"id": "6b3b57a7-3ba8-4322-873e-1d6c8333daae" ,
"name": "Homer Simpson" ,
"updated_at": Mon Dec 28 2015 12:05:40 GMT+05:30 ,
"version": 1
Here is the sample test code that I ran twice in parallel
require 'rethinkdb'
include RethinkDB::Shortcuts
conn = r.connect(:host => 'localhost',
:port => 28015,
:db => 'test')
def update_applicant(conn, current_version)
result = r.table('applicants').get('6b3b57a7-3ba8-4322-873e-1d6c8333daae').update{ |applicant|
r.branch(
applicant['version'].eq(current_version),
{updated_at: Time.now, version: current_version + 1},
{}
)
}.run(conn)
fail 'optimistic locking failure' if result['unchanged'] == 1
rescue => e
puts "optimistic locking failure: #{current_version}"
current_version = r.table('applicants').get('6b3b57a7-3ba8-4322-873e-1d6c8333daae').run(conn)['version']
retry
end
(1..100).each { |version| update_applicant(conn, version) }
conn.close
This seems to work but I want to make sure there will be no race conditions and other issues with this approach in a production environment. I am assuming that update is an atomic operation and using a branch in update still keeps it atomic.
I am looking for some validation and suggestions from RethinkDB devs/users. Thanks.
update is always an atomic operation unless you pass the non_atomic: true flag (which is sometimes necessary if the update contains a nondeterministic operation), so that code looks safe to me.

Solr performance with commitWithin does not make sense

I am running a very simple performance experiment where I post 2000 documents to my application.
Who in tern persists them to a relational DB and sends them to Solr for indexing (Synchronously, in the same request).
I am testing 3 use cases:
No indexing at all - ~45 sec to post 2000 documents
Indexing included - commit after each add. ~8 minutes (!) to post and index 2000 documents
Indexing included - commitWithin 1ms ~55 seconds (!) to post and index 2000 documents
The 3rd result does not make any sense, I would expect the behavior to be similar to the one in point 2. At first I thought that the documents were not really committed but I could actually see them being added by executing some queries during the experiment (via the solr web UI).
I am worried that I am missing something very big. Is it possible that committing after each add will degrade performance by a factor of 400?!
The code I use for point 2:
SolrInputDocument = // get doc
SolrServer solrConnection = // get connection
solrConnection.add(doc);
solrConnection.commit();
Where as the code for point 3:
SolrInputDocument = // get doc
SolrServer solrConnection = // get connection
solrConnection.add(doc, 1); // According to API documentation I understand there is no need to call an explicit commit after this
According to this wiki:
https://wiki.apache.org/solr/NearRealtimeSearch
the commitWithin is a soft-commit by default. Soft-commits are very efficient in terms of making the added documents immediately searchable. But! They are not on the disk yet. That means the documents are being committed into RAM. In this setup you would use updateLog to be solr instance crash tolerant.
What you do in point 2 is hard-commit, i.e. flush the added documents to disk. Doing this after each document add is very expensive. So instead, post a bunch of documents and issue a hard commit or even have you autoCommit set to some reasonable value, like 10 min or 1 hour (depends on your user expectations).

Resources