Understanding the Open Policy Agent (OPA) Disk-Storage implementation's use of .sst and .vlog files (BadgerDB)

Understanding the Open Policy Agent (OPA) Disk-Storage implementation's use of .sst and .vlog files (BadgerDB) - disk

I'm working through some OPA examples like this one that leverage disk storage. I've removed the temporary directory in favor of a permanent one (like we'd have in a production system) and I'm noticing some strange behavior. If I first write the example record
"authz": {
"tenants": {
"acmecorp.openpolicyagent.org": {
"tier": "gold"
},
"globex.openpolicyagent.org" :{
"tier": "silver"
}
}
}
then the directory is populated with 000001.sst, 000001.vlog, DISCARD, KEYREGISTRY, and MANIFEST files. However, on every subsequent read a new .sst and .vlog file are added with an incremented number such as 000002.sst. It seems really inefficient to keep writing new files on writes and especially reads, why is this the case?
Also, is the expectation that I do my own garbage collection on another thread or is this something that comes built in with OPA or Badger?

It seems really inefficient to keep writing new files on writes and especially reads, why is this the case?
From the perspective of using OPA, this should be considered an implementation detail. I can't comment on the necessity of those files, other than this is how Badger does it. Badger itself is far from simple, it's a multi-layer system involving its own caches etc -- it's too complex (for me!) to judge its on-disk behaviour in any way.
Also, is the expectation that I do my own garbage collection on another thread or is this something that comes built in with OPA or Badger?
You are not expected to do any such thing. In fact, OPA has a goroutine running that will periodically call the advised GC routine, here's the code.
If you find the need to dig into this further, the Badger community might be another good venue, see this Dgraph discourse category. (And we can of course discuss this on the OPA slack, too.)

Related

where is fyne's thread safety defined?

I was attracted to Fyne (and hence Go) by a promise of thread safety. But now that I'm getting better at reading Go I'm seeing things that make be believe that the API as a whole is not thread safe and perhaps was never intended to be. So I'm trying to determine what "thread safe" means in Fyne.
I'm looking specifically at
func (l *Label) SetText(text string) {
l.Text = text
l.textProvider.SetText(text) // calls refresh
}
and noting that l.Text is also a string. Assignments in Go are not thread safe, so it seems obvious to me that if two threads fight over the text of a label and both call label.SetText at the same time, I can expect memory corruption.
"But you wouldn't do that", one might say. No, but I am worried about the case of someone editing the content of an Entry while an app thread decides it needs to replace all the Entry's text - this is entirely possible in my app because it supports simultaneous editing by multiple users over a network, so updates to all sorts of widgets come in asynchronously. (Note I don't care what happens if two people edit the same Entry at the same time; someone's changes will be lost and I don't care who's. But it must not result in memory corruption.) Note that one approach I could take would be to have the background thread create an entirely new Entry widget, which would then replace the one in the current Box. But is that thread safe?
It's not that I don't know how to serialize things with channels. But I was hoping that Fyne would eliminate the need for it (a blog post claims it does); and even using channels I can't convince myself that a user meddling with a widget in various ways while some background thread is altering it, hiding it, etc, isn't going to result in crashes. Maybe all that is serialized under the covers and is perfectly safe, but I don't want to find out the hard way that it isn't, because I'll have no way to fix it.
Fyne is clearly pretty new and seems to have tons of promise, but documentation seems light on details. Is more information available somewhere? Have people tried this successfully?

You have found some race conditions here. There are plans to improve, but the 1.2 release was required to get a new "BaseWidget" first - and that was only released a few weeks ago.
Setting fields directly is primarily for setup purposes and so not expected to be used in the way you illustrate. That said, we do want to support it. The base widget will soon introduce something akin to SetFieldsAndRefresh(func()) which will ensure the safety of the code passed and refresh the widget afterward.
There is indeed a race currently within Refresh(). The use of channels internally were designed to remove this - but there are some corners such as multiple goroutines calling it. This is the area that our new BaseWidget code can help with - as they can internally lock automatically. Using this approach will be thread safe with no changes to the developer in a future release.
The API so far has made it possible for developers to not worry about threading and work from any goroutines - we do need to work internally to make it safer - you are quite right. https://github.com/fyne-io/fyne/issues/506

how to index tons of data at once with Rails, (re)tire, json without eating (all) memory?

In a Rails 3.2.x app, using (Re)tire to access an ES cluster a rake task is going through approx 1M rows to create a new index. (Ruby 1.9.3).
The task is using .to_json with specific attributes and methods listed to limit the resulting hash for each element.
Yet as the task run the memory is eaten away, ending with the process being killed usually by the system.
The task is already using find_by_batch. Smaller batches sizes (using find_each) don't help.
checking without index
Removing the index.import call does improve things (obviously). The task goes through the whole collection very fast without a problem. Pointing to either ES, tire or the JSON conversion (and the relations it might call upon).
reducing the scope of the task
Adding back index.import and passing a very limited hash (with string keys) for each item does make things slower but not too much and does not eat memory away. So json might no be the culprit here.
adding attributes and methods back
The culprit seems to be one of the method used to grab one of the additional attributes. It's based on a relation of the model and another ... Ending up with a lot of models being involved and sifted through.
As pointed out by Index the results of a method in ElasticSearch (Tire + ActiveRecord) adding includes does help a bit but the task does end up heavy too.
going around
I also tried to go around part of the problem and replace the calls to Tire with the use of ES bulk API.
Generating json files and sending them with a Ruby http lib can work. Yet, the same problem arise : memory since the same requests to the DB are made.
What's left ?
What I don't get is why even with the find_by_batch Ruby keeps eating away memory. I would expect that after each batch of data, memory related that batch would be freed.
Next to try : GC.start calls, Active Record caching de activation around the tasks.
Yet, except if a solution limiting the memory use drastically (300 or 500Mo instead of 800+) the background issue is : indexing a lot of instances of a Model including data related to some other models.
am I missing something for the import and includes that would solve the issue ?
would splitting that task into smaller background jobs (resque, sidekiq) help ? I would suppose so as each batch would be isolated from the others and once treated, really free up the memory (?) (orchestrating those tasks would be another trouble)
is there good practices related to indexing big quantities of data into ES ?

I've been using Rails + Elasticsearch for a while and did this kind of dance a few times.
A few things comes to mind, in no particular order.
Did you try to use the recent elasticsearch gem (instead of tire) ? I've updated my apps to use and like having more control on what is done.
I would also try to force a GC sweep after each ActiveRecord loop. You could also be extra careful with memory allocation by explicitly resetting all local variables each time.
You could use the fork & exec trick to fork a brand new process at each loop, it would be the most effective GC you can get. It's a little overhead when you write it the first time, but the pay-off is great. Take good care of limiting the amount of memory used in the outer part of the task. Using a process-based background task would partly achieve the same goal, but you might still get memory bloat.
Can you limit the use of ActiveRecord? If you need some basic associations you could use a lower-level/simpler tool like Sequel (or else) to use Ruby hashes/arrays instead of full fledged AR models.

How to deactivate safe mode in the mongo shell?

Short question is on the title: I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
Long Question for those willing to know the context:
I am working on a huge set of data like
{
_id:ObjectId("azertyuiopqsdfghjkl"),
stringdate:"2008-03-08 06:36:00"
}
and some other fields and there are about 250M documents like that (whole database with the indexes weights 36Go). I want to convert the date in a real ISODATE field. I searched a bit how I could make an update query like
db.data.update({},{$set:{date:new Date("$stringdate")}},{multi:true})
but did not find how to make this work and resolved myself to make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value. The query use the _id so the default index is used.
Problem is that it takes a very long time. I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added. I also set an index on a relevant field to process the database chunk by chunk. Finally I ran several concurrent mongo clients on both the server and my workstation to ensure that the limitant factor is the database lock availability and not any other factor like cpu or network costs.
I monitored the whole thing with mongotop, mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time. I am a bit disappointed mongodb does not have a more precise granularity on its write lock, why not allowing concurrent write operations on the same collection as long as there is no risk of interference? Now that I think about it I should have sharded the collection on a dozen shards even while staying on the same server, because there would have been individual locks on each shard.
But since I can't do a thing right now to the current database structure, I searched how to improve performance to at least spend 90% of my time writing in mongo (from 70% currently), and I figured out that since I ran my script in the default mongo shell, every time I make an update, there is also a getLastError() which is called afterwards and I don't want it because there is a 99.99% chance of success and even in case of failure I can still make an aggregation request after the end of the big process to retrieve the single exceptions.
I don't think I would gain so much performance by deactivating the getLastError calls, but I think itis worth trying.
I took a look at the documentation and found confirmation of the default behavior, but not the procedure for changing it. Any suggestion?

I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
You can use db.getLastError({w:0}) ( http://docs.mongodb.org/manual/reference/method/db.getLastError/ ) to do what you want but it won't help.
This is because for one:
make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value.
When using the shell in a non-interactive mode like within a loop it doesn't actually call getLastError(). As such downing your write concern to 0 will do nothing.
I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added.
I did tell people when they asked about this stuff to add those fields incase of movement but instead they listened to the guy who said "leave them out! They use space!".
I shouldn't feel smug but I do. That's an unfortunately side effect of being right when you were told you were wrong.
mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time
That's because of all the movement in your documents, kinda hard to fix that.
I am a bit disappointed mongodb does not have a more precise granularity on its write lock
The write lock doesn't actually denote the concurrency of MongoDB, this is another common misconception that stems from the transactional SQL technologies.
Write locks in MongoDB are mutexs for one.
Not only that but there are numerous rules which dictate that operations will subside to queued operations under certain circumstances, one being how many operations waiting, another being whether the data is in RAM or not, and more.
Unfortunately I believe you have got yourself stuck in between a rock and hard place and there is no easy way out. This does happen.

Why does loading cached objects increase the memory consumption drastically when computing them will not?

Relevant background info
I've built a little software that can be customized via a config file. The config file is parsed and translated into a nested environment structure (e.g. .HIVE$db = an environment, .HIVE$db$user = "Horst", .HIVE$db$pw = "my password", .HIVE$regex$date = some regex for dates etc.)
I've built routines that can handle those nested environments (e.g. look up value "db/user" or "regex/date", change it etc.). The thing is that the initial parsing of the config files takes a long time and results in quite a big of an object (actually three to four, between 4 and 16 MB). So I thought "No problem, let's just cache them by saving the object(s) to .Rdata files". This works, but "loading" cached objects makes my Rterm process go through the roof with respect to RAM consumption (over 1 GB!!) and I still don't really understand why (this doesn't happen when I "compute" the object all anew, but that's exactly what I'm trying to avoid since it takes too long).
I already thought about maybe serializing it, but I haven't tested it as I would need to refactor my code a bit. Plus I'm not sure if it would affect the "loading back into R" part in just the same way as loading .Rdata files.
Question
Can anyone tell me why loading a previously computed object has such effects on memory consumption of my Rterm process (compared to computing it in every new process I start) and how best to avoid this?
If desired, I will also try to come up with an example, but it's a bit tricky to reproduce my exact scenario. Yet I'll try.

Its likely because the environments you are creating are carrying around their ancestors. If you don't need the ancestor information then set the parents of such environments to emptyenv() (or just don't use environments if you don't need them).
Also note that formulas (and, of course, functions) have environments so watch out for those too.

If it's not reproducible by others, it will be hard to answer. However, I do something quite similar to what you're doing, yet I use JSON files to store all of my values. Rather than parse the text, I use RJSONIO to convert everything to a list, and getting stuff from a list is very easy. (You could, if you want, convert to a hash, but it's nice to have layers of nested parameters.)
See this answer for an example of how I've done this kind of thing. If that works out for you, then you can forego the expensive translation step and the memory ballooning.
(Taking a stab at the original question...) I wonder if your issue is that you are using an environment rather than a list. Saving environments might be tricky in some contexts. Saving lists is no problem. Try using a list or try converting to/from an environment. You can use the as.list() and as.environment() functions for this.

Core Data and threading

What are some of the obscure pitfalls of using Core Data and threads? I've read much of the documentation, and so far I've come across the following either in the docs or through painful experience:
Use a new NSManagedObjectContext for each thread, but a single NSPersistentStoreCoordinator is enough for the whole app.
Before sending an NSManagedObject's objectID back to the main thread (or any other thread), be sure the context has been saved (or at a minimum, it wasn't a newly-inserted-but-not-yet-saved object) - otherwise the objectID will actually be a temporary ID and not a persistent one.
Use mergeChangesFromContextDidSaveNotification: to detect when a save happens in another thread and use that to merge those changes with the current thread's context.
Bonus question/observation: I was led to believe by the wording of some of the docs that mergeChangesFromContextDidSaveNotification: is something only needed by the main thread to merge changes into the "main" context from worker threads - but I don't think that's the case.
I set up my importer to create batches of data which are imported using a subclass of an NSOperation that owns it's own context. The operations are loaded into an NSOperationQueue that's set to allow the default number of concurrent operations, so it's possible for several import batches to be running at the same time. I would occasionally get very strange validation errors and exceptions (like trying to add nil to a relationship) and other failures that I had never seen when I did all the same stuff on the main thread. It occurred to me (and perhaps this should have been obvious) that maybe the context merging needed to be done for all contexts in every thread - not just the "main" one! I don't know why I didn't think of that before, but I think this helped. (It hasn't been tested well enough yet for me to feel sure, though.) In any case, is it true that you need to observe that notification for ALL import threads that may be working with the same datasets and adding/updating the same entities? If so, this is yet another pitfall bullet point, IMO, although I have yet to be certain that it'll work.
Given how many of these I've run into with Core Data in general (and not all of them just about multi-threading), I have to wonder how many more are lurking. Since multi-threading so often ends up with bugs that are difficult if not impossible to reproduce due to the timing issues, I figured I'd ask if anyone had other important things that I may be missing that I need to concern myself with.

There is an entire rather large bit of documentation devoted to the subject of Core Data and Threading.
It isn't clear from your set of issues what isn't covered by that documentation.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio