How to do Lazy Map deserialization in Haskell - performance

Similar to this question by #Gabriel Gonzalez: How to do fast data deserialization in Haskell
I have a big Map full of Integers and Text that I serialized using Cerial. The file is about 10M.
Every time I run my program I deserialize the whole thing just so I can lookup an handful of the items. Deserialization takes about 500ms which isn't a big deal but I alway seem to like profiling on Friday.
It seems wasteful to always deserialize 100k to 1M items when I only ever need a few of them.
I tried decodeLazy and also changing the map to a Data.Map.Lazy (not really understanding how a Map can be Lazy, but ok, it's there) and this has no effect on the time except maybe it's a little slower.
I'm wondering if there's something that can be a bit smarter, only loading and decoding what's necessary. Of course a database like sqlite can be very large but it only loads what it needs to complete a query. I'd like to find something like that but without having to create a database schema.
Update
You know what would be great? Some fusion of Mongo with Sqlite. Like you could have a JSON document database using flat-file storage ... and of course someone has done it https://github.com/hamiltop/MongoLiteDB ... in Ruby :(
Thought mmap might help. Tried mmap library and segfaulted GHCI for the first time ever. No idea how can even report that bug.
Tried bytestring-mmap library and that works but no performance improvement. Just replacing this:
ser <- BL.readFile cacheFile
With this:
ser <- unsafeMMapFile cacheFile
Update 2
keyvaluehash may be just the ticket. Performance seems really good. But the API is strange and documentation is missing so it will take some experimenting.
Update 3: I'm an idiot
Clearly what I want here is not lazier deserialization of a Map. I want a key-value database and there's several options available like dvm, tokyo-cabinet and this levelDB thing I've never seen before.
Keyvaluehash looks to be a native-Haskell key-value database which I like but I still don't know about the quality. For example, you can't ask the database for a list of all keys or all values (the only real operations are readKey, writeKey and deleteKey) so if you need that then have to store it somewhere else. Another drawback is that you have to tell it a size when you create the database. I used a size of 20M so I'd have plenty of room but the actual database it created occupies 266M. No idea why since there isn't a line of documentation.

One way I've done this in the past is to just make a directory where each file is named by a serialized key. One can use unsafeinterleaveIO to "thunk" the deserialized contents of each read file, so that values are only forced on read...

Related

I want to create a desktop app with database-like search functions but without the SQL database

I know basic SQL, and SQL is all I know when it comes to storing and retrieving data. I want to create 1 .exe and it should contain all ~100,000 key-value pairs (i have the data in .txt files) and maybe an extra attribute for description (this I would add myself - like a note to myself).
I also would like to write it in a new language I don't know yet; like python or C# (I have made desktop apps written in Java & VB.net all with SQL databases). So language will not be an issue and I would appreciate suggestions.
These key-value pairs might not need to be updated and I'm willing to re-compile/repackage the code to make 1 change in the data. The key is 6 letters long and 2 numbers at the end like hxnaaa01. Each of these letters represent or describe something about itself so I would also need to search for a specific letter on a specific position to get exactly what I need.
I know that regex would work well with what I need but all I mentioned is all I know. I don't know enough and I don't know what keywords to google.
I have read about XML and CSV. I don't really know what they are and I'm not sure how all of this would fit in 1 executable.
To summarize, I need:
1 executable (Windows Desktop App)
Search function ~100k KVP+1more attribute (using regex?)
no database
with GUI
ability to add a "note" to each KVP
should be fast and lightweight
1 executable (Windows Desktop App), no database
Data persistence will require either additional files, or a database. It's pretty much unavoidable, you can store data in memory, but it's only persisted for as long as it resides there.
You have another requirement: "fast and lightweight".
To achieve this requirement, you'll need to really think about your solution, what technology you use and how you can improve it in future.
Although searching through data is pretty trivial, an efficient solution is not. It requires upfront research into algorithms, data structures and general practices. (which is a rabbit hole itself).
In the case of JSON [1], you'll need to create an additional file to contain all your key/value pairs, you can use C# to create the extra file (on first launch, for example).
JSON promises to be lightweight, I tend to agree, some may not. When dealing with the filesystem, I think it can be agreed is often far from lightweight solution.
JSON is very readable though:
{
"key": "value",
"comment": "oh this is cool"
}
There's a lot of factors that play into something being fast and lightweight, so there's a need for some research on your part.
Honestly, depending on your experience, I wouldn't focus so much on the fast, I'd focus more on it working, then refactor that into something that's fast if it's too slow. [2]
And again, depending on your experience, I'd stick to opening the file, using a for/loop to find my key and do something with the data found, plus reward myself for having something that works.
TL;DR: you need either a file, or database for truly persistent storage, JSON or a remotely hosted MySQL would work. Try not to focus too much on fast before you have something that works.
https://www.json.org/json-en.html [1]
https://stackoverflow.com/a/5581595/2932298
https://stackify.com/premature-optimization-evil/ [2]

does ElasticSearch preserve disk storage when saving same value on same field

Let's say I am sending log entries to ElasticSearch. We are considering adding the calling method, calling class, and line of code to our log entries. Being that these fields will contain similar values, would ElasticSearch attempt to preserve disk space by not copying this data for every occasion of the same value?
EDIT - Additional clarification: I did not read anywhere that Elastic does this. I know that some data storage systems, like columnar databases, write their data to disk so as to preserve disk storage by not writing duplicated data over and over again. So I am wondering if ElasticSearch implements similar techniques..
As far as I know: no, it doesn't. It would make several key features quite hard I believe, and I have not seen any reference to this practice.
It's tricky to 'proof' the non-existence of some method unless you look at all source-code, but I would expect this page about disk usage tuning to containt references to this practice.
Did you read anywhere about this, or does it just seem practical to you?

How to deactivate safe mode in the mongo shell?

Short question is on the title: I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
Long Question for those willing to know the context:
I am working on a huge set of data like
{
_id:ObjectId("azertyuiopqsdfghjkl"),
stringdate:"2008-03-08 06:36:00"
}
and some other fields and there are about 250M documents like that (whole database with the indexes weights 36Go). I want to convert the date in a real ISODATE field. I searched a bit how I could make an update query like
db.data.update({},{$set:{date:new Date("$stringdate")}},{multi:true})
but did not find how to make this work and resolved myself to make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value. The query use the _id so the default index is used.
Problem is that it takes a very long time. I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added. I also set an index on a relevant field to process the database chunk by chunk. Finally I ran several concurrent mongo clients on both the server and my workstation to ensure that the limitant factor is the database lock availability and not any other factor like cpu or network costs.
I monitored the whole thing with mongotop, mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time. I am a bit disappointed mongodb does not have a more precise granularity on its write lock, why not allowing concurrent write operations on the same collection as long as there is no risk of interference? Now that I think about it I should have sharded the collection on a dozen shards even while staying on the same server, because there would have been individual locks on each shard.
But since I can't do a thing right now to the current database structure, I searched how to improve performance to at least spend 90% of my time writing in mongo (from 70% currently), and I figured out that since I ran my script in the default mongo shell, every time I make an update, there is also a getLastError() which is called afterwards and I don't want it because there is a 99.99% chance of success and even in case of failure I can still make an aggregation request after the end of the big process to retrieve the single exceptions.
I don't think I would gain so much performance by deactivating the getLastError calls, but I think itis worth trying.
I took a look at the documentation and found confirmation of the default behavior, but not the procedure for changing it. Any suggestion?
I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
You can use db.getLastError({w:0}) ( http://docs.mongodb.org/manual/reference/method/db.getLastError/ ) to do what you want but it won't help.
This is because for one:
make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value.
When using the shell in a non-interactive mode like within a loop it doesn't actually call getLastError(). As such downing your write concern to 0 will do nothing.
I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added.
I did tell people when they asked about this stuff to add those fields incase of movement but instead they listened to the guy who said "leave them out! They use space!".
I shouldn't feel smug but I do. That's an unfortunately side effect of being right when you were told you were wrong.
mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time
That's because of all the movement in your documents, kinda hard to fix that.
I am a bit disappointed mongodb does not have a more precise granularity on its write lock
The write lock doesn't actually denote the concurrency of MongoDB, this is another common misconception that stems from the transactional SQL technologies.
Write locks in MongoDB are mutexs for one.
Not only that but there are numerous rules which dictate that operations will subside to queued operations under certain circumstances, one being how many operations waiting, another being whether the data is in RAM or not, and more.
Unfortunately I believe you have got yourself stuck in between a rock and hard place and there is no easy way out. This does happen.

Why does loading cached objects increase the memory consumption drastically when computing them will not?

Relevant background info
I've built a little software that can be customized via a config file. The config file is parsed and translated into a nested environment structure (e.g. .HIVE$db = an environment, .HIVE$db$user = "Horst", .HIVE$db$pw = "my password", .HIVE$regex$date = some regex for dates etc.)
I've built routines that can handle those nested environments (e.g. look up value "db/user" or "regex/date", change it etc.). The thing is that the initial parsing of the config files takes a long time and results in quite a big of an object (actually three to four, between 4 and 16 MB). So I thought "No problem, let's just cache them by saving the object(s) to .Rdata files". This works, but "loading" cached objects makes my Rterm process go through the roof with respect to RAM consumption (over 1 GB!!) and I still don't really understand why (this doesn't happen when I "compute" the object all anew, but that's exactly what I'm trying to avoid since it takes too long).
I already thought about maybe serializing it, but I haven't tested it as I would need to refactor my code a bit. Plus I'm not sure if it would affect the "loading back into R" part in just the same way as loading .Rdata files.
Question
Can anyone tell me why loading a previously computed object has such effects on memory consumption of my Rterm process (compared to computing it in every new process I start) and how best to avoid this?
If desired, I will also try to come up with an example, but it's a bit tricky to reproduce my exact scenario. Yet I'll try.
Its likely because the environments you are creating are carrying around their ancestors. If you don't need the ancestor information then set the parents of such environments to emptyenv() (or just don't use environments if you don't need them).
Also note that formulas (and, of course, functions) have environments so watch out for those too.
If it's not reproducible by others, it will be hard to answer. However, I do something quite similar to what you're doing, yet I use JSON files to store all of my values. Rather than parse the text, I use RJSONIO to convert everything to a list, and getting stuff from a list is very easy. (You could, if you want, convert to a hash, but it's nice to have layers of nested parameters.)
See this answer for an example of how I've done this kind of thing. If that works out for you, then you can forego the expensive translation step and the memory ballooning.
(Taking a stab at the original question...) I wonder if your issue is that you are using an environment rather than a list. Saving environments might be tricky in some contexts. Saving lists is no problem. Try using a list or try converting to/from an environment. You can use the as.list() and as.environment() functions for this.

Organizing memcache keys

Im trying to find a good way to handle memcache keys for storing, retrieving and updating data to/from the cache layer in a more civilized way.
Found this pattern, which looks great, but how do I turn it into a functional part of a PHP application?
The Identity Map pattern: http://martinfowler.com/eaaCatalog/identityMap.html
Thanks!
Update: I have been told about the modified memcache (memcache-tag) that apparently does do a lot of this, but I can't install linux software on my windows development box...
Well, memcache use IS an identity map pattern. You check your cache, then you hit your database (or whatever else you're using). You can go about finding information about the source by storing objects instead of just values, but you'll take a performance hit for that.
You effectively cannot ask the cache what it contains as a list. To mass invalidate, you'll have to keep a list of what you put in and iterate it, or you'll have to iterate every possible key that could fit the pattern of concern. The resource you point out, memcache-tag can simplify this, but it doesn't appear to be maintained inline with the memcache project.
So your options now are iterative deletes, or totally flushing everything that is cached. Thus, I propose a design consideration is the question that you should be asking. In order to get a useful answer for you, I query thus: why do you want to do this?

Resources