Should I be expecting the same performance as a standard Java Hashmap for in memory read/writes? - chronicle

My use case is I need to store millions of symbols in a map where the key is a String i.e
"IBM"
and the value is a json string that has information about the symbol i.e
"{ "Symbol" : "IBM", "AssetType": "Common Stock", "Name":
"International Business Machines Corporation",}".
When using a persisted ChronicleMap to store a 25 million entries of these I'm getting some bad performance vs a standard Java HashMap .
Some ball parks numbers...to insert 25 million records into a HashMap it takes about ~70 seconds vs ChronicleMap which takes ~125 seconds. Reading all the entries back from the HashMap takes 5 secs vs 20 seconds on ChronicleMap.
I set the averageKey/averageValue to sensible settings and I generously sized entries to 50 million as I saw other posts suggesting to do the same.
I'm really just asking what my expectations should be here? Are the ball park figures above in line with what ChronicleMap should be capable of compared to a normal HashMap?
Or am I wrong to treat it as a normal HashMap and actually things like the size of the data I'm putting in means I'm going to get varying performance between using a standard HashMap & ChronicleMap?

That seems reasonable if you are persisting the data, HashMap isn't persisted. ChronicleMap is capable of much more throughput. I would look at how many threads you are using.
ChronicleMap is not a normal HashMap in that it stores copies of the data off-heap so it has no impact on GCs. There is a copy cost each way, however. Ideally, you would store the data as a Java object and not JSon, but it should still work.

Related

boltdb data format for (space) efficient storage

I need to store business data from mysql into bolt. The data is a map[string]string looks like this:
{"id": "<uuid>", "shop_id":"12345678", "date": "20181019"... }
Since the amount of data will be huge and increasing, apart from splitting the data into standalone files (such as 201810.db), I would like to make the final file as small as possible. So I plan to encode the data myself, by using "lookup tables", i.e. since all keys will be same for all data items, I map id=1, shop_id=2,... so that the key will only consume 1~2 bytes. And for values, I also do the same encoding, so that columns which are highly redundant (i.e. select distinct only return a few results) will consume less space in the bolt file.
Now my question is: how does bolt stores keys and values? If I use the method described above, will bolt store more objects "per page" so that eventually space efficiency can be improved?
Or, since it utilize "pages" so that even storing one byte of data will consume a full page? If this is the case, do I have to manually group a bunch of objects until their combined size is larger than a bolt page in order to make it "fully filled"? Of course, this will hurt random access, but for my application, this can be overcome at the expense of increased coding complexity.

Using ChronicleMap as a key-value database

I would like to use a ChronicleMap as a memory-mapped key-value database (String to byte[]). It should be able to hold up to the order of 100 million entries. Reads/gets will happen much more frequently than writes/puts, with an expected write rate of less than 10 entries/sec. While the keys would be similar in length, the length of the value could vary strongly: it could be anything from a few bytes up to tens of Mbs. Yet, the majority of values will have a length between 500 to 1000 bytes.
Having read a bit about ChronicleMap, I am amazed about its features and am wondering why I can't find articles describing it being used as a general key-value database. To me there seem to be a lot of advantages of using ChronicleMap for such a purpose. What am I missing here?
What are the drawbacks of using ChronicleMap for the given boundary conditions?
I voted for closing this question because any "drawbacks" would be relative.
As a data structure, Chronicle Map is not sorted, so it doesn't fit when you need to iterate the key-value pairs in the sorted order by key.
Limitation of the current implementation is that you need to specify the number of elements that are going to be stored in the map in advance, and if the actual number isn't close to the specified number, you are going to overuse memory and disk (not very severely though, on Linux systems), but if the actual number of entries exceeds the specified number by approximately 20% or more, operation performance starts to degrade, and the performance hit grows linearly with the number of entries growing further. See https://github.com/OpenHFT/Chronicle-Map/issues/105

Collecting large statistical sets with pg_stat_statements?

According to Postgres pg_stat_statements documentation:
The module requires additional shared memory proportional to
pg_stat_statements.max. Note that this memory is consumed whenever the
module is loaded, even if pg_stat_statements.track is set to none.
and also:
The representative query texts are kept in an external disk file, and
do not consume shared memory. Therefore, even very lengthy query texts
can be stored successfully. However, if many long query texts are
accumulated, the external file might grow unmanageably large.
From these it is unclear what the actual memory cost of a high pg_stat_statements.max would be - say at 100k or 500k (default is 5k). Is it safe to set the levels that high, would could be the negative repercussions of such high levels? Would aggregating statistics into an external database via logstash/fluentd be a preferred approach above certain sizes?
1.
from what I have read, it hashes the query and keeps it in DB, saving the text to FS. So next concern is more expected then overloaded shared memory:
if many long query texts are accumulated, the external file might grow
unmanageably large
the hash of text is so much smaller then text, that I think you should not worry about extension memory consumption comparing long queries. Especially knowing that extension uses Query Analyser (which will work for EVERY query ANYWAY):
the queryid hash value is computed on the post-parse-analysis
representation of the queries
Setting pg_stat_statements.max 10 times bigger should take 10 times more shared memory I believe. The grows should be linear. It does not say so in documentation, but logically should be so.
There is no answer if it is safe or not to set setting to distinct value, because there is no data on other configuration values and HW you have. But as growth should be linear, consider this answer: "if you set it to 5K, and query runtime has grown almost nothing, then setting it to 50K will prolong it almost nothing times ten". BTW, my question - who is gong to dig 50000 slow statements? :)
2.
This extension already makes a pre-aggregation for "dis-valued" statement. You can select it straight on DB, so moving data to other db and selecting it there will only give you the benefit of unloading the original DB and loading another. In other words you save 50MB for a query on original, but spend same on another. Does it make sense? For me - yes. This is what I do myself. But I also save execution plans for statement (which is not a part of pg_stat_statements extension). I believe it depends on what you have and what you have. Definitely there is no need for that just because of a number of queries. Again unless you have so big file that extension can
As a recovery method if that happens, pg_stat_statements may choose to
discard the query texts, whereupon all existing entries in the
pg_stat_statements view will show null query fields

How to Quickly Update Mongo Documents String Fields with Complex Functions

What is the fastest way to update documents in a Mongo database with complex functions, let's say a string search / replace or a sqrt calculation?
Since such operations are missing, e.g. a $replace, it is not possible with update (which would probably be the fastest, since on my test collection it only takes about 50 ms to set a field on some 100k objects).
When I simply iterate over all documents it takes about 45 seconds. It gets a little faster when I limit my query to the fields I'm using during the update.
This time of course grow larger on larger collections, therefore the question whether there is a faster way than iterating over the collection (e.g. via a map reduce job?).
No :) Without native support for such functionality you'll be stuck with a read->modify->write approach. That said if you can write a field to 100k objects on a machine that manages that in 50ms that process shouldn't take anywhere near 45 seconds if you have to read, modify and write those same documents. Are you sure the bottleneck is the database rather than the machine that is running that pass? Are you sure you're batching appropriately and not do an update per document?

Riak performance - unexpected results

In the last days I played a bit with riak. The initial setup was easier then I thought. Now I have a 3 node cluster, all nodes running on the same vm for the sake of testing.
I admit, the hardware settings of my virtual machine are very much downgraded (1 CPU, 512 MB RAM) but still I am a quite surprised by the slow performance of riak.
Map Reduce
Playing a bit with map reduce I had around 2000 objects in one bucket, each about 1k - 2k in size as json. I used this map function:
function(value, keyData, arg) {
var data = Riak.mapValuesJson(value)[0];
if (data.displayname.indexOf("max") !== -1) return [data];
return [];
}
And it took over 2 seconds just for performing the http request returning its result, not counting the time it took in my client code to deserialze the results from json. Removing 2 of 3 nodes seemed to slightly improve the performance to just below 2 seconds, but this still seems really slow to me.
Is this to be expected? The objects were not that large in bytesize and 2000 objects in one bucket isnt that much, either.
Insert
Batch inserting of around 60.000 objects in the same size as above took rather long and actually didnt really work.
My script which inserted the objects in riak died at around 40.000 or so and said it couldnt connect to the riak node anymore. In the riak logs I found an error message which indicated that the node ran out of memory and died.
Question
This is really my first shot at riak, so there is definately the chance that I screwed something up.
Are there any settings I could tweak?
Are the hardware settings too constrained?
Maybe the PHP client library I used for interacting with riak is the limiting factor here?
Running all nodes on the same physical machine is rather stupid, but if this is a problem - how can i better test the performance of riak?
Is map reduce really that slow? I read about the performance hit that map reduce has on the riak mailing list, but if Map Reduce is slow, how are you supposed to perform "queries" for data needed nearly in realtime? I know that riak is not as fast as redis.
It would really help me a lot if anyone with more experience in riak could help me out with some of these questions.
This answer is a bit late, but I want to point out that Riak's mapreduce implementation is designed primarily to work with links, not entire buckets.
Riak's internal design is actually pretty much optimized against working with entire buckets. That's because buckets are not considered to be sequential tables but a keyspace distributed across a cluster of nodes. This means that random access is very fast — probably O(log n), but don't quote me on that — whereas serial access is very, very, very slow. Serial access, the way Riak is currently designed, necessarily means asking all nodes for their data.
Incidentally, "buckets" in Riak terminology are, confusingly and disappointingly, not implemented the way you probably think. What Riak calls a bucket is in reality just a namespace. Internally, there is only one bucket, and keys are stored with the bucket name as a prefix. This means that no matter how small or large you bucket is, enumerating the keys in a single bucket of size n will take m time, where m is the total number of keys in all buckets.
These limitations are implementation choices by Basho, not necessarily design flaws. Cassandra implements the exact same partitioning model as Riak, but supports efficient sequential range scans and mapreduce across large amounts of keys. Cassandra also implements true buckets.
A recommendation I'd have now that some time has passed and several new versions of Riak have come about is this. Never rely on full bucket map/reduce, that's not an optimized operation, and chances are very good there are other ways to optimize your map/reduce so you don't have to look through so much data to pull out the singlets you need.
Secondary indices now available in newer versions of Riak are definitely the way to go in this regard. Put an index on the objects you want to find (perhaps named 'ismax_int' with a value of 0 or 1). You can map/reduce a secondary index with hundreds of thousands of keys in microseconds which a full bucket scan would have taken multiple seconds to consider.
I don't have direct experience of Riak, but have worked with Cassandra a little, which is similar.
Firstly, performance will probably depend a lot on the number of cores available, and the memory. These systems are usually heavily pipelined and concurrent and benefit from a lot of cores. 4+ cores and 4GB+ of RAM would be a good starting point.
Secondly, MapReduce is designed for batch processing, not realtime queries.
Riak and all similar Key-Value stores are designed for high write performance, high read performance for simple lookups, no complex querying at all.
Just for comparison, Cassandra on a single node (6 core, 6GB) can do 20,000 individual inserts per second.

Resources