Berkeley DB File compression - berkeley-db-je

This is in conjunction with my previous question Click here.
We are using berkeley DB for temporary storage before it is processed and stored into a relational DB.The problem arises when the size increases beyond a certain point.Now we have to either split the files into smaller ones or compress the existing files.In this question I want to ask the compression part,whether berkeley DB has any built in compression utility or do we have to do it programatically.If it is built in,then it will always be faster.

From here:
According to the Berkeley FAQ there are two ways of optimizing it (before compression):
Compact
Vacuum
You can also implement your own compression algorithm as shown here.
How different is the Berkeley DB VACUUM from SQLite's?
SQLite
implements the VACUUM command as a database dump followed by a
complete reload from that dump. It is an expensive operation, locking
the entire database for the duration of the operation. It is also an
all or nothing operation. Either it works, or it fails and you have to
try again sometime. When SQLite finishes, the database is frequently
smaller in size (file size is smaller) and the btree is better
organized (shallower) than before due to in-order key insertion of the
data from the dump file. SQLite, when it works and when you can afford
locking everyone out of the database, does a good job of VACUUM.
Berkeley DB approaches this in a completely different way. For many
releases now Berkeley DB's B-Tree implementation has had the ability
to compact while other oprations are in-flight. Compacting is a
process wherein the B-Tree nodes are examined and, when less than
optimal, they are re-organized (reverse split, etc.). The more shallow
your B-Tree, the fewer lookups required to find the data at a leaf
node. Berkeley DB can compact sections of the tree, or the whole tree
at once. For 7x24x365 (five-nines) operation this is critical. The BDB
version of compact won't adversly impact ongoing database operations
whereas SQLite's approach does. But compaction doesn't address empty
sections of the database (segments of the database file where deleted
data once lived). Berkeley DB also supports compression of database
files by moving data within the file, then truncating the file
returning that space to the filesystem. As of release 5.1 of Berkeley
DB, the VACUUM command will compact and compress the database file(s).
This operation takes more time than the dump/load approach of SQLite
because it is doing more work to allow for the database to remain
operational. We believe this is the right trade-off, but if you
disagree you can always dump/load the database in your code.

Related

How is Hazelcast distributed cache faster than a call to the DB?

Lets say there I have 2 servers that are using Hazelcasts distributed cache. If on server #1, I store 2 items in a map in that distributed cache. One of those items will be saved in the local back up, and the other will be stored in the backup of the other servers Hazelcast instance(Please correct me if that is incorrect).
My question is, if I try to retrieve the second item from the cache(stored in the backup on server #2), a TCP call will be made to retrieve that data. How is this faster than just calling the DB?
First of all let me correct how data is stored on Hazelcast.
Hazelcast uses a distribution algorithm based on consistent hashing, meaning the hashing algorithm returns the same output for the same input all the time. This distribution is not 100% equal distribution but for high number of elements pretty good and cost effective. That said it doesn't mean you'll have one element on each node in the worst case.
By default Hazelcast also keeps on backup, that means each node will have both elements (in a 2 node setup), either owned data or as a backup for failure case. You can make backups readable (read-from-backup=true), however that introduces a slight chance to read staled data (time between owner is updated but backup is not yet).
In addition data in Hazelcast, again by default, is stored in serialized form, means binary streamable representation.
Ok so how can all this be faster than a TCP connection to your database?
The answer is twofold:
Hazelcast is a key-value store. Therefore it is optimized for requesting data by key and answering with the value as quickly as possible.
Data is already serialized, therefore the byte stream is just "smashed" into the socket without any real further work to be done.
Your database on the other hand has to really query data from a table. The internal data structures to hold the information is optimized for complex queries but not to access on a key base. But, and this is important, current database implementation optimize internally (in RAM) for fast access too. So the effect will only happen for databases that serve under high load. Caches (local or distributed) are designed to speed up slow operations, resulting in: if your database is blazingly fast you won't see a benefit.
Anyways designing a system you expect to grow exponentially you should consider caching right from the start. A comprehensive introduction into caching and the behind ideas is available in a caching whitepaper and article I wrote some time ago: https://dzone.com/articles/caching-why-you-should-care
I hope this answers your question :-)

Solutions for using columnar store with compression but allowing recent rows to be updated - Greenplum

I'm looking for specific examples of where someone has solved the following problem in Greenplum (or in another MPP db):
I have large fact tables that I would like to store in columnar orientation and compress (in Greenplum this would specifically be level 5 zlib compression).
However, each new row has a short time period where it can be updated before it becomes "static" - for example a value is allowed to change until some flag is raised. In Greenplum to use compression I need to use "append only" table types meaning the row cannot be directly updated safely.
So, in my inexperienced head, I believe an approach to this could be to have two tables - one holding the rows while they are allowed to be updated which can use standard HEAP storage and has no compression, and one holding only those that have become "static" (the vast majority) which is in columnar orientation and is compressed.
Clearly there are mechanics involved that make life more complex (unioning the two if I want to look at everything, triggering the deletion from one table and insertion into the other when the row becomes static, etc etc) so I'd really appreciate hearing about a real life solution to this problem.
Thanks
Andy.
Greenplum 4.3.x allows you to update column-oriented tables, you should use this version
In general, you can have partitioned table, partition by the "updatable" flag (one partition should be AO CO, one heap). This way your select queries won't be affected, but moving data from one partition to another would require removing the data from the heap partition and inserting it into column-oriented partition (not that complex to implement though)

Use Vertica Database for OLTP data?

Can Vertica Database be used for OLTP data?
And if so what are the pros and cons on doing this?
Looking for a Vertica vs Oracle fight :)Since Oracle license is so costly, would Vertica do it job for a better price ?
thx all
Using Vertica as a transactional database is a bad idea. It's designed to be a data warehousing tool. Essentially, it reads and writes data in an optimized fashion. Lots of transactions? That's not what it is designed to do.
I would recommend that you look into VoltDB. Michael Stonebreaker who is the force behind Vertica founded that company as well. His basic philosophy is that Oracle, SQL Server, et al do not do well for high performance since they are designed to do everything. The future is having databases designed for specific tasks.
So he had some concepts for a data warehousing which became Vertica. For transactional databases, there's VoltDB. Not owned by HP, for the record.
For the record, I haven't used VoltDB. From what I know, it isn't as mature as Vertica is as a solution but it looks like it has a ton of promise.
HP Vertica is a column store database. The nature of the way that data is organised within a column store does not lend itself to rapid writes.
HP Vertica gets around this by having a WOS (Write Optimised Store) and ROS (Read Optimised Store which is file based).
Data is moved out of the WOS into the ROS fairly rapidly and the ROS itself has a "merge up" process that takes small ROS files and merges them together to form larger and therefore more easily scanned files.
If you tried to use Vertica for OLTP then what would happen would be that you'd get loads of ROS containers and possibly hit the default limit of 1024 ROS containers very quickly.
If you fronted the store with some form a queuing mechanism to pass through records in larger batches then this would result in fewer and larger ROS files. It would work but if you wanted to take your OLTP system to be reading very close to its writing activity it would not fit the use case.
The WOS/ROS mechanism is a neat work around for the fundamental performance penalty of writes in a column store DB but fundamentally Vertica is not an OLTP DB but rather a data mart technology that can ingest data in near real time
I think there are different ways to read into this question.
Can you use Vertica as an OLTP database?
First I'll define this question a bit. An OLTP database means the database itself is responsible for the transaction processing, not simply receiving somewhat normalized data.
My answer here is absolutely not, unless perhaps it is a single user database. There is practically no RI, no RI locking, table locks on DELETE/UPDATE, and you're likely to accumulate a delete vector in normal OLTP type usage.
You can work around some of these with some extensive middleware programming (distributed locks, heavy avoidance of DELETE/UPDATE, etc). But why? There are tons of options out there that are not Oracle, don't carry a huge price tag but give you everything you need for OLTP.
Can you use Vertica to ingest and query OLTP data?
Yes, definitely. Best to use Vertica towards its strengths, though. Queries in Vertica tend to have a fair amount of overhead, and you can plow through large amounts of data with ease, even normalized. I would not be using Vertica to primary run point queries, grabbing a few rows here and there. It isn't that you can't, but you can't with the same concurrency as other databases that are meant for this purpose.
TL;DR Use the right tool for the right job. I really love using Vertica, but just because I like to swing a hammer doesn't mean that every problem is a nail.
This question is a little old now but i'll share my experience.
I would not suggest vertica as OLTP unless you very carefully consider your workload.
As mentioned in other answers, Vertica has 2 types of storage. ROS is the Read Optimized Storage and WOS is the Write Optimized Storage. WOS is purely in memory so it performs better for inserts but queries slower as all the small updates need to be queried and unioned. Vertica can handle small loads in theory but in practice it didn't work out very well for us performance wise. Also there are drawbacks to WOS namely being that when the database fails WOS is not necessarily preserved when it rolls back to last good epoch. (ROS isn't either but in practice you lose a lot less from ROS).
ROS is a lot more reliable and gives better read performance but you will never be able to handle more than a certain number of queries without a careful design. Although vertica is horizontally scalable, in practice large tables get segmented across all nodes and therefore queries must run on all nodes. So adding more nodes doesn't mean handling more concurrent queries it just means less work per query. If your tables are small enough to be unsegmented then this might not be an issue for you.
Also worth noting is the OLTP typically implies lots concurrent transactions so you'll need to plan resource pools very carefully. By default vertica has a planned concurrency for the general resource pool of the minimum of number of cores per server or RAM/2GB. Essentially what this value does is determine the default memory allocation PER NODE for a segmented query. Therefore by default vertica will not let you run more queries than cores. You can adjust this value but once you hit a cap on memory theres no much you can do because the memory is allocated per node so adding more nodes doesn't even help. If you hit any errors at all for resource pool memory allocations that is the first config your should look at.
Additionally, Vertica is bad with deletes and updates (which resolve to a delete and an insert in the background) so if these are a regular part of your workload then Vertica is probably a bad choice. Personally we use MySQL for our dimension tables that require deletes/updates and then sync that data periodically into vertica to use for joins.
Personally I use Vertica as an OLTP-ish realtime-ish database. We batch our loads into 5 minute intervals which makes vertica happy in terms of how many/large the inserts are. These batches are inserted using COPY DIRECT so that they avoid WOS entirely (only do this if they are large batches as this forces ROS container creation and can be bad if you do it too often). As many projections as we can have are unsegmented to allow better scale out since this makes queries hit only 1 node and allocate memory on only 1 node. It has worked well for us so far and we load about 5 billion rows a day with realtime querying from our UI.
Up_one - considering the telecom use-case - are you doing CDR or something else?
To answer your original question yes Vertica may be a great fit but it depends on how you are loading the data, how you are doing updates, what your data size is and what your SLA is. I am really familiar in this space because I implemented Vertica at a telecom that I worked for at the time.

Berkeley DB Java Edition - tuning for large amount of data

I need to load over 1 billion keys into Berkley DB and therefore I want to tune it in advance to get better performance. With standard configuration it takes me now about 15min to load 1'000'000 keys which is too slow.
Is there a proper way to tune for example the B+Tree of Berkley DB (node size etc...)?
(As an comparision, after tuning tokyo cabinet, it loads 1 billion keys in 25min).
P.S.
I'm looking for tuning tips as a code and not parameters to set for a running system (like jvm size etc...)
I'm curious, when TokyoCabinet loads 1B keys in 25 minutes what are the sizes of the keys/values being stored? What's the I/O systems and the storage system you're using? Are you using the term "load" to mean 1B transactional commits to permanent stable storage? That would be ~666,666 inserts/second, which is physically impossible given any I/O system I'm aware of. Multiply that number times the key and value size and now you're hopelessly beyond physical limits.
Please take a look at Gustavo Duarte's blog, read a bit about I/O systems and how things work in hardware and then review your statement. I'm very interested in finding out what exactly TokyoCabinet is doing and what it isn't doing. If I had to guess I'd say that either it's committing to file-system cache in the operating system, but not flushing (fdsync()-ing) those buffers to disk.
Full Disclosure: I'm a product manager at Oracle for Oracle Berkeley DB (a direct competitor of TokyoCabinet), I've been playing with these databases and the best hardware around for them for about ten years so I'm both biased and skeptical.
Berkeley DB has flags you can set on the transaction handle which mimic this and other similar methods of trading off durability (the "D" in ACID) for speed.
As far as how to make Berkeley DB Java Edition (BDB-JE) faster you can try the following:
Deferred writes: this delays writing
to the transaction log for as long as
possible (when buffers are full, it
flushes the data)
Sort your keys in advance: most
B-Trees (ours included) do much
better with in-order insertions for
fast load times-
Increasing the size of the log
files from the default of 10MiB to
something larger, like 100MiB, this
reduces I/O cost-
It's very important to be clear about claims of performance with databases. They seems simple, but it turns out to be very very tricky to get them right so that they don't ever corrupt data or lose committed transactions.
I hope this helps you a bit.
Bulk inserts on BDB-JE are an order of magnitude faster if you group them into a single transaction. The reason is that each single commit causes (by default) a sync write to disk while a transaction is synchronized on commit. In my application writing 100'000 small keys as single commits tooks more than a minute while in a transaction it takes just a few seconds.

Storage for Write Once Read Many

I have a list of 1 million digits. Every time the user submit an input, I would need to do a matching of the input with the list.
As such, the list would have the Write Once Read Many (WORM) characteristics?
What would be the best way to implement storage for this data?
I am thinking of several options:
A SQL Database but is it suitable for WORM (UPDATE: using VARCHAR field type instead of INT)
One file with the list
A directory structure like /1/2/3/4/5/6/7/8/9/0 (but this one would be taking too much space)
A bucket system like /12345/67890/
What do you think?
UPDATE: The application would be a web application.
To answer this question you'll need to think about two things:
Are you trying to minimize storage space, or are you trying to minimize process time.
Storing the data in memory will give you the fastest processing time, especially if you could optimize the datastructure for your most common operations (in this case a lookup) at the cost of memory space. For persistence, you could store the data to a flat file, and read the data during startup.
SQL Databases are great for storing and reading relational data. For instance storing Names, addresses, and orders can be normalized and stored efficiently. Does a flat list of digits make sense to store in a relational database? For each access you will have a lot of overhead associated with looking up the data. Constructing the query, building the query plan, executing the query plan, etc. Since the data is a flat list, you wouldn't be able to create an effective index (your index would essentially be the values you are storing, which means you would do a table scan for each data access).
Using a directory structure might work, but then your application is no longer portable.
If I were writing the application, I would either load the data during startup from a file and store it in memory in a hash table (which offers constant lookups), or write a simple indexed file accessor class that stores the data in a search optimized order (worst case a flat file).
Maybe you are interested in how The Pi Searcher did it. They have 200 million digits to search through, and have published a description on how their indexed searches work.
If you're concerned about speed and don't want to care about file system storage, probably SQL is your best shot. You can optimize your table indexes but also will add another external dependency on your project.
EDIT: Seems MySQL have an ARCHIVE Storage Engine:
MySQL supports on-the-fly compression since version 5.0 with the ARCHIVE storage engine. Archive is a write-once, read-many storage engine, designed for historical data. It compresses data up to 90%. It does not support indexes. In version 5.1 Archive engine can be used with partitioning.
Two options I would consider:
Serialization - when the memory footprint of your lookup list is acceptable for your application, and the application is persistent (a daemon or server app), then create it and store it as a binary file, read the binary file on application startup. Upside - fast lookups. Downside - memory footprint, application initialization time.
SQL storage - when the lookup is amenable to index-based lookup, and you don't want to hold the entire list in memory. Upside - reduced init time, reduced memory footprint. Downside - requires DBMS (extra app dependency, design expertise), fast, but not as fast as holding the whole list in memeory
If you're concerned about tampering, buy a writable DVD (or a CD if you can find a store which still carries them ...), write the list on it and then put it into a server with only a DVD drive (not a DVD writer/burner). This way, the list can't be modified. Another option would be to buy an USB stick which has a "write protect" switch but they are hard to come by and the security isn't as good as with a CD/DVD.
Next, write each digit into a file on that disk with one entry per line. When you need to match the numbers, just open the file, read each line and stop when you find a match. With todays computer speeds and amounts of RAM (and therefore file system cache), this should be fast enough for a once-per-day access pattern.
Given that 1M numbers is not a huge amount of numbers for todays computers, why not just do pretty much the simplest thing that could work. Just store the numbers in a text file and read them into a hash set on application startup. On my computer reading in 1M numbers from a text file takes under a second and after that I can do about 13M lookups per second.

Resources