Database with low memory requirements and Ruby interface - ruby

I need a database with low memory requirements for a small virtual server with few memory. At the moment I'm stuck with SQLite and Kyoto Cabinet or Tokyo Cabinet. The database should have a Ruby interface.
Ideally I want to avoid key-value-stores, because I have “complex” queries (more complex than looking up a single key) and tuples as keys. On the other hand I don't want to have a fixed schema and avoid the planning and migration efforts of a SQL database. A database server is also not necessary because only a single application will use the database.
Do you have any recommendations and numbers for me?

There is schema-less Postgresql (Postgresql 9.2 + json). Not as hard/confusing to set up as I thought. You get lots of flexibility with queries while still getting the benefits of schema-less storage. PG 9.2 includes plv8js, a new language handler that allows you to create functions in JavaScript. Here is one example of how you can index and query JSON docs in PG 9.2: http://people.planetpostgresql.org/andrew/index.php?/archives/249-Using-PLV8-to-index-JSON.html
CouchDB (Use BigCouch. Based on CouchDB, but fewer bugs/problems.):
very low memory requirements.
schema-less.
HTTP-based interface. Ruby has plenty of HTTP clients. HTTP caching (like Varnish) can also speed reads.
creative/complex queries. You can create indexes and queries on any key in the document (record). You can get very creative with queries since the indexes are very programmable.
Downsides:
Learning curve of setting up your queries/indexes.
You have to schedule a type of cleanup operation called "compaction".
Data will take up more space compared to other databases.
More: http://www.paperplanes.de/2010/7/26/10_annoying_things_about_couchdb.html
If disk is cheap and memory expensive, it would make a good candidate for your needs.
"...another strength of CouchDB, which has proven to serve thousands of concurrent requests only needed about 10MB of RAM - how awesome is that?!?!" (From: http://www.larsgeorge.com/2009/03/hbase-vs-couchdb-in-berlin.html )

SQLite3 is a great fit for what you are trying to do. It's used by a lot of companies as their embedded app database because it's flexible, fast, well tested, and has a small footprint. It's easy to create and blow away tables so it plays well with testing or single-application-use data stores.
The SQL language it uses is rich enough to do normal things but I'd recommend using Sequel with it. It's a great ORM and easily lets you treat it as a full-blown ORM, or drop all the way down to talking raw SQL to the DBM.

You are looking for a solution that only has a database file and no running server, probably. In that case, Sqlite should be a good choice - If you don't need it, just close the connection and that's it. Sqlite has everything that you need from and RDMS (expect for enforcing FK's directly, but that can be done with triggers), with a very little memory footprint, so in that case you are probably worried more about the memory your ORM (if any) uses.
Personally, I use sqlite for that use case as well, as it is portable and easy to access and install (which shouldn't be the problem on a server anyway, but in a desktop application it is).

BerkeleyDB with SQLite API is what you need.
http://www.oracle.com/technetwork/database/berkeleydb/overview/sql-160887.html

Related

storing data in secondary database

Our application (java,spring, hibernate) uses postgress to store data.
We are looking to add an analysis engine to the application. I want to explore using a nosql db to run the analysis on. This is an attempt at learning the nosql a bit also to free the main application activity from performance penalty (as much as possible).
So, I want the data changes to also synch to the nosql db (in addition to postgres). Any synch mechanism will affect the performance of the main data/transaction activity.
Is it a good idea to push the data changes to a message bus and free the main transaction as early as possible ? Can anyone point me to frameworks/technologies/ideas that address this issue of same data going to two different data stores.
The simplest solution would be sending data to a Postgres read replica and running your analytics queries on that. The performance impact is minimal and this would save a lot of time compared to alternative approaches.
Unless you really know what you are doing, I would avoid NoSQL for this kind of application. If your dataset is too big for a Postgres read replica, you might want to use Redshift, which is a columnar datastore that is optimized for types of analytics queries typically performed.

Handle huge data imported from facebook

I'm currently create a program that imports all groups and feeds from Facebook which the user wants.
I used to use the Graph API with OAuth and this works very well.
But I came to the point that I realized that one request can't handle the import of 1000 groups plus the feeds.
So I'm looking for a solution that imports this data in the background (like a cron job) into a database.
Requirements
Runs in background
Runs under Linux
Restful
Questions
What's you experience about that?
Would hadoop the right solution?
You can use neo4j.
Neo4j is a graph database, reliable and fast for managing and querying highly connected data
http://www.neo4j.org/
1) Decide structure of nodes, relationships, and there properties and accordingly
You need to create API that will get data from facebook and store it in Neo4j.
I have used neo4j in 3 big projects, and it is best for graph data.
2) Create a cron jon that will get data from facebook and store into the neo4j.
I think implementing mysql for graph database is not a good idea. for large data neo4j is the good option.
Interestingly you designed the appropriate solution yourself already. So in fact you need following components:
a relational database, since you want to request data in a structured, quick way
-> from experiences I would pressure the fact to have a fully normalized data model (in your case with tables users, groups, users2groups), also have 4-Byte surrogate keys over larger keys from facebook (for back referencing you can store their keys as attributes, but internal relations are more efficient on surrogate keys)
-> establish indexes based on hashes rather than strings (eg. crc32(lower(STRING))) - an example select would than be this: select somethinguseful from users where name=SEARCHSTRING and hash=crc32(lower(SEARCHSTRING))
-> never,ever establish unique columns based on strings with length > 8 Byte; unique bulk inserts can be done based on hashes+string checking via insert...select
-> once you got that settled you could also look into sparse matrices (see wikipedia) and bitmaps to get your users2groups optimized (however I have learned that this is an extra that should not hinder you to come up with a first version soon)
a cron job that is run periodically
-> ideally along the caps, facebook is giving you (so if they rule you to not request more often than once per second, stick to that - not more, but also try to come as close as possible to the cap) -> invest some time in getting the management of this settled, if different types of requests need to be fired (request for user records <> requests for group records, but maybe hit by the same cap)
-> most of the optimization can only be done with development - so if I were you I would stick to any high level programming language that does not bother to much with var type juggling and that also comes along with a broad support for associative arrays such as PHP and I would programm that thing myself
-> I made good experiences with setting up the cron job as web page with deactivated output buffering (for php look at ob_end_flush(void)) - easy to test and the cron job can be triggered via curl; if you channel status outputs via an own function (eg with time stamps) this could then also become flexible to either run viw browser or via command line -> which means efficient testing + efficient production running
your user ui, which only requests your database and never, ever, never the external system api
lots of memory, to keep your performance high (optimal: all your data+index data fits into database memory/cache dedicated to the database)
-> if you use mysql as database you should look into innodb_flush_log_at_trx_commit=0, and innodb_buffer_pool_size (just google, if interested)
Hadoop is a file system layer - it could help you with availability. However I would put this into the category of "sparse matrix", which is nothing that stops you from coming up with a solution. From my experience availability is not a primary constraint in data exposure projects.
-------------------------- UPDATE -------------------
I like neo4j from the other answer. So I wondered what I can learn for my future projects. My experiences with mysql is that RAM is usually the biggest constraint. So increasing your RAM to be able to load the full database can gain you performance improvements by a factor of 2-1000 - depending on from where you are coming from. Everything else such as index improvements and structure somehow follows. So if I would need to make up a performance prioritization list, it would be something like this:
MYSQL + enough RAM dedicated to the database to load all data
NEO4J + enough RAM dedicated to the database to load all data
I would still prefer MYSQL. It stores records efficiently, but needs to run joins for deriving relations (which neo4j does not require to that extend). Join-costs are usually low with the right indexes and according to http://docs.neo4j.org/chunked/milestone/configuration-caches.html neo4j does need to add extra management data to the property separation. For big data projects those management data sums up and in full load to memory set ups requires you buy more memory. Performance wise these both options are ultimate. Further, much further down the line you would find this:
NEO4J + not enough RAM dedicated to the database to load all data
MYSQL + not enough RAM dedicated to the database to load all data
In worst case MYSQL will even put indexes to disk (at least partly), which can result in massive read delay. In comparison with NEO4J you could perform a ' direct jump from node to node' exercise, which should - at least in theory - be faster.

Postgres Hstore vs. Redis - performance wise

I read about HStores in Postgres something that is offered by Redis as well.
Our application is written in NodeJS. Two questions:
Performance-wise, is Postgres HStore comparable to Redis?
for session storage, what would you recommend--Redis, or Postgres with some other kind of data type (like HStore, or maybe even the usual relational table)? And how bad is one option vs the other?
Another constraint, is that we will need to use the data that is already in PostgreSQL and combine it with the active sessions (which we aren't sure where to store at this point, if in Redis or PostgreSQL).
From what we have read, we have been pointed out to use Redis as a Session manager, but due to the PostgreSQL constraint, we are not sure how to combine both and the possible performance issues that may arise.
Thanks!
Redis will be faster than Postgres because Pg offers reliability guarantees on your data (when the transaction is committed, it is guaranteed to be on disk), whereas Redis has a concept of writing to disk when it feels like it, so shouldn't be used for critical data.
Redis seems like a good option for your session data, or heck even store in a cookie or in your client side Javascript. But if you need data from your database on every request then it might not be even worth involving Redis. It very much depends on your application.
Using PostgreSQL as session manager is usually bad idea.
For older than 9.1 was physical limit of transaction per second based on persistent media parameters. For session management you usually don't need MGA (because there are not collision) and it means so MGA is overhead and databases without MGA and ACID must be significantly faster (10 or 100).
I know a use case, where PostgreSQL was used for session management and Performance was really terrible and unstable - it was eshop with about 10000 living sessions. When session management was moved to memcached, then performance and stability was significantly increased. PostgreSQL can be used for 100 living session without problem probably. For higher numbers there are better tools.

Any good document-oriented DB for Windows desktop besides MongoDB, etc?

I've been searching for a document-oriented DB that for a Windows desktop program. MongoDB seems to be the best one so far, because it's smaller (11MB) and simpler when compared to CoachDB (which is another option but it seems to be more complex and the download size is almost 50MB), but unfortunately, on 32-bit Windows the database size limit in MongoDB is 2GB, and they don't intend to fix this limit anytime.
Do you have any recommendation? Requirements:
Open source;
schema-less, in BSON/JSON format;
Easy to deploy to a windows machine.
Many thanks!
I'm just curious.. Why would you need a non-relational database for a desktop application. I mean, these things are designed for high-availability clusters and a really large amount of data, both of which are irrelevant for desktop apps where you would usually have just one user at a time and not so large dataset.
What I would use if I were you is an embedded database like HSQLDB or SQLite.
Now, if you want make it schema-less for simplicity, well just create your tables only with columns id long and data varchar
And then serialize/deserialize your objects to and from JSON yourself when accessing the data.
You can see a really easy way to do the JSON stuff here:
JSON Serializer for arbitrary HashMaps in Voldemort
Note: The question on link above is Voldemort-specific, but the answer I received isn't and could be applied here as well (assuming you are using Java, if not there has to be an easy way to do so in your language, too).

Need to load the whole postgreSQL database into the RAM

How do I put my whole PostgreSql database into the RAM for a faster access?? I have 8GB memory and I want to dedicate 2 GB for the DB. I have read about the shared buffers settings but it just caches the most accessed fragment of the database. I needed a solution where the whole DB is put into the RAM and any read would happen from the RAM DB and any write operation would first write into the RAM DB and then the DB on the hard drive.(some thing like the default fsync = on with shared buffers in postgresql configuration settings).
I have asked myself the same question for a while. One of the disadvantages of PostgreSQL is that it does not seem to support an IN MEMORY storage engines as MySQL does...
Anyway I ran in to an article couple of weeks ago describing how this could be done; although it only seems to work on Linux. I really can't vouch for it for I have not tried it myself, but it does seem to make sense since a PostgreSQL tablespace is indeed assigned a mounted repository.
However, even with this approach, I am not sure you could put your index(s) into RAM as well; I do not think MySQL forces HASH index use with its IN MEMORY table for nothing...
I also wanted to do a similar thing to improve performance for I am also working with huge data sets. I am using python; they have dictionary data types which are basically hash tables in the form of {key: value} pairs. Using these is very efficient and effective. Basically, to get my PostgreSQL table into RAM, I load it into such a python dictionary, work with it, and persist it into db once in a while; its worth it if it is used well.
If you are not using python, I am pretty sure their is a similar dictionary-mapping data structure in your language.
Hope this helps!
if you are pulling data by id, use memcached - http://www.danga.com/memcached/ + postgresql.
Set up an old-fashioned RAMdisk and tell pg to store its data there.
Be sure you back it up well though.
Perhaps something like a Tangosol Coherence cache if you're using Java.
With only an 8GB database, if you've already optimized all the SQL activity and you're ready solve query problems with hardware, I suggest you're in trouble. This is just not a scalable solution in the long term. Are you sure there is nothing you can do to make substantial differences on the software and database design side?
I haven't tried this myself (yet) but:
There is a standard docker image available for postgres - https://hub.docker.com/_/postgres/
docker supports tmpfs mounts that are entirely in-memory https://docs.docker.com/storage/tmpfs/
Theoretically, it should be possible to combine the two.
If you do this, you might also want to tweak seq_page_cost and random_page_cost to reflect the relative storage costs. See https://www.postgresql.org/docs/current/runtime-config-query.html
The pre-existing advice for query optimization and increasing shared_buffers still stands though. The chances are that if you're having these problems on a database this small simply putting it into RAM probably isn't the right fix.
One solution is to use Fujistu version of PostGreSQL that supports in memory columnstore indexes...
https://www.postgresql.fastware.com/in-memory-columnar-index-brochure
But it cost a lot....
Or run MS SQL Server with the In Memory tables features.... Even the free version express has it !

Resources