My use case:
A single-node out-of-memory "big dict" (or "big map"). The total size is too large for memory, e.g. 20gb, but is ok for single-node disk. Due to the total size, it's unwieldy with single-file solution like SQLite. Also I want easy close backpacks, so want manageable file sizes. It needs to be a series of size-controllable files managed by the tool in a user transparent way. Further it should be embedded, ie, a simple lib, no client/server.
Long story short, I picked Rocksdb.
Now new requirements or nice-to-haves: I want to use a cloud blobstore as the ultimate storage. For example, a couple levels of hot caches reside in mo.eory or local disk with configurable total size; beyond that, go read/write to a cloud blob store.
After the initial creation of the dataset, the usage is mainly read. I don't care much about "distributed", multiple-machines competing-to-write that kind of complexities.
I don't see Rocksdb has this option.There's rocksdb-cloud that appears to be in "internal dev" mode---no end-user doc whatsoever.
Questions:
Is my use case reasonable? Would a cloud kv store (like GCP Firestore?) plus a naive flat cache in memory going to have similar effect?
How to do this with Rocksdb? Or any alternative?
Thanks.
RocksDB allows you to define your own FileSystem or Env, which you can implement the interaction layer with whatever special filesystem you want. So it's possible, but you need implement or define the integration layer with cloud kv store. (running on HDFS is an example, which defines it's own Env)
Comment to #jayZhuang; too long as comment
This looks like the code is modular in decent ways, but can hardly say it "supports" cloud storage, because that needs forking and hacking the code itself. More reasonable to the end use would be extension or plugin from outside, basically "give me a few auth argents and the location of the storage and I do the rest". The few major blob stores should be a modest effort for this.
For me, I'm using Rocksdb from python via a hardly maintained Rocksdb python client. (There are no active options.) I have nice python utilities for cloud blobstore. I'm sure there's no way to let Rocksdb use that coming from python via an inactive Rocksdb python client package. Although I am able to do c++ extensions for python, that would need digging into both Rocksdb and blobstore in c++. It's not something I'll take on.
Thanks for the pointers. Do you know of any other examples closer to the end user?
Related
I'm currently working on a traditional monolith application, but I am in the process of breaking it up into spring microservices managed by kubernetes. The application allows the uploading/downloading of large files and these files are normally stored on the host filesystem. I'm wondering what would be the most viable method of persisting these files in a microservice architecture?
You have a bunch of different options, Googling your question you'll find many answers, for any budget and taste. Basically you'd want high-availability storage like AWS S3. You could setup your own dedicated server to store these files as well if you wanted to cut costs, but then you'd have to worry about backups and availability. If you need low latency access to these files then you'd want to have them behind CDN as well.
We are mostly on prem. We end up using nfs. Path to least resistance, but probably not the most performant and making it highly available is tough. If you have the chance i agree with Denis Pshenov, that S3-like system for example minio might be a better alternative.
Maybe you should have a look at the rook project (https://rook.io/). It's easy to set up and provides different kinds of storage and persistence technologies to your CNAs.
There are many places to store your data. It also depends on the budget that you are able to spent (Holding duplicate data means also more storage which costs money) and mostly on your business requirements.
Is all data needed at all time?
Are there geo/region-related cases?
How fast needs a read / write operation need to be?
Do things need to be cached?
Statefull or Stateless?
Are there operational requirements? How should this be maintained?
...
A part from this your microservices should not know where the data is actually stored. In kubernetes you can use Persistent-Volumes https://kubernetes.io/docs/concepts/storage/persistent-volumes/ that can link to a storage of your Cloud-Provider or something else. The microservice should just mount the volume and be able to treat it like a local file.
Note that the Cloud Provider Storages already include solutions for scaling, concurrency etc. So I would probably use a single Blob-Storage under the hood.
However it has to be said, there is trend to understand a microservice as a package of data and logic coupled together and also accept duplicating the data, which leads to better scalability.
See for more information:
http://blog.christianposta.com/microservices/the-hardest-part-about-microservices-data/
https://github.com/katopz/best-practices/blob/master/best-practices-for-building-a-microservice-architecture.md#stateless-service-instances
https://12factor.net/backing-services
https://blog.twitter.com/engineering/en_us/topics/infrastructure/2017/the-infrastructure-behind-twitter-scale.html
I noticed that Spring reference application (Sagan) uses the SimpleCacheManager implementation. See here for source code of Sagan.
I was surprised by this choice because I thought that all but small applications running on a single node would use something like a Redis cache manager and not the simple cache manager.
How can a large application like Sagan -which I assume runs on cloudfoundry- use this simple implementation?
Any comment welcome.
Well, the SimpleCacheManager choice has been made because it was the simplest solution that could possibly work. Note that Sagan is, at least for now, not storing a lot of data in that cache and merely using it to respect various APIs rate-limiting and get better performance on some parts of the application.
Yes, Sagan is running on CloudFoundry (see this presentation) and is using CF marketplace services.
Even if cache consistency between instances is not a constraint for now, we could definitely add another marketplace service, here a Redis Cloud instance, and use this as a central cache repository.
Now that we're considering using that cache for more features, it even makes sense to at least consider that use case, since it could lower our monthly bill (pay a small fee for a redis service and use less memory for our CF instances).
In any case, thanks a lot balteo for this insightful question, we've created a Github issue for that.
I am currently building a high traffic GIS system which uses python on the web front end. The system is 99% read only. In the interest of performance, I am considering using an externally generated cache of pre-generated read-optimised GIS information and storing in an SQLite database on each individual web server. In short it's going to be used as a distributed read-only cache which doesn't have to hop over the network. The back end OLTP store will be postgreSQL but that will handle less than 1% of the requests.
I have considered using Redis but the dataset is quite large and therefore it will push up the administrative cost and memory cost on the virtual machines this is being hosted on. Memcache is not suitable as it cannot do range queries.
Am I going to hit read-concurrency problems with SQLite doing this?
Is this a sensible approach?
Ok after much research and performance testing, SQLite is suitable for this. It has good request concurrency on static data. SQLite only becomes an issue if you are doing writes as well as heavy reads.
More information here:
http://www.sqlite.org/lockingv3.html
if usage case is just a cache why don't you use something like
http://memcached.org/.
You can find memcached bindings for python in pypi repository.
Another options is that you use materialized views in postgres, this way you will keep things simple and have everything in one place.
http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views
I'm consider to use HDFS as horizontal scaling file storage system for our client video hosting service. My main concern that HDFS wasn't developed for this needs this is more "an open source system currently being used in situations where massive amounts of data need to be processed".
We don't want to process data just store them, create on a base of HDFS something like small internal Amazon S3 analog.
Probably important moment is that stored file size will be quite git from 100Mb to 10Gb.
Did anyone use HDFS in such purposes?
If you are using an S3 equivalient then it should already provide a distributed, mountable file-system no? Perhaps you can check out OpenStack at http://openstack.org/projects/storage/.
The main disadvantage would be the lack of POSIX semantics. You can't mount the drive, and you need special APIs to read and write from it. The Java API is the main one. There is a project called libhdfs that makes a C API over JNI, but I've never used it. Thriftfs is another option.
I'm also not sure about the read performance compared to other alternatives. Maybe someone else knows. Have you checked out other distributed filesystems like Lustre?
You may want to consider MongoDB for this. They have GridFS which will allow you to use it as a storage. You can then horizontally scale your storage through shards and provide fault tolerance with replication.
http://docs.mongodb.org/manual/core/gridfs/
http://docs.mongodb.org/manual/replication/
http://docs.mongodb.org/manual/sharding/
Please give me some hints on my issue.
I'm building a queue data structure that:
has a backup on hard disk at realtime
and can restore the backup
Can respond to massive enqueue/dequeue request
Thank you!
Is this an exercise your doing. If not, you should probably look at some of the production message queueing technologies (e.g. MSMQ for Windows) which supports persisting the queues on the disk and not just storing them in memory.
In terms of your requirements
1. has a backup on hard disk at realtime
Yes MSMQs can do that.
2. and can restore the backup
And that.
3. Can respond to massive enqueue/dequeue request
And this...
If you can avoid it, don't roll your own. For Java try ActiveMQ.
You're probably looking at something more complex than just a simple library.
Since this is an exercise (probably wanting you to think about the underlying data structures), you might start the simple way by queueing to a mySQL database. Your performance will suck compared to dedicated queueing software, but you can at least get the rest of the infrastructure working.
After that, you'll probably be looking at some form of custom file format and a multi-threaded server on top of it. You might be able to pull it off using something like BDB or SQLite for the I/O layer, thus saving you the trouble of writing the actual disk routines.