I'm consider to use HDFS as horizontal scaling file storage system for our client video hosting service. My main concern that HDFS wasn't developed for this needs this is more "an open source system currently being used in situations where massive amounts of data need to be processed".
We don't want to process data just store them, create on a base of HDFS something like small internal Amazon S3 analog.
Probably important moment is that stored file size will be quite git from 100Mb to 10Gb.
Did anyone use HDFS in such purposes?
If you are using an S3 equivalient then it should already provide a distributed, mountable file-system no? Perhaps you can check out OpenStack at http://openstack.org/projects/storage/.
The main disadvantage would be the lack of POSIX semantics. You can't mount the drive, and you need special APIs to read and write from it. The Java API is the main one. There is a project called libhdfs that makes a C API over JNI, but I've never used it. Thriftfs is another option.
I'm also not sure about the read performance compared to other alternatives. Maybe someone else knows. Have you checked out other distributed filesystems like Lustre?
You may want to consider MongoDB for this. They have GridFS which will allow you to use it as a storage. You can then horizontally scale your storage through shards and provide fault tolerance with replication.
http://docs.mongodb.org/manual/core/gridfs/
http://docs.mongodb.org/manual/replication/
http://docs.mongodb.org/manual/sharding/
Related
My use case:
A single-node out-of-memory "big dict" (or "big map"). The total size is too large for memory, e.g. 20gb, but is ok for single-node disk. Due to the total size, it's unwieldy with single-file solution like SQLite. Also I want easy close backpacks, so want manageable file sizes. It needs to be a series of size-controllable files managed by the tool in a user transparent way. Further it should be embedded, ie, a simple lib, no client/server.
Long story short, I picked Rocksdb.
Now new requirements or nice-to-haves: I want to use a cloud blobstore as the ultimate storage. For example, a couple levels of hot caches reside in mo.eory or local disk with configurable total size; beyond that, go read/write to a cloud blob store.
After the initial creation of the dataset, the usage is mainly read. I don't care much about "distributed", multiple-machines competing-to-write that kind of complexities.
I don't see Rocksdb has this option.There's rocksdb-cloud that appears to be in "internal dev" mode---no end-user doc whatsoever.
Questions:
Is my use case reasonable? Would a cloud kv store (like GCP Firestore?) plus a naive flat cache in memory going to have similar effect?
How to do this with Rocksdb? Or any alternative?
Thanks.
RocksDB allows you to define your own FileSystem or Env, which you can implement the interaction layer with whatever special filesystem you want. So it's possible, but you need implement or define the integration layer with cloud kv store. (running on HDFS is an example, which defines it's own Env)
Comment to #jayZhuang; too long as comment
This looks like the code is modular in decent ways, but can hardly say it "supports" cloud storage, because that needs forking and hacking the code itself. More reasonable to the end use would be extension or plugin from outside, basically "give me a few auth argents and the location of the storage and I do the rest". The few major blob stores should be a modest effort for this.
For me, I'm using Rocksdb from python via a hardly maintained Rocksdb python client. (There are no active options.) I have nice python utilities for cloud blobstore. I'm sure there's no way to let Rocksdb use that coming from python via an inactive Rocksdb python client package. Although I am able to do c++ extensions for python, that would need digging into both Rocksdb and blobstore in c++. It's not something I'll take on.
Thanks for the pointers. Do you know of any other examples closer to the end user?
I'm currently working on a traditional monolith application, but I am in the process of breaking it up into spring microservices managed by kubernetes. The application allows the uploading/downloading of large files and these files are normally stored on the host filesystem. I'm wondering what would be the most viable method of persisting these files in a microservice architecture?
You have a bunch of different options, Googling your question you'll find many answers, for any budget and taste. Basically you'd want high-availability storage like AWS S3. You could setup your own dedicated server to store these files as well if you wanted to cut costs, but then you'd have to worry about backups and availability. If you need low latency access to these files then you'd want to have them behind CDN as well.
We are mostly on prem. We end up using nfs. Path to least resistance, but probably not the most performant and making it highly available is tough. If you have the chance i agree with Denis Pshenov, that S3-like system for example minio might be a better alternative.
Maybe you should have a look at the rook project (https://rook.io/). It's easy to set up and provides different kinds of storage and persistence technologies to your CNAs.
There are many places to store your data. It also depends on the budget that you are able to spent (Holding duplicate data means also more storage which costs money) and mostly on your business requirements.
Is all data needed at all time?
Are there geo/region-related cases?
How fast needs a read / write operation need to be?
Do things need to be cached?
Statefull or Stateless?
Are there operational requirements? How should this be maintained?
...
A part from this your microservices should not know where the data is actually stored. In kubernetes you can use Persistent-Volumes https://kubernetes.io/docs/concepts/storage/persistent-volumes/ that can link to a storage of your Cloud-Provider or something else. The microservice should just mount the volume and be able to treat it like a local file.
Note that the Cloud Provider Storages already include solutions for scaling, concurrency etc. So I would probably use a single Blob-Storage under the hood.
However it has to be said, there is trend to understand a microservice as a package of data and logic coupled together and also accept duplicating the data, which leads to better scalability.
See for more information:
http://blog.christianposta.com/microservices/the-hardest-part-about-microservices-data/
https://github.com/katopz/best-practices/blob/master/best-practices-for-building-a-microservice-architecture.md#stateless-service-instances
https://12factor.net/backing-services
https://blog.twitter.com/engineering/en_us/topics/infrastructure/2017/the-infrastructure-behind-twitter-scale.html
I'm just beginning to learn about Big Data and I'm interested in Hadoop. I'm planning on building a simple analytics system to make sense of certain events that occurs in my site.
So I'm planning to have code (both front and back end) to trigger some events that would queue messages (most likely with RabbitMQ). These messages will then be processed by a consumer that would write the data continuously to HDFS. Then, I can run a map reduce job at any time to analyze the current data set.
I'm leaning towards Amazon EMR for the Hadoop functionality. So my question is this, from my server running the consumer, how do I save the data to HDFS? I know there's a command like "hadoop dfs -copyFromLocal", but how do I use this across servers? Is there a tool available?
Has anyone tried a similar thing? I would love to hear about your implementations. Details and examples would be very much helpful. Thanks!
If you mention EMR, it's takes input from a folder in s3 storage, so you can use your preffered language library to push data to s3 to analyse it later with EMR jobs. For example, in python one can use boto.
There are even drivers allowing you to mount s3 storage as a device, but some time ago all of them were too buggy to use them in production systems. May be thing have changed with time.
EMR FAQ:
Q: How do I get my data into Amazon S3? You can use Amazon S3 APIs to
upload data to Amazon S3. Alternatively, you can use many open source
or commercial clients to easily upload data to Amazon S3.
Note that emr (as well as s3) implies additional costs, and that it's usage is justified for really big data. Also note that it is always benefical to have relatively large files both in terms of Hadoop performance and storage costs.
I have read many questions/comments regarding saving the image in DB or file system on server side. However i'm still confused. For now I allow user to upload image (limit to 10MB) and I save the image in the server folder and serve the image via apache context path configuration pointed to that location. However, due to the numbers of image and high load. We want to provide load balancing and fail over functionality. So I have 2 options.
Add code to replicate the uploaded image to all servers or using rsync to do that.
Using CouchDB or MongoDB and save the image as attachment of an document. So I have out of the box replicate functionality.
Can anyone show me the pros/cons of these approach. Can CouchDB/MongoDB have the same read performance compared to file system ?
You can also store files in distributed file system. The benefit over DB supported image server is you do not have to alter the application. Obviously, storing all the data the same way, including images, may be a benefit for you, but changing architecture for already working system may also be problematic.
For example, GlusterFS may be installed on top of "normal" file system to give you distributed features minimizing changes to the system itself. It is supposed to support via its plugins (translators) all the feature you would potentially expect from cloud system: replication, load balancing, stripping of files into relocated parts and fail-over.
Can CouchDB/MongoDB have the same read performance compared to file system ?
No, there will be lag between file system timers and database timers, this is an unfortunately reality.
I have no idea of your current setup, load and performance so I cannot really advise on what to do, however, Apache isn't really a good image server anyway.
Your best bet might be to look into a CDN cache for your images.
I have a load balanced enviorment with over 10 web servers running IIS. All websites are accessing a single file storage that hosts all the pictures. We currently have 200GB of pictures - we store them in directories of 1000 images per directory. Right now all the images are in a single storage device (RAID 10) connected to a single server that serves as the file server. All web servers are connected to the file server on the same LAN.
I am looking to improve the architecture so that we would have no single point of failure.
I am considering two alternatives:
Replicate the file storage to all of the webservers so that they all access the data locally
replicate the file storage to another storage so if something happens to the current storage we would be able to switch to it.
Obviously the main operations done on the file storage are read, but there are also a lot of write operations. What do you think is the preferred method? Any other idea?
I am currently ruling out use of CDN as it will require an architecture change on the application which we cannot make right now.
Certain things i would normally consider before going for arch change is
what are the issues of current arch
what am i doing wrong with the current arch.(if this had been working for a while, minor tweaks will normally solve a lot of issues)
will it allow me to grow easily (here there will always be a upper limit). Based on the past growth of data, you can effectively plan it.
reliability
easy to maintain / monitor / troubleshoot
cost
200GB is not a lot of data, and you can go in for some home grown solution or use something like a NAS, which will allow you to expand later on. And have a hot swappable replica of it.
Replicating to storage of all the webservers is a very expensive setup, and as you said there are a lot of write operations, it will have a large overhead in replicating to all the servers(which will only increase with the number of servers and growing data). And there is also the issue of stale data being served by one of the other nodes. Apart from that troubleshooting replication issues will be a mess with 10 and growing nodes.
Unless the lookup / read / write of files is very time critical, replicating to all the webservers is not a good idea. Users(of web) will hardly notice the difference of 100ms - 200ms in loadtime.
There are some enterprise solutions for this sort of thing. But I don't doubt that they are expensive. NAS doesn’t scale well. And you have a single point of failure which is not good.
There are some ways that you can write code to help with this. You could cache the images on the web servers the first time they are requested, this will reduce the load on the image server.
You could get a master slave set up, so that you have one main image server but other servers which copy from this. You could load balance these, and put some logic in your code so that if a slave doesn’t have a copy of an image, you check on the master. You could also assign these in priority order so that if the master is not available the first slave then becomes the master.
Since you have so little data in your storage, it makes sense to buy several big HDs or use the free space on your web servers to keep copies. It will take down the strain on your backend storage system and when it fails, you can still deliver content for your users. Even better, if you need to scale (more downloads), you can simply add a new server and the stress on your backend won't change, much.
If I had to do this, I'd use rsync or unison to copy the image files in the exact same space on the web servers where they are on the storage device (this way, you can swap out the copy with a network file system mount any time).
Run rsync every now and then (for example after any upload or once in the night; you'll know better which sizes fits you best).
A more versatile solution would be to use a P2P protocol like Bittorreent. This way, you could publish all the changes on the storage backend to the web servers and they'd optimize the updates automatcially.