I have a Elastic Beanstalk environment with ec2 instances. Since these are volatile, we can't rely on them to store files.
This is a codeigniter app using Twig template system and users can edit the template files. Currently these template files are stored in the file system (vps).
When we migrate to elastic beanstalk, we no longer can store the file templates as we do now.
What's the best approach to this? S3? Elasticache (memcached) ?
From your description it seems you have two main requirements -
Frequent read access
Durable persistence
Like you said using filesystem storage is not scalable or reliable for your usecase.
I think you should consider using Elasticache backed by DynamoDB. Since you care about persistence Elasticache alone is not a feasible solution. DynamoDB will provide you durability that Elasitcache does not provide.
DynamoDB reads have low read/write latencies so it may turn out to be pretty fast for your use case.
Dynamo DB can give you fast reads and writes. Here is a presentation on this topic that you might find helpful: http://dig.csail.mit.edu/2013/Talks/dig-seminar-0926-daniela.pdf
Your Dynamo DB schema will depend on how your data is structured, if you want to maintain a history of file edits etc. You can also just store a file in S3 which is cached using Elasticache but that may not give you enough flexibility. If you want to be able to write parts of file with S3 you will have to rewrite the entire file everytime a file is edited. But with Dynamo DB you can get the advantage of structuring your storage such that you can edit chunks of file on demand. Also with S3 you have to remember your reads will be eventually consistency. Dynamo DB supports both Strong and Eventual Consistency.
Related
I'm currently working on a traditional monolith application, but I am in the process of breaking it up into spring microservices managed by kubernetes. The application allows the uploading/downloading of large files and these files are normally stored on the host filesystem. I'm wondering what would be the most viable method of persisting these files in a microservice architecture?
You have a bunch of different options, Googling your question you'll find many answers, for any budget and taste. Basically you'd want high-availability storage like AWS S3. You could setup your own dedicated server to store these files as well if you wanted to cut costs, but then you'd have to worry about backups and availability. If you need low latency access to these files then you'd want to have them behind CDN as well.
We are mostly on prem. We end up using nfs. Path to least resistance, but probably not the most performant and making it highly available is tough. If you have the chance i agree with Denis Pshenov, that S3-like system for example minio might be a better alternative.
Maybe you should have a look at the rook project (https://rook.io/). It's easy to set up and provides different kinds of storage and persistence technologies to your CNAs.
There are many places to store your data. It also depends on the budget that you are able to spent (Holding duplicate data means also more storage which costs money) and mostly on your business requirements.
Is all data needed at all time?
Are there geo/region-related cases?
How fast needs a read / write operation need to be?
Do things need to be cached?
Statefull or Stateless?
Are there operational requirements? How should this be maintained?
...
A part from this your microservices should not know where the data is actually stored. In kubernetes you can use Persistent-Volumes https://kubernetes.io/docs/concepts/storage/persistent-volumes/ that can link to a storage of your Cloud-Provider or something else. The microservice should just mount the volume and be able to treat it like a local file.
Note that the Cloud Provider Storages already include solutions for scaling, concurrency etc. So I would probably use a single Blob-Storage under the hood.
However it has to be said, there is trend to understand a microservice as a package of data and logic coupled together and also accept duplicating the data, which leads to better scalability.
See for more information:
http://blog.christianposta.com/microservices/the-hardest-part-about-microservices-data/
https://github.com/katopz/best-practices/blob/master/best-practices-for-building-a-microservice-architecture.md#stateless-service-instances
https://12factor.net/backing-services
https://blog.twitter.com/engineering/en_us/topics/infrastructure/2017/the-infrastructure-behind-twitter-scale.html
I need to create a food ordering service, using microservices, scalable , cluster, several steps to order. Need to store user data between steps / requests.
What is an approach to keep state and user data? Store it in DB? Cache? Shared memory?
Are there any tutorials for the best practice of it?
(I gonna use spring / springboot and modules)
Anything that you cannot afford to lose (usually the business data) will go in DB and can be parallelly cached in an in-memory DB like Redis that has a cache eviction algorithm inbuilt.
Anything that, if lost, is not a big deal (usually the technical things that are not directly linked with the business data) can go only in an in-memory DB.
Since you are using Spring, you could probably use something like Redis with Spring Data Redis. There are already known Spring solutions (such as this) to fall back on api calls to fetch data from DB if the Redis server goes down. You can also run multiple Redis instances behind Redis Sentinel to provide failover. Redis Cluster provides a way to run a Redis installation where data is automatically sharded across multiple Redis nodes. Also, you can configure Redis to persist the data in file system once daily or so to backup the cache data for disaster recovery.
If you are looking for a fully managed service, AWS provides "Step Functions" to satisfy your stateful requirements: https://stackoverflow.com/questions/tagged/aws-step-functions
I am currently a bit confused into which database to use for geolocation Tracking. What I want to do is update the location of a group of people every 30 secs. The data is sent to the server using web-sockets. Each user has an Id in the database and I would like to update the location of that user every 30 second. After doing so, I would like to query these locations and show it in real time to another group of users. My question is what is the advantage and the disadvantages of DynamoDb and Redis. Which one is faster and can scale easier. I am expecting almost 2 million QPS
Both can scale fairly well, but this depends heavily on your use case and architecture.
DynamoDB is a cloud based NoSQL storage system, and Redis is an in memory data structure store. This means that queries to DynamoDB would involve making a roundtrip to Amazon's servers, while queries to Redis would be over RAM (so, much, much lower latency).
As a consequence of the above, the amount of data you can store in Redis would be limited by the RAM available on your hardware. That said, in the event of Redis or your hardware crashing for some reason, you would have to be content with some level of data loss. You can mitigate this somewhat by configuring Redis persistence so that Redis writes to disk regularly (either every N seconds or by manually triggering a write in your code) and mitigate further by then copying those writes to S3 or elsewhere. This trades performance (depending on your scale) for data safety somewhat due to I/O latency. See the documentation for Redis persistence and this blog post by the GitHub engineering team mentioning their decision to remove Redis persistence for performance reasons.
Meanwhile all of the issues above are abstracted away for you by DynamoDB since AWS handles availability for you behind the scenes. You are really only limited by how much you can afford and usage (read/write per second) limits.
DynamoDB does not have native support for querying and inserting geospatial data (although there is a library for it, but it seems to be unmaintained), Redis does. You could write your own code for this.
DynamoDB does not have support for namespacing, or rather, DynamoDB is namespaced by your AWS account meaning you would not be able to maintain a separate DynamoDB instance with the same table names (say for production vs dev data) on the same AWS account. Redis doesn't either, but you can trivially spin up a separate Redis instance for this.
See also Redis MEMORY USAGE command and Redis memory optimization docs.
I'm just beginning to learn about Big Data and I'm interested in Hadoop. I'm planning on building a simple analytics system to make sense of certain events that occurs in my site.
So I'm planning to have code (both front and back end) to trigger some events that would queue messages (most likely with RabbitMQ). These messages will then be processed by a consumer that would write the data continuously to HDFS. Then, I can run a map reduce job at any time to analyze the current data set.
I'm leaning towards Amazon EMR for the Hadoop functionality. So my question is this, from my server running the consumer, how do I save the data to HDFS? I know there's a command like "hadoop dfs -copyFromLocal", but how do I use this across servers? Is there a tool available?
Has anyone tried a similar thing? I would love to hear about your implementations. Details and examples would be very much helpful. Thanks!
If you mention EMR, it's takes input from a folder in s3 storage, so you can use your preffered language library to push data to s3 to analyse it later with EMR jobs. For example, in python one can use boto.
There are even drivers allowing you to mount s3 storage as a device, but some time ago all of them were too buggy to use them in production systems. May be thing have changed with time.
EMR FAQ:
Q: How do I get my data into Amazon S3? You can use Amazon S3 APIs to
upload data to Amazon S3. Alternatively, you can use many open source
or commercial clients to easily upload data to Amazon S3.
Note that emr (as well as s3) implies additional costs, and that it's usage is justified for really big data. Also note that it is always benefical to have relatively large files both in terms of Hadoop performance and storage costs.
I'm consider to use HDFS as horizontal scaling file storage system for our client video hosting service. My main concern that HDFS wasn't developed for this needs this is more "an open source system currently being used in situations where massive amounts of data need to be processed".
We don't want to process data just store them, create on a base of HDFS something like small internal Amazon S3 analog.
Probably important moment is that stored file size will be quite git from 100Mb to 10Gb.
Did anyone use HDFS in such purposes?
If you are using an S3 equivalient then it should already provide a distributed, mountable file-system no? Perhaps you can check out OpenStack at http://openstack.org/projects/storage/.
The main disadvantage would be the lack of POSIX semantics. You can't mount the drive, and you need special APIs to read and write from it. The Java API is the main one. There is a project called libhdfs that makes a C API over JNI, but I've never used it. Thriftfs is another option.
I'm also not sure about the read performance compared to other alternatives. Maybe someone else knows. Have you checked out other distributed filesystems like Lustre?
You may want to consider MongoDB for this. They have GridFS which will allow you to use it as a storage. You can then horizontally scale your storage through shards and provide fault tolerance with replication.
http://docs.mongodb.org/manual/core/gridfs/
http://docs.mongodb.org/manual/replication/
http://docs.mongodb.org/manual/sharding/