Amazon EMR vs EC2 for Off loading BI & Analytics anno 2018 - amazon-ec2

I looked at some posts but they are a bit older on this topic. I have read the AWS and other blogs as well, but ...
My simple non-programming question for AWS in today's environment is:
If we have a DWH of say, 20+TB and growing, that we want to off-load to the Cloud as many are doing, then
if we have a regular daily DWH feed with some mutations, then
should we in the case of AWS, use EMR or EC2?
Moreover, it is a complete batch environment, no Streaming or KAFKA requirements. Usage of SPARK for sure.
EMR seems great, but I have the impression it is for Data Scientists to do whatever they want whenever they want. For more regular ETL I am wondering if this is suited. The appeal of less management is certainly a boon.
In the docs on AWS I cannot find a definitive answer, hence this question.
My impression is that with AMI and bootstrapping own services, that EMR is certainly one way to go, and, that EC2 would be more for a KAFKA Cluster or if you really want to control your own environment and tooling completely based on say Cloudera's distribution per se.

So, the answer here is for others that may need to assess which options apply for off-loading, whatever. It is actually not so hard in hindsight. Note that AZURE and non-AWS vendors not considered here. In a nutshell, then:
EMR is an (PaaS) AWS Managed Hadoop Service
EMR provides tools that AMAZON feel will do the job for Data Science, Analytics, etc. But you can "bootstrap" your own requirements / software, if needed.
EMR-clusters comprise short-running EC2 instances and provisioning happens under water as it were. You get patches effected easily this way. You can up- and downscale very easily as well. Compute and storage are divorced allowing this scaling to occur easily.
Elasticity applies obviously more so to compute, data needs to be there as long as you need it. EMR relies on S3 to save results to, longer term. After saving, one terminates the EMR-cluster, and when required, start a new EMR-cluster and attach your saved S3 results - if applicable - to this new cluster. EMRFS allows S3 to look like part of HDFS and provides easy access. EBS-backed storaged exists that allows saving of results to storage tied to the EC2-instance for the duration of that instance.
It's a new way of doing things. One has access to "spot" instances with obviously spot prices. Billing is less predictable as it depends what you do, but could well overall be cheaper - provided governed correctly. An example of this is expedia's management of EMR-clusters.
Ad-hoc querying is not well served with S3, so you will need another AWS Managed Services such as Presto / Athena or Redshift (Spectrum) which is an additional set of services and cost. Just mentioning this due to slower S3 performance.
EC2 (IaaS) is more "traditional"
You elect to take this path if you want to provision EC2 instances yourself a syou want control of the software and what you want on your Hadoop environment.
EC2 instances - VMs - have compute power, memory, EBS-backed temporal storage, and use EFS for file systems for HDFS or, say, KUDU, and S3. S3 access is not as easy to access as under EMRFS with EMR.
You install and maintain the Hadoop software yourself and apply patches, etc. Management of Hadoop on these EC2 instances is of course less of a big deal with Cloudera and Cloudbreak.
Billing is more predictable one could argue, on the basis of up-time of an EC2 instance, and billing applies continuously for any persisted storage.
Important point, one can combine an EC2 approach for, say, DWH Loading on Hadoop - if "off-loading", and EMR Clusters for Data Science.
MR Data Locality
This not adhered to in both approaches unless bare metal options used, but then the elasticity - E - is harder for both parties, which allows cost savings.
Data locality seems to be assumed by most, but actually it has gone with Cloud computing as expected, and seems quite OK in terms of performance for Data Science etc.
For ad hoc querying AMAZON say they are not so sure on S3, and from experience, using EFS fof HDFS/PARQUET or KUDU works pretty quick, to say the least, from my experience at least.

Related

Choosing Hadoop solution for Big Data project - Pricing Options

I have to use Hadoop for my research work and I am deciding for the best option to start with. So far I have end up to go with Cloudera. I've downloaded the quick start VM
and started learning different turorials.
The issue is that my system can't afford to run it and perform very slow and I think it might just stop working after I feed it with all the data and run other services.
I was advised to go for a cloud service with 4 cluster node. Can someone please help me by providing the best option and estimated pricing to consider? 1 year plan might be enough to complete my research.
Thanks.
If you are a linux user, Just download the individual components(like hdfs, MR1, YARN, Hbase, Hive etc...) from this Cloudera Archives instead of loading Cloudera Quickstart VM.
If you want to try the 4 node cluster, easiest option is to use cloud.
There are plenty of cloud providers. I have personally used AWS, Google Cloud, Microsoft Azure, IBM SmartCloud. Out of which, AWS is the best to start with.
It is like pay as you go(use).I can recommend you to use a decent EC2 Machine(4 X m3.large Machines)
Type: m3.large CPU:2 RAM:7.5G Storage: 1 x 32 SSD Price: $0.133 per Hour AWS Pricing
If you plan to do the research for one year, I recommend you to go for VPC.
Cons of AWS EC2:
If you launch a machine in EC2, the moment you restart your machine, Your IP and the hostname will get changed.
In AWS VPC, your IP and hostname will remain the same.
If you use 4 Machinesx24x7xone month,it costs you $389.44.
You can calculate the AWS cost by yourself
As best as I can see you have two paths:
Setup Hadoop in a cloud service provider (i.e. Amazon's EC2 or
Redhat's Openshift.
Use Hadoop-as-a-service (i.e. Amazon's EMR or Microsoft's HDInsight).
The first path, setting up Hadoop in a cloud service provider will require you to become a semi-competent Hadoop administrator. If that's your goal, great! However you'll spend a great deal of time learning the necessary skills and mindset to become that. I don't suspect that that is your goal.
The second path is the one I'd recommend out of these two. Using Hadoop-as-a-service you will get up and running faster, but will cost more up front and on an ongoing (per hour basis). You'll still probably save money because you'll be spending less time troubleshooting your Hadoop cluster and more time doing the work you wanted to do in the first place.
I have to wonder, if you can even fit your dataset on your laptop, why are you using big data tools in the first place? True, they'll work. However Big Data is at least partially defined as data sets and computational problems that just don't fit on a single machine.

Designing an Analytics System with Hadoop

I'm just beginning to learn about Big Data and I'm interested in Hadoop. I'm planning on building a simple analytics system to make sense of certain events that occurs in my site.
So I'm planning to have code (both front and back end) to trigger some events that would queue messages (most likely with RabbitMQ). These messages will then be processed by a consumer that would write the data continuously to HDFS. Then, I can run a map reduce job at any time to analyze the current data set.
I'm leaning towards Amazon EMR for the Hadoop functionality. So my question is this, from my server running the consumer, how do I save the data to HDFS? I know there's a command like "hadoop dfs -copyFromLocal", but how do I use this across servers? Is there a tool available?
Has anyone tried a similar thing? I would love to hear about your implementations. Details and examples would be very much helpful. Thanks!
If you mention EMR, it's takes input from a folder in s3 storage, so you can use your preffered language library to push data to s3 to analyse it later with EMR jobs. For example, in python one can use boto.
There are even drivers allowing you to mount s3 storage as a device, but some time ago all of them were too buggy to use them in production systems. May be thing have changed with time.
EMR FAQ:
Q: How do I get my data into Amazon S3? You can use Amazon S3 APIs to
upload data to Amazon S3. Alternatively, you can use many open source
or commercial clients to easily upload data to Amazon S3.
Note that emr (as well as s3) implies additional costs, and that it's usage is justified for really big data. Also note that it is always benefical to have relatively large files both in terms of Hadoop performance and storage costs.

Amazon EC2 consideration - redundancy and elastic IPs

I've been tasked with determining if Amazon EC2 is something we should move our ecommerce site to. We currently use Amazon S3 for a lot of images and files. The cost would go up by about $20/mo for our host costs, but we could sell our server for a few thousand dollars. This all came up because right now there are no procedures in place if something happened to our server.
How reliable is Amazon EC2? Is the redundancy good, I don't see anything about this in the FAQ and it's a problem on our current system I'm looking to solve.
Are elastic IPs beneficial? It sounds like you could point DNS to that IP and then on Amazon's end, reroute that IP address to any EC2 instance so you could easily get another instance up and running if the first one failed.
I'm aware of scalability, it's the redundancy and reliability that I'm asking about.
At work, I've had something like 20-40 instances running at all times for over a year. I think we've had 1-3 alert emails come from amazon suggesting that we terminate and boot another instance (presumably because they are detecting possible failure in the underlying hardware). We've never had an instance go down suddenly, which seems rather good.
Elastic IP's are amazing and are part of the solution. The other part is being able to rapidly bring up new instances. I've learned that you shouldn't care about instances going down, that it's more important to use proper load balancing and be able to bring up commodity instances quickly.
Yes, it's very good. If you aren't able to put together a concurrent redundancy (where you have multiple servers fulfilling requests simultaneously), using the elastic IP to quickly redirect to another EC2 instance would be a way to minimize downtime.
Yeah I think moving from inhouse server to Amazon will definitely make a lot of sense economically. EBS backed instances ensure that even if the machine gets rebooted, the transient memory is not lost. And if you have a clear separation between your application and data layer and can have them on different machines, then you can build even better redundancy for your data.
For ex, if you use mysql, then you can consider using Amazon RDS service - which gives you a highly available and reliable MySQL instance, fully managed (patches and all). The application layer then can be made more resilient by having more smaller instances rather than one larger instance, through load balancing.
The cost you will save on is really hardware maintenance and the cost you would have to incur to build in disaster recovery.

Benefits of EBS vs. instance-store (and vice-versa) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm unclear as to what benefits I get from EBS vs. instance-store for my instances on Amazon EC2. If anything, it seems that EBS is way more useful (stop, start, persist + better speed) at relatively little difference in cost...? Also, is there any metric as to whether more people are using EBS now that it's available, considering it is still relatively new?
The bottom line is you should almost always use EBS backed instances.
Here's why
EBS backed instances can be set so that they cannot be (accidentally) terminated through the API.
EBS backed instances can be stopped when you're not using them and resumed when you need them again (like pausing a Virtual PC), at least with my usage patterns saving much more money than I spend on a few dozen GB of EBS storage.
EBS backed instances don't lose their instance storage when they crash (not a requirement for all users, but makes recovery much faster)
You can dynamically resize EBS instance storage.
You can transfer the EBS instance storage to a brand new instance (useful if the hardware at Amazon you were running on gets flaky or dies, which does happen from time to time)
It is faster to launch an EBS backed instance because the image does not have to be fetched from S3.
If the hardware your EBS-backed instance is scheduled for maintenance, stopping and starting the instance automatically migrates to new hardware. I was also able to move an EBS-backed instance on failed hardware by force-stopping the instance and launching it again (your mileage may vary on failed hardware).
I'm a heavy user of Amazon and switched all of my instances to EBS backed storage as soon as the technology came out of beta. I've been very happy with the result.
EBS can still fail - not a silver bullet
Keep in mind that any piece of cloud-based infrastructure can fail at any time. Plan your infrastructure accordingly. While EBS-backed instances provide certain level of durability compared to ephemeral storage instances, they can and do fail. Have an AMI from which you can launch new instances as needed in any availability zone, back up your important data (e.g. databases), and if your budget allows it, run multiple instances of servers for load balancing and redundancy (ideally in multiple availability zones).
When Not To
At some points in time, it may be cheaper to achieve faster IO on Instance Store instances. There was a time when it was certainly true. Now there are many options for EBS storage, catering to many needs. The options and their pricing evolve constantly as technology changes. If you have a significant amount of instances that are truly disposable (they don't affect your business much if they just go away), do the math on cost vs. performance. EBS-backed instances can also die at any point in time, but my practical experience is that EBS is more durable.
99% of our AWS setup is recyclable. So for me it doesn't really matter if I terminate an instance -- nothing is lost ever. E.g. my application is automatically deployed on an instance from SVN, our logs are written to a central syslog server.
The only benefit of instance storage that I see are cost-savings. Otherwise EBS-backed instances win. Eric mentioned all the advantages.
[2012-07-16] I would phrase this answer a lot different today.
I haven't had any good experience with EBS-backed instances in the past year or so. The last downtimes on AWS pretty much wrecked EBS as well.
I am guessing that a service like RDS uses some kind of EBS as well and that seems to work for the most part. On the instances we manage ourselves, we have got rid off EBS where possible.
Getting rid to an extend where we moved a database cluster back to iron (= real hardware). The only remaining piece in our infrastructure is a DB server where we stripe multiple EBS volumes into a software RAID and backup twice a day. Whatever would be lost in between backups, we can live with.
EBS is a somewhat flakey technology since it's essentially a network volume: a volume attached to your server from remote. I am not negating the work done with it – it is an amazing product since essentially unlimited persistent storage is just an API call away. But it's hardly fit for scenarios where I/O performance is key.
And in addition to how network storage behaves, all network is shared on EC2 instances. The smaller an instance (e.g. t1.micro, m1.small) the worse it gets because your network interfaces on the actual host system are shared among multiple VMs (= your EC2 instance) which run on top of it.
The larger instance you get, the better it gets of course. Better here means within reason.
When persistence is required, I would always advice people to use something like S3 to centralize between instances. S3 is a very stable service. Then automate your instance setup to a point where you can boot a new server and it gets ready by itself. Then there is no need to have network storage which lives longer than the instance.
So all in all, I see no benefit to EBS-backed instances what so ever. I rather add a minute to bootstrap, then run with a potential SPOF.
We like instance-store. It forces us to make our instances completely recyclable, and we can easily automate the process of building a server from scratch on a given AMI. This also means we can easily swap out AMIs. Also, EBS still has performance problems from time to time.
Eric pretty much nailed it. We (Bitnami) are a popular provider of free AMIs for popular applications and development frameworks (PHP, Joomla, Drupal, you get the idea). I can tell you that EBS-backed AMIs are significantly more popular than S3-backed. In general I think s3-backed instances are used for distributed, time-limited jobs (for example, large scale processing of data) where if one machine fails, another one is simply spinned up. EBS-backed AMIS tend to be used for 'traditional' server tasks, such as web or database servers that keep state locally and thus require the data to be available in the case of crashing.
One aspect I did not see mentioned is the fact that you can take snapshots of an EBS-backed instance while running, effectively allowing you to have very cost-effective backups of your infrastructure (the snapshots are block-based and incremental)
I've had the exact same experience as Eric at my last position. Now in my new job, I'm going through the same process I performed at my last job... rebuilding all their AMIs for EBS backed instances - and possibly as 32bit machines (cheaper - but can't use same AMI on 32 and 64 machines).
EBS backed instances launch quickly enough that you can begin to make use of the Amazon AutoScaling API which lets you use CloudWatch metrics to trigger the launch of additional instances and register them to the ELB (Elastic Load Balancer), and also to shut them down when no longer required.
This kind of dynamic autoscaling is what AWS is all about - where the real savings in IT infrastructure can come into play. It's pretty much impossible to do autoscaling right with the old s3 "InstanceStore"-backed instances.
I'm just starting to use EC2 myself so not an expert, but Amazon's own documentation says:
we recommend that you use the local instance store for temporary data and, for data requiring a higher level of durability, we recommend using Amazon EBS volumes or backing up the data to Amazon S3.
Emphasis mine.
I do more data analysis than web hosting, so persistence doesn't matter as much to me as it might for a web site. Given the distinction made by Amazon itself, I wouldn't assume that EBS is right for everyone.
I'll try to remember to weigh in again after I've used both.
EBS is like the virtual disk of a VM:
Durable, instances backed by EBS can be freely started and stopped (saving money)
Can be snapshotted at any point in time, to get point-in-time backups
AMIs can be created from EBS snapshots, so the EBS volume becomes a template for new systems
Instance storage is:
Local, so generally faster
Non-networked, in normal cases EBS I/O comes at the cost of network bandwidth (except for EBS-optimized instances, which have separate EBS bandwidth)
Has limited I/O per second IOPS. Even provisioned I/O maxes out at a few thousand IOPS
Fragile. As soon as the instance is stopped, you lose everything in instance storage.
Here's where to use each:
Use EBS for the backing OS partition and permanent storage (DB data, critical logs, application config)
Use instance storage for in-process data, noncritical logs, and transient application state. Example: external sort storage, tempfiles, etc.
Instance storage can also be used for performance-critical data, when there's replication between instances (NoSQL DBs, distributed queue/message systems, and DBs with replication)
Use S3 for data shared between systems: input dataset and processed results, or for static data used by each system when lauched.
Use AMIs for prebaked, launchable servers
Most people choose to use EBS backed instance as it is stateful. It is to safer because everything you have running and installed inside it, will survive stop/stop or any instance failure.
Instance store is stateless, you loose it with all the data inside in case of any instance failure situation. However, it is free and faster because the instance volume is tied to the physical server where the VM is running.
For someone new to all this and if accidentally landed here
As of now all AMI's in quickstart section are EBS backed
Also there's a good explanation at official doc for difference between EBS and Instance store
& this image pretty much sums it up
If you run multiple instance and assign a scheduled service of AWS Instance as one of your priority on Avoiding Unexpected Charges, I would recommend not to use the instance-store.
As explained on documentation of EBS
Volumes
and the answer from j2d3 and Siddharth Sharma the
instance-store can run for as long as you want, but it cannot be
stopped. Means that the service cannot be scheduled by an Automatic
Start/Stop or Instance
Recovery.
Moreover, for this kind of scheme there is also no benefit to use EBS Backed on Elastic Beanstalk as it is designed to ensure that all the resources you need are keep running. It will always do an automatically relaunches any services that you stop.
Reviewing all the rest, out of the total charges on using the VPC, EBS and ELB that added to EC2-Classic, the EC2-VPC with ELB is mostly the best choice where unlike on EC2-Classic, a stopped instance retains its associated Elastic IP addresses and the EBS volume is stored automatically.
As conclusion, taking the main part of your question:
it seems that EBS is way more useful (stop, start, persist + better
speed) at relatively little difference in cost...?
The answer is yes but if your instance is EBS-based, it can be stopped. It will remain in your account, you will not be charged for it. You will be charge only the volume but EBS is charged hourly. You may also consider that among all available types you have a flexibility to Resize the EBS Volume.
Beside the benefits that already listed by Eric, it shall also be aware that in term of cost S3 may or may not be cheaper than EBS. I agree that it relatively little difference in cost if you keep running both types of instance within the same platform and architecture of the application all the time.
However if there a scenario to run the application on a lower cost service, pull all unhandled task and role them to the VPC/EBS via a pipeline or lambda within a short time basis say <1 hour a day, which impossible to do when you use an instance-store, then it will be a different story.

What is the point of instance storage on EC2?

I'm building some AMIs from one of the basic ones on EC2. One of the instance types is running Tomcat and contains a lot of Lucene indexes; another instance will be running MySQL and have correspondingly large data requirements with it.
I'm trying to define the best way to include those in the AMIs that I'm authoring. If I mount /mnt/lucene and /mnt/mysql, those don't get included in the AMI generated. So it seems to me like the preferred way to deal with those is to have an EBS for each one, take snapshots and spin up instances which have their own EBS based on the most recent snapshots. Is that the best way to proceed?
What is the point of instance storage? It seems like it will only work as a temporary storage area - what am I missing? Presumably there is a reason Amazon offer up to 800GB of storage on standard large instances...
Instance storage is faster than EBS. You don't mention what you will be doing with your instances, but for some applications speed might be more important than durability. For an application that is primarily doing data mining on a large database, having a few hundred gigs of local, fast storage to host the DB might be beneficial. Worker nodes in a MapReduce cluster might also be great candidates for instance storage, depending on what type of job it was.
Another point of instance storage is that it's independent. There have been many EBS outages (google e.g. "site:aws.amazon.com ebs outage"). If the instance runs at all, it has the instance storage available. Obviously if you rely on instance storage, you need to run multiple instances (on multiple availability zones) and tolerate single failing instances.
I know this is late to the game, but one other little considered factoid...
EBS storage makes it exceedingly easy to create AMI's from, whereas, instance-store based storage requires that creation of AMI's be done locally on the machine itself with a whole bunch of work to prep, store, and register the AMI.

Resources