Choosing Hadoop solution for Big Data project - Pricing Options - hadoop

I have to use Hadoop for my research work and I am deciding for the best option to start with. So far I have end up to go with Cloudera. I've downloaded the quick start VM
and started learning different turorials.
The issue is that my system can't afford to run it and perform very slow and I think it might just stop working after I feed it with all the data and run other services.
I was advised to go for a cloud service with 4 cluster node. Can someone please help me by providing the best option and estimated pricing to consider? 1 year plan might be enough to complete my research.
Thanks.

If you are a linux user, Just download the individual components(like hdfs, MR1, YARN, Hbase, Hive etc...) from this Cloudera Archives instead of loading Cloudera Quickstart VM.
If you want to try the 4 node cluster, easiest option is to use cloud.
There are plenty of cloud providers. I have personally used AWS, Google Cloud, Microsoft Azure, IBM SmartCloud. Out of which, AWS is the best to start with.
It is like pay as you go(use).I can recommend you to use a decent EC2 Machine(4 X m3.large Machines)
Type: m3.large CPU:2 RAM:7.5G Storage: 1 x 32 SSD Price: $0.133 per Hour AWS Pricing
If you plan to do the research for one year, I recommend you to go for VPC.
Cons of AWS EC2:
If you launch a machine in EC2, the moment you restart your machine, Your IP and the hostname will get changed.
In AWS VPC, your IP and hostname will remain the same.
If you use 4 Machinesx24x7xone month,it costs you $389.44.
You can calculate the AWS cost by yourself

As best as I can see you have two paths:
Setup Hadoop in a cloud service provider (i.e. Amazon's EC2 or
Redhat's Openshift.
Use Hadoop-as-a-service (i.e. Amazon's EMR or Microsoft's HDInsight).
The first path, setting up Hadoop in a cloud service provider will require you to become a semi-competent Hadoop administrator. If that's your goal, great! However you'll spend a great deal of time learning the necessary skills and mindset to become that. I don't suspect that that is your goal.
The second path is the one I'd recommend out of these two. Using Hadoop-as-a-service you will get up and running faster, but will cost more up front and on an ongoing (per hour basis). You'll still probably save money because you'll be spending less time troubleshooting your Hadoop cluster and more time doing the work you wanted to do in the first place.
I have to wonder, if you can even fit your dataset on your laptop, why are you using big data tools in the first place? True, they'll work. However Big Data is at least partially defined as data sets and computational problems that just don't fit on a single machine.

Related

Amazon EMR vs EC2 for Off loading BI & Analytics anno 2018

I looked at some posts but they are a bit older on this topic. I have read the AWS and other blogs as well, but ...
My simple non-programming question for AWS in today's environment is:
If we have a DWH of say, 20+TB and growing, that we want to off-load to the Cloud as many are doing, then
if we have a regular daily DWH feed with some mutations, then
should we in the case of AWS, use EMR or EC2?
Moreover, it is a complete batch environment, no Streaming or KAFKA requirements. Usage of SPARK for sure.
EMR seems great, but I have the impression it is for Data Scientists to do whatever they want whenever they want. For more regular ETL I am wondering if this is suited. The appeal of less management is certainly a boon.
In the docs on AWS I cannot find a definitive answer, hence this question.
My impression is that with AMI and bootstrapping own services, that EMR is certainly one way to go, and, that EC2 would be more for a KAFKA Cluster or if you really want to control your own environment and tooling completely based on say Cloudera's distribution per se.
So, the answer here is for others that may need to assess which options apply for off-loading, whatever. It is actually not so hard in hindsight. Note that AZURE and non-AWS vendors not considered here. In a nutshell, then:
EMR is an (PaaS) AWS Managed Hadoop Service
EMR provides tools that AMAZON feel will do the job for Data Science, Analytics, etc. But you can "bootstrap" your own requirements / software, if needed.
EMR-clusters comprise short-running EC2 instances and provisioning happens under water as it were. You get patches effected easily this way. You can up- and downscale very easily as well. Compute and storage are divorced allowing this scaling to occur easily.
Elasticity applies obviously more so to compute, data needs to be there as long as you need it. EMR relies on S3 to save results to, longer term. After saving, one terminates the EMR-cluster, and when required, start a new EMR-cluster and attach your saved S3 results - if applicable - to this new cluster. EMRFS allows S3 to look like part of HDFS and provides easy access. EBS-backed storaged exists that allows saving of results to storage tied to the EC2-instance for the duration of that instance.
It's a new way of doing things. One has access to "spot" instances with obviously spot prices. Billing is less predictable as it depends what you do, but could well overall be cheaper - provided governed correctly. An example of this is expedia's management of EMR-clusters.
Ad-hoc querying is not well served with S3, so you will need another AWS Managed Services such as Presto / Athena or Redshift (Spectrum) which is an additional set of services and cost. Just mentioning this due to slower S3 performance.
EC2 (IaaS) is more "traditional"
You elect to take this path if you want to provision EC2 instances yourself a syou want control of the software and what you want on your Hadoop environment.
EC2 instances - VMs - have compute power, memory, EBS-backed temporal storage, and use EFS for file systems for HDFS or, say, KUDU, and S3. S3 access is not as easy to access as under EMRFS with EMR.
You install and maintain the Hadoop software yourself and apply patches, etc. Management of Hadoop on these EC2 instances is of course less of a big deal with Cloudera and Cloudbreak.
Billing is more predictable one could argue, on the basis of up-time of an EC2 instance, and billing applies continuously for any persisted storage.
Important point, one can combine an EC2 approach for, say, DWH Loading on Hadoop - if "off-loading", and EMR Clusters for Data Science.
MR Data Locality
This not adhered to in both approaches unless bare metal options used, but then the elasticity - E - is harder for both parties, which allows cost savings.
Data locality seems to be assumed by most, but actually it has gone with Cloud computing as expected, and seems quite OK in terms of performance for Data Science etc.
For ad hoc querying AMAZON say they are not so sure on S3, and from experience, using EFS fof HDFS/PARQUET or KUDU works pretty quick, to say the least, from my experience at least.

About renting and using a cluster on Amazon EC2

I am researching now in the topic of improving the MapReduce scheduler but unfortunately my university does not provide a cluster for research purposes. I was thinking about renting a cluster and I heard about Amazon EC2, but I have no experience with its services and I do not know how to use them.
I am in need of 5 machines with the following specifications (for each machine):
A dual-processor (2.2 GHz AMD Opteron(m) Processor 4122 with 4 physical cores)
8GB of RAM
500GB disk
I want to setup the Linux operating system and the Hadoop framework manually, just like I would if I had the machines physically on my hands. I would like to know if Amazon EC2 offers something like this, and I would like to estimate the cost of this infrastructure for, let's say, a month.
In the case I choose Amazon's Elastic MapReduce framework, would I be able to control de version of Hadoop? Could I also be able to change the configuration of the scheduler in it so that I can set my algorithm?
Finally, I would like to know if there is any kind of simulator for MapReduce to make different experiments.
Please excuse my multiple questions, I am new in this field and any guidance would be really appreciated.
I was thinking about renting a cluster and I heard about Amazon EC2, but I have no experience with its services and I do not know how to use them.
Amazon's AWS has a elaborate documentation, for reference here is the Getting Started link to get you going. Also, AWS self-paced labs are worth checking out.
I am in need of 5 machines with the following specifications (for each machine): A dual-processor, 8GB of RAM, and 500GB of disk.
Amazon's AWS provides a wide range of EC2 instance types. Choose which one best fits your use-case from a list of instance types.
I want to setup the Linux operating system and the Hadoop framework manually, just like I would if I had the machines physically on my hands. I would like to know if Amazon EC2 offers something like this, and I would like to estimate the cost of this infrastructure for, let's say, a month.
AWS does not provide a VM without an OS installed in it. All the VM's provided by AWS are pre-loaded with OS and you could manually install Hadoop on top of that. Of course AWS provides a wide range of OS.
Amazon AWS also provides a Simple Monthly Calculator to calculate how much your cluster might cost based on the instances you have selected and number of EB2 volumes you have attached to each instance.
In the case I choose Amazon's Elastic MapReduce framework, would I be able to control de version of Hadoop? Could I also be able to change the configuration of the scheduler in it so that I can set my algorithm?
If you are using AWS EMR to deploy Hadoop cluster then you could select the version of Hadoop to be installed, supported Hadoop versions by Amazon are 2.4.0, 2.2.0, 1.0.3, 0.20.205.
Finally, I would like to know if there is any kind of simulator for MapReduce to make different experiments.
I did not understand about the mapreduce simulator part though.

About online distributed environment

I am learning Mapreduce and Hadoop now. I know I can do some tests and run some samples on a singe node. But I really want to do some practice on a real distributed environment. So I want to ask :
Is there a website which can offer a distributed environment for me to do some experiments?
Somebody told me that I can use Amazon web service to build a distributed environment. Is it real? Does someone have such an experience?
And I want to know how you guys learn hadoop before you use it in your work?
Thank you!
There are a few options:
If you just want to learn about the Map/Reduce paradigm, I would recommend you take a look at JSMapReduce. This is embedded directly in the browser, you have nothing to install, and you can create real Map/Reduce programs.
If you want to learn about Hadoop specifically, Amazon has this thing called Elastic Map Reduce which is essentially Hadoop running on AWS, so this enables you to write your Hadoop job, decide how many machines you want in your cluster, which type of machines you want, and then run it, and EMR will do everything, bootstrap the machines for you, run your job and store the results on S3. I would recommend looking at this tutorial to get an idea how to setup a job on EMR. Just remember, EMR is not free, so you'll have to pay for your computing resources.
Alternatively if you're not looking to pay the cost of EMR, you could always setup Hadoop on your local machine in non-distributed mode, and experiment with it, as described here. Even if it's a single node setup, the abstractions will be the same as if you were using a big cluster, so it's a good way to get up to speed and then go on EMR or a real cluster when you want to get serious.
Amazon offers a free tier, so you can spin up some vms and try experimenting that way. The micro instances they have aren't very powerful, but are fine for small scale tests.
You can also spin up VMs on your desktop if it is powerful enough. I have done this myself using VMPlayer. You can install any flavor of Linux you like for free. Ubuntu is pretty easy to start with. When you setup the networking for your VMs, be sure to use bridged networking. That way each VM will get its own IP address on your network so they can communicate with each other.
Well, it's maybe not about '100% online' but should give really good alternative with some details.
If you are not ready to pay for online cluster resources (such as EMR solution mentioned here) and you don't like to build your own cluster but you are not satisfied with single node setup, you can try to build virtual cluster on powerful enough desktop.
You need minimun 3 VM, I prefer Ubuntu. 4 is better. To see real Hadoop you need minimal replication factor 3. So you need 3 dataNode, 3 taskTrackers. Well, you also need nameNode / JobTracker - it could be one of nodes used for dataNode but I'd recommend to have separate VM. If you need HBase, for example, you again need one Master and minimum 3 RegionServer. So, again, you need 3 but better 4 VM,
There is pretty good free product, Cloudera CDH which is 'somewhat commercial' Hadoop distribution. They also have manager with GUI and simplified installation. BTW they have even prepared demo VMs but I never have used them. You can download everything here. They also host lot of materials about Hadoop and their environment.
Alternative between completely free solution with VMs on desktop and paid service like EMR is your virtual cluster built on top of one dedicated server if you have spare. This is what I personally did. One physical server powered by VmWare free solution, 4 virtual machine, 1 SSD for OS and 3 'general' HDD for storages. Every VM runs Ubuntu 11.04 (again free). Cloudera manager free edition, CDH. So everything is free but you need some hardware that is often available as spare. And you have playground. OK, you need to invest time but by my mind you will get greatest experience from this approach.
Although I do not know much about it, another option may be Greenplum's analytic workbench (1000 node cluster w/ Hadoop for testing): http://www.greenplum.com/solutions/analytics-workbench

ActiveMQ Shared File System Master Slave on Amazon EC2

We want to use an ActiveMQ master/slave configuration based on a shared file system on Amazon EC2 - that's the final goal. Operating system should be Ubuntu 12.04, but that shouldn't make too much difference.
Why not a master/slave configuration based on RDS? We've tried that and it's easy to set up (including multi-AZ). However, it is relatively slow and the failover takes approximately three minutes - so we want to find something else.
Which shared file system should we use? We did some research and came to the following conclusion (which might be wrong, so please correct me):
GlusterFS is often suggested and should be supporting multi-AZs fine.
NFSv4 should be working (while NFSv3 is said to corrupt the file system), but I didn't see too many references to it on EC2 (rather: asked for NFS, got the suggestion to use GlusterFS). Is there any particular reason for that?
Ubuntu's Ceph isn't stable yet.
Hadoop Distributed File System (HDFS) sounds like overkill to me and the NameNode would again be a single point of failure.
So GlusterFS it is? We found hardly any success stories. Instead rather dissuasive entries in the bug tracker without any real explanation: "I would not recommend using GlusterFS with master/slave shared file system with more than probably 15 enqueued/dequeued messages per second." Does anyone know why or is anyone successfully using ActiveMQ on GlusterFS with a lot of messages?
EBS or ephemeral storage? Since GlusterFS should replicate all data, we could use the ephemeral storage or are there any advantages of using EBS (IMHO snapshots are not relevant for our scenario).
We'll probably try out GlusterFS, but according to Murphy's Law we'll run into problems at the worst possible moment. So we'd rather try to avoid that by (hopefully) getting a few more opinions on this. Thanks in advance!
PS: Why didn't I post this on ServerFault? It would be a better fit, but on SO there are 10 times more posts about this topic, so I stuck with the flock.
Just an idea.... but with activemq 5.7 (or maybe already 5.6) you can have pluggable lockers (http://activemq.apache.org/pluggable-storage-lockers.html). So it might be an option to use the filesystem as storage and RDS as just a locking mechanism. Note I have never tried this before.

Amazon EC2 consideration - redundancy and elastic IPs

I've been tasked with determining if Amazon EC2 is something we should move our ecommerce site to. We currently use Amazon S3 for a lot of images and files. The cost would go up by about $20/mo for our host costs, but we could sell our server for a few thousand dollars. This all came up because right now there are no procedures in place if something happened to our server.
How reliable is Amazon EC2? Is the redundancy good, I don't see anything about this in the FAQ and it's a problem on our current system I'm looking to solve.
Are elastic IPs beneficial? It sounds like you could point DNS to that IP and then on Amazon's end, reroute that IP address to any EC2 instance so you could easily get another instance up and running if the first one failed.
I'm aware of scalability, it's the redundancy and reliability that I'm asking about.
At work, I've had something like 20-40 instances running at all times for over a year. I think we've had 1-3 alert emails come from amazon suggesting that we terminate and boot another instance (presumably because they are detecting possible failure in the underlying hardware). We've never had an instance go down suddenly, which seems rather good.
Elastic IP's are amazing and are part of the solution. The other part is being able to rapidly bring up new instances. I've learned that you shouldn't care about instances going down, that it's more important to use proper load balancing and be able to bring up commodity instances quickly.
Yes, it's very good. If you aren't able to put together a concurrent redundancy (where you have multiple servers fulfilling requests simultaneously), using the elastic IP to quickly redirect to another EC2 instance would be a way to minimize downtime.
Yeah I think moving from inhouse server to Amazon will definitely make a lot of sense economically. EBS backed instances ensure that even if the machine gets rebooted, the transient memory is not lost. And if you have a clear separation between your application and data layer and can have them on different machines, then you can build even better redundancy for your data.
For ex, if you use mysql, then you can consider using Amazon RDS service - which gives you a highly available and reliable MySQL instance, fully managed (patches and all). The application layer then can be made more resilient by having more smaller instances rather than one larger instance, through load balancing.
The cost you will save on is really hardware maintenance and the cost you would have to incur to build in disaster recovery.

Resources