ActiveMQ Shared File System Master Slave on Amazon EC2 - amazon-ec2

We want to use an ActiveMQ master/slave configuration based on a shared file system on Amazon EC2 - that's the final goal. Operating system should be Ubuntu 12.04, but that shouldn't make too much difference.
Why not a master/slave configuration based on RDS? We've tried that and it's easy to set up (including multi-AZ). However, it is relatively slow and the failover takes approximately three minutes - so we want to find something else.
Which shared file system should we use? We did some research and came to the following conclusion (which might be wrong, so please correct me):
GlusterFS is often suggested and should be supporting multi-AZs fine.
NFSv4 should be working (while NFSv3 is said to corrupt the file system), but I didn't see too many references to it on EC2 (rather: asked for NFS, got the suggestion to use GlusterFS). Is there any particular reason for that?
Ubuntu's Ceph isn't stable yet.
Hadoop Distributed File System (HDFS) sounds like overkill to me and the NameNode would again be a single point of failure.
So GlusterFS it is? We found hardly any success stories. Instead rather dissuasive entries in the bug tracker without any real explanation: "I would not recommend using GlusterFS with master/slave shared file system with more than probably 15 enqueued/dequeued messages per second." Does anyone know why or is anyone successfully using ActiveMQ on GlusterFS with a lot of messages?
EBS or ephemeral storage? Since GlusterFS should replicate all data, we could use the ephemeral storage or are there any advantages of using EBS (IMHO snapshots are not relevant for our scenario).
We'll probably try out GlusterFS, but according to Murphy's Law we'll run into problems at the worst possible moment. So we'd rather try to avoid that by (hopefully) getting a few more opinions on this. Thanks in advance!
PS: Why didn't I post this on ServerFault? It would be a better fit, but on SO there are 10 times more posts about this topic, so I stuck with the flock.

Just an idea.... but with activemq 5.7 (or maybe already 5.6) you can have pluggable lockers (http://activemq.apache.org/pluggable-storage-lockers.html). So it might be an option to use the filesystem as storage and RDS as just a locking mechanism. Note I have never tried this before.

Related

Choosing Hadoop solution for Big Data project - Pricing Options

I have to use Hadoop for my research work and I am deciding for the best option to start with. So far I have end up to go with Cloudera. I've downloaded the quick start VM
and started learning different turorials.
The issue is that my system can't afford to run it and perform very slow and I think it might just stop working after I feed it with all the data and run other services.
I was advised to go for a cloud service with 4 cluster node. Can someone please help me by providing the best option and estimated pricing to consider? 1 year plan might be enough to complete my research.
Thanks.
If you are a linux user, Just download the individual components(like hdfs, MR1, YARN, Hbase, Hive etc...) from this Cloudera Archives instead of loading Cloudera Quickstart VM.
If you want to try the 4 node cluster, easiest option is to use cloud.
There are plenty of cloud providers. I have personally used AWS, Google Cloud, Microsoft Azure, IBM SmartCloud. Out of which, AWS is the best to start with.
It is like pay as you go(use).I can recommend you to use a decent EC2 Machine(4 X m3.large Machines)
Type: m3.large CPU:2 RAM:7.5G Storage: 1 x 32 SSD Price: $0.133 per Hour AWS Pricing
If you plan to do the research for one year, I recommend you to go for VPC.
Cons of AWS EC2:
If you launch a machine in EC2, the moment you restart your machine, Your IP and the hostname will get changed.
In AWS VPC, your IP and hostname will remain the same.
If you use 4 Machinesx24x7xone month,it costs you $389.44.
You can calculate the AWS cost by yourself
As best as I can see you have two paths:
Setup Hadoop in a cloud service provider (i.e. Amazon's EC2 or
Redhat's Openshift.
Use Hadoop-as-a-service (i.e. Amazon's EMR or Microsoft's HDInsight).
The first path, setting up Hadoop in a cloud service provider will require you to become a semi-competent Hadoop administrator. If that's your goal, great! However you'll spend a great deal of time learning the necessary skills and mindset to become that. I don't suspect that that is your goal.
The second path is the one I'd recommend out of these two. Using Hadoop-as-a-service you will get up and running faster, but will cost more up front and on an ongoing (per hour basis). You'll still probably save money because you'll be spending less time troubleshooting your Hadoop cluster and more time doing the work you wanted to do in the first place.
I have to wonder, if you can even fit your dataset on your laptop, why are you using big data tools in the first place? True, they'll work. However Big Data is at least partially defined as data sets and computational problems that just don't fit on a single machine.

About online distributed environment

I am learning Mapreduce and Hadoop now. I know I can do some tests and run some samples on a singe node. But I really want to do some practice on a real distributed environment. So I want to ask :
Is there a website which can offer a distributed environment for me to do some experiments?
Somebody told me that I can use Amazon web service to build a distributed environment. Is it real? Does someone have such an experience?
And I want to know how you guys learn hadoop before you use it in your work?
Thank you!
There are a few options:
If you just want to learn about the Map/Reduce paradigm, I would recommend you take a look at JSMapReduce. This is embedded directly in the browser, you have nothing to install, and you can create real Map/Reduce programs.
If you want to learn about Hadoop specifically, Amazon has this thing called Elastic Map Reduce which is essentially Hadoop running on AWS, so this enables you to write your Hadoop job, decide how many machines you want in your cluster, which type of machines you want, and then run it, and EMR will do everything, bootstrap the machines for you, run your job and store the results on S3. I would recommend looking at this tutorial to get an idea how to setup a job on EMR. Just remember, EMR is not free, so you'll have to pay for your computing resources.
Alternatively if you're not looking to pay the cost of EMR, you could always setup Hadoop on your local machine in non-distributed mode, and experiment with it, as described here. Even if it's a single node setup, the abstractions will be the same as if you were using a big cluster, so it's a good way to get up to speed and then go on EMR or a real cluster when you want to get serious.
Amazon offers a free tier, so you can spin up some vms and try experimenting that way. The micro instances they have aren't very powerful, but are fine for small scale tests.
You can also spin up VMs on your desktop if it is powerful enough. I have done this myself using VMPlayer. You can install any flavor of Linux you like for free. Ubuntu is pretty easy to start with. When you setup the networking for your VMs, be sure to use bridged networking. That way each VM will get its own IP address on your network so they can communicate with each other.
Well, it's maybe not about '100% online' but should give really good alternative with some details.
If you are not ready to pay for online cluster resources (such as EMR solution mentioned here) and you don't like to build your own cluster but you are not satisfied with single node setup, you can try to build virtual cluster on powerful enough desktop.
You need minimun 3 VM, I prefer Ubuntu. 4 is better. To see real Hadoop you need minimal replication factor 3. So you need 3 dataNode, 3 taskTrackers. Well, you also need nameNode / JobTracker - it could be one of nodes used for dataNode but I'd recommend to have separate VM. If you need HBase, for example, you again need one Master and minimum 3 RegionServer. So, again, you need 3 but better 4 VM,
There is pretty good free product, Cloudera CDH which is 'somewhat commercial' Hadoop distribution. They also have manager with GUI and simplified installation. BTW they have even prepared demo VMs but I never have used them. You can download everything here. They also host lot of materials about Hadoop and their environment.
Alternative between completely free solution with VMs on desktop and paid service like EMR is your virtual cluster built on top of one dedicated server if you have spare. This is what I personally did. One physical server powered by VmWare free solution, 4 virtual machine, 1 SSD for OS and 3 'general' HDD for storages. Every VM runs Ubuntu 11.04 (again free). Cloudera manager free edition, CDH. So everything is free but you need some hardware that is often available as spare. And you have playground. OK, you need to invest time but by my mind you will get greatest experience from this approach.
Although I do not know much about it, another option may be Greenplum's analytic workbench (1000 node cluster w/ Hadoop for testing): http://www.greenplum.com/solutions/analytics-workbench

how to monitor a linux system on amazon ec2 without cloudwatch?

I would like to monitor the following on Amazon ec2 instances loaded with amazon linux, every X minutes :
disk statistics
process stats (similar to what top does)
ram usage
check if my scripts are running fine
should I use my own scripts and things or are there any tools that already achieve this ?
I searched and there was a suggestion about munin
what seems to be the better approach ?
Here is a great article on scale. About halfway down, the author lists monitoring tools and how they differ.
http://highscalability.com/blog/2010/8/16/scaling-an-aws-infrastructure-tools-and-patterns.html
We've implemented Cacti, it was very easy, it creates all sorts of graphs/reports (most of what you mentioned). Munin is one that he lists as well, but we have not tried that solution yet.

Managing server instance identity on EC2

I recently brought up a cluster on EC2, and I felt like I had to invent a lot of things. I'm wondering what kinds of tools, patterns, ideas are out there for how to deal with this.
Some context:
I had 3 different kinds of servers, so first I created AMIs for each of them. The first AMI had zookeeper, so step one in deploying the system was to get the zookeeper server running.
My script then made a note of the mapping between EC2's completely arbitrary and unpredictable hostnames, and the zookeeper server.
Then as I brought up new instances of the other 2 kinds of servers, the first thing I would do is ssh to the new server, and add the zookeeper server to its /etc/hosts file. Then as the server software on each instance starts up, it can find zookeeper.
Obviously this is a problem that lots of people have to solve, and it probably works a little bit differently in different clouds.
Are there products that address this concept? I was pretty surprised that EC2 didn't provide some kind of way to tie your own name to its name.
Thanks for any ideas.
How to do some service discovery on Amazon EC2 seems to have some good options.
I think you might want to look at http://puppetlabs.com/mcollective/introduction/ and the suite of tools from http://puppetlabs.com in general.
From the site:
The Marionette Collective AKA MCollective is a framework to build server orchestration or parallel job execution systems.
Primarily we’ll use it as a means of programmatic execution of Systems Administration actions on clusters of servers. In this regard we operate in the same space as tools like Func, Fabric or Capistrano.
I am fairly certain mcollective was built to solve exactly the problem you are trying to address. But, be forewarned, it's not a DNS-based solution, it's a method of addressing arbitrarily large and arbitrarily tagged groups of hosts.

Has anyone tried using ZooKeeper?

I was currently looking into memcached as way to coordinate a group of server, but came across Apache's ZooKeeper along the way. It looks interesting, and Yahoo uses it, so it shouldn't be bad, but I'd never heard of it before, so I'm kind of skeptical. Has anyone else given it a try? Any comments or ideas?
ZooKeeper and Memcached have different purposes. You can use memcached to do server coordination, but you'll have to do most of this work yourself. Memcached only allows coordination in that it caches common data lookups to be used by multiple clients. From reading ZooKeeper's documentation, it has a much broader focus than this. ZooKeeper seems to provide support for server clustering, which isn't the same as the cache clustering memcached provides.
Have a look at Brad Fitzpatrick's Linux Journal article on memcached to get a better idea what I mean.
To get an overview of what Zookeper is capable of, watch the following presentation by it's creators. It's capable of so much more (creating queue's, electing master processes amongst a group of peers, distributed high performance run time configurations, rendezvous points for dis-joined processes, determining if processes are still running, etc).
http://zookeeper.sourceforge.net/index.sf.shtml
To answer your question, if "coordination" is what you are looking for Zookeeper is much better targeted at that than memcached.
Zookeeper is great for coordinating data across servers. It does a good job of ordering every transaction and making guarantees that transactions happen in order. However when first breaking into it the documentation sucks; it's very 'high-level' without enough concrete examples or explanations as how to properly handle certain events. One of the included examples (as of version 3.3.3) had its own bugs in it.
Your code will also need to be cognizant of event driven interactions, and polling interactions. With massively distributed architecture, when acting upon 'events' you can inadvertently create a stampede that could not be desirable for your environment (herding effect).

Resources