Do we have to compile Mesos on each and every node? - mesos

I am building a Mesos cluster on 3 three masters and 2 slaves. Each VM has at least 4 vCPUs and 2Gb of memory. In the past, I compiled Mesos on each one of the servers which took a ridiculous amount of time to complete. Even when I used this command to build:
make -j 4 V=0
My question is, am I doing it wrong? The building instructions say nothing about multiple node clusters. I don't want to install from the distribution binaries because I want the latest.

I suggest you use docker.
Next are 2 docker images from mesosphere which afford most latest docker images for master & slave.
https://hub.docker.com/r/mesosphere/mesos-master/tags/
https://hub.docker.com/r/mesosphere/mesos-slave/tags/
With this, you can just pull images from dockerhub, and directly run container for mesos master & mesos slave without compile sourcecode, FYI.

Related

flink running error on mac standalone org.apache.flink.configuration.IllegalConfigurationException

I am new to flink and have install yarn and flink on my macbook with M1 pro chip.
when I tried to submit job using bin/flink run -m yarn-cluster examples/streaming/SocketWindowWordCount.jar --port 8882 , it returned an errorCaused by: org.apache.flink.configuration.IllegalConfigurationException: The number of requested virtual cores for application master 1 exceeds the maximum number of virtual cores 0 available in the Yarn Cluster.
Can anyone tell how to fix?
Really in a hurry. Many Thankssssss!!!
I believe that Flink doesn't support ARM architectures. There are quite a number of open tickets involved with multiple problems to run Flink on Apple M1s. There is https://issues.apache.org/jira/browse/FLINK-13448 which shows that some work has been done, but there still are issues open like https://issues.apache.org/jira/browse/FLINK-25188, https://issues.apache.org/jira/browse/FLINK-24932, https://issues.apache.org/jira/browse/FLINK-22331, https://issues.apache.org/jira/browse/FLINK-25505 and

Hadoop installation using Cloudera VMware

Can anyone please let me know the minimum RAM required (of the host machine) for running Cloudera's hadoop on VMware workstation?
I have 6GB of RAM. The documentation says that the RAM required by the VM is 4 GB.
Still, when I run it, the CentOS is loaded and the VM crashes. I have no other active application running at the time.
Are there any other options apart from installing hadoop manually?
You may be running into your localhost running out or memory or some other issue preventing the machine from booting completely. There are a couple of other options if you don’t want to deal with a manual install:
If you have access to a docker environment try the the docker image they provide.
Run it in the cloud with AWS, GCE, Azure, they usually have a small allotment of personal/student credits available.
For AWS, EMR also makes it easy for you to run something repeatedly.
For really short durations, you could try the demo from Bitnami (https://bitnami.com/stack/hadoop) and just run whatever you need to there.

Very slow network performance of Docker containers with host's network

I'm having a problem with sluggish network performance between Docker containers and host's network. I asked this question on the Docker's forum but have received no answers so far.
Problem
Set-up: two Macs on the same local network; the first runs an MQTT broker (mosquitto); the second runs Docker for Mac. Two C++ programs run on the second Mac and exchange data multiple times through the MQTT broker (on the first Mac), using the Paho MQTT C library.
Native run: when I ran the two C++ programs natively, the network performance was excellent as expected. The programs were built with XCode 7.3.
Docker runs: when I ran either of the C++ programs, or both of them, in Docker, the network performance dropped dramatically, roughly 30 times slower than the native run. The Docker image is based on ubuntu:latest, and the programs were built by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.1) 5.4.0 20160609.
I tried to use the host network (--network="host" in Docker run) but it didn't help. I also tried to run the MQTT broker on the second Mac (so that the broker and the containers ran on the same host); the problem persisted. The problem existed on both my work LAN and my home network.
In theory, it could have been that the C++ programs were generally slow in Docker containers. But I doubt this was the case because in my experience, the general performance of C++ code in Docker is about as fast as in the native environment.
Question
What could be the cause of this problem? Are there any settings in Docker that can solve this issue?
Your problem sounds very similar to this open issue on the Docker for Mac repo. Unfortunately, there doesn't seem to be a known solution, but the discussion in there may be useful. My personal guess at the moment is that the bug lives near the hyperkit virtualization being used on Docker for Mac specifically.
In my case, I was oddly able to bypass this issue by using a different physical router, but I have no idea why it worked. Sadly that's not really a 'solution' though.
I hate that this isn't a great answer, but I wanted to at least share the discussion in the open issue. Good luck and keep us posted.
I suspect the default allocation of memory and CPU for the containers might not be optimal for the kind of network performance you are trying to achieve.
Investigate the utilization of resources within the containers using standard tools like top, htop, strace etc. Or you can use docker stat command when these instances are in peak operation
$ docker stats node1 node2
CONTAINER CPU % MEM USAGE/LIMIT MEM % NET I/O
node1 0.07% 796 KB/64 MB 1.21% 788 B/648 B
node2 0.07% 2.746 MB/64 MB 4.29% 1.266 KB/648 B
Then you might want to modify various resource allocation parameters available with docker run.
EDIT: Another thing to check would be MTU of the actual system interface and the setting on the docker interfaces. Use
--mtu=BYTES to set MTU of your docker values to match your system interface's MTU value

How to install pyspark & spark for learning purpose on a laptop with limited resources?

I have a windows 7 laptop with 6GB RAM . What is the most RAM/resource efficient way to install pyspark & spark on this laptop just for learning purpose. I don't want to work on actual big data but small dataset is ideal since this is just for learning pyspark & spark in general. I would prefer the latest version of Spark.
FYI: I don't have hadoop installed.
Thanks
You've basically got three options:
Build everything from source
Install Virtualbox and use a pre-built VM like Cloudera Quickstart
Install Docker and find a suitable container
Getting everything up and running when you choose to build from source can be a pain. You've got to install the JDK, build hadoop and spark (both of which require you to install additional software to build them), set up a bunch of environment variables and then pray that didn't mess anything up.
VMs are nice, particularly the one from Cloudera, but you'll often be stuck with an older version of Spark and it might be tight with the resources you described.
I'd go with Docker.
Once you've got docker installed, it becomes very easy to try Spark (and lots of other technologies). My favorite containers for playing around use ipython or jupyter notebooks.
Install Docker:
https://docs.docker.com/installation/windows/
Jupyter Notebook Python, Spark, Mesos Stack
https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook
One thing to keep in mind is that you are going to have to allocate a certain amount of memory for the VM and the remaining memory still has to operate Windows. Windows 7 requires a minimum of 1 GB for a 32-bit OS or 2 GB for a 64-bit OS. So likely you are only going to wind up with around 4 GB of RAM for running the VM, which is not much.
Assuming you are 64-bit, note that Cloudera requires a minimum of 4 GB RAM to run CDH 5, but if you want to run Cloudera Express, you need 8 GB.
Running Docker from Windows will require you to use boot2docker, which keeps the entire VM in memory. It uses minimal memory (like around 27 MB) to run, so you should be fine there. A MUCH better solution than running VirtualBox!
Another option to consider would be to spin up a free machine on something like Amazon Web Services (http://aws.amazon.com) or Google Cloud (http://cloud.google.com). Particularly with the later, you can get a free trial amount of credits, which you could use to spin up a machine with more RAM than you would typically get with AWS.

Setup multinode Hadoop cluster using virtual machines on my laptop

I have a windows 7 laptop and I need to setup hadoop (mutlinode) cluster on it.
I have the following things ready -
virtual softwares, i.e. virtualbox and vmware player.
Two virtual machines, i.e.
Ubuntu - for Hadoop master and
Ubuntu - for (1X) Hadoop slave
Has anyone done a setup of such a cluster using Virtual machines on
your laptop ?
If yes please help me to install it.
I've searched over google but I am not getting how to configure this multi-node cluster on hadoop using VMs?
How to run two Ubuntu OS on windows 7 using VMware or virtualbox?
Should we use same Ubuntu version VM image or
vm images with different versions of Ubuntu linux?
Yes you can use ubuntu two node. I am using five nodes(1 master, 4 datanodes).
If you want install multi node in vm ware.
Just download ubutnu from this link: http://www.ubuntu.com/download/desktop
And install two machine. And install java and openssh.
And download shell script for multinode from this link::
https://github.com/tonyreddy/Apache-MultiNode-Insatallation-Shellscript
And try it .....
All the best............
Since you're running Hadoop on your laptop, obviously you're doing it for learning purposes or building POC or functional debugging.
Instead of going through the hassles of installing and setting up Hadoop and related Big-Data softwares, you can simply install a pre-configured pseudo-distributed VM.
Some good options are:
Cloudera QuickStart VM
Hortonworks Sandbox
I've been using the Cloudera's VM on my laptop for quite sometime now and it's been working great.
Cloudera and Hortonworks are the fastest way to get it up and running.
Make sure you have enough RAM installed on your laptop for the Operating system already running, else your laptop will restart abruptly often while you use the Virtual machines.
Let me give you an example -
If you are using Windows 10, it needs 3-5GB RAM to be used to work smoothly,
This means if you load a Virtual Machine of 5GB size in your RAM, Windows may crash when it does not find enough RAM to operate.
You must upgrade the RAM from 8GB to 12GB or best 16GB for smooth operation of your laptop.
Hope it helps

Resources