Are there any solutions to launch Hadoop MapReduce jobs using GPUs? - hadoop

I'm working on a solution that increases the performance of MapReduce jobs, especially the response time, for that I'm looking for ideas that allows me to run jobs on GPUs.
if you have proposals I am interested.

Hadoop YARN 3.1.x has introduced YARN on (Nvidia) GPUs
Disclaimer - Haven't tried it, and is still a new framework, so YMMV

Related

Can apache mahout ALS work without hadoop?

I tried using ParallelALSFactorizationJob, but it crashes here:
Exception in thread "main" java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
Command line help mentions using filesystem, but it seems it wants hadoop. How can I run it on Windows, mahout.cmd file is broken:
"===============DEPRECATION WARNING==============="
"This script is no longer supported for new drivers as of Mahout 0.10.0"
"Mahout's bash script is supported and if someone wants to contribute a fix for this"
"it would be appreciated."
So is that possible (ALS + Windows - hadoop)?
Mahout is a community-driven project and its community is very strong.
"Apache Mahout is one of the first and most prominent Big Data machine
learning platforms. It implements machine learning algorithms on top
of distributed processing platforms such as Hadoop and Spark."
-Tiwary, C. (2015). Learning Apache Mahout.
Apache Spark is an open-source, in-memory, general purpose computing system that runs on both Windows and Unix like systems. Instead of Hadoop-like disk-based computation, Spark uses cluster memory to upload all the data into the memory, and this data can be queried repeatedly.
"As Spark is gaining popularity among data scientists, the Mahout
community is also quickly working on making Mahout algorithms function
on Spark's execution engine to speed up its calculation 10 to 100
times faster. Mahout provides several important building blocks to
create recommendations using Spark."
-Gupta, A (2015). Learning Apache Mahout Classification.
(This last book also provides a step by step guide Using Mahout's Spark shell (they don't use Windows and it isn't clear if they use Hadoop or not though). For more information on that topic, see the implementation section at https://mahout.apache.org/users/sparkbindings/play-with-shell.html.)
In addition to this, you can build recommendation engines using Spark such as DataFrames, RDD, Pipelines, and Transforms available in Spark MLlib and
in Spark, (...) the Alternating Least Squares (ALS) method is used for
generating model-based collaborative filtering.
-Gorakala, S. (2016). Building Recommendation Engines.
At this point, there's one question still to answer before answering your question: can we run Spark without Hadoop?.
So, yes, it's possible to use ALS method on Windows using Spark (without Hadoop).

Suggestion for building a small Hadoop cluster for learning purpose

I have a test for my Big Data class where I have to do some sort of big data analytics with 'smaller' datasets. I actually have my stuff figured it out. I installed Hadoop 2.8.1 and Spark 2.2.0 (I use PySpark to build a program) in standalone mode on my Ubuntu 16.04 from source. I'm actually good to go to do my thing by my own.
The thing is, some of my friends are struggling in configuring all of these and I thought to myself "why don't I make my own little cluster with my classmates". So I'm looking for suggestions.
My laptop has 12 GB RAM and Intel Core i5.
If I understand correctly, your friends have trouble setting up spark in standalone mode (meaning no cluster at all, just local computation). I don't think setting up a cluster they can work with takes away from the complexity they will face. Or are they trying to set up a cluster? Because standalone mode of Spark really doesn't need much configuration.
Another approach is to use a preconfigured VM everyone can use individually. Either prepared by yourself, or there are sandboxes by different providers, e.g. Cloudera and Hortonworks.

Running and Contributing Hadoop Scheduler algorithm

I am actively looking for a way to contribute in hadoop development. I want to run an user-defined scheduling algorithm in hadoop cluster and want to see its performance. I have already configured hadoop cluster and have gone through Yarn Scheduler Load Simulator (SLS). Now, for executing other user-defined algorithms and check their performances respect to regular schedulers (Fair/Capacity), what procedures/steps I need to follow? Which one will be better, simulator or regular cluster? I am noob in hadoop and badly need help for this.

getting frustrated trying to set up pseudo-dist hadoop cluster

I have spend the past 8 hours trying to set up my Hadoop cluster, and to be honest, its getting exhausting. Its not just today. Its been a few weeks to be exact. I have tried probably 20-30 different tutorials i acquired on the web, and each time, i get errors towards the end...like SSH connection issues or JVM failure or PATH issues or the worst of all (WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable) ...and many more. All that leading me to reset my VMplayer and reinstall Ubuntu over and over again.
I am working towards the Hadoop Developer certification, and i need hands on experience on everything from MapReduce & Hadoop to the Eco-System(Hive, PIG, etc). The only thing in my way right now is setting up the cluster for practice. I have run out of options
My question is is there any way(easier the better, but any will do) to install Hadoop MapReduce Version 2(YARN) without pulling my hair out? I would really like something that has been shown to be consistent and has worked for multiple people.
64 bit on Ubuntu
EDIT: Thanks to everyone in advance
You haven't stated if you were using vanilla Hadoop vs a distribution. If you're using the vanilla Apache Hadoop version, you may want to try a distribution like CDH.
The CDH5B2 Documentation specifically addresses how to perform the installs in Ubuntu. The distribution contains YARN, Spark, Hive, Pig, Sqoop, Flume, etc., so it should fit all of your needs.
Thank you for that. You set me down the right path.
For those interested in development, go here
http://www.cloudera.com/content/support/en/downloads/download-components/download-products.html?productID=F6mO278Rvo
It will save you a lot of pain.
Hadoop Video description
http://www.youtube.com/watch?v=o19zLaTuuSk

Hadoop virtual cluster vs single machine

I have a question regarding speed & performance of
using multiple virtualized nodes in a single machine VS single node on the single machine itself.
which one will perform better?
The reason why I ask this question is because I am currently learning hadoop on a single machine, and I see some tutorials on the internet that shows the use of multiple virtualized nodes in a single machine.
Thank you in advance
There is always some overhead that comes with virtualization, so unless really necessary I wouldn't advise to run Hadoop in a virtualized environment.
That being said, I know VMWare did a lot of work on making Hadoop work in a virtualized environment, and they have published some benchmarks in which they claim under certain conditions to have better performance with VMs that a native application. I haven't played much with vSphere, but this could be something to look at if you want to explore virtualization further. But don't take the numbers for granted, it really depends on the type of hardware you're running, so in some conditions I think you might gain some performance with VMs, but I'm guessing from experience that in most cases you won't gain anything.
If you're just getting started and testing with Hadoop, I think virtualizing is overkill. You can very easily run Hadoop in pseudo-distributed mode, which means that you can run multiple Hadoop daemons on the same box, each as a separate process. That's what I used to get started with Hadoop, and it's a good head start. You can find more info here (or might need another page depending on which Hadoop version you're running).
If you get to the point where you want to test with a real cluster, but don't have the resources, I would advise looking at Amazon Elastic Map/Reduce: it gives you a cluster on demand and it's pretty cheap. That way you can do more advanced tests. More info here.
the bottom line is, I think if the purpose is simply testing, you don't really need a virtual cluster.
A performance analysis case study conducted on this topic showed that a virtual Hadoop cluster is only around 4% less efficient compared to its native counterpart: Virtualized hadoop performance case study

Resources