We are currently working on a project where we want execute image/video detection algorithms on a cluster of about 80-100 linux servers. We searched about possible software tools and considered Hadoop as a possible solution. I installed Hadoop on VMWare Image as master/slave configuration and was able to execute the analysis programs written in C++ via Hadoop streaming. Due to the nature of analysis algorithms we could not split the video files about 1GB in size into splits. Since Hadoop Map/Reduce works via dividing the input files into 64MB-128MB splits and assigns the programs to the corresponding node where the splits reside (data locality), I could not make the programs run on slave node but only on master node. Hence, at this point can we make Hadoop streaming execute programs based on node availability (CPU etc) but not split location on node?
I wonder whether Hadoop is the right tool for executing image/video detection algorithms? Can there be a better solution maybe Apache Mesos/Spark etc.
Sincerely
We conducted some research on the above issue and we concluded that for our case, that is just running analysis modules for particular video files, hadoop/spark implementation will be an overkill and showing the ear from the reverse. Better implementation will be with utilizing open source MPI/Scheduler applications. After analysis even developing a job scheduler running on multiple nodes seem rather easy and could be developed in a relative 2-3 month time period
Sincerely
Related
I have watched hundreds of videos and red hundreds of articles. but they are all so complicated? Why people cant clearly make a good introduction of what is something before breaking it down? What the hell is Hadoop? I get it is some kind of file distributing system, it has cool features like high performance, HDFS, YARN, MapReduce , Hadoop Common bla bla. Please someone, tell me what is it? is it a software like Visual Studio, Anaconda Navigator, Android Studio or what? or is it a huge company that has thousands of data servers where you can upload your company's data and manage it perfectly over there? Why these videos on YouTube say that Hadoop is storage efficient? does it mean you use Hadoop's data servers and they save your data efficiently? I am absolutely sure that I am not the only one who asks these questions when they are watching these videos on YouTube.
Thanks In advance!
Hadoop is an ecosystem, means consist of several software which can work together in a distributed mode.
Three main software inside Hadoop which provided as service are : HDFS, YARN, MapReduce.
HDFS is a distributed file system. It means HDFS is a filesystem something like our PC filesystem (NTFS, FAT32, EXT4, etc.). The main goal of a file system is management of files. each file in a file system consist of blocks (consider a file is divided to chunks). our local filesystem remain these block on one machine but HDFS split(and distribute) them on several machine.
YARN is a resource manager like our resource manager inside our OS, but it manages resource from several machines. When you run an application (i.e. Notepad) inside your PC, your OS gives your application CPU core (when needed) and Memory. In Yarn when you submit and application it gives your application CPU core (when needed) and Memory on all machines and your application must know how to work in distributed mode.
MapReduce help you to write programs in distributed mode and you don't need know how to write your application in distributed mode. MapReduce is something like a programming language plus a compiler which Yarn can know how to run its program.
One day google decided to introduce a distributed file server,
and named it Google File System, or GFS.
This idea has published in 2003 a paper: [https://static.googleusercontent.com/media/research.google.com/en/ir/archive/gfs-sosp2003.pdf]
At the same time, Doug Cutting, the creator of Apache Lucene, were working on apache Nutch, this paper helped them to implement a nice distributed file system named NDFS or Nutch Distributed File System.
The, Nutch were able to split files in multiple blocks and store them in multiple servers and replicate them to preserver data reliability.
But this data needs to be processed, and you can simply fetch terabytes of data on a single machine and process them.
in 2004, Google again came with a great idea named Map Reduce.
[http://research.google.com/archive/mapreduce.html]
with Map Reduce, you can put your data processing codes near your data blocks, process them locally (map phase) then gather results and sum them up (reduce phase).
now Nutch has both ndfs and map reduce.
quoting from the book, Hadoop definitive guide by Tom White:
NDFS and the MapReduce
implementation in Nutch were applicable beyond the realm of search,
and in February 2006 they moved out of Nutch to form an independent
subproject of Lucene called Hadoop.
this is story of hadoop.
but now, other tools has came along to co-work with Hadoop
like Apache Hive which helps us to run SQL like queries on Big and distributed data on Hadoop File System (now named HDFS).
or Apache Pig that uses a high level language to run complex analysis on Hadoop stored files using map reduce framework.
there is much other tools like Hbase, Spark, Flume, Sqoop, Crunch, etc.
we dont call them Hadoop! but they live in a Hadoop centered ecosystem.
so we call this community, Hadoop Ecosystem.
Its better if you start reading the book: Hadoop definitive guide by Tom White
instead of reading multiple blog posts and YouTube videos.
I tried using ParallelALSFactorizationJob, but it crashes here:
Exception in thread "main" java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
Command line help mentions using filesystem, but it seems it wants hadoop. How can I run it on Windows, mahout.cmd file is broken:
"===============DEPRECATION WARNING==============="
"This script is no longer supported for new drivers as of Mahout 0.10.0"
"Mahout's bash script is supported and if someone wants to contribute a fix for this"
"it would be appreciated."
So is that possible (ALS + Windows - hadoop)?
Mahout is a community-driven project and its community is very strong.
"Apache Mahout is one of the first and most prominent Big Data machine
learning platforms. It implements machine learning algorithms on top
of distributed processing platforms such as Hadoop and Spark."
-Tiwary, C. (2015). Learning Apache Mahout.
Apache Spark is an open-source, in-memory, general purpose computing system that runs on both Windows and Unix like systems. Instead of Hadoop-like disk-based computation, Spark uses cluster memory to upload all the data into the memory, and this data can be queried repeatedly.
"As Spark is gaining popularity among data scientists, the Mahout
community is also quickly working on making Mahout algorithms function
on Spark's execution engine to speed up its calculation 10 to 100
times faster. Mahout provides several important building blocks to
create recommendations using Spark."
-Gupta, A (2015). Learning Apache Mahout Classification.
(This last book also provides a step by step guide Using Mahout's Spark shell (they don't use Windows and it isn't clear if they use Hadoop or not though). For more information on that topic, see the implementation section at https://mahout.apache.org/users/sparkbindings/play-with-shell.html.)
In addition to this, you can build recommendation engines using Spark such as DataFrames, RDD, Pipelines, and Transforms available in Spark MLlib and
in Spark, (...) the Alternating Least Squares (ALS) method is used for
generating model-based collaborative filtering.
-Gorakala, S. (2016). Building Recommendation Engines.
At this point, there's one question still to answer before answering your question: can we run Spark without Hadoop?.
So, yes, it's possible to use ALS method on Windows using Spark (without Hadoop).
I am actively looking for a way to contribute in hadoop development. I want to run an user-defined scheduling algorithm in hadoop cluster and want to see its performance. I have already configured hadoop cluster and have gone through Yarn Scheduler Load Simulator (SLS). Now, for executing other user-defined algorithms and check their performances respect to regular schedulers (Fair/Capacity), what procedures/steps I need to follow? Which one will be better, simulator or regular cluster? I am noob in hadoop and badly need help for this.
I have a question regarding speed & performance of
using multiple virtualized nodes in a single machine VS single node on the single machine itself.
which one will perform better?
The reason why I ask this question is because I am currently learning hadoop on a single machine, and I see some tutorials on the internet that shows the use of multiple virtualized nodes in a single machine.
Thank you in advance
There is always some overhead that comes with virtualization, so unless really necessary I wouldn't advise to run Hadoop in a virtualized environment.
That being said, I know VMWare did a lot of work on making Hadoop work in a virtualized environment, and they have published some benchmarks in which they claim under certain conditions to have better performance with VMs that a native application. I haven't played much with vSphere, but this could be something to look at if you want to explore virtualization further. But don't take the numbers for granted, it really depends on the type of hardware you're running, so in some conditions I think you might gain some performance with VMs, but I'm guessing from experience that in most cases you won't gain anything.
If you're just getting started and testing with Hadoop, I think virtualizing is overkill. You can very easily run Hadoop in pseudo-distributed mode, which means that you can run multiple Hadoop daemons on the same box, each as a separate process. That's what I used to get started with Hadoop, and it's a good head start. You can find more info here (or might need another page depending on which Hadoop version you're running).
If you get to the point where you want to test with a real cluster, but don't have the resources, I would advise looking at Amazon Elastic Map/Reduce: it gives you a cluster on demand and it's pretty cheap. That way you can do more advanced tests. More info here.
the bottom line is, I think if the purpose is simply testing, you don't really need a virtual cluster.
A performance analysis case study conducted on this topic showed that a virtual Hadoop cluster is only around 4% less efficient compared to its native counterpart: Virtualized hadoop performance case study
I work in a research group doing a lot of Machine Learning and Computational Biology.
We currently have a cluster, but it is poorly maintained, suffers from low I/O throughput, and most critically doesn't have any setup for scheduling or load-balancing. Therefore, to use it, you have to find a free node yourself, ssh into that node, run your script on the command line, and manually collect your results.
What is the best software stack to implement an easy to use scheduler and load-balancer, such that users can submit their job to a central queue, have it run automatically when resources are available, and easily get their results back?
There's a number of scheduler/resource manager options that are open source and well thought of:
Torque/Maui, descendants of the venerable PBS, now maintained by adaptive computing
Slurm, a newer project out of LLNL, which has the advantage that it scales very well
Open Grid Engine, née Sun Grid Engine
But there's also a number of entire software stacks that aim to make managing a cluster easier:
Warewulf, out of LBL
Rocks
I'm making this a community wiki for others who have suggestions.