Suggestion for building a small Hadoop cluster for learning purpose - hadoop

I have a test for my Big Data class where I have to do some sort of big data analytics with 'smaller' datasets. I actually have my stuff figured it out. I installed Hadoop 2.8.1 and Spark 2.2.0 (I use PySpark to build a program) in standalone mode on my Ubuntu 16.04 from source. I'm actually good to go to do my thing by my own.
The thing is, some of my friends are struggling in configuring all of these and I thought to myself "why don't I make my own little cluster with my classmates". So I'm looking for suggestions.
My laptop has 12 GB RAM and Intel Core i5.

If I understand correctly, your friends have trouble setting up spark in standalone mode (meaning no cluster at all, just local computation). I don't think setting up a cluster they can work with takes away from the complexity they will face. Or are they trying to set up a cluster? Because standalone mode of Spark really doesn't need much configuration.
Another approach is to use a preconfigured VM everyone can use individually. Either prepared by yourself, or there are sandboxes by different providers, e.g. Cloudera and Hortonworks.

Related

install Hadoop,Pig and hive in laptop

I want to install hadoop, pig and hive in my laptop. I don't know how to install and configure hadoop,pig and hive and what software are required to do it.
Please let me know exact steps require to install/configure Hadoop, Pig and hive in laptop.
and i can use windows OS and i install the hadoop in windows OS
For beginners, I would recommend sticking to a good prepackaged Hadoop distribution/sandbox. Even if you want to learn how to setup up a Hadoop cluster before using the tools it provides (e.g. Hive etc.), setting up a common distribution is a lot easier at least in the beginning.
Prepackaged sandboxes for Hadoop are going to be in Linux. But most likely, you will not need to do a lot in Linux to start using Hadoop if you start from these sandboxes. Personally, I think the time you will save by avoiding support and documentation issues on Windows ports will compensate greatly for any added effort required for jumping into Linux, and you will at least enter the domain of Linux which itself is a tremendously important tool.
For prepackaged solutions, you may try to aim at Cloudera quickstart VM or MapR quickstart VM as these are the most widely used distributions. By using sandboxes, you will skip the installation process (which may be hectic if you don't know what you want and specially if you aren't familiar with Linux) and jump right into usage of tools. Due to availability of good documentation for large vendors such as Cloudera and MapR, you will also face lesser issues in accessing the tools you want to learn.
Follow the vendor specific setup guidelines (also listed on the download pages as getting started guides) for further details on setting up the sandbox.
Once you have the sandbox setup, you can use a lot of different ways to access Hive and Pig. You can use a command line interface for Hive (called beeline). If you are familiar with JDBC, you can access Hive through that. Install Apache-Thrift to enable much wider access options, but you can also save that for later.
I would not recommend learning Pig unless you have very specific uses for it. If you are familiar with Java (or Scala, or even Python, among other options), try writing some Map-Reduce style jobs to learn more about how Hadoop works. Open Ambari (or Cloudera Manger etc.) interface which comes pre-configured with these sandboxes and see the tools and services that come pre-packaged with the sandbox. These are the most common ones and can be used as a useful list for starters. Start learning about them (but skip Pig if you can, even if it is pre-installed ;)
Once you are familiar with the sandbox you have, I would suggest going for Apache Nifi which has easier learning curve and give a lot of flexibility. But you will most likely have to setup a new sandbox for that. It may also serve as a good revision exercise for learning. Integrate that with your Hadoop sandbox, implement some decent use cases and you will have some good experience to show.

Can apache mahout ALS work without hadoop?

I tried using ParallelALSFactorizationJob, but it crashes here:
Exception in thread "main" java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
Command line help mentions using filesystem, but it seems it wants hadoop. How can I run it on Windows, mahout.cmd file is broken:
"===============DEPRECATION WARNING==============="
"This script is no longer supported for new drivers as of Mahout 0.10.0"
"Mahout's bash script is supported and if someone wants to contribute a fix for this"
"it would be appreciated."
So is that possible (ALS + Windows - hadoop)?
Mahout is a community-driven project and its community is very strong.
"Apache Mahout is one of the first and most prominent Big Data machine
learning platforms. It implements machine learning algorithms on top
of distributed processing platforms such as Hadoop and Spark."
-Tiwary, C. (2015). Learning Apache Mahout.
Apache Spark is an open-source, in-memory, general purpose computing system that runs on both Windows and Unix like systems. Instead of Hadoop-like disk-based computation, Spark uses cluster memory to upload all the data into the memory, and this data can be queried repeatedly.
"As Spark is gaining popularity among data scientists, the Mahout
community is also quickly working on making Mahout algorithms function
on Spark's execution engine to speed up its calculation 10 to 100
times faster. Mahout provides several important building blocks to
create recommendations using Spark."
-Gupta, A (2015). Learning Apache Mahout Classification.
(This last book also provides a step by step guide Using Mahout's Spark shell (they don't use Windows and it isn't clear if they use Hadoop or not though). For more information on that topic, see the implementation section at https://mahout.apache.org/users/sparkbindings/play-with-shell.html.)
In addition to this, you can build recommendation engines using Spark such as DataFrames, RDD, Pipelines, and Transforms available in Spark MLlib and
in Spark, (...) the Alternating Least Squares (ALS) method is used for
generating model-based collaborative filtering.
-Gorakala, S. (2016). Building Recommendation Engines.
At this point, there's one question still to answer before answering your question: can we run Spark without Hadoop?.
So, yes, it's possible to use ALS method on Windows using Spark (without Hadoop).

New to Titan db, help installing titan db

I am new to Titan db and I have been reading the documentation in this website: http://s3.thinkaurelius.com/docs/titan/0.5.4/ I really could not find much documentations on installing Titan db, can I install it on my windows 7 or do I need to install it on a Virtual machine that runs on Linux?
Is this the only download I need to get started? Titan 0.5.4 with Hadoop 2 (signature). https://github.com/thinkaurelius/titan/wiki/Downloads ?
Do I also need to install hadoop or will the link above i provided will install it as well?
The Titan distribution you mentioned generally has what you need to get started. I'm not sure what you mean by "installing titan" because how you set it up is highly dependent on what you plan to do with it. If you are just playing around in the Titan Gremlin Console, just unapackage the Titan distribution you mentioned and start it with the included gremlin.sh file (or in the case of windows gremlin.bat). If you plan to use a backend like hbase or cassandra, you probably need to have those up and running first - check their documentation to understand how that works. The same could be said of the indexing backends like elasticsearch. In short, your question is a bit too general to answer with real specifics.
I would recommend that you spend some more time reviewing the documentation you referenced to make sure you understand the Titan architecture and approach. If you skipped Getting Started or didn't fully understand that, I would re-read it and try out the examples. If that's a bit too much you might even consider stepping back to just pure Gremlin and trying the Getting Started docs there.

Which distribution of hadoop is better?

I am working with massive data, my input data is about 100 GB.I want to choose one of the hadoop distributions, but i don't know to choose mapr cluster or cloudera cluster. i want to use free versions(mapr M3 and cloudera CDH4 that uses hadoop 0.20).
which of them is better? which configurations do i use that they work the best?
Thanks.
Actually speaking, answer to this question is the most common answer in this world, it depends. It's totally upto you and your requirements. One might find one particular flavor more suitable for his/her needs, and you might find the same flavor less useful. Moreover it's all about personal choice, like I personally like Apache's Hadoop. All are good. It's just that which one fits into your needs.
Which of them is better? is a controversial topic. Questions like this often end up as heated arguments. See this question for example. So, i'm not going to list down advantages of any one over the other. But there are certain differences among these different flavors of Hadoop which could probably help you during your thought process.
The major difference between CDH(Apache Hadoop as well) and MapR is that MapR uses its own proprietary file system, MapRFS instead of HDFS. The M3 Edition is free and available for unlimited production use. Support is provided on a community basis and through MapR's Forums. CDH is 100% open source and you can use the "Standard" version of Cloudera Manager without any charges. And Apache, well it's Apache :). Do what ever you feel like.
MapR has even partnered recently with Canonical, the organization behind the Ubuntu operating system, in an effort to make Hadoop available as an integrated part of Ubuntu through its repositories. The partnership announced that MapR's M3 Edition for Apache Hadoop will be packaged and made available for download as an integrated part of the Ubuntu operating system(see this if you need more info on this). The source code is available on Github. CDH codebase is same as Apache's, with some patches of their own.
But the free edition lacks some good features like JobTracker HA, NameNode HA, Mirroring, Snapshot etc. CDH4, being based on Hadoop-2.x provides you the HA features though. By virtue of its design MapR doesn't have any SPOF, like CDH3(or Hadoop-1.x) does. The MapRFS stores data in volumes, conceptually in a set of containers distributed across a cluster. Each container includes its own metadata, eliminating the central NameNode single point of failure. Still the API is Apache Hadoop compatible. MapR setup requirements differ from Apache/CDH. Like MapR requires raw volumes to be available for installation for example. Once you have the correct hardware & OS pre-requisites, setup times and eval times should be on the same order of magnitude as Apache/CDH.
IMHO, M3 is not gonna give you huge advantages over Apache/CDH as some of the catchy MapR features are not present in M3 free edition, like NFS-HA, Snapshots etc.
Being the first one Cloudera definitely has an extra edge in terms of experience and a solid customer base. But MapR has gone more innovative in terms of significant changes to the MapReduce and HDFS components to improve performance.
I'll write some more after sometime, as i'm on a call and you are waiting for the answer ;)

Hadoop virtual cluster vs single machine

I have a question regarding speed & performance of
using multiple virtualized nodes in a single machine VS single node on the single machine itself.
which one will perform better?
The reason why I ask this question is because I am currently learning hadoop on a single machine, and I see some tutorials on the internet that shows the use of multiple virtualized nodes in a single machine.
Thank you in advance
There is always some overhead that comes with virtualization, so unless really necessary I wouldn't advise to run Hadoop in a virtualized environment.
That being said, I know VMWare did a lot of work on making Hadoop work in a virtualized environment, and they have published some benchmarks in which they claim under certain conditions to have better performance with VMs that a native application. I haven't played much with vSphere, but this could be something to look at if you want to explore virtualization further. But don't take the numbers for granted, it really depends on the type of hardware you're running, so in some conditions I think you might gain some performance with VMs, but I'm guessing from experience that in most cases you won't gain anything.
If you're just getting started and testing with Hadoop, I think virtualizing is overkill. You can very easily run Hadoop in pseudo-distributed mode, which means that you can run multiple Hadoop daemons on the same box, each as a separate process. That's what I used to get started with Hadoop, and it's a good head start. You can find more info here (or might need another page depending on which Hadoop version you're running).
If you get to the point where you want to test with a real cluster, but don't have the resources, I would advise looking at Amazon Elastic Map/Reduce: it gives you a cluster on demand and it's pretty cheap. That way you can do more advanced tests. More info here.
the bottom line is, I think if the purpose is simply testing, you don't really need a virtual cluster.
A performance analysis case study conducted on this topic showed that a virtual Hadoop cluster is only around 4% less efficient compared to its native counterpart: Virtualized hadoop performance case study

Resources