How to run mahout from command line with KNN based Item Recommender? - hadoop

I'm new to mahout and still trying to figure things out.
I'm trying to run a KNN based recommender using mahout 0.8 that runs in hadoop cluster (distributed recommender). I'm using mahout 0.8, so KNN is deprecated, but it is still usable (at least when I make it in java code)
I have several questions:
Is it true that there are basically two mahout implementations?
distributed (runs from command line)
non disributed (runs from jar file)
Assumming (1) is correct, Is mahout support running KNN based recommender from command-line? Can someone gives me a direction to do it?
Assumming (1) is wrong, how can I build a recommender in java (I'm using eclipse) that runs in hadoop cluster (distributed)?
Thanks!

KNN is being deprecated because it is being replaced with item-based and user-based cooccurrence recommenders and the ALS-WR recommender, which are better, more modern.
Yes, but not all code has a CLI interface. For the most part the CLI jobs in Mahout are Hadoop/distributed jobs that produce files in HDFS for output. These can be run from jar files with your own code wrapping them as you must with the local/non-distributed/non-Hadoop versions, which do not have a CLI. The in-memory recommenders require you to pass in a user ID to get recs, so you have to write code to do that. The Hadoop versions do have a CLI since they precalculate all recs for all users and put them in files. You'll probably insert them in your DB or serve them up some other way.
No, to my knowledge only user-based, item-based, and ALS-WR recommenders are supported from the command line. This runs the Hadoop/distributed version of the recommenders. This can work on a single machine, of course even using the local filesystem since Hadoop can be set up that way.
For the in-memory recommenders, just write your driver code and run them in eclipse, since Hadoop is not involved it works fine. If you want to use the Hadoop versions, setup Hadoop on your dev machine to run locally using the local filesystem and everything works fine in eclipse. Once you have things debugged move it to your Hadoop cluster. You can also debug remotely on the cluster but that is another question altogether.
The latest thing in Mahout recommenders is one that is trained in the background using Hadoop then the output is indexed by Solr. You then query Solr with items the user has expressed a preference for, no need to precalculate all recs for all users since they returned from a Solr query in near realtime. This is in Mahout 1.0-SNAPSHOT's mahout/examples/ or here https://github.com/pferrel/solr-recommender
BTW this code is being integrated with Mahout 1.0 and moved to run on Spark instead of Hadoop so even the training step will be much much faster.
Update:
I've clarified what can be run from the CLI above.

Related

Can apache spark run without hadoop?

Are there any dependencies between Spark and Hadoop?
If not, are there any features I'll miss when I run Spark without Hadoop?
Spark is an in-memory distributed computing engine.
Hadoop is a framework for distributed storage (HDFS) and distributed processing (YARN).
Spark can run with or without Hadoop components (HDFS/YARN)
Distributed Storage:
Since Spark does not have its own distributed storage system, it has to depend on one of these storage systems for distributed computing.
S3 – Non-urgent batch jobs. S3 fits very specific use cases when data locality isn’t critical.
Cassandra – Perfect for streaming data analysis and an overkill for batch jobs.
HDFS – Great fit for batch jobs without compromising on data locality.
Distributed processing:
You can run Spark in three different modes: Standalone, YARN and Mesos
Have a look at the below SE question for a detailed explanation about both distributed storage and distributed processing.
Which cluster type should I choose for Spark?
Spark can run without Hadoop but some of its functionality relies on Hadoop's code (e.g. handling of Parquet files). We're running Spark on Mesos and S3 which was a little tricky to set up but works really well once done (you can read a summary of what needed to properly set it here).
(Edit) Note: since version 2.3.0 Spark also added native support for Kubernetes
By default , Spark does not have storage mechanism.
To store data, it needs fast and scalable file system. You can use S3 or HDFS or any other file system. Hadoop is economical option due to low cost.
Additionally if you use Tachyon, it will boost performance with Hadoop. It's highly recommended Hadoop for apache spark processing.
As per Spark documentation, Spark can run without Hadoop.
You may run it as a Standalone mode without any resource manager.
But if you want to run in multi-node setup, you need a resource manager like YARN or Mesos and a distributed file system like HDFS,S3 etc.
Yes, spark can run without hadoop. All core spark features will continue to work, but you'll miss things like easily distributing all your files (code as well as data) to all the nodes in the cluster via hdfs, etc.
Yes, you can install the Spark without the Hadoop.
That would be little tricky
You can refer arnon link to use parquet to configure on S3 as data storage.
http://arnon.me/2015/08/spark-parquet-s3/
Spark is only do processing and it uses dynamic memory to perform the task, but to store the data you need some data storage system. Here hadoop comes in role with Spark, it provide the storage for Spark.
One more reason for using Hadoop with Spark is they are open source and both can integrate with each other easily as compare to other data storage system. For other storage like S3, you should be tricky to configure it like mention in above link.
But Hadoop also have its processing unit called Mapreduce.
Want to know difference in Both?
Check this article: https://www.dezyre.com/article/hadoop-mapreduce-vs-apache-spark-who-wins-the-battle/83
I think this article will help you understand
what to use,
when to use and
how to use !!!
Yes, of course. Spark is an independent computation framework. Hadoop is a distribution storage system(HDFS) with MapReduce computation framework. Spark can get data from HDFS, as well as any other data source such as traditional database(JDBC), kafka or even local disk.
Yes, Spark can run with or without Hadoop installation for more details you can visit -https://spark.apache.org/docs/latest/
Yes spark can run without Hadoop. You can install spark in your local machine with out Hadoop. But Spark lib comes with pre Haddop libraries i.e. are used while installing on your local machine.
You can run spark without hadoop but spark has dependency on hadoop win-utils. so some features may not work, also if you want to read hive tables from spark then you need hadoop.
Not good at english,Forgive me!
TL;DR
Use local(single node) or standalone(cluster) to run spark without Hadoop,but stills need hadoop dependencies for logging and some file process.
Windows is strongly NOT recommend to run spark!
Local mode
There are so many running mode with spark,one of it is called local will running without hadoop dependencies.
So,here is the first question:how to tell spark we want to run on local mode?
After read this official doc,i just give it a try on my linux os:
Must install java and scala,not the core content so skip it.
Download spark package
There are "without hadoop" and "hadoop integrated" 2 type of package
The most important thing is "without hadoop" do NOT mean run without hadoop but just not bundle with hadoop so you can bundle it with your custom hadoop!
Spark can run without hadoop(HDFS and YARN) but need hadoop dependency jar such as parquet/avro etc SerDe class,so strongly recommend to use "integrated" package(and you will found missing some log dependencies like log4j and slfj and other common utils class if chose "without hadoop" package but all this bundled with hadoop integrated pacakge)!
Run on local mode
Most simple way is just run shell,and you will see the welcome log
# as same as ./bin/spark-shell --master local[*]
./bin/spark-shell
Standalone mode
As same as blew,but different with step 3.
# Starup cluster
# if you want run on frontend
# export SPARK_NO_DAEMONIZE=true
./sbin/start-master.sh
# run this on your every worker
./sbin/start-worker.sh spark://VMS110109:7077
# Submit job or just shell
./bin/spark-shell spark://VMS110109:7077
On windows?
I kown so many people run spark on windown just for study,but here is so different on windows and really strongly NOT recommend to use windows.
The most important things is download winutils.exe from here and configure system variable HADOOP_HOME to point where winutils located.
At this moment 3.2.1 is the most latest release version of spark,but a bug is exist.You will got a exception like Illegal character in path at index 32: spark://xxxxxx:63293/D:\classe when run ./bin/spark-shell.cmd,only startup a standalone cluster then use ./bin/sparkshell.cmd or use lower version can temporary fix this.
For more detail and solution you can refer for here
No. It requires full blown Hadoop installation to start working - https://issues.apache.org/jira/browse/SPARK-10944

Diffence between Pig on local mode vs pig-withouthadoop.jar

I wanted to know that what is the performance gain or loss if I use pig in local mode (which internally calls Map reduce) vs using PIG-withouthadoop.jar file.?
Does PIG-withouthadoop.jar really does not use hadoop ???
And If I only want to use Pig without clusters, like design a data flow, then what should I use,? Pig in local mode OR pig-withouthadoop.jar file??
Currently I have written my script using pig local mode and while trying to deploy in server and set up PIG in local mode, I think I also need HADOOP_HOME to be set in the environment variables before setting the PIG_HOME variable
Kindly advice ..
Thanks in advance. :)
Let me answer your question in a sequence:
1) When we talk about performance, then if we assume the file size and the Pig script to be constant, while running in local mode and Hadoop mode. Then, definitely the processing will be faster in local mode as all the task is getting performed in a single JVM and but in case of Hadoop mode, the input file will be carried to the data nodes, then the Pig script or UDFs will also get carried to the cluster. This will demand more time, although, in both the cases the pig scripts and UDFs will internally get converted to map and reduce task and also the number of map and reduce class constructed will always be same in both the cases. We can check this by using EXPLAIN command.
2) No. Pig internally contains a bundle of Hadoop jars. So, if you haven't started the Hadoop by using start-all.sh command, pig will work as it uses the internal Hadoop bundled jars. Now, the interesting part is, if you have installed hadoop and then use pig without starting the Hadoop, then sometimes it will not work because the of Hadoop version mismatch. So to be in safe side start Hadoop explicitly. So, Pig always uses Hadoop. :)
3) Always use Hadoop local mode if the file size is less. As already explained, Pig by default comes with Hadoop jars.
4) Yes you need to set this, if you are using Hadoop explicitly.
Local mode will literally run Pig, HDFS and MR1 (or YARN+MR2) in one JVM.
It's not really relevant to compare performance difference in local vs cluster modes. Local mode is generally used for testing or running small MR jobs that can work on 1 node.
With regards to pig-withouthadoop.jar, I can see how the jar's name can be construed to mean that Pig won't using Hadoop. But that is not the case.
Pig packages two jars relevant to execution:
pig.jar, which is an "uber jar" that also includes all hadoop and mapreduce jars. You can literally take that jar on a box which does not already have hadoop installed, and run pig (after setting the right configs and environment.)
But most clusters already have hadoop installed and configured. In that case, you use pig-withouthadoop.jar. This jar is half the size of the uber jar, for obvious reasons.
Either ways you'll need to ensure hadoop configs hdfs-site.xml, mapred-site.xml etc are in standard location (/etc/hadoop/conf/ typically) for Pig to work.

Running a hadoop job using java command

I have a simple java program that sets up a MR job. I could successfully execute this in Hadoop infrastructure (hadoop 2x) using 'hadoop jar '. But I want to achieve the same thing using java command as below.
java className
How can I pass hadoop configuration to this className?
What extra arguments do I need to supply?
Any link/documentation would be highly appreciated.
As you run your 'hadoop jar' command with the other parameters, same way you can run using java.
check if, this commands evaluates to hadoop class path
$ hadoop classpath
then whatever your custom jar is should be added in class path
$ java -cp `hadoop classpath`:/my/tools/jar/tools.jar
I am able to get mine working with this, on my hadoop cluster
I don't think you can find a documentation on this. hadoop command is a script, a lot of classes are used there eg. Class for accessing filesystem FsShell, class used when we run a jar RunJar etc. Adding hadoop related libraries, configuration files to classpath are handled in the hadoop command itself.
You better take a look at the hadoop script.
How can you do that? Any jar file execution means, it has to execute in distributed environment where all daemons work together to complete the execution.
We are not running locally or on local file system. So, it needs be executed as per the norms of hdfs so i don't think we can execute like we do in local file system.
Hadoop is a framework which simplifies the distributed computing. Before hadoop also, programmers know about parallel processing and multi threading concepts. But when you deal with multiple machines you need to know how to
Communicate between machines
Network processing
What if one machine fails? fault tolerance
and many more! which is a huge, that's where hadoop simplifies your job. It takes care of all your operating level stuff and you can focus on just your business logic.
So in your case, based on what you are asking, there is no direct answer for that. Because by passing parameters the your program doesn't work. You will need to write lot of libraries to deal with distributed computing. If you want to explore them, then I would suggest go ahead and read hadoop source code.
http://hadoop.apache.org/version_control.html

running a non mapreduce program in hadoop

I have a question.. I have a program write in Netbeans. the program read data from cassandra and write the result into it. My program is not MapReduce at all.I execute the program and make a .jar file from it. now, I want to know if I can execute it in Hadoop?
actually, I want to know can I run a non-MapReduce Program in Hadoop?
You could architect this program to run on Hadoop v2 as a Yarn application. This would require re-architecting your application to fit the Yarn paradigm. An example of how to do this is given here: Writing App Framework on Yarn
This is not a simple exercise. Also, if you are interested in using Hadoop, I would consider simply re-writing your application to use HBase (another No-SQL Columnar database competitor to Cassandra) which is written specifically for Hadoop. It translates your query requests to MapReduce calls automatically.
This question is ages long but has never been answered. Anyhow, two projects are looking into this issue:
Apache Slider (incubating): http://slider.incubator.apache.org/
and
Apache Myriad (incubating): http://myriad.incubator.apache.org/
Slider is mainly sponsored by Hortonworks while Myriad is a MapR / Mesosphere project with large assistance from PayPal.

Cascading HBase Tap

I am trying to write Scalding jobs which have to connect to HBase, but I have trouble using the HBase tap. I have tried using the tap provided by Twitter Maple, following this example project, but it seems that there is some incompatibility between the Hadoop/HBase version that I am using and the one that was used as client by Twitter.
My cluster is running Cloudera CDH4 with HBase 0.92 and Hadoop 2.0.0-cdh4.1.3. Whenever I launch a Scalding job connecting to HBase, I get the exception
java.lang.NoSuchMethodError: org.apache.hadoop.net.NetUtils.getInputStream(Ljava/net/Socket;)Ljava/io/InputStream;
at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:363)
at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1046)
...
It seems that the HBase client used by Twitter Maple is expecting some method on NetUtils that does not exist on the version of Hadoop deployed on my cluster.
How do I track down what exactly is the mismatch - what version would the HBase client expect and so on? Is there in general a way to mitigate these issues?
It seems to me that often client libraries are compiled with hardcoded version of the Hadoop dependencies, and it is hard to make those match the actual versions deployed.
The method actually exists but has changed its signature. Basically, it boils down to having different versions of Hadoop libraries on your client and server. If your server is running Cloudera, you should be using the HBase and Hadoop libraries from Cloudera. If you're using Maven, you can use Cloudera's Maven repository.
It seems like library dependencies are handled in Build.scala. I haven't used Scala yet, so I'm not entirely sure how to fix it there.
The change that broke compatibility was committed as part of HADOOP-8350. Take a look at Ted Yu's comments and the responses. He works on HBase and had the same issue. Later versions of the HBase libraries should automatically handle this issue, according to his comment.

Resources