Apache Kylin installation without Sandbox - hadoop

I was wondering if there are any resources regarding Apache Kylin installation without any sandbox (like cloudera, hortonworks) support. I have managed to do the following:
Install Hadoop 2.6
Install Hive
Install HBase
Then I used the binary from kylin site and so far been able to run it. The problem start when I try to build a cube, the map reduce job gets stuck in step 2. I am thinking if it is still assuming to be in sandbox mode and not submitting job to hadoop at all (there is no entry in hadoop jobtracker).
So I need solution regarding the two:
1. Possible configuration of kylin in pure hadoop setup (no sandbox)
2. somehow enable the kylin setup to submit job to hadoop.

There is no such sandbox or non-sandbox configuration in Kylin. Just make sure the machine where Kylin runs has hadoop setup correctly and you should be fine.
Under the scene, kylin.sh uses hbase classpath and hive -e set | grep 'env:CLASSPATH' to detect hadoop settings. Double check these commands work as expect if you are not sure what cluster Kylin connects to.
If Kylin has problem submitting MR jobs, check two places. First is hadoop resource manager, see if the job has really been submitted or not. Sometimes it's just running slow. Second check kylin.log, see if any exception there. Post the log to kylin dev mailing list and someone will be able to help.

You can install hadoop-2.6 , hive-0.14 ,hbase-0.98.8-hadoop2 with Zookeeper inbuilt or external zookeeper-3.5
Now you can run kylin-v1.1-release on it
If you still face Issues paste the log here

Related

Jenkins as JobServer on Hadoop EdgeNode

I´m not sure that someone can help me but I´ll take a try.
I´m running Jenkins on an Openshift-Cluster to use it for Deployment and as a jobserver for running ETL-Jobs. These jobs are transferring data from flatfiles to databases and from db to db.
Now, I should expand the system to transfer data to a hadoop cluster using MapR.
What I would like to know is, how can I use a new Jenkins-Slave as a jobserver on an EdgeNode from the hadoop-cluster using MapR. Do I need the Jenkins on the EdgeNode or am I able to use MapR from my existing Jenkins-Jobserver?
Mabye, someone is able to help me or has some informations/links how to solve it.
Thx to all....
"Use MapR" isn't quite clear to me because I just view it as Hadoop at the end of the day, but you can effectively make your Jenkins slave an "edge node" by installing only the Hadoop Java (maybe also MapR) client utilities plus any XML configuration files from the other edge nodes that define how to communicate with the cluster.
Then, Jenkins would be able to run sh("hadoop jar app.jar"), for example
If you're using Openshift, you might also try putting a Hadoop client inside a Docker image that could run in Jenkins, or anywhere else

does Apache Kylin need a Apache Derby or Mysql for run the sample cube

I installed Java and Hadoop and Hbase and Hive and Spark and Kylin.
hadoop-3.0.3
hbase-1.2.6
apache-hive-2.3.3-bin
spark-2.2.2-bin-without-hadoop
apache-kylin-2.3.1-bin
I will be grateful if someone in Help me with Kyle's installation and configuration them.
http://kylin.apache.org/docs/ this may help you. You can send email to dev#kylin.apache.org, then the questions will be discussed and answered in the mailing list. There are some tips for sending the email: 1. provide Kylin version 2. provide log information 3.provide the usage scenario. If you want to get a quick start, you can run Kylin in a Hadoop sandbox VM or in the cloud, for example, start a small AWS EMR or Azure HDInsight cluster and then install Kylin in one of the nodes. When you use Kylin-2.3.1, I suggest you use Spark-2.1.2.

installation of Oozie on a separate machine then Hadoop

Very new to Oozie, hence please excuse me if I sound like a newbie.
I have a hadoop cluster which is up and running. I want to install Oozie, this i want on a separate machine then then hadoop. Is this possible? the reason for asking is that on every installation guide I have seen it asks to install hadoop on the machine hence am not sure if its technically possible to have hadoop on a separate machine then Oozie.
Thanks in advance
Oozie server serves client's requests, it's a web application which uses embedded Tomcat, it can be installed on any machine where hadoop is reachable from, it's not tied to hadoop by itself. You can specify hadoop's nameNode and jobTracker in workflow properties so oozie will know where to send it's jobs.

Hadoop ResourceManager not show any job's record

I install Hadoop MultiNode cluster based on this link http://pingax.com/install-apache-hadoop-ubuntu-cluster-setup/
then I try to run wordcount example in my environment, but when I access to Resource Manager http://HadoopMaster:8088 to see the job's details, no records show in UI.
I also search this problem, one guy give the solution like that Hadoop is not showing my job in the job tracker even though it is running but in my case, I'm just running hadoop's example, in which wordcount also didn't add any extra configuration for yarn.
Anyone has install successfully Hadoop2 Muiltinode and Hadoop web UI works correctly can help me about this issue or can give a link to install correctly.
Whether you got the output of word-count job?

Can apache spark run without hadoop?

Are there any dependencies between Spark and Hadoop?
If not, are there any features I'll miss when I run Spark without Hadoop?
Spark is an in-memory distributed computing engine.
Hadoop is a framework for distributed storage (HDFS) and distributed processing (YARN).
Spark can run with or without Hadoop components (HDFS/YARN)
Distributed Storage:
Since Spark does not have its own distributed storage system, it has to depend on one of these storage systems for distributed computing.
S3 – Non-urgent batch jobs. S3 fits very specific use cases when data locality isn’t critical.
Cassandra – Perfect for streaming data analysis and an overkill for batch jobs.
HDFS – Great fit for batch jobs without compromising on data locality.
Distributed processing:
You can run Spark in three different modes: Standalone, YARN and Mesos
Have a look at the below SE question for a detailed explanation about both distributed storage and distributed processing.
Which cluster type should I choose for Spark?
Spark can run without Hadoop but some of its functionality relies on Hadoop's code (e.g. handling of Parquet files). We're running Spark on Mesos and S3 which was a little tricky to set up but works really well once done (you can read a summary of what needed to properly set it here).
(Edit) Note: since version 2.3.0 Spark also added native support for Kubernetes
By default , Spark does not have storage mechanism.
To store data, it needs fast and scalable file system. You can use S3 or HDFS or any other file system. Hadoop is economical option due to low cost.
Additionally if you use Tachyon, it will boost performance with Hadoop. It's highly recommended Hadoop for apache spark processing.
As per Spark documentation, Spark can run without Hadoop.
You may run it as a Standalone mode without any resource manager.
But if you want to run in multi-node setup, you need a resource manager like YARN or Mesos and a distributed file system like HDFS,S3 etc.
Yes, spark can run without hadoop. All core spark features will continue to work, but you'll miss things like easily distributing all your files (code as well as data) to all the nodes in the cluster via hdfs, etc.
Yes, you can install the Spark without the Hadoop.
That would be little tricky
You can refer arnon link to use parquet to configure on S3 as data storage.
http://arnon.me/2015/08/spark-parquet-s3/
Spark is only do processing and it uses dynamic memory to perform the task, but to store the data you need some data storage system. Here hadoop comes in role with Spark, it provide the storage for Spark.
One more reason for using Hadoop with Spark is they are open source and both can integrate with each other easily as compare to other data storage system. For other storage like S3, you should be tricky to configure it like mention in above link.
But Hadoop also have its processing unit called Mapreduce.
Want to know difference in Both?
Check this article: https://www.dezyre.com/article/hadoop-mapreduce-vs-apache-spark-who-wins-the-battle/83
I think this article will help you understand
what to use,
when to use and
how to use !!!
Yes, of course. Spark is an independent computation framework. Hadoop is a distribution storage system(HDFS) with MapReduce computation framework. Spark can get data from HDFS, as well as any other data source such as traditional database(JDBC), kafka or even local disk.
Yes, Spark can run with or without Hadoop installation for more details you can visit -https://spark.apache.org/docs/latest/
Yes spark can run without Hadoop. You can install spark in your local machine with out Hadoop. But Spark lib comes with pre Haddop libraries i.e. are used while installing on your local machine.
You can run spark without hadoop but spark has dependency on hadoop win-utils. so some features may not work, also if you want to read hive tables from spark then you need hadoop.
Not good at english,Forgive me!
TL;DR
Use local(single node) or standalone(cluster) to run spark without Hadoop,but stills need hadoop dependencies for logging and some file process.
Windows is strongly NOT recommend to run spark!
Local mode
There are so many running mode with spark,one of it is called local will running without hadoop dependencies.
So,here is the first question:how to tell spark we want to run on local mode?
After read this official doc,i just give it a try on my linux os:
Must install java and scala,not the core content so skip it.
Download spark package
There are "without hadoop" and "hadoop integrated" 2 type of package
The most important thing is "without hadoop" do NOT mean run without hadoop but just not bundle with hadoop so you can bundle it with your custom hadoop!
Spark can run without hadoop(HDFS and YARN) but need hadoop dependency jar such as parquet/avro etc SerDe class,so strongly recommend to use "integrated" package(and you will found missing some log dependencies like log4j and slfj and other common utils class if chose "without hadoop" package but all this bundled with hadoop integrated pacakge)!
Run on local mode
Most simple way is just run shell,and you will see the welcome log
# as same as ./bin/spark-shell --master local[*]
./bin/spark-shell
Standalone mode
As same as blew,but different with step 3.
# Starup cluster
# if you want run on frontend
# export SPARK_NO_DAEMONIZE=true
./sbin/start-master.sh
# run this on your every worker
./sbin/start-worker.sh spark://VMS110109:7077
# Submit job or just shell
./bin/spark-shell spark://VMS110109:7077
On windows?
I kown so many people run spark on windown just for study,but here is so different on windows and really strongly NOT recommend to use windows.
The most important things is download winutils.exe from here and configure system variable HADOOP_HOME to point where winutils located.
At this moment 3.2.1 is the most latest release version of spark,but a bug is exist.You will got a exception like Illegal character in path at index 32: spark://xxxxxx:63293/D:\classe when run ./bin/spark-shell.cmd,only startup a standalone cluster then use ./bin/sparkshell.cmd or use lower version can temporary fix this.
For more detail and solution you can refer for here
No. It requires full blown Hadoop installation to start working - https://issues.apache.org/jira/browse/SPARK-10944

Resources