Missing spark-env.sh if I installed pyspark with pip - pip

I installed pyspark 2.2.0 with pip, but I don't see a file named spark-env.sh nor the conf directory. I would like to define variables like SPARK_WORKER_CORES in this file. How should I proceed?
I am using Mac OSX El Capitan, python 2.7.

PySpark from PyPi (i.e. installed with pip or conda) does not contain the full PySpark functionality; it is only intended for use with a Spark installation in an already existing cluster, in which case you might want to avoid downloading the whole Spark distribution. From the docs:
The Python packaging for Spark is not intended to replace all of the
other use cases. This Python packaged version of Spark is suitable for
interacting with an existing cluster (be it Spark standalone, YARN, or
Mesos) - but does not contain the tools required to setup your own
standalone Spark cluster. You can download the full version of Spark
from the Apache Spark downloads page.
So, what you should do is download Spark as said above (PySpark is an essential component of it).

Related

installing Cloudera Impala without installing CDH?

I'm using pure Apache Hadoop with Hive. I need to install Apache Impala, for integrate with Hive and Kudu. I'm trying to install with source from this link and with this build instructions. In this page said that, "build Impala from source and how to configure and run Impala in a single node development environment."
Can I use Apache Impala source for production environment with pure Apache Hadoop(not Cloudera)? I also see this page from other stack-overflow question. In this page answer, recommended that using package for installing. But the dependencies of this packages is HBase, Hadoop etc CDH.
Can anyone show how to install Apache Impala without Cloudera Hadoop and inside the Apache Hadoop(pure) in production environment?
Thanks.

Installing spark on hadoop

I installed hadoop 2.7 on my mac. Then i want to install spark on it. But there is no any document for this.can anybody explain step by step how to install spark on hadoop?
Steps to Install Apache Spark
1) Open Apache Spark Website http://spark.apache.org/
2) Click on Downloads Tab a new Page will get open
3) Choose Pre-built for Hadoop 2.7 and later
4) Choose Direct Download
5) Click on Download Spark: spark-2.0.2-bin-hadoop2.7.tgz and save it on your desired location.
6) Go to the Downloaded Tar file and Extract it.
7) Again Extract the spark-2.0.2-bin-hadoop2.7.tar [File name will differ as version changes] to generate spark-2.0.2-bin-hadoop2.7 folder
8) Now open Shell Prompt and go to the bin directory of spark-2.0.2-bin-hadoop2.7 folder [Folder name will differ as version changes ]
9) Execute command spark-shell.sh
You will be in Spark Shell you can execute the spark commands
https://spark.apache.org/docs/latest/quick-start.html <-- Quick start Guide from spark
Hope this Helps!!!
For running spark on yarn cluster there is lot of steps to install hadoop and spark and all so i write one blog on it step by step you can install it and run spark shell on yarn see the below link
https://blog.knoldus.com/2016/01/30/spark-shell-on-yarn-resource-manager-basic-steps-to-create-hadoop-cluster-and-run-spark-on-it/
Here are the steps I took to install Apache Spark to a Linux Centos system with hadoop:
Install a default Java system (ex: sudo yum install java-11-openjdk)
Download latest release of Apache Spark from spark.apache.org
Extract the Spark tarball (tar xvf spark-2.4.5-bin-hadoop2.7.tgz)
Move Spark folder created after extraction to the /opt/ directory (sudo mv spark-2.4.5-bin-hadoop2.7/ /opt/spark)
Execute with command /opt/spark/bin/spark-shell if you wish to work with Scala or /opt/spark/bin/pyspark if you want to work with Python

Install Spark 1.5 in existing Hortonworks HDP Cluster

I'm new to Hadoop and want find the way how to install Spark 1.5.1 on the existing Hadoop cluster. 4 nodes, Ubuntu 14.04. Hadoop 2.3.2. Ambari Version 2.1.2.1. Followed tutorial, but there are spark version for the Ubuntu 12, and I cannot install it on our system. So after step 1 I stucked. sudo apt-get install spark_2_3_2_1_12-master -y
Got an error:
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package spark_2_3_2_1_12-master
Can anyone provide us with some guidline, how to install 1.5?
Currently we have Spark 1.4 installed, up, and running, but due to requirement of functionality need the 1.5!
Ubuntu 14.04 Trusty Tahr is not officially supported by HDP. If you look at the repos available for stack updates, HDP stack public repos, they only have ones up for Centos, Red Hat, and Oracle Linux. Did you try using Spark's Simple Build Tool to build spark-1.5 source against your Hadoop install ? You would need to set SPARK_HADOOP_HOME=your hadoop location. See this for step by step with Ubuntu 14.04 and an earlier version of Spark. I don't see why the same steps would fail with Spark 1.5.

how to install Spark and Hadoop from tarball separately [Cloudera]

I want to install Cloudera distribution of Hadoop and Spark using tarball.
I have already set up Hadoop in Pseudo-Distributed mode in my local machine and successfully ran a Yarn example.
I have downloaded latest tarballs CDH 5.3.x from here
But the folder structure of Spark downloaded from Cloudera is differrent from Apache website. This may be because Cloudera provides it's own version maintained separately.
So, as there are no documentation I have found yet to install Spark from this Cloudera's tarball separately.
Could someone help me to understand how to do it?
Spark could be extracted to any directory. You just need to run the ./bin/spark-submit command (available in extracted spark directory) with required parameters to submit the job. To start spark interactive shell, please use command ./bin/spark-shell.

How to download CDH4 setup manually

How can I download the CDH4 setup manually?
I mean I want to download the setup without using apt-get from the ubuntu command prompt.
CDH4 is Cloudera's distribution of Apache Hadoop. CDH is a collection of Apache Hadoop and several components of its ecosystem.
Assuming that you are requesting for the source, each component can be downloaded as a tarball from the following location:
CDH Packaging and Tarball Information

Resources