Using Amazon EMR, Hive .13, Hadoop 2.x, and Presto Server 0.89. Trying to set up Presto to query data that is usually queried through Hive. Hive metadata is stored in MySQL. Presto Server is installed set up on all nodes. For the most part everything is set up as is documented on prestodb.io.
I first start the server on all nodes (coordinator and workers), and then start the CLI on the coordinator/name node. When I try to run a query using the below commands I get a "Query ... No worker nodes available" error:
presto-cli presto-cli --server localhost:8080 --catalog jmx --schema default
presto:default> SELECT * FROM sys.node;
"Query ... No worker nodes available"
If I include the node-scheduler.include-coordinator=true in my coordinator config.properties file, 1 node is returned from this query.
Configs:
etc/config.properties (only on coordinator)
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
task.max-memory=1GB
discovery-server.enabled=true
discovery.uri=http://aws.internal.ip.of.coordinator:8080
etc/config.properties (only on workers)
coordinator=false
http-server.http.port=8080
task.max-memory=1GB
discovery.uri=http://aws.internal.ip.of.coordinator:8080
etc/catalog/hive.properties (all nodes)
connector.name=hive-hadoop2
hive.metastore.uri=thrift://aws.internal.ip.of.coordinator:9083
etc/catalog/jmx.properties (all nodes)
connector.name=jmx
etc/jvm.config (all nodes)
-server
-Xmx16G
-XX:+UseConcMarkSweepGC
-XX:+ExplicitGCInvokesConcurrent
-XX:+CMSClassUnloadingEnabled
-XX:+AggressiveOpts
-XX:+HeapDumpOnOutOfMemoryError
-XX:OnOutOfMemoryError=kill -9 %p
-XX:ReservedCodeCacheSize=150M
etc/log.properties
com.facebook.presto=INFO
etc/node.properties
node.environment=production
node.id=unique-uuid #used uuidgen
node.data-dir=/mnt/presto-data
Simple mistake on my part was making this not run. I had a random semi-colon instead of a period in my aws.internal.ip.of.coordinator IP address. Looking at my configs I just didn't see it.
The above code will work on an Amazon EMR multi-node cluster similar to the one above.
Related
I had a lot of issues surrounding the installation/activation of Hive 2.1.0 on our HDP 2.6.2 cluster. But finally I got it working, so I wanted to share the steps involved with the community. I got these steps from different sources, which I will also mention below each step. My specifications:
Clustered HDP 2.6.2 (hortonworks) environment
Kerberos
Hive 1.2.1000 -> Hive 2.1.0
Step 1: Enable Hive Interactive Query
Follow the steps on the Hortonworks website. This includes enabling YARN pre-emption and some other Yarn settings. After adjusting YARN your can enable Hive Interactive Query via Ambari. You also have to specify a default queue that is at least 20% of your total cluster capacity.
Source
Step 2: Kerberos related settings
Make sure you add the following settings to the custom hiveserver2-interactive site in Ambari. Where ${REALMNAME} is the name of your LDAP realm.
hive.llap.zk.sm.keytab.file=/etc/security/keytabs/hive.llap.zk.sm.keytab
hive.llap.zk.sm.principal=hive/_HOST#${REALMNAME}
hive.llap.daemon.keytab.file=/etc/security/keytabs/hive.service.keytab
hive.llap.daemon.service.principal=hive/_HOST#${REALMNAME}
Now you have to put those 2 keytabs (basically the same keytabs) on every YARN node. This can be done manually or through Ambari (Kerberos service). Make sure those keytabs are chown hive:hadoop and have a chmod 440 (group read).
Note: you also need a user hive on all those nodes.
Source
Step 3: Zookeeper configuration
It could be that Hive is not recognized by Zookeeper, this will give acl errors when trying to start the HiveServer2 Interactive. To cope with this issue I added the right hive acl nodes through a zookeeper client host.
su -
# First, authenticate with the hive keytab
kinit hive/'hostname' -kt /etc/security/keytabs/hive.service.keytab
# Second, connect to a zookeeper client on your cluster
/usr/hdp/current/zookeeper-server/bin/zkCli.sh -server ${ZOOKEEPER_CLIENT}
# Third, check the current status of the user-hive acl
getAcl /llap-sasl/user-hive
# Fourth, If this is not there create the following nodes
create /llap-sasl/user-hive "" sasl:hive:cdrwa,world:anyone:r
create /llap-sasl/user-hive/llap0 "" sasl:hive:cdrwa,world:anyone:r
create /llap-sasl/user-hive/llap0/workers "" sasl:hive:cdrwa,world:anyone:r
# Fifth, change the llap-sasl node to add the user hive
setAcl /llap-sasl sasl:hive:cdrwa,world:anyone:r
Source 1, Source 2
Basically, this should work for Kerberized environments. If you got errors related to ACL, go back to your Zookeeper settings and look if everything is fine. If you have errors related to a missing Hive user, you should look of the hive user is added correctly to the nodes. If you have an error related to Kerberos (principal or keytabs) look if the keytabs are on the designated (YARN) nodes with the correct rights.
What is really executed and where, when using jdbc drivers to connect to e.g. oracle.?
1: I have started a spark master as
spark-class.cmd org.apache.spark.deploy.master.Master
and a worker like so
spark-class.cmd org.apache.spark.deploy.worker.Worker spark://myip:7077
and spark shell as
spark-shell --master spark://myip:7077
in spark-defaults.conf I have
spark.driver.extraClassPath = C:/jdbcDrivers/ojdbc8.jar
spark.executor.extraClassPath = C:/jdbcDrivers/ojdbc8.jar
and in spark-env.sh I have
SPARK_CLASSPATH=C:/jdbcDrivers/ojdbc8.jar
I can now run queries against Oracle in the spark-shell:
val jdbcDF = spark.read.format("jdbc").option("url","jdbc:oracle:thin:#...
This works fine without separately adding the jdbc driver jar in the scala shell.
When I start the master and worker in the same way, but create a scala project in eclipse and connecting to the master as follows:
val sparkSession = SparkSession.builder.
master("spark://myip:7077")
.appName("SparkTestApp")
.config("spark.jars", "C:\\pathToJdbc\\ojdbc8.jar")
.getOrCreate()
then it fails if I don't explicitly add the jdbc jar in the scala code.
How is the execution different? Why do I need to specify the jdbc jar in the code? What is the purpose of connecting to the master if it doesn't rely on the master and workers started?
If I use multiple workers with jdbc will they use only one connection or will they simultaneously read in parallel over several connections?
You are certainly using too much for the sample and you got confused.
The two lines, spark-class.cmd org.apache.spark.deploy.master.Master and spark-class.cmd org.apache.spark.deploy.worker.Worker spark://myip:7077, started a Spark Standalone cluster with one master and one worker. See Spark Standalone Mode.
In addition to running on the Mesos or YARN cluster managers, Spark also provides a simple standalone deploy mode. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use our provided launch scripts. It is also possible to run these daemons on a single machine for testing.
You chose to start the Spark Standalone cluster manually (as described in Starting a Cluster Manually).
I doubt that spark-defaults.conf is used by the cluster at all. The file is to configure your Spark applications that are spark-submit to a cluster (as described in Dynamically Loading Spark Properties):
bin/spark-submit will also read configuration options from conf/spark-defaults.conf, in which each line consists of a key and a value separated by whitespace.
With that said, I think we can safely put Spark Standalone aside. It does not add much to the discussion (and does confuse a bit).
"Installing" JDBC Driver for Spark Application
In order to use a JDBC driver in your Spark application, you should spark-submit with --driver-class-path command-line option (or spark.driver.extraClassPath property as described in Runtime Environment):
spark.driver.extraClassPath Extra classpath entries to prepend to the classpath of the driver.
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
I strongly recommend using spark-submit --driver-class-path.
$ ./bin/spark-submit --help
...
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
You can read my notes on how to use a JDBC driver with PostgreSQL in Working with Datasets from JDBC Data Sources (and PostgreSQL).
PROTIP Use SPARK_PRINT_LAUNCH_COMMAND=1 to check out the command line of spark-submit.
All above applies to spark-shell too (as it uses spark-submit under the covers).
I have 12 node cluster. Its Hardware information are :
NameNode : CPU Core i3 2.7 Ghz | 8GB RAM | 500 GB HDD
DataNode : CPU Core i3 2.7 Ghz | 2GB RAM | 500 GB HDD
I have installed the hadoop 2.7.2. I am using normal hadoop installation process on ubuntu and it work fine. But I want to add client machine.and I have no such clue that how to add client machine.
Question :
Installing process of Client machine. ?
How to run any script of pig/hive on that client machine ?
Client should have same copy of Hadoop Distribution and configuration which is present at Namenode then Only Client will come to know on which node Job tracker/Resourcemanager is running, and IP of Namenode to access HDFS data.
Also you need to update /etc/hosts of client machine with IP addresses and hostnames of namenode and datanode.
Note that, you shouldn’t start any hadoop service on client machine.
Steps to follow on client machine:
create an user account on the cluster, say user1
create an account on client machine with the same name: user1
configure client machine to access the cluster machines (ssh w\out passphrase i.e, password less login)
copy/get a hadoop distribution same as cluster to client machine and extract it to /home/user1/hadoop-2.x.x
copy(or Edit) the hadoop configuration files (*-site.xml) from Namenode of the cluster - from this client will know where the Namenode/resourcemanager is running.
Set environment variables: JAVA_HOME, HADOOP_HOME (/home/user1/hadoop-2.x.x)
Set hadoop bin to your path: export PATH=$HADOOP_HOME/bin:$PATH
test it out: hadoop fs -ls / which should list the root directory of the cluster hdfs.
you may face some issues like privileges, may need to set JAVA_HOME places like conf/hadoop-env.sh on client machine. update/comment any error you get.
Answers to more questions from comments:
How to load data from client node to hdfs ? - Just run hadoop fs commands from client machine: hadoop fs -put /home/user1/data/* /user/user1/data - you can also write shell-scripts that would run these command(s) if you need to run them many times.
Why I am installing hadoop on the client if we only use ssh to connect remotely to the master node ?
because client need to communicate with cluster, and need to know
where cluster nodes are.
client will be running hadoop jobs
like hadoop fs commands, hive queries, hadoop jar commnads, spark
jobs, developing mapreduce jobs etc for which client will need
hadoop binaries on client node.
Basically you are not only using the ssh to
connect, but you are performing some operations on hadoop cluster from
client node so you would need hadoop binaries.
ssh is used by
hadoop binaries on client node, when you run such operations like hadoop fs
-ls/ from client node to cluster. (remember adding $HADOOP_HOME/bin to PATH as part of installation process above)
when you are saying "we only use ssh" - that sounds to me like when you want to make changes/access hadoop configuration files from cluster you are connecting using ssh to cluster nodes - you do this as part of administrative work but when you need to run hadoop commands/jobs against cluster from client node you dont need to ssh manually - hadoop installation on client node will take care of it.
with out hadoop instalations how can you run hadoop commands/jobs/queries from client node to cluster?
3. should user name 'user1' must be same ? what if it is different ? - it will work. you can install hadoop on client node under group user say: qa or dev, and all users on client node as sudo under that group. than when user1 on client node need to run any hadoop job on cluster: user1 should be able to sudo -i -u qa and then run hadoop command from it.
I wanted to understand how hive knows which of the hadoop namenode is in active state and what happens when the active namenode fails
Hive is configured via metatool to point to the configured dfs.nameservices for HA HDFS. See https://cwiki.apache.org/confluence/display/Hive/Hive+MetaTool. dfs.nameservices is a logical address while the actual namenodes are configured with dfs.ha.namenodes.[id].
As for which Namenode is active, state is stored in Zookeeper. When the active namenode fails, failover is triggered after a configured time (5 second default, ha.zookeeper.session-timeout.ms). A fencing script is required and triggers the standby namenode to become active.
In hdfs HA environment name node url should be a logical name (eg hdfs://logicalnamenode). We need to configure hive to work with HA. For that you need to change the hive name node configuration with metatool command.
List the current NN configuration
~# metatool -listFSRoot
hdfs://namenode:8020/user/hive/warehouse
The following command will update the old NN configuration with Logical name
metatool -updateLocation hdfs://logicalnamenode hdfs://namenode:8020 -tablePropKey avro.schema.url
I am trying to understand how spark runs on YARN cluster/client. I have the following question in my mind.
Is it necessary that spark is installed on all the nodes in yarn cluster? I think it should because worker nodes in cluster execute a task and should be able to decode the code(spark APIs) in spark application sent to cluster by the driver?
It says in the documentation "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster". Why does client node have to install Hadoop when it is sending the job to cluster?
Adding to other answers.
Is it necessary that spark is installed on all the nodes in the yarn
cluster?
No, If the spark job is scheduling in YARN(either client or cluster mode). Spark installation is needed in many nodes only for standalone mode.
These are the visualizations of spark app deployment modes.
Spark Standalone Cluster
In cluster mode driver will be sitting in one of the Spark Worker node whereas in client mode it will be within the machine which launched the job.
YARN cluster mode
YARN client mode
This table offers a concise list of differences between these modes:
pics source
It says in the documentation "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client-side)
configuration files for the Hadoop cluster". Why does the client node have
to install Hadoop when it is sending the job to cluster?
Hadoop installation is not mandatory but configurations(not all) are!. We can call them Gateway nodes. It's for two main reasons.
The configuration contained in HADOOP_CONF_DIR directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration.
In YARN mode the ResourceManager’s address is picked up from the
Hadoop configuration(yarn-default.xml). Thus, the --master parameter is yarn.
Update: (2017-01-04)
Spark 2.0+ no longer requires a fat assembly jar for production
deployment. source
We are running spark jobs on YARN (we use HDP 2.2).
We don't have spark installed on the cluster. We only added the Spark assembly jar to the HDFS.
For example to run the Pi example:
./bin/spark-submit \
--verbose \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--conf spark.yarn.jar=hdfs://master:8020/spark/spark-assembly-1.3.1-hadoop2.6.0.jar \
--num-executors 2 \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 4 \
hdfs://master:8020/spark/spark-examples-1.3.1-hadoop2.6.0.jar 100
--conf spark.yarn.jar=hdfs://master:8020/spark/spark-assembly-1.3.1-hadoop2.6.0.jar - This config tell the yarn from were to take the spark assembly. If you don't use it, it will upload the jar from were you run spark-submit.
About your second question: The client node doesn't not need Hadoop installed. It only needs the configuration files. You can copy the directory from your cluster to your client.
1 - Spark if following s slave/master architecture. So on your cluster, you have to install a spark master and N spark slaves. You can run spark in a standalone mode. But using Yarn architecture will give you some benefits.
There is a very good explanation of it here : http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
2- It is necessary if you want to use Yarn or HDFS for example, but as i said before you can run it in standalone mode.
Let me try to cut glues and make it short for impatient.
6 components: 1. client, 2. driver, 3. executors, 4. application master, 5. workers, and 6. resource manager; 2 deploy modes; and 2 resource (cluster) management.
Here's the relation:
Client
Nothing special, is the one submitting spark app.
Worker, executors
Nothing special, one worker holds one or more executors.
Master, & resource (cluster) manager
(no matter client or cluster mode)
in yarn, resource manager and master sit in two different nodes;
in standalone, resource manager == master, same process in the same node.
Driver
in client mode, sits with client
in yarn - cluster mode, sits with master (in this case, client process exits after submission of app)
in standalone - cluster mode, sits with one worker
Voilà!