Spark History server setup - windows

I am trying to setup Spark History config server in local. I am using using Windows and Pycharm for Pyspark programming. I am able to view Spark Web-UI at localhost:4040.
The things I have done are:
spark-defaults.conf: (Where I have added last three lines.)
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.
# Example:
# spark.master spark://master:7077
# spark.eventLog.enabled true
# spark.eventLog.dir hdfs://namenode:8021/directory
# spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory 5g
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark.jars.packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1
spark.eventLog.enabled true
spark.history.fs.logDirectory file:///D:///tmp///spark-events
Run the history server
C:\Users\hp\spark>bin\spark-class.cmd org.apache.spark.deploy.history.HistoryServer
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/08/09 08:58:04 INFO HistoryServer: Started daemon with process name: 13476#DESKTOP-B9KRC6O
20/08/09 08:58:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/08/09 08:58:23 INFO SecurityManager: Changing view acls to: hp
20/08/09 08:58:23 INFO SecurityManager: Changing modify acls to: hp
20/08/09 08:58:23 INFO SecurityManager: Changing view acls groups to:
20/08/09 08:58:23 INFO SecurityManager: Changing modify acls groups to:
20/08/09 08:58:23 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hp); groups with view permissions: Set(); users with modify permissions: Set(hp); groups with modify permissions: Set()
20/08/09 08:58:24 INFO FsHistoryProvider: History server ui acls disabled; users with admin permissions: ; groups with admin permissions
20/08/09 08:58:26 INFO Utils: Successfully started service on port 18080.
20/08/09 08:58:26 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and started at http://DESKTOP-B9KRC6O:18080
After I run my Pyspark program successfully, I am unable to see Job details at Spark-History-server web UI. Though the server is started. It looks like below:
The references I have already used:
Windows: Apache Spark History Server Config
How to run Spark History Server on Windows
The code I use is as follows:
from pyspark import SparkContext,SparkConf
from pyspark.sql import SparkSession
conf = SparkConf().setAppName("madhu").setMaster("local")
sc = SparkContext(conf=conf)
spark = SparkSession(sc).builder.getOrCreate()
def readtable(dbname,table):
dbname = dbname
table=table
hostname = "localhost"
jdbcPort = 3306
username = "root"
password = "madhu"
jdbc_url = "jdbc:mysql://{0}:{1}/{2}?user={3}&password={4}".format(hostname,jdbcPort, dbname,username,password)
dataframe = spark.read.format('jdbc').options(driver = 'com.mysql.jdbc.Driver',url=jdbc_url, dbtable=table).load()
return dataframe
t1 = readtable("db","table1")
t2 = readtable("db2","table2")
print(t2.show())
spark.stop()
Please help me on how can I achieve the same.I will provide any data that will be required.
I have also tried with directory paths as:
spark.eventLog.enabled true
spark.history.fs.logDirectory file:///D:/tmp/spark-events

You must provide the correct master URL in the application and run the application with spark-submit.
You can find it in the Spark UI at localhost:4040
In the following example, the master URL is spark://XXXX:7077.
Your application should be:
conf = SparkConf().setAppName("madhu").setMaster("spark://XXXX:7077")

Related

How to set hadoop class parameters in Hive Similar to Pig shown here?

I want Hive to automatically acquire kerberos ticket whenever hive(More specifically hive-shell not hive-server) is executed and also renew it automatically in between if job run more then timeout of ticket.
I found similar functionality in Pig. See This. I tested and it is working it acquires ticket automatically from keytab I don't have to acquire it manually using kinit and then start job. It renews tickets also whenever needed as mentioned in doc.
On some research I came across user-name-handling-in-hadoop. I found out similar log statement of dumping configuration parameters of class UserGroupInformation when starting hive.
As i wanted it every time hive is executed I tried it putting in HADOOP_OPTS which looks like this
export HADOOP_OPTS="$HADOOP_OPTS -Djava.security.krb5.conf=/etc/krb5.conf -Dhadoop.security.krb5.principal=root#MSI.COM -Dhadoop.security.krb5.keytab=/etc/security/keytab/user.service.keytab"
but whenever I execute it. It dumps following parameters which means it is not considering principle and keytab may be property name can be wrong since I have used names I found in Pig. It is observed that krb5.conf property is taken into consider as changing name of conf file shows default realm can't found as it is not able to read correct conf file.
23/01/23 23:33:28 DEBUG security.UserGroupInformation: hadoop login commit
23/01/23 23:33:28 DEBUG security.UserGroupInformation: using kerberos user:null
23/01/23 23:33:28 DEBUG security.UserGroupInformation: using local user:UnixPrincipal: root
23/01/23 23:33:28 DEBUG security.UserGroupInformation: Using user: "UnixPrincipal: root" with name root
23/01/23 23:33:28 DEBUG security.UserGroupInformation: User entry: "root"
23/01/23 23:33:28 DEBUG security.UserGroupInformation: Assuming keytab is managed externally since logged in from subject.
23/01/23 23:33:28 DEBUG security.UserGroupInformation: UGI loginUser:root (auth:KERBEROS)
For any guidance Thanks in Advance
I ultimately want whenever hive-shell or hive-cli is invoked it automatically request for Kerberos ticket and renew it if needed.

apache nifi 1.7.1 restart issue on windows machine

I am using apache nifi on a Apache hadoop cluster and when i try to start the run-start.sh file i get below error.
environment: WINDOWS
2018-08-23 20:00:31,856 WARN [main] org.apache.nifi.bootstrap.Command Failed to set permissions so that only the owner can read pid file D:\NIFI\nifi-1.7.1-bin\nifi-1.7.1\bin\.
.\run\nifi.pid; this may allows others to have access to the key needed to communicate with NiFi. Permissions should be changed so that only the owner can read this file
2018-08-23 20:00:31,866 WARN [main] org.apache.nifi.bootstrap.Command Failed to set permissions so that only the owner can read status file D:\NIFI\nifi-1.7.1-bin\nifi-1.7.1\bi
n\..\run\nifi.status; this may allows others to have access to the key needed to communicate with NiFi. Permissions should be changed so that only the owner can read this file
2018-08-23 20:00:31,879 INFO [main] org.apache.nifi.bootstrap.Command Launched Apache NiFi with Process ID 9888
and unable to connect from front. this happen when i shut down and try to reconnect again. i generally stopping server using ctrl+_c on windows.
kindly help

Hadoop 3 : how to configure / enable erasure coding?

I'm trying to setup an Hadoop 3 cluster.
Two questions about the Erasure Coding feature :
How I can ensure that erasure coding is enabled ?
Do I still need to set the replication factor to 3 ?
Please indicate the relevant configuration properties related to erasure coding/replication, in order to get the same data security as Hadoop 2 (replication factor 3) but with the disk space benefits of Hadoop 3 erasure coding (only 50% overhead instead of 200%).
In Hadoop3 we can enable Erasure coding policy to any folder in HDFS. By default erasure coding is not enabled in Hadoop3, you can enable it by using setPolicy command with specifying desired path of folder.
1: To ensure erasure coding is enabled, you can run getPolicy command.
2: In Hadoop3 Replication factor setting will affect only to other folders which is not configured by erasure code setPolicy. You can use both Erasure coding and replication factor settings in single cluster.
Command to List the supported erasure policies:
./bin/hdfs ec -listPolicies
Command to Enable XOR-2-1-1024k Erasure policy:
./bin/hdfs ec -enablePolicy -policy XOR-2-1-1024k
Command to Set Erasure policy to HDFS directory:
./bin/hdfs ec -setPolicy -path /tmp -policy XOR-2-1-1024k
Command to Get the policy set to the given directory:
./bin/hdfs ec -getPolicy -path /tmp
Command to Remove the policy from the directory.i.e unset policy:
./bin/hdfs ec -unsetPolicy -path /tmp
Command to Disable policy:
./bin/hdfs ec -disablePolicy -policy XOR-2-1-1024k

Getting error while running Sqoop in EC2 instance

I installed Sqoop in my EC2 instance with the reference of http://kontext.tech/docs/DataAndBusinessIntelligence/p/configure-sqoop-in-a-edge-node-of-hadoop-cluster My hadoop cluster is also working well and good.
I got the error Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster and I solved it with the mentioned solution given in that particular link. But unfortunately I get another error while running Sqoop Import:
Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Error: Could not find or load main class org.apache.hadoop.mapred.YarnChild.
Please suggest me how to over come this error.
This is how my sqoop-env.template.sh looks:
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# included in all the hadoop scripts with source command
# should not be executable directly
# also should not be passed any arguments, since we need original $*
# Set Hadoop-specific environment variables here.
#Set path to where bin/hadoop is available
#export HADOOP_COMMON_HOME=$HOME/hadoop-3.1.0
#Set path to where hadoop-*-core.jar is available
#export HADOOP_MAPRED_HOME=$HOME/hadoop-3.1.0
#set the path to where bin/hbase is available
#export HBASE_HOME=
#Set the path to where bin/hive is available
#export HIVE_HOME=
#Set the path for where zookeper config dir is
#export ZOOCFGDIR=`

Streamsets Mapr FS origin/dest. KerberosPrincipal exception (using hadoop impersonation (in mapr 6.0))

I am trying to do a simple data move from a mapr fs origin to a mapr fs destination (this is not my use case, just doing this simple movement for testing purposes). When trying to validate this pipeline, the error message I see in the staging area is:
HADOOPFS_11 - Cannot connect to the filesystem. Check if the Hadoop FS location: 'maprfs:///mapr/mycluster.cluster.local' is valid or not: 'java.io.IOException: Provided Subject must contain a KerberosPrincipal
Tyring different variations of the hadoop fs URI field (eg. mfs:///mapr/mycluster.cluster.local, maprfs:///mycluster.cluster.local) does not seem to help. Looking at the logs after trying to validate, I see
2018-01-04 10:28:56,686 mfs2mfs/mapr2sqlserver850bfbf0-6dc0-4002-8d44-b73e33fcf9b3 INFO Created source of type: com.streamsets.pipeline.stage.origin.maprfs.ClusterMapRFSSource#16978460 DClusterSourceOffsetCommitter *admin preview-pool-1-thread-3
2018-01-04 10:28:56,697 mfs2mfs/mapr2sqlserver850bfbf0-6dc0-4002-8d44-b73e33fcf9b3 INFO Error connecting to FileSystem: java.io.IOException: Provided Subject must contain a KerberosPrincipal ClusterHdfsSource *admin preview-pool-1-thread-3
java.io.IOException: Provided Subject must contain a KerberosPrincipal
....
2018-01-04 10:20:39,159 mfs2mfs/mapr2mapr850bfbf0-6dc0-4002-8d44-b73e33fcf9b3 INFO Authentication Config: ClusterHdfsSource *admin preview-pool-1-thread-3
2018-01-04 10:20:39,159 mfs2mfs/mapr2mapr850bfbf0-6dc0-4002-8d44-b73e33fcf9b3 ERROR Issues: Issue[instance='MapRFS_01' service='null' group='HADOOP_FS' config='null' message='HADOOPFS_11 - Cannot connect to the filesystem. Check if the Hadoop FS location: 'maprfs:///mapr/mycluster.cluster.local' is valid or not: 'java.io.IOException: Provided Subject must contain a KerberosPrincipal''] ClusterHdfsSource *admin preview-pool-1-thread-3
2018-01-04 10:20:39,169 mfs2mfs/mapr2mapr850bfbf0-6dc0-4002-8d44-b73e33fcf9b3 INFO Validation Error: Failed to configure or connect to the 'maprfs:///mapr/mycluster.cluster.local' Hadoop file system: java.io.IOException: Provided Subject must contain a KerberosPrincipal HdfsTargetConfigBean *admin 0 preview-pool-1-thread-3
java.io.IOException: Provided Subject must contain a KerberosPrincipal
....
However, to my knowledge, the system is not running Keberos, so this error message is a bit confusing for me. Uncommenting #export SDC_JAVA_OPTS="-Dmaprlogin.password.enabled=true ${SDC_JAVA_OPTS}" in the sdc environment variable file for native mapr authentication did not seem to help the problem (even when reinstalling and commenting this line before running the streamsets mapr setup script).
Does anyone have any idea what is happening and how to fix it? Thanks.
This answer was provided on the mapr community forums and worked for me (using mapr v6.0). Note that the instruction here differ from those currently provided by the streamsets documentation. Throughout these instructions, I was logged in as user root.
After installing streamsets (and the mapr prerequisites) as per the documentation...
Change the owner of the the streamsets $SDC_DIST or $SDC_HOME location to the mapr user (or whatever other user you plan to use for the hadoop impersonation): $chown -R mapr:mapr $SDC_DIST (for me this was the /opt/streamsets-datacollector dir.). Do the same for $SDC_CONF (/etc/sdc for me) as well as /var/lib/sdc and var/log/sdc.
In $SDC_DIST/libexec/sdcd-env.sh, set the user and group name (near the top of the file) to mapr user "mapr" and enable mapr password login. The file should end up looking like:
# user that will run the data collector, it must exist in the system
#
export SDC_USER=mapr
# group of the user that will run the data collector, it must exist in the system
#
export SDC_GROUP=mapr
....
# Indicate that MapR Username/Password security is enabled
export SDC_JAVA_OPTS="-Dmaprlogin.password.enabled=true ${SDC_JAVA_OPTS}
Edit the file /usr/lib/systemd/system/sdc.service to look like:
[Service]
User=mapr
Group=mapr
$cd into /etc/systemd/system/ and create a directory called sdc.service.d. Within that directory, create a file (with any name) and add the contents (without spaces):
Environment=SDC_JAVA_OPTS=-Dmaprlogin.passowrd.enabled=true
If you are using mapr's sasl ticket auth. system (or something similar), generate a ticket for the this user on the node that is running streamsets. In this case, with the $maprlogin password command.
Then finally, restart the sdc service: $systemctl deamon-reload then $systemctl retart sdc.
Run something like $ps -aux | grep sdc | grep maprlogin to check that the sdc process is ownned by mapr and that the -Dmaprlogin.passowrd.enabled=true parameter has been successfully set. Once this is done, should be able to validate/run maprFS to maprFS operations in streamsets pipeline builder in batch processing mode.
** NOTE: If using Hadoop Configuration Directory param. instead of Hadoop FS URI, remember to have the files in your $HADOOP_HOME/conf directory (eg.hadoop-site.xml, yarn-site.xml, etc.) (in the case of mapr, something like /opt/mapr/hadoop/hadoop-<version>/etc/hadoop/) either soft-linked or hard-copied to a directory $SDC_DIST/resource/<some hadoop config dir. you made need to create> (I just copy eberything in the directory) and add this path to the Hadoop Configuration Directory param. for your MaprFS (or HadoopFS). In the sdc web UI Hadoop Configuration Directory box, it would look like Hadoop Configuration Directory: <the directory within $SDC_DIST/resources/ that holds the hadoop files>.
** NOTE: If you are still logging errors of the form
2018-01-16 14:26:10,883
ingest2sa_demodata_batch/ingest2sademodatabatchadca8442-cb00-4a0e-929b-df2babe4fd41
ERROR Error in Slave Runner: ClusterRunner *admin
runner-pool-2-thread-29
com.streamsets.datacollector.runner.PipelineRuntimeException:
CONTAINER_0800 - Pipeline
'ingest2sademodatabatchadca8442-cb00-4a0e-929b-df2babe4fd41'
validation error : HADOOPFS_11 - Cannot connect to the filesystem.
Check if the Hadoop FS location: 'maprfs:///' is valid or not:
'java.io.IOException: Provided Subject must contain a
KerberosPrincipal'
you may also need to add -Dmaprlogin.password.enabled=true to the pipeline's /cluster/Worker Java Options tab for the origin and destination hadoop FS stages.
** The video linked to in the mapr community link also says to generate a mapr ticket for the sdc user (the default user that sdc process runs as when running as a service), but I did not do this and the solution still worked for me (so if anyone has any idea why it should be done regardless, please let me know in the comments).

Resources