can I run spark command on python on my local machine to hadoop? - hadoop

I want to run below code on my local machine.
from pyspark import SparkContext
from pyspark.sql.session import SparkSession
def quiet_logs( sc ):
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR )
logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR )
#Spark Data Frame Jobs
spark = SparkSession.builder.getOrCreate()
I have no installed spark on my machine.
this is make sense?
my purpose is load data to hadoop from my local machine?
thanks in advance

If you want to load data to Hadoop from your local machine then you have to follow some approach.
One of them goes like
-> Send data from your local to Hadoop edge node.
Use SFTP for this purpose
->
Move data from edge node to hdfs using
hdfs dfs -cp
-> Run your spark job on hdfs then Load the data as required either in hive table or any use case.

Related

Sqoop failing when importing as avro in AWS EMR

I'm trying to perform an sqoop import in Amazon EMR(hadoop 2.8.5 sqoop 1.4.7). The import goes pretty well when no avro option(--as-avrodatafile) is specified. But once it's set, the job is failing with
19/10/29 21:31:35 INFO mapreduce.Job: Task Id : attempt_1572305702067_0017_m_000000_1, Status : FAILED
Error: org.apache.avro.reflect.ReflectData.addLogicalTypeConversion(Lorg/apache/avro/Conversion;)V
Using this option -D mapreduce.job.user.classpath.first=true doesn't work.
Running locally(in my machine) I found that copying the avro-1.8.1.jar in sqoop to hadoop lib folder works, but in the EMR cluster I have only access to the master node, so doing the above doesn't work because it isn't the master node who runs the jobs.
Did anyone face this problem?
The solution I found was to connect to every node in the cluster(I thought I only had access to the master node, but I was wrong, in EMR we have access to all nodes) and replace the Avro jar that is included with Hadoop by the Avro jar that comes in Sqoop. It's not an elegant solution but it works.
[UPDATE]
Happened that the option -D mapreduce.job.user.classpath.first=true wasn't working because I was using s3a as target dir when Amazon says that we should use s3. As soon as I started using s3 Sqoop could perform the import correctly. So, no need of replacing any file in the nodes. Using s3a could lead to some strange errors under EMR due to Amazon own configuration, don't use it. Even in terms of performance s3 is better than s3a in EMR as the implementation for s3 is Amazon's.

Do we need to run hiveserver2 on our client machine to access hive metastore?

I am using spark-java to access hive metastore. On my machine only spark is installed and nothing else. I don't have hadoop directory or Hive folder. I have created hive-site.xml, hdfs-site.xml,core-site.xml and yarn-site.xml inside spark/conf directory. My hive metastore is setup on another machine which is a part of hadoop cluster and is the namenode. I can access hive metastore from spark/bin/beeline and spark/bin/spark-shell on my desktop, but when I try to access hive-metastore from java-api, I get metastore_db folder and derby.log file created in my project, which means I can't access hive metastore.
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL basic example")
.enableHiveSupport()
.config("spark.sql.warehouse.dir", "hdfs://bigdata-namenode:9000/user/hive/warehouse")
.config("mapred.input.dir.recursive", true)
.config("hive.mapred.supports.subdirectories", true)
.config("spark.sql.hive.thriftServer.singleSession", true)
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.master("local")
.getOrCreate();
spark.sql("show databases").show();
when I start thrift server on my desktop (i.e client machine) I get this log thriftserver.log
which says spark.sql.warehouse.dir is set to my local file system path i.e not hdfs where is actual warehouse located.
/spark/conf/core-site.xml
/spark/conf/hive-site.xml

How to connect Sqoop to multiple hadoop clusters

Is there anyway to have Sqoop connected to different Hadoop clusters so that multiple Sqoop jobs can be created to export data to multiple hadoop clusters?
to export data to multiple hadoop clusters
If data is going into Hadoop, that's technically a Sqoop import
Not clear how you currently manage different clusters from one machine, but you would need to have the conf folder of all environments available for Sqoop to read
The sqoop command-line program is a wrapper which runs the bin/hadoop script shipped with Hadoop. If you have multiple installations of Hadoop present on your machine, you can select the Hadoop installation by setting the $HADOOP_HOME environment variable.
For example:
$ HADOOP_HOME=/path/to/some/hadoop sqoop import --arguments...
or:
$ export HADOOP_HOME=/some/path/to/hadoop
$ sqoop import --arguments...
If $HADOOP_HOME is not set, Sqoop will use the default installation location for Cloudera’s Distribution for Hadoop, /usr/lib/hadoop.
The active Hadoop configuration is loaded from $HADOOP_HOME/conf/, unless the $HADOOP_CONF_DIR environment variable is set
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_controlling_the_hadoop_installation
Depending on how you setup Hadoop, Hortonworks only has Sqoop 1, while Cloudera (and maybe MapR) have Sqoop2, and those instructions are probably different since Sqoop2 architecture is different.

Reading a file in Spark in cluster mode in Amazon EC2

I'm trying to execute a spark program in cluster mode in Amazon Ec2 using
spark-submit --master spark://<master-ip>:7077 --deploy-mode cluster --class com.mycompany.SimpleApp ./spark.jar
And the class has a line that tries to read a file:
JavaRDD<String> logData = sc.textFile("/user/input/CHANGES.txt").cache();
I'm unable to read this txt file in cluster mode even if I'm able to read in standalone mode. In cluster mode, it's looking to read from hdfs. So I put the file in hdfs at /root/persistent-hdfs using
hadoop fs -mkdir -p /wordcount/input
hadoop fs -put /app/hadoop/tmp/input.txt /wordcount/input/input.txt
And I can see the file using hadoop fs -ls /workcount/input. But Spark is still unable to read the file. Any idea what I'm doing wrong. Thanks.
You might want to check the following points:
Is the file really in the persistent HDFS?
It seems that you just copy the input file from /app/hadoop/tmp/input.txt to /wordcount/input/input.txt, all in the node disk. I believe you misunderstand the functionality of the hadoop commands.
Instead, you should try putting the file explicitly in the persistent HDFS (root/persistent-hdfs/), and then loading it using the hdfs://... prefix.
Is the persistent HDFS server up?
Please take a look here, it seems Spark only starts the ephemeral HDFS server by default. In order to switch to the persistent HDFS server, you must do the following:
1) Stop the ephemeral HDFS server: /root/ephemeral-hdfs/bin/stop-dfs.sh
2) Start the persistent HDFS server: /root/persistent-hdfs/bin/start-dfs.sh
Please try these things, I hope they can serve you well.

Export data into Hive from a node without Hadoop(HDFS) installed

Is it possible to export data from a node that has not hadoop(HDFS) or Sqoop installed to a Hive server?
I would read the data from a source which could be Mysql or just files in some directory and then use the Hadoop core classes or something like Sqoop to export the data into my Hadoop cluster.
I am programming in Java.
Since you are final destination is a hive table. I would suggest the following :
Create a hive final table.
use the following command to load data from the other node
LOAD DATA LOCAL INPATH '<full local path>/kv1.txt' OVERWRITE INTO TABLE table_name;
refer this
Using Java , You could use JSCH lib to invoke these shell commands or so .
Hope this helps.

Resources