Sqoop failing when importing as avro in AWS EMR - sqoop

I'm trying to perform an sqoop import in Amazon EMR(hadoop 2.8.5 sqoop 1.4.7). The import goes pretty well when no avro option(--as-avrodatafile) is specified. But once it's set, the job is failing with
19/10/29 21:31:35 INFO mapreduce.Job: Task Id : attempt_1572305702067_0017_m_000000_1, Status : FAILED
Error: org.apache.avro.reflect.ReflectData.addLogicalTypeConversion(Lorg/apache/avro/Conversion;)V
Using this option -D mapreduce.job.user.classpath.first=true doesn't work.
Running locally(in my machine) I found that copying the avro-1.8.1.jar in sqoop to hadoop lib folder works, but in the EMR cluster I have only access to the master node, so doing the above doesn't work because it isn't the master node who runs the jobs.
Did anyone face this problem?

The solution I found was to connect to every node in the cluster(I thought I only had access to the master node, but I was wrong, in EMR we have access to all nodes) and replace the Avro jar that is included with Hadoop by the Avro jar that comes in Sqoop. It's not an elegant solution but it works.
[UPDATE]
Happened that the option -D mapreduce.job.user.classpath.first=true wasn't working because I was using s3a as target dir when Amazon says that we should use s3. As soon as I started using s3 Sqoop could perform the import correctly. So, no need of replacing any file in the nodes. Using s3a could lead to some strange errors under EMR due to Amazon own configuration, don't use it. Even in terms of performance s3 is better than s3a in EMR as the implementation for s3 is Amazon's.

Related

Is there a way to load the install-interpreter.sh file in EMR in order to load 3rd party interpreters?

I have an Apache Zeppelin notebook running and I'm trying to load the jdbc and/or postgres interpreter to my notebook in order to write to a postgres DB from Zeppelin.
The main resource to load new interpreters here tells me to run the code below to get other interpreters:
./bin/install-interpreter.sh --all
However, when I run this command in EMR terminal, I find that the EMR cluster does not come with an install-interpreter.sh executable file.
What is the recommended path?
1. Should I find the install-interpreter.sh file and load that to the EMR cluster under ./bin/?
2. Is there an EMR configuration on start time that would enable the install-interpreter.sh file?
Currently all tutorials and documentations assumes that you can run the install-interpreter.sh file.
The solution is to not run this code below in root (aka - ./ )
./bin/install-interpreter.sh --all
Instead in EMR, run the code above in Zeppelin, which in the EMR cluster, is in /usr/lib/zeppelin

Running oozie job using a modified hadoop config file to support S3 to HDFS

Hello I am trying to copy a file in my S3 bucket into HDFS using the cp command.
I do something like
Hadoop --config config fs -cp s3a://path hadooppath
This works well when my config is in my local.
However now I am trying to set it up as an oozie job. So when I am now unable to pass the configuration files present in config directory in my local system. Even if its in HDFS, then still it doesn't seem to work. Any suggestions ?
I tried -D command in Hadoop and passed name and value pairs, still it throws some error. It works only from my local system.
Did you Try DISTCP in oozie? Hadoop 2.7.2 will supports S3 data source. You can able to schedule it by coordinators. Just parse the credentials to coordinators either RESTAPI or in Properties files. Its easy way to copy a data periodically(Scheduled manner).
${HADOOP_HOME}/bin/hadoop distcp s3://<source>/ hdfs://<destination>/

Sqoop import is showing error - Jobtracker is not yet running

I am trying to do sqoop import. It shows error Jobtracker is not running.
But I tried with eval by selecting few rows it works.
But while doing import I am getting error. I have included snapshot of both eval and import function which I have tried.
I tried function hadoop command (hadoop fs -ls, -put) is working.
I started start-all.sh.
Afterwards I check with jps then all daemon run.
After few minutes, all daemon stop.
SQOOP Eval function just acts on the RDBMS database and returns you the resultset. Here hadoop does not come into picture.
SQOOP Import function tries to import the data from RDBMS and load it into HDFS. The namenode is not able to connect to the HDFS. Restart hadoop and check the job tracker and namenode logs. Check whether the namenode and datanode storage directories mentioned in the the HDFS-Site.xml are available else point to a new directory.

Import data from inter cluster hadoop with different versions using command line

Can you tell me the exact command to import data from hdfs with two different haddop version one with hadoop 2.0.4 alpha and other 2.4.0 version? How can I use distcp command in this case?
When you have different versions use hftp instead of using the actual hdfs command. You can see examples on Cloudera website. Use the hftp on your source cluster and hdfs on your destination cluster address.

NoServerForRegionException while running Hadoop MapReduce job on HBase

I am executing a simple Hadoop MapReduce program with HBase as an input and output.
I am getting the error:
java.lang.RuntimeException: org.apache.hadoop.hbase.client.NoServerForRegionException: Unable to find region for OutPut,,99999999999999 after 10 tries.
This exception appeared to us when there was difference in hbase version.
Our code was built with and running with 0.94.X version of hbase jars. Whereas the hbase server was running on 0.90.3.
When we changed our pom file with right version (0.90.3) of hbase jars it started working fine.
Query bin/hbase hbck and find in which machine Region server is running.
Make sure that all your Region server is up and running.
Use start regionserver for starting Region server
Even if Regionserver at the machine is started it may fail because of time sync.
Make sure you have NTP installed on all Regionserver nodes and HbaseMaster node.
As Hbase works on a key-value pair where it uses the Timestamp as the Index, So it allows a time skew less than 3 seconds.
Deleting (or move to /tmp) the WAL logs helped in our case:
hdfs dfs -mv /apps/hbase/data/MasterProcWALs/state-*.log /tmp

Resources