-- --schema doesn't work with Sqoop create-hive-table - hortonworks-data-platform

HDP-2.5.0.0 using Ambari 2.4.0.1
I'm able to create a table in HCatalog from a SQL Server source db, for example :
sqoop import --null-string '\\N' --null-non-string '\\N' --hive-delims-replacement '\0D' --hcatalog-home /usr/hdp/current/hive-webhcat --hcatalog-database MS_Management_Coaching --hcatalog-table TripAggregate --create-hcatalog-table --hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="ZLIB")' --validate --connect 'jdbc:sqlserver://<DB server>;database=Management' --username uname--password pwd--table TripAggregate -- --schema Coaching
but when I try to use a the --create-hive-table, the -- --schema option doesn't work, no matter where I position it :
-bash-4.2$ sqoop create-hive-table --hive-database test --connect 'jdbc:sqlserver://<DB Server>;database=Management' --username uname--password pwd--table TripAggregate -- --schema Coaching
Warning: /usr/hdp/2.5.0.0-1245/accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
16/10/12 21:28:13 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.5.0.0-1245
16/10/12 21:28:13 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/10/12 21:28:13 ERROR tool.BaseSqoopTool: Error parsing arguments for create-hive-table:
16/10/12 21:28:13 ERROR tool.BaseSqoopTool: Unrecognized argument: --
16/10/12 21:28:13 ERROR tool.BaseSqoopTool: Unrecognized argument: --schema
16/10/12 21:28:13 ERROR tool.BaseSqoopTool: Unrecognized argument: DriverCoaching
Try --help for usage instructions.

If the argument -- is given on the command-line, then subsequent arguments are sent directly to the underlying tool.
After looking into sqoop code, I found that in --create-hive-table flow does not go to the underlying tool. That's why you are not able to use -- --schema in your command.
Useful part of source code for ImportTool :
public void validateOptions(SqoopOptions options)
throws InvalidOptionsException {
// If extraArguments is full, check for '--' followed by args for
// mysqldump or other commands we rely on.
options.setExtraArgs(getSubcommandArgs(extraArguments));
int dashPos = getDashPosition(extraArguments);
if (hasUnrecognizedArgs(extraArguments, 0, dashPos)) {
throw new InvalidOptionsException(HELP_STR);
}
validateImportOptions(options);
validateIncrementalOptions(options);
validateCommonOptions(options);
validateCodeGenOptions(options);
validateOutputFormatOptions(options);
validateHBaseOptions(options);
validateHiveOptions(options);
validateHCatalogOptions(options);
validateAccumuloOptions(options);
}
Useful part of source code for CreateHiveTable :
public void validateOptions(SqoopOptions options)
throws InvalidOptionsException {
if (hasUnrecognizedArgs(extraArguments)) {
throw new InvalidOptionsException(HELP_STR);
}
validateCommonOptions(options);
validateOutputFormatOptions(options);
validateHiveOptions(options);
if (options.getTableName() == null) {
throw new InvalidOptionsException(
"--table is required for table definition importing." + HELP_STR);
}
}
you see no checking for -- args is done in the later.
Edit:
--hive-import by default creates hive table and you can use -- --schema with import command. If you want sqoop to create hive table for you and import data in that table. It should work for you.

Related

PySpark with io.github.spark-redshift-community: BasicAWSCredentialsProvider not found

I'm trying to load data from my redshift database using PySpark.
I'm using "io.github.spark-redshift-community" as connector. It's requires a "tempdir" parameter to use a S3. My code looks like the following:
import findspark
findspark.add_packages("io.github.spark-redshift-community:spark-redshift_2.12:5.0.3")
findspark.add_packages("com.amazonaws:aws-java-sdk-bundle:1.12.262")
findspark.add_packages("org.apache.hadoop:hadoop-aws:3.3.4")
findspark.init()
spark = SparkSession.builder.master("local[8]").appName("Dim_Customer").getOrCreate()
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", S3_ACCESS_KEY)
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", S3_SECRET_KEY)
spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3a.enableV4", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
spark._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "true")
df_read_1 = spark.read \
.format("io.github.spark_redshift_community.spark.redshift") \
.option("url", "jdbc:redshift://IP/DATABASE?user=USER&password=PASS") \
.option("dbtable", "table") \
.option("tempdir", "s3a://url/")\
.option("forward_spark_s3_credentials", "true") \
.load()
But I'm getting an error: Class org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider not found
I've fond some sources saying to change BasicAWSCredentialsProvider to SimpleAWSCredentialsProvider, but I get another error: NoSuchMethodError.
Could someone help me, please?
Is that any problem with the hadoop and aws-java-sdk versions?
Thank you in advance!

Problem connecting remote PrestoDB from local machine using pyspark

I'm trying to connecct to a presto DB installed in a remote server from my mac local machine using pyspark, below is my code. I have downloaded the presto driver and placed it under /user/name//Hadoop/spark-2.3.1-bin-hadoop2.7/jars ( I guess this is where I'm making a mistake, but not sure)
from pyspark.sql import SparkSession, HiveContext
from pyhive import presto, hive
def main():
spark = SparkSession.builder\
.appName("tests")\
.enableHiveSupport()\
.getOrCreate()
df_presto = spark.read.format("jdbc") \
.option("driver", "io.prestosql.jdbc.PrestoDriver")\
.option("url", "jdbc:presto://host.com:443/hive") \
.option("user", "user_name")\
.option("password", "password") \
.option("dbtable", "(select column from table_name limit 10) tmp") \
.load()
Preso driver : presto-jdbc-340.jar
When I tried to execute the code, I'm getting an error as below
Traceback (most recent call last):
File "/Users/user_name/Hadoop/spark-2.3.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/Users/user_name/Hadoop/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o38.load.
: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.IllegalArgumentException: java.net.UnknownHostException: ip-10-120-99-149.ec2.internal;
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194)
Any idea how can I fix this?

Passing external yml file in my spark-job/code not working throwing "Can't construct a java object for tag:yaml.org,2002"

I am using spark 2.4.1 version and java8. I am trying to load external property file while submitting my spark job using spark-submit.
As I am using below TypeSafe to load my property file.
<groupId>com.typesafe</groupId>
<artifactId>config</artifactId>
<version>1.3.1</version>
In my spark driver class MyDriver.java I am loading the YML file as below
String ymlFilename = args[1].toString();
Optional<QueryEntities> entities = InputYamlProcessor.process(ymlFilename);
I have all code here including InputYamlProcessor.java
https://gist.github.com/BdLearnerr/e4c47c5f1dded951b18844b278ea3441
This is working fine in my local but when I run on cluster this gives error
Error :
Can't construct a java object for tag:yaml.org,2002:com.snp.yml.QueryEntities; exception=Class not found: com.snp.yml.QueryEntities
in 'reader', line 1, column 1:
entities:
^
at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.construct(Constructor.java:345)
at org.yaml.snakeyaml.constructor.BaseConstructor.getSingleData(BaseConstructor.java:127)
at org.yaml.snakeyaml.Yaml.loadFromReader(Yaml.java:450)
at org.yaml.snakeyaml.Yaml.loadAs(Yaml.java:444)
at com.snp.yml.InputYamlProcessor.process(InputYamlProcessor.java:62)
Caused by: org.yaml.snakeyaml.error.YAMLException: Class not found: com.snp.yml.QueryEntities
at org.yaml.snakeyaml.constructor.Constructor.getClassForNode(Constructor.java:650)
at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.getConstructor(Constructor.java:331)
at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.construct(Constructor.java:341)
... 12 more
My spark job script is
$SPARK_HOME/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--name MyDriver \
--jars "/local/jars/*.jar" \
--files hdfs://files/application-cloud-dev.properties,hdfs://files/column_family_condition.yml \
--class com.sp.MyDriver \
--executor-cores 3 \
--executor-memory 9g \
--num-executors 5 \
--driver-cores 2 \
--driver-memory 4g \
--driver-java-options -Dconfig.file=./application-cloud-dev.properties \
--conf spark.executor.extraJavaOptions=-Dconfig.file=./application-cloud-dev.properties \
--conf spark.driver.extraClassPath=. \
--driver-class-path . \
ca-datamigration-0.0.1.jar application-cloud-dev.properties column_family_condition.yml
What am I doing wrong here? How to fix this issue ?
Any fix is highly thankful.
Tested :
I printed something like this inside the class , before the line where getting above... to check if the issue is really class not found.
public static void printTest() {
QueryEntity e1 = new QueryEntity();
e1.setTableName("tab1");
List<QueryEntity> li = new ArrayList<QueryEntity>();
li.add(e1);
QueryEntities ll = new QueryEntities();
ll.setEntitiesList(li);
ll.getEntitiesList().stream().forEach(e -> logger.error("e1 Name :" + e.getTableName()));
return;
}
Output :
19/09/18 04:40:33 ERROR yml.InputYamlProcessor: e1 Name :tab1
Can't construct a java object for tag:yaml.org,2002:com.snp.helpers.QueryEntities; exception=Class not found: com.snp.helpers.QueryEntities
in 'reader', line 1, column 1:
entitiesList:
at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.construct(Constructor.java:345)
What is wrong here ?
This has got nothing to do with QueryEntities
i.e. YAMLException: Class not found: com.snp.yml.QueryEntities
is YML constructor issue
Changed To
Yaml yaml = new Yaml(new CustomClassLoaderConstructor(com.snp.helpers.QueryEntities.class.getClassLoader()));
From
/*Constructor constructor = new Constructor(com.snp.helpers.QueryEntities.class);
Yaml yaml = new Yaml( constructor );*/

How to load Impala table directly to Spark using JDBC?

I am trying to write a spark job with Python that would open a jdbc connection with Impala and load a VIEW directly from Impala into a Dataframe. This question is pretty close but in scala: Calling JDBC to impala/hive from within a spark job and creating a table
How do I do this? There are plenty of examples for other datasources such as MySQL, PostgreSQL, etc. but I haven't seen one for Impala + Python + Kerberos. An example would be of great help. Thank you!
Tried this with information from the web but it didn't work.
SPARK Notebook
#!/bin/bash
export PYSPARK_PYTHON=/home/anave/anaconda2/bin/python
export HADOOP_CONF_DIR=/etc/hive/conf
export PYSPARK_DRIVER_PYTHON=/home/anave/anaconda2/bin/ipython
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip=* --no-browser'
# use Java8
export JAVA_HOME=/usr/java/latest
export PATH=$JAVA_HOME/bin:$PATH
# JDBC Drivers for Impala
export CLASSPATH=/home/anave/impala_jdbc_2.5.30.1049/Cloudera_ImpalaJDBC41_2.5.30/*.jar:$CLASSPATH
export JDBC_PATH=/home/anave/impala_jdbc_2.5.30.1049/Cloudera_ImpalaJDBC41_2.5.30
# --jars $SRCDIR/spark-csv-assembly-1.4.0-SNAPSHOT.jar \
# --conf spark.sql.parquet.binaryAsString=true \
# --conf spark.sql.hive.convertMetastoreParquet=false
pyspark --master yarn-client \
--driver-memory 4G \
--executor-memory 2G \
# --num-executors 10 \
--jars /home/anave/spark-csv_2.11-1.4.0.jar $JDBC_PATH/*.jar
--driver-class-path $JDBC_PATH/*.jar
Python Code
properties = {
"driver": "com.cloudera.impala.jdbc41.Driver",
"AuthMech": "1",
# "KrbRealm": "EXAMPLE.COM",
# "KrbHostFQDN": "impala.example.com",
"KrbServiceName": "impala"
}
# imp_env is the hostname of the db, works with other impala queries ran inside python
url = "jdbc:impala:imp_env;auth=noSasl"
db_df = sqlContext.read.jdbc(url=url, table='summary', properties=properties)
I received this error msg (Full Error Log):
Py4JJavaError: An error occurred while calling o42.jdbc.
: java.lang.ClassNotFoundException: com.cloudera.impala.jdbc41.Driver
You can use
--jars $(echo /dir/of/jars/*.jar | tr ' ' ',')
instead of
--jars /home/anave/spark-csv_2.11-1.4.0.jar $JDBC_PATH/*.jar
or for another approach please see my answer
1st approach is to use spark-submit on below impala_jdbc_connection.py script like spark-submit --driver-class-path /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/jars/ImpalaJDBC41.jar --jars /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/jars/ImpalaJDBC41.jar --class com.cloudera.impala.jdbc41.Driver impala_jdbc_connection.py
impala_jdbc_connection.py
properties = {
"drivers": "com.cloudera.impala.jdbc41.Driver"
}
#initalize the spark session
spark = (
SparkSession.builder
.config("spark.jars.packages", "jar-packages-list")
.config("spark.sql.warehouse.dir","hdfs://dwh-hdp-node01.dev.ergo.liferunoffinsuranceplatform.com:8020/user/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
)
db_df = spark.read.jdbc(url= 'jdbc:impala://host_ip_address:21050/database_name', table ='table_name', properties = properties)
db_df.show()
2nd approach is not a direct import from impala to spark but rather a conversion of results to spark dataframe
pip install impyla Source: https://github.com/cloudera/impyla
Connect to impala and fetch results from impala database and convert result to spark dataframe
from impala.dbapi import connect
conn = connect(host = 'IP_ADDRESS_OF_HOST', port=21050)
cursor = conn.cursor()
cursor.execute('select * from database.table')
res= cursor.fetchall() # convert res to spark dataframe
for data in res:
print(data)
Did this in Azure Databricks notebook after setting up the jar in the cluster libraries. Generally followed previous post except that d is upper case for Driver config. Worked great.
properties = {
"Driver": "com.cloudera.impala.jdbc41.Driver"
}
db_df = spark.read.jdbc(url= 'jdbc:impala://hostname.domain.net:21050/dbname;AuthMech=3;UID=xxxx;PWD=xxxx', table ='product', properties = properties)
db_df.show()
This works for me:
spark-shell --driver-class-path ImpalaJDBC41.jar --jars ImpalaJDBC41.jar
val jdbcURL = s"jdbc:impala://192.168.56.101:21050;AuthMech=0"
val connectionProperties = new java.util.Properties()
val hbaseDF = sqlContext.read.jdbc(jdbcURL, "impala_table", connectionProperties)

Sqoop merge fails when column contains new line character

Ran Sqoop with options: --fields-terminated-by '\001' --optionally-enclosed-by '\003'
Ran it twice to create 2 directories. This created a QueryResult.java with following line:
private final DelimiterSet __inputDelimiters = new DelimiterSet((char) 1, (char) 10, (char) 3, (char) 0, false);
So far so good!
Used this QueryResult class to run a 'Sqoop Merge' but when it comes to the column that follows the column with new line character, it dies with the exception: java.util.NoSuchElementException
Sqoop version:
Sqoop 1.4.4-mapr
git commit id 16d0124c5b5f7bc68b8f67fbe77f0c91d46d64c1
Compiled by root on Wed Aug 28 17:22:49 PDT 2013
Any ideas?
This is no longer an issue. Fixed it by adding --hive-drop-import-delims option. Hope this helps someone. Thanks.

Resources