HDP-2.5.0.0 using Ambari 2.4.0.1
I'm able to create a table in HCatalog from a SQL Server source db, for example :
sqoop import --null-string '\\N' --null-non-string '\\N' --hive-delims-replacement '\0D' --hcatalog-home /usr/hdp/current/hive-webhcat --hcatalog-database MS_Management_Coaching --hcatalog-table TripAggregate --create-hcatalog-table --hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="ZLIB")' --validate --connect 'jdbc:sqlserver://<DB server>;database=Management' --username uname--password pwd--table TripAggregate -- --schema Coaching
but when I try to use a the --create-hive-table, the -- --schema option doesn't work, no matter where I position it :
-bash-4.2$ sqoop create-hive-table --hive-database test --connect 'jdbc:sqlserver://<DB Server>;database=Management' --username uname--password pwd--table TripAggregate -- --schema Coaching
Warning: /usr/hdp/2.5.0.0-1245/accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
16/10/12 21:28:13 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.5.0.0-1245
16/10/12 21:28:13 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/10/12 21:28:13 ERROR tool.BaseSqoopTool: Error parsing arguments for create-hive-table:
16/10/12 21:28:13 ERROR tool.BaseSqoopTool: Unrecognized argument: --
16/10/12 21:28:13 ERROR tool.BaseSqoopTool: Unrecognized argument: --schema
16/10/12 21:28:13 ERROR tool.BaseSqoopTool: Unrecognized argument: DriverCoaching
Try --help for usage instructions.
If the argument -- is given on the command-line, then subsequent arguments are sent directly to the underlying tool.
After looking into sqoop code, I found that in --create-hive-table flow does not go to the underlying tool. That's why you are not able to use -- --schema in your command.
Useful part of source code for ImportTool :
public void validateOptions(SqoopOptions options)
throws InvalidOptionsException {
// If extraArguments is full, check for '--' followed by args for
// mysqldump or other commands we rely on.
options.setExtraArgs(getSubcommandArgs(extraArguments));
int dashPos = getDashPosition(extraArguments);
if (hasUnrecognizedArgs(extraArguments, 0, dashPos)) {
throw new InvalidOptionsException(HELP_STR);
}
validateImportOptions(options);
validateIncrementalOptions(options);
validateCommonOptions(options);
validateCodeGenOptions(options);
validateOutputFormatOptions(options);
validateHBaseOptions(options);
validateHiveOptions(options);
validateHCatalogOptions(options);
validateAccumuloOptions(options);
}
Useful part of source code for CreateHiveTable :
public void validateOptions(SqoopOptions options)
throws InvalidOptionsException {
if (hasUnrecognizedArgs(extraArguments)) {
throw new InvalidOptionsException(HELP_STR);
}
validateCommonOptions(options);
validateOutputFormatOptions(options);
validateHiveOptions(options);
if (options.getTableName() == null) {
throw new InvalidOptionsException(
"--table is required for table definition importing." + HELP_STR);
}
}
you see no checking for -- args is done in the later.
Edit:
--hive-import by default creates hive table and you can use -- --schema with import command. If you want sqoop to create hive table for you and import data in that table. It should work for you.
Related
I'm trying to load data from my redshift database using PySpark.
I'm using "io.github.spark-redshift-community" as connector. It's requires a "tempdir" parameter to use a S3. My code looks like the following:
import findspark
findspark.add_packages("io.github.spark-redshift-community:spark-redshift_2.12:5.0.3")
findspark.add_packages("com.amazonaws:aws-java-sdk-bundle:1.12.262")
findspark.add_packages("org.apache.hadoop:hadoop-aws:3.3.4")
findspark.init()
spark = SparkSession.builder.master("local[8]").appName("Dim_Customer").getOrCreate()
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", S3_ACCESS_KEY)
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", S3_SECRET_KEY)
spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3a.enableV4", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
spark._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "true")
df_read_1 = spark.read \
.format("io.github.spark_redshift_community.spark.redshift") \
.option("url", "jdbc:redshift://IP/DATABASE?user=USER&password=PASS") \
.option("dbtable", "table") \
.option("tempdir", "s3a://url/")\
.option("forward_spark_s3_credentials", "true") \
.load()
But I'm getting an error: Class org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider not found
I've fond some sources saying to change BasicAWSCredentialsProvider to SimpleAWSCredentialsProvider, but I get another error: NoSuchMethodError.
Could someone help me, please?
Is that any problem with the hadoop and aws-java-sdk versions?
Thank you in advance!
I'm trying to connecct to a presto DB installed in a remote server from my mac local machine using pyspark, below is my code. I have downloaded the presto driver and placed it under /user/name//Hadoop/spark-2.3.1-bin-hadoop2.7/jars ( I guess this is where I'm making a mistake, but not sure)
from pyspark.sql import SparkSession, HiveContext
from pyhive import presto, hive
def main():
spark = SparkSession.builder\
.appName("tests")\
.enableHiveSupport()\
.getOrCreate()
df_presto = spark.read.format("jdbc") \
.option("driver", "io.prestosql.jdbc.PrestoDriver")\
.option("url", "jdbc:presto://host.com:443/hive") \
.option("user", "user_name")\
.option("password", "password") \
.option("dbtable", "(select column from table_name limit 10) tmp") \
.load()
Preso driver : presto-jdbc-340.jar
When I tried to execute the code, I'm getting an error as below
Traceback (most recent call last):
File "/Users/user_name/Hadoop/spark-2.3.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/Users/user_name/Hadoop/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o38.load.
: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.IllegalArgumentException: java.net.UnknownHostException: ip-10-120-99-149.ec2.internal;
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194)
Any idea how can I fix this?
I am using spark 2.4.1 version and java8. I am trying to load external property file while submitting my spark job using spark-submit.
As I am using below TypeSafe to load my property file.
<groupId>com.typesafe</groupId>
<artifactId>config</artifactId>
<version>1.3.1</version>
In my spark driver class MyDriver.java I am loading the YML file as below
String ymlFilename = args[1].toString();
Optional<QueryEntities> entities = InputYamlProcessor.process(ymlFilename);
I have all code here including InputYamlProcessor.java
https://gist.github.com/BdLearnerr/e4c47c5f1dded951b18844b278ea3441
This is working fine in my local but when I run on cluster this gives error
Error :
Can't construct a java object for tag:yaml.org,2002:com.snp.yml.QueryEntities; exception=Class not found: com.snp.yml.QueryEntities
in 'reader', line 1, column 1:
entities:
^
at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.construct(Constructor.java:345)
at org.yaml.snakeyaml.constructor.BaseConstructor.getSingleData(BaseConstructor.java:127)
at org.yaml.snakeyaml.Yaml.loadFromReader(Yaml.java:450)
at org.yaml.snakeyaml.Yaml.loadAs(Yaml.java:444)
at com.snp.yml.InputYamlProcessor.process(InputYamlProcessor.java:62)
Caused by: org.yaml.snakeyaml.error.YAMLException: Class not found: com.snp.yml.QueryEntities
at org.yaml.snakeyaml.constructor.Constructor.getClassForNode(Constructor.java:650)
at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.getConstructor(Constructor.java:331)
at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.construct(Constructor.java:341)
... 12 more
My spark job script is
$SPARK_HOME/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--name MyDriver \
--jars "/local/jars/*.jar" \
--files hdfs://files/application-cloud-dev.properties,hdfs://files/column_family_condition.yml \
--class com.sp.MyDriver \
--executor-cores 3 \
--executor-memory 9g \
--num-executors 5 \
--driver-cores 2 \
--driver-memory 4g \
--driver-java-options -Dconfig.file=./application-cloud-dev.properties \
--conf spark.executor.extraJavaOptions=-Dconfig.file=./application-cloud-dev.properties \
--conf spark.driver.extraClassPath=. \
--driver-class-path . \
ca-datamigration-0.0.1.jar application-cloud-dev.properties column_family_condition.yml
What am I doing wrong here? How to fix this issue ?
Any fix is highly thankful.
Tested :
I printed something like this inside the class , before the line where getting above... to check if the issue is really class not found.
public static void printTest() {
QueryEntity e1 = new QueryEntity();
e1.setTableName("tab1");
List<QueryEntity> li = new ArrayList<QueryEntity>();
li.add(e1);
QueryEntities ll = new QueryEntities();
ll.setEntitiesList(li);
ll.getEntitiesList().stream().forEach(e -> logger.error("e1 Name :" + e.getTableName()));
return;
}
Output :
19/09/18 04:40:33 ERROR yml.InputYamlProcessor: e1 Name :tab1
Can't construct a java object for tag:yaml.org,2002:com.snp.helpers.QueryEntities; exception=Class not found: com.snp.helpers.QueryEntities
in 'reader', line 1, column 1:
entitiesList:
at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.construct(Constructor.java:345)
What is wrong here ?
This has got nothing to do with QueryEntities
i.e. YAMLException: Class not found: com.snp.yml.QueryEntities
is YML constructor issue
Changed To
Yaml yaml = new Yaml(new CustomClassLoaderConstructor(com.snp.helpers.QueryEntities.class.getClassLoader()));
From
/*Constructor constructor = new Constructor(com.snp.helpers.QueryEntities.class);
Yaml yaml = new Yaml( constructor );*/
I am trying to write a spark job with Python that would open a jdbc connection with Impala and load a VIEW directly from Impala into a Dataframe. This question is pretty close but in scala: Calling JDBC to impala/hive from within a spark job and creating a table
How do I do this? There are plenty of examples for other datasources such as MySQL, PostgreSQL, etc. but I haven't seen one for Impala + Python + Kerberos. An example would be of great help. Thank you!
Tried this with information from the web but it didn't work.
SPARK Notebook
#!/bin/bash
export PYSPARK_PYTHON=/home/anave/anaconda2/bin/python
export HADOOP_CONF_DIR=/etc/hive/conf
export PYSPARK_DRIVER_PYTHON=/home/anave/anaconda2/bin/ipython
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip=* --no-browser'
# use Java8
export JAVA_HOME=/usr/java/latest
export PATH=$JAVA_HOME/bin:$PATH
# JDBC Drivers for Impala
export CLASSPATH=/home/anave/impala_jdbc_2.5.30.1049/Cloudera_ImpalaJDBC41_2.5.30/*.jar:$CLASSPATH
export JDBC_PATH=/home/anave/impala_jdbc_2.5.30.1049/Cloudera_ImpalaJDBC41_2.5.30
# --jars $SRCDIR/spark-csv-assembly-1.4.0-SNAPSHOT.jar \
# --conf spark.sql.parquet.binaryAsString=true \
# --conf spark.sql.hive.convertMetastoreParquet=false
pyspark --master yarn-client \
--driver-memory 4G \
--executor-memory 2G \
# --num-executors 10 \
--jars /home/anave/spark-csv_2.11-1.4.0.jar $JDBC_PATH/*.jar
--driver-class-path $JDBC_PATH/*.jar
Python Code
properties = {
"driver": "com.cloudera.impala.jdbc41.Driver",
"AuthMech": "1",
# "KrbRealm": "EXAMPLE.COM",
# "KrbHostFQDN": "impala.example.com",
"KrbServiceName": "impala"
}
# imp_env is the hostname of the db, works with other impala queries ran inside python
url = "jdbc:impala:imp_env;auth=noSasl"
db_df = sqlContext.read.jdbc(url=url, table='summary', properties=properties)
I received this error msg (Full Error Log):
Py4JJavaError: An error occurred while calling o42.jdbc.
: java.lang.ClassNotFoundException: com.cloudera.impala.jdbc41.Driver
You can use
--jars $(echo /dir/of/jars/*.jar | tr ' ' ',')
instead of
--jars /home/anave/spark-csv_2.11-1.4.0.jar $JDBC_PATH/*.jar
or for another approach please see my answer
1st approach is to use spark-submit on below impala_jdbc_connection.py script like spark-submit --driver-class-path /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/jars/ImpalaJDBC41.jar --jars /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/jars/ImpalaJDBC41.jar --class com.cloudera.impala.jdbc41.Driver impala_jdbc_connection.py
impala_jdbc_connection.py
properties = {
"drivers": "com.cloudera.impala.jdbc41.Driver"
}
#initalize the spark session
spark = (
SparkSession.builder
.config("spark.jars.packages", "jar-packages-list")
.config("spark.sql.warehouse.dir","hdfs://dwh-hdp-node01.dev.ergo.liferunoffinsuranceplatform.com:8020/user/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
)
db_df = spark.read.jdbc(url= 'jdbc:impala://host_ip_address:21050/database_name', table ='table_name', properties = properties)
db_df.show()
2nd approach is not a direct import from impala to spark but rather a conversion of results to spark dataframe
pip install impyla Source: https://github.com/cloudera/impyla
Connect to impala and fetch results from impala database and convert result to spark dataframe
from impala.dbapi import connect
conn = connect(host = 'IP_ADDRESS_OF_HOST', port=21050)
cursor = conn.cursor()
cursor.execute('select * from database.table')
res= cursor.fetchall() # convert res to spark dataframe
for data in res:
print(data)
Did this in Azure Databricks notebook after setting up the jar in the cluster libraries. Generally followed previous post except that d is upper case for Driver config. Worked great.
properties = {
"Driver": "com.cloudera.impala.jdbc41.Driver"
}
db_df = spark.read.jdbc(url= 'jdbc:impala://hostname.domain.net:21050/dbname;AuthMech=3;UID=xxxx;PWD=xxxx', table ='product', properties = properties)
db_df.show()
This works for me:
spark-shell --driver-class-path ImpalaJDBC41.jar --jars ImpalaJDBC41.jar
val jdbcURL = s"jdbc:impala://192.168.56.101:21050;AuthMech=0"
val connectionProperties = new java.util.Properties()
val hbaseDF = sqlContext.read.jdbc(jdbcURL, "impala_table", connectionProperties)
Ran Sqoop with options: --fields-terminated-by '\001' --optionally-enclosed-by '\003'
Ran it twice to create 2 directories. This created a QueryResult.java with following line:
private final DelimiterSet __inputDelimiters = new DelimiterSet((char) 1, (char) 10, (char) 3, (char) 0, false);
So far so good!
Used this QueryResult class to run a 'Sqoop Merge' but when it comes to the column that follows the column with new line character, it dies with the exception: java.util.NoSuchElementException
Sqoop version:
Sqoop 1.4.4-mapr
git commit id 16d0124c5b5f7bc68b8f67fbe77f0c91d46d64c1
Compiled by root on Wed Aug 28 17:22:49 PDT 2013
Any ideas?
This is no longer an issue. Fixed it by adding --hive-drop-import-delims option. Hope this helps someone. Thanks.