Sqoop merge fails when column contains new line character - hadoop

Ran Sqoop with options: --fields-terminated-by '\001' --optionally-enclosed-by '\003'
Ran it twice to create 2 directories. This created a QueryResult.java with following line:
private final DelimiterSet __inputDelimiters = new DelimiterSet((char) 1, (char) 10, (char) 3, (char) 0, false);
So far so good!
Used this QueryResult class to run a 'Sqoop Merge' but when it comes to the column that follows the column with new line character, it dies with the exception: java.util.NoSuchElementException
Sqoop version:
Sqoop 1.4.4-mapr
git commit id 16d0124c5b5f7bc68b8f67fbe77f0c91d46d64c1
Compiled by root on Wed Aug 28 17:22:49 PDT 2013
Any ideas?

This is no longer an issue. Fixed it by adding --hive-drop-import-delims option. Hope this helps someone. Thanks.

Related

How to delete temp table created by VerticaPy - error the table "v_temp_schema"."train" already exists

I want to read a csv file using VerticaPy - https://www.vertica.com/python/documentation_last/utilities/read_csv/
I am running Vertica using a Docker image on my laptop.
The Jupyter code is
import sys
!{sys.executable} -m pip install vertica-python
!{sys.executable} -m pip install verticapy
from verticapy import *
conn_info = {'host': '127.0.0.1',
'port': 5433,
'user': 'dbadmin',
'password': '',
'database': 'kaggle_titanic'}
train = read_csv("train.csv")
train
The code runs fine the first time but when I run it again, I get error
NameError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_8344/783506563.py in <module>
23 # cur.execute("drop table kaggle_titanic.v_temp_schema.train")
24
---> 25 train = read_csv("train.csv")
26 train
27 #iris2
~\Anaconda3\lib\site-packages\verticapy\utilities.py in read_csv(path, cursor, schema, table_name, sep, header, header_names, dtype, na_rep, quotechar, escape, genSQL, parse_n_lines, insert, temporary_table, temporary_local_table, ingest_local)
918 result = cursor.fetchall()
919 if (result != []) and not (insert):
--> 920 raise NameError(
921 'The table "{}"."{}" already exists !'.format(schema, table_name)
922 )
NameError: The table "v_temp_schema"."train" already exists !
Looks like VerticaPy creates a temp table in the first run. How do I delete it?
I tried adding this code
with vertica_python.connect(**conn_info) as conn:
cur = conn.cursor()
cur.execute("drop table kaggle_titanic.v_temp_schema.train")
But got error
MissingSchema: Severity: ROLLBACK, Message: Schema "v_temp_schema" does not exist, Sqlstate: 3F000, Routine: RangeVarGetObjid, File: /data/jenkins/workspace/RE-ReleaseBuilds/RE-Jackhammer_3/server/vertica/Catalog/Namespace.cpp, Line: 281, Error Code: 4650, SQL: 'drop table kaggle_titanic.v_temp_schema.train'
I had to stop/start the database (from docker image) to run the code but clearly that isn't the right way.
How do I delete v_temp_schema.train?
I answered to this request on Github: https://github.com/vertica/VerticaPy/issues/285

Pyspark - error when save data into Hive table "unresolved operator 'InsertIntoTable HiveTableRelation'"

I use the following:
pyspark library , version 2.3.1
python, version 2.7.1
hadoop, version 2.7.3
hive, version 1.2.1000.2.6.5.30-1
spark version 2
My hive table looks following:
CREATE TABLE IF NOT EXISTS my_database.my_table
(
division STRING COMMENT 'Sample column'
)
I want to save data into HIVE using pyspark. I use the following code:
spark_session = SparkSession.builder.getOrCreate()
hive_context = HiveContext(spark_session.sparkContext)
hive_table_schema = hive_context.table("my_database.my_table").schema
df_to_save = spark_session.createDataFrame([["a"],["b"],["c"]], schema=hive_table_schema)
df_to_save.write.mode("append").insertInto("my_database.my_table")
But the following error occur:
Traceback (most recent call last):
File "/home/my_user/mantis service_quality_check__global/scripts/row_counts_preprocess.py", line 147, in <module> df_to_save.write.mode("append").insertInto(hive_table_row_counts_str)
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 716, in insertInto
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: u"unresolved operator 'InsertIntoTable HiveTableRelation `my_database`.`my_table`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [division#14], false, false;;\n'InsertIntoTable HiveTableRelation `my_database`.`my_table`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [division#14], false, false\n+- LogicalRDD [division#2], false\n"
Please is there someone who cane help with this? I am stuck with this few days
I found the issue. SparkSession has to support hive. The method enableHiveSupport() has to be call when spark session is created.
Then creation of spark session will looks like following
spark_session = SparkSession.builder.enableHiveSupport().getOrCreate()

Airflow Failed: ParseException line 2:0 cannot recognize input near

I'm trying to run a test task on Airflow but I keep getting the following error:
FAILED: ParseException 2:0 cannot recognize input near 'create_import_table_fct_latest_values' '.' 'hql'
Here is my Airflow Dag file:
import airflow
from datetime import datetime, timedelta
from airflow.operators.hive_operator import HiveOperator
from airflow.models import DAG
args = {
'owner': 'raul',
'start_date': datetime(2018, 11, 12),
'provide_context': True,
'depends_on_past': False,
'retries': 2,
'retry_delay': timedelta(minutes=5),
'email': ['raul.gregglino#leroymerlin.ru'],
'email_on_failure': True,
'email_on_retry': False
}
dag = DAG('opus_data',
default_args=args,
max_active_runs=6,
schedule_interval="#daily"
)
import_lv_data = HiveOperator(
task_id='fct_latest_values',
hive_cli_conn_id='metastore_default',
hql='create_import_table_fct_latest_values.hql ',
hiveconf_jinja_translate=True,
dag=dag
)
deps = {}
# Explicity define the dependencies in the DAG
for downstream, upstream_list in deps.iteritems():
for upstream in upstream_list:
dag.set_dependency(upstream, downstream)
Here is the content of my HQL file, in case this may be the issue and I can't figure:
*I'm testing the connection to understand if the table is created or not, then I'll try to LOAD DATA, hence the LOAD DATA is commented out.
CREATE TABLE IF NOT EXISTS opus_data.fct_latest_values_new_data (
id_product STRING,
id_model STRING,
id_attribute STRING,
attribute_value STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED ',';
#LOAD DATA LOCAL INPATH
#'/media/windows_share/schemas/opus/fct_latest_values_20181106.csv'
#OVERWRITE INTO TABLE opus_data.fct_latest_values_new_data;
In the HQL file it should be FIELDS TERMINATED BY ',':
CREATE TABLE IF NOT EXISTS opus_data.fct_latest_values_new_data (
id_product STRING,
id_model STRING,
id_attribute STRING,
attribute_value STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
And comments should start with -- in HQL file, not #
Also this seems incorrect and causing Exception hql='create_import_table_fct_latest_values.hql '
Have a look at this example:
#Create full path for the file
hql_file_path = os.path.join(os.path.dirname(__file__), source['hql'])
print hql_file_path
run_hive_query = HiveOperator(
task_id='run_hive_query',
dag = dag,
hql = """
{{ local_hive_settings }}
""" + "\n " + open(hql_file_path, 'r').read()
)
See here for more details.
Or put all HQL into hql parameter:
hql='CREATE TABLE IF NOT EXISTS opus_data.fct_latest_values_new_data ...'
I managed to find the answer for my issue.
It was related to the path my HiveOperator was calling the file from. As no Variable had been defined to tell Airflow where to look for, I was getting the error I mentioned in my post.
Once I have defined it using the webserver interface (See picture), my dag started to work propertly.
I made a change to my DAG code regarding the file location for organization only and this is how my HiveOperator looks like now:
import_lv_data = HiveOperator(
task_id='fct_latest_values',
hive_cli_conn_id='metastore_default',
hql='hql/create_import_table_fct_latest_values2.hql',
hiveconf_jinja_translate=True,
dag=dag
)
Thanks to (#panov.st) who helped me in person to identify my issue.

oracle ojdbc7 issue - camel-sql component failed to execute queries after upgrading to ojdbc7 driver

Post upgrading ojdbc driver to ojdbc7, camel-sql component started creating issues when executed queries with named parameters.
Provided the simple SQL query which failed during the execution. Not all the queries are failing but only few of them.
SELECT some_id
FROM app_schema.app_store_table
WHERE store_id = :#storeId
AND source = :#source;
On Database side verified that the username has access to the object with app_schema and also tried creating a PUBLIC synonym, but no luck.
Based on my understanding, it fails while trying to get the parameter count in SqlProducer.java → process() → execute() → ps.getParameterMetaData()
jdbcTemplate.execute(statementCreator, new PreparedStatementCallback<Map<?, ?>>() {
public Map<?, ?> doInPreparedStatement(PreparedStatement ps) throws SQLException {
int expected = parametersCount > 0 ? parametersCount : ps.getParameterMetaData().getParameterCount();
...
Further debugging found the following error from ojdbc7 driver. It confirms that the ojdbc7 driver failed to parse the query correctly. In the below example, the "." ([DOT]) between schema_name and table_name is replaced by ojdbc driver before execution.
Aug 30, 2017 9:06:57 AM oracle.jdbc.driver.OracleConnection isValid
FINE: 19C71BE Exit [17.696372ms]
Aug 30, 2017 9:06:57 AM oracle.jdbc.driver.PhysicalConnection prepareStatement
FINE: 19C71BE Public Enter: "**SELECT some_id FROM app_schema.app_store_table WHERE store_id = ? and source = ?**"
Aug 30, 2017 9:06:57 AM oracle.jdbc.driver.PhysicalConnection prepareStatement
FINE: 19C71BE Return: oracle.jdbc.driver.OraclePreparedStatementWrapper#c2c269
Aug 30, 2017 9:06:57 AM oracle.jdbc.driver.PhysicalConnection prepareStatement
FINE: 19C71BE Exit [226.482362ms]
Aug 30, 2017 9:06:57 AM oracle.jdbc.driver.PhysicalConnection prepareStatement
FINE: 19C71BE Public Enter: "***SELECT store_id, source FROM app_schemaapp_store_table***" <-- .[DOT] between schema_name and table_name is replaced by the driver which results in this error.
Aug 30, 2017 9:06:57 AM oracle.jdbc.driver.PhysicalConnection prepareStatement
FINE: 19C71BE Return: oracle.jdbc.driver.OraclePreparedStatementWrapper#e1e592
Aug 30, 2017 9:06:57 AM oracle.jdbc.driver.PhysicalConnection prepareStatement
FINE: 19C71BE Exit [0.413825ms]
Aug 30, 2017 9:06:57 AM oracle.jdbc.driver.OraclePreparedStatement getMetaData
FINE: E1E592 Public Enter:
Aug 30, 2017 9:06:57 AM oracle.jdbc.driver.T4CTTIoer processError
SEVERE: CD5A42 Throwing SQLException: ORA-00942: table or view does not exist
This can be replicated easily using the following command
-- example for .[DOT] getting replaced between schema_name and table_name. Refer Parameter SQL.
C:\Users\xxxxxx\Desktop\ojdbc\oracle>java -cp ojdbc7_g-12.1.0.2.jar oracle.jdbc.driver.OracleParameterMetaDataParser "*SELECT some_id FROM app_schema.app_store_table WHERE store_id = ? and source = ?*"
SQL:SELECT some_id FROM app_schema.app_store_table WHERE store_id = :1 and source = :2
SqlKind:SELECT, Parameter Count=2
Parameter SQL: ***SELECT store_id, source FROM app_schemaapp_store_table***
-- example for invalid parsing of the query. Refer Parameter SQL.
C:\Users\vm\ojdbc\oracle>java -cp ojdbc7_g-12.1.0.2.jar oracle.jdbc.driver.OracleParameterMetaDataParser "*SELECT f_some_id FROM app_store_table WHERE f_store_id = ? and f_source = ?*"
SQL:SELECT f_some_id FROM app_store_table WHERE f_store_id = :1 and f_source = :2
SqlKind:SELECT, Parameter Count=2
Parameter SQL: ***SELECT f, f FROM app_store_table***
Exception:
Caused by: java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:450)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:392)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:385)
at oracle.jdbc.driver.T4CTTIfun.processError(T4CTTIfun.java:1018)
at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:522)
at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:257)
at oracle.jdbc.driver.T4C8Odscrarr.doODNY(T4C8Odscrarr.java:96)
at oracle.jdbc.driver.T4CPreparedStatement.doDescribe(T4CPreparedStatement.java:717)
at oracle.jdbc.driver.OracleStatement.describe(OracleStatement.java:4404)
at oracle.jdbc.driver.OracleResultSetMetaData.<init>(OracleResultSetMetaData.java:52)
at oracle.jdbc.driver.OracleStatement.getResultSetMetaData(OracleStatement.java:4387)
at oracle.jdbc.driver.OraclePreparedStatement.getMetaData(OraclePreparedStatement.java:5581)
at oracle.jdbc.driver.OraclePreparedStatementWrapper.getMetaData(OraclePreparedStatementWrapper.java:1509)
at oracle.jdbc.driver.OracleParameterMetaData.getParameterMetaData(OracleParameterMetaData.java:70)
at oracle.jdbc.driver.OraclePreparedStatement.getParameterMetaData(OraclePreparedStatement.java:12861)
at oracle.jdbc.driver.OraclePreparedStatementWrapper.getParameterMetaData(OraclePreparedStatementWrapper.java:1551)
at org.apache.commons.dbcp2.DelegatingPreparedStatement.getParameterMetaData(DelegatingPreparedStatement.java:266)
at org.apache.commons.dbcp2.DelegatingPreparedStatement.getParameterMetaData(DelegatingPreparedStatement.java:266)
at org.apache.camel.component.sql.SqlProducer$2.doInPreparedStatement(SqlProducer.java:92)
at org.apache.camel.component.sql.SqlProducer$2.doInPreparedStatement(SqlProducer.java:90)
at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:589)
... 29 more
The same component worked well with ojdbc6.
Environment Details:
-Camel : 2.15.1.redhat-621166
-Fuse : 6.2.1
-OJDBC : ojdbc7 (12.1.0.2)
Additonal Dependecies.
-commons-dbcp2 (2.1.1)
-commons-pool2 (2.4.2)
-spring-jdbc (3.2.12.RELEASE)

How to write spark dataframe to impala database

I use the following code to write the spark dataframe to impala through JDBC connection.
df.write.mode("append").jdbc(url="jdbc:impala://10.61.1.101:21050/test;auth=noSasl",table="t_author_classic_copy", pro)
But I get the following error: java.sql.SQLException: No suitable driver found
then I change the mode:
df.write.mode("overwrite").jdbc(url="jdbc:impala://10.61.1.101:21050/test;auth=noSasl",table="t_author_classic_copy", pro)
but it still get an error:
CAUSED BY: Exception: Syntax error
), Query: CREATE TABLE t_author_classic_copy1 (id TEXT NOT NULL, domain_id TEXT NOT NULL, pub_num INTEGER , cited_num INTEGER , rank DOUBLE PRECISION ).
This works for me:
spark-shell --driver-class-path ImpalaJDBC41.jar --jars ImpalaJDBC41.jar
val jdbcURL = s"jdbc:impala://192.168.56.101:21050;AuthMech=0"
val connectionProperties = new java.util.Properties()
import org.apache.spark.sql.SaveMode
sqlContext.sql("select * from my_users").write.mode(SaveMode.Append).jdbc(jdbcURL, "users", connectionProperties)

Resources