Using teradata fast export within sqoop command - hadoop

Having an issue with sqooping from a Teradata database when using the Teradata method "--fast-export", example sqoop query is below
-Dhadoop.security.credential.provider.path=jceks:/PATH/TO/password/password.jcecks
-Dteradata.db.job.data.dictionary.usexviews=false
--connect
jdbc:teradata://DATABASE
--password-alias
password.alias
--username
USER
--connection-manager
org.apache.sqoop.teradata.TeradataConnManager
--fields-terminated-by
'\t'
--lines-terminated-by
'\n'
--null-non-string
''
--null-string
''
--num-mappers
8
--split-by
column3
--target-dir
/THE/TARGET/DIR
--query
SELECT column1,column2,column3 WHERE column3 > '2020-01-01 00:00:00' and column3 <= '2020-01-12 10:41:20' AND $CONDITIONS
--
--method
internal.fastexport
The error I am getting is
Caused by: com.teradata.connector.common.exception.ConnectorException: java.sql.SQLException: [Teradata Database] [TeraJDBC ] [Error 3524] [SQLState 42000] The user does not have CREATE VIEW access to database DATABASE.
I suspect fast export will implement a staging table/view to be temporarily created, and the job under the hood will be ingesting from the temp table. Is this a sqoop mechanism and is it possible to turn it off?
Many thanks
Dan

Fast export does not implement any view to extract data. The view is being created by Sqoop based on --query value. Hence, the user running the job must have CV right granted on the DATABASE.
You can check user's rights on the database by running the below query replacing USER_NAME and DATABASE_NAME by their values in your env.
ACCESS_RIGHT = 'CV' , means CREATE VIEW so leave it as it is.
SELECT *
FROM dbc.allRoleRights WHERE roleName IN
(SELECT roleName FROM dbc.roleMembers WHERE grantee = 'USER_NAME')
AND DATABASENAME = 'DATABASE_NAME'
AND ACCESS_RIGHT = 'CV'
ORDER BY 1,2,3,5;
You may need CT (Create table) rights in order to create log table for fast export. This is given by Sqoop parameters --error-table and --error-database

Related

How to pass a string value in sqoop free form query

I need to import data from few different SQL servers which have same tables, table structure and even primary key value. So to uniquely identify a record, ingested from a SQLserver say "S1", i want to have a extra column - say "serverName" in my hive tables. How should i add this in my sqoop free form query.
All i want to do is pass a hardcoded value along with list of columns such that the hardcoded column value should get stored in Hive. Once done, I can take care of dynamically changing this value depending upon the server data.
sqoop import --connect "connDetails" --username "user" --password "pass" --query "select col1, col2, col3, 'S1' from table where \$CONDITIONS" --hive-import --hive-overwrite --hive-table stg.T1 --split-by col1 --as-textfile --target-dir T1 --hive-drop-import-delims
S1 being the hardcoded value here. I am thinking in SQL-way that when you pass a hardcode value, same is returned as the query result. Any pointers how to get this done?
Thanks in Advance.
SOLVED: Actually it just needed an alias for the hardcoded value. So the sqoop command executed is -
sqoop import --connect "connDetails" --username "user" --password "pass" --query "select col1, col2, col3, 'S1' as serverName from table where \$CONDITIONS" --hive-import --hive-overwrite --hive-table stg.T1 --split-by col1 --as-textfile --target-dir T1 --hive-drop-import-delims

Hadoop-Sqoop import without an integer value using split-by

I am importing data from memsql to Hdfs using Sqoop. My source table in Memsql doesn't have any integer value, I created a new table including a new column 'test' with the existing columns.
FOllowing is the query
sqoop import --connect jdbc:mysql://XXXXXXXXX:3306/db_name --username XXXX --password XXXXX --query "select closed,extract_date,open,close,cast(floor(rand()*1000000 as int) as test from tble_name where \$CONDITIONS" --target-dir /user/XXXX--split-by test;
this query gave me following error :
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'as int) as test from table_name where (1 = 0)' at line 1
I tried it another way as well:
sqoop import --connect jdbc:mysql://XXXXX:3306/XXXX --username XXXX --password XXXX --query "select closed,extract_date,open,close,ceiling(rand()*1000000) as test from table_name where \$CONDITIONS" --target-dir /user/dfsdlf --split-by test;
With the following query the job gets executed, but there is no data being transferred. It says split-by column is of float type and change it to integer type strictly.
Please help me with this to change split-by column as integer type from float type
The problem mostly seems to be related with the use of alias as the --split-by parameter.
If it's required to use the particular column in the query , you can run the query
'select closed,extract_date,open,close,ceiling(rand()*1000000) from table_name' in the console, get the column name thus coming for the table in the console and use it in --split-by 'complete_column_name_from_console' (here it should be --split-by 'ceiling(rand()*1000000)') .

Is it possible to write a Sqoop incremental import with filters on the new file before importing?

My doubt is, Say, I have a file A1.csv with 2000 records on sql-server table, I import this data into hdfs, later that day I have added 3000 records to the same file on sql-server table.
Now, I want to run incremental import for the second chunk of data to be added on hdfs, but, I do not want complete 3000 records to be imported. I need only some data according to my necessity to be imported, like, 1000 records with certain condition to be imported as part of the increment import.
Is there a way to do that using sqoop incremental import command?
Please Help, Thank you.
You need a unique key or a Timestamp field to identify the deltas which is the new 1000 records in your case. using that field you have to options to bring in the data to Hadoop.
Option 1
is to use the sqoop incremental append, below is the example of it
sqoop import \
--connect jdbc:oracle:thin:#enkx3-scan:1521:dbm2 \
--username wzhou \
--password wzhou \
--table STUDENT \
--incremental append \
--check-column student_id \
-m 4 \
--split-by major
Arguments :
--check-column (col) #Specifies the column to be examined when determining which rows to import.
--incremental (mode) #Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified.
--last-value (value) Specifies the maximum value of the check column from the previous import.
Option 2
Using the --query argument in sqoop where you can use the native sql for mysql/any database you connect to.
Example :
sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
--split-by a.id --target-dir /user/foo/joinresults
sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
-m 1 --target-dir /user/foo/joinresults

Error unrecognized argument --hive-partition-key

I am getting error Unrecognized argument --hive-partition-key , when I run the following statement:
sqoop import
--connect 'jdbc:sqlserver://192.168.56.1;database=xyz_dms_cust_100;username-hadoop;password=hadoop'
--table e_purchase_category
--hive_import
--delete-target-dir
--hive-table purchase_category_p
--hive-partition-key "creation_date"
--hive-partition-value "2015-02-02"
The partitioned table exists.
Hive partition key (creation_date in your example) should not be part of your database table when you are using hive-import. When you are trying to create table in hive with partition you will not include partition column in your table schema. The same applies to sqoop hive-import as well.
Based on your sqoop command, i am guessing that creation_date column is present in your SQLServer table. If yes, you might be getting this error
ERROR tool.ImportTool: Imported Failed:
Partition key creation_date cannot be a column to import.
To resolve this issue, i have two solutions:
Make sure that the partition column is not present in the SQLServer table. So, when sqoop creates hive table it includes that partition column and its value as directory in hive warehouse.
Change the sqoop command by including a free form query to get all the columns expect the partiton column and do hive-import. Below is a example for this solution
Example:
sqoop import
--connect jdbc:mysql://localhost:3306/hadoopexamples
--query 'select City.ID, City.Name, City.District, City.Population from City where $CONDITIONS'
--target-dir /user/XXXX/City
--delete-target-dir
--hive-import
--hive-table City
--hive-partition-key "CountryCode"
--hive-partition-value "USA"
--fields-terminated-by ','
-m 1
Another method:
You can also try to do your tasks in different steps:
Create a partition table in hive (Example: city_partition)
Load data from RDBMS to sqoop using hive-import into a plain hive table (Example: city)
Using insert overwrite, import data into partition table (city_partition) from plain hive table (city) like:
INSERT OVERWRITE TABLE city_partition
PARTITION (CountryCode='USA')
SELECT id, name, district, population FROM city;
It could applied too :
sqoop import --connect jdbc:mysql://localhost/akash
--username root
--P
--table mytest
--where "dob='2019-12-28'"
--columns "id,name,salary"
--target-dir /user/cloudera/
--m 1 --hive-table mytest
--hive-import
--hive-overwrite
--hive-partition-key dob
--hive-partition-value '2019-12-28'

How to convert mysql DDL into hive DDL

Given a SQL script containing DDL for creating tables in MySQL database, I would like to convert the script into Hive DDL, so that I can create tables into hive. I could have written an interpreter myself, but thought there might be details I could miss (e.g. data format conversion, int, bigint, time, date, etc.) since I am very new to hive DDL.
I have seen this thread How to transfer mysql table to hive?, which mentioned sqoop http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html. However, from what I see, sqoop certainly translate the DDL, but only as an intermediate step (thus the translated DDL is no where to be found). Am I missing the command that would output the translation with the MySQL DDL as an input?
For example, my MySQL DDL look like:
CREATE TABLE `user_keyword` (
`username` varchar(32) NOT NULL DEFAULT '',
`keyword_id` int(10) unsigned NOT NULL,
PRIMARY KEY (`username`,`keyword_id`),
KEY `keyword_id` (`keyword_id`),
CONSTRAINT `analyst_keywords_ibfk_1` FOREIGN KEY (`keyword_id`) REFERENCES `keywords` (`keyword_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
And the output Hive DDL would be like:
CREATE TABLE user_keyword (
username string,
keyword_id int,
);
I actually thought this was not supported, but after looking at the Source here is what I saw in HiveImport.java:
/**
* #return true if we're just generating the DDL for the import, but
* not actually running it (i.e., --generate-only mode). If so, don't
* do any side-effecting actions in Hive.
*/
private boolean isGenerateOnly() {
return generateOnly;
}
/**
* #return a File object that can be used to write the DDL statement.
* If we're in gen-only mode, this should be a file in the outdir, named
* after the Hive table we're creating. If we're in import mode, this should
* be a one-off temporary file.
*/
private File getScriptFile(String outputTableName) throws IOException {
if (!isGenerateOnly()) {
return File.createTempFile("hive-script-", ".txt",
new File(options.getTempDir()));
} else {
return new File(new File(options.getCodeOutputDir()),
outputTableName + ".q");
}
}
So basically you should be able to do only the DDL generation using the option --generate-only used in cunjunction with --outdir and your table will be create in the output dir specified and named after your table.
For example based on the link you provided:
sqoop import --verbose --fields-terminated-by ',' --connect jdbc:mysql://localhost/test --table employee --hive-import --warehouse-dir /user/hive/warehouse --fields-terminated-by ',' --split-by id --hive-table employee --outdir /tmp/mysql_to_hive/ddl --generate-only
will create /tmp/mysql_to_hive/ddl/employee.q
Alternatively, one could use the create-hive-table tool to do that. The create-hive-table tool populates a Hive metastore with a definition for a table based on a database table previously imported to HDFS, or one planned to be imported. This effectively performs the --hive-import step of sqoop-import without running the preceeding import. For example,
sqoop create-hive-table --connect jdbc:mysql://localhost/demo
-username root --table t2 --fields-terminated-by ',' --hive-table t2
This command will create a blank hive table t2 based on the schema of the same table in MySQL without importing the data.

Resources