How to convert mysql DDL into hive DDL

How to convert mysql DDL into hive DDL - hadoop

Given a SQL script containing DDL for creating tables in MySQL database, I would like to convert the script into Hive DDL, so that I can create tables into hive. I could have written an interpreter myself, but thought there might be details I could miss (e.g. data format conversion, int, bigint, time, date, etc.) since I am very new to hive DDL.
I have seen this thread How to transfer mysql table to hive?, which mentioned sqoop http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html. However, from what I see, sqoop certainly translate the DDL, but only as an intermediate step (thus the translated DDL is no where to be found). Am I missing the command that would output the translation with the MySQL DDL as an input?
For example, my MySQL DDL look like:
CREATE TABLE `user_keyword` (
`username` varchar(32) NOT NULL DEFAULT '',
`keyword_id` int(10) unsigned NOT NULL,
PRIMARY KEY (`username`,`keyword_id`),
KEY `keyword_id` (`keyword_id`),
CONSTRAINT `analyst_keywords_ibfk_1` FOREIGN KEY (`keyword_id`) REFERENCES `keywords` (`keyword_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
And the output Hive DDL would be like:
CREATE TABLE user_keyword (
username string,
keyword_id int,
);

I actually thought this was not supported, but after looking at the Source here is what I saw in HiveImport.java:
/**
* #return true if we're just generating the DDL for the import, but
* not actually running it (i.e., --generate-only mode). If so, don't
* do any side-effecting actions in Hive.
*/
private boolean isGenerateOnly() {
return generateOnly;
}
/**
* #return a File object that can be used to write the DDL statement.
* If we're in gen-only mode, this should be a file in the outdir, named
* after the Hive table we're creating. If we're in import mode, this should
* be a one-off temporary file.
*/
private File getScriptFile(String outputTableName) throws IOException {
if (!isGenerateOnly()) {
return File.createTempFile("hive-script-", ".txt",
new File(options.getTempDir()));
} else {
return new File(new File(options.getCodeOutputDir()),
outputTableName + ".q");
}
}
So basically you should be able to do only the DDL generation using the option --generate-only used in cunjunction with --outdir and your table will be create in the output dir specified and named after your table.
For example based on the link you provided:
sqoop import --verbose --fields-terminated-by ',' --connect jdbc:mysql://localhost/test --table employee --hive-import --warehouse-dir /user/hive/warehouse --fields-terminated-by ',' --split-by id --hive-table employee --outdir /tmp/mysql_to_hive/ddl --generate-only
will create /tmp/mysql_to_hive/ddl/employee.q

Alternatively, one could use the create-hive-table tool to do that. The create-hive-table tool populates a Hive metastore with a definition for a table based on a database table previously imported to HDFS, or one planned to be imported. This effectively performs the --hive-import step of sqoop-import without running the preceeding import. For example,
sqoop create-hive-table --connect jdbc:mysql://localhost/demo
-username root --table t2 --fields-terminated-by ',' --hive-table t2
This command will create a blank hive table t2 based on the schema of the same table in MySQL without importing the data.

Related

Using teradata fast export within sqoop command

Having an issue with sqooping from a Teradata database when using the Teradata method "--fast-export", example sqoop query is below
-Dhadoop.security.credential.provider.path=jceks:/PATH/TO/password/password.jcecks
-Dteradata.db.job.data.dictionary.usexviews=false
--connect
jdbc:teradata://DATABASE
--password-alias
password.alias
--username
USER
--connection-manager
org.apache.sqoop.teradata.TeradataConnManager
--fields-terminated-by
'\t'
--lines-terminated-by
'\n'
--null-non-string
''
--null-string
''
--num-mappers
8
--split-by
column3
--target-dir
/THE/TARGET/DIR
--query
SELECT column1,column2,column3 WHERE column3 > '2020-01-01 00:00:00' and column3 <= '2020-01-12 10:41:20' AND $CONDITIONS
--
--method
internal.fastexport
The error I am getting is
Caused by: com.teradata.connector.common.exception.ConnectorException: java.sql.SQLException: [Teradata Database] [TeraJDBC ] [Error 3524] [SQLState 42000] The user does not have CREATE VIEW access to database DATABASE.
I suspect fast export will implement a staging table/view to be temporarily created, and the job under the hood will be ingesting from the temp table. Is this a sqoop mechanism and is it possible to turn it off?
Many thanks
Dan

Fast export does not implement any view to extract data. The view is being created by Sqoop based on --query value. Hence, the user running the job must have CV right granted on the DATABASE.
You can check user's rights on the database by running the below query replacing USER_NAME and DATABASE_NAME by their values in your env.
ACCESS_RIGHT = 'CV' , means CREATE VIEW so leave it as it is.
SELECT *
FROM dbc.allRoleRights WHERE roleName IN
(SELECT roleName FROM dbc.roleMembers WHERE grantee = 'USER_NAME')
AND DATABASENAME = 'DATABASE_NAME'
AND ACCESS_RIGHT = 'CV'
ORDER BY 1,2,3,5;
You may need CT (Create table) rights in order to create log table for fast export. This is given by Sqoop parameters --error-table and --error-database

Hadoop-Sqoop import without an integer value using split-by

I am importing data from memsql to Hdfs using Sqoop. My source table in Memsql doesn't have any integer value, I created a new table including a new column 'test' with the existing columns.
FOllowing is the query
sqoop import --connect jdbc:mysql://XXXXXXXXX:3306/db_name --username XXXX --password XXXXX --query "select closed,extract_date,open,close,cast(floor(rand()*1000000 as int) as test from tble_name where \$CONDITIONS" --target-dir /user/XXXX--split-by test;
this query gave me following error :
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'as int) as test from table_name where (1 = 0)' at line 1
I tried it another way as well:
sqoop import --connect jdbc:mysql://XXXXX:3306/XXXX --username XXXX --password XXXX --query "select closed,extract_date,open,close,ceiling(rand()*1000000) as test from table_name where \$CONDITIONS" --target-dir /user/dfsdlf --split-by test;
With the following query the job gets executed, but there is no data being transferred. It says split-by column is of float type and change it to integer type strictly.
Please help me with this to change split-by column as integer type from float type

The problem mostly seems to be related with the use of alias as the --split-by parameter.
If it's required to use the particular column in the query , you can run the query
'select closed,extract_date,open,close,ceiling(rand()*1000000) from table_name' in the console, get the column name thus coming for the table in the console and use it in --split-by 'complete_column_name_from_console' (here it should be --split-by 'ceiling(rand()*1000000)') .

Incremental data load using sqoop without primary key or timestamp

I have a table that doesn't have any primary key and datemodified/timestamp. This table is just like a transaction table that keeps saving all data (No delete/update).
My problem now is I want to inject the data to HDFS without loading the whole table again every time I run the incremental load.
The code below gets the latest row imported to HDFS if my table has primary key.
sqoop job \
--create tb_w_PK_DT_append \
-- \
import \
--connect jdbc:mysql://10.217.55.176:3306/SQOOP_Test \
--username root \
--incremental append \
--check-column P_id \
--last-value 0 \
--target-dir /data \
--query "SELECT * FROM tb_w_PK_DT WHERE \$CONDITIONS" \
-m 1;
Any solution to get the latest data imported without any primary key or date modified.

I know I am bit late to answer this, but just wanted to share for reference. If There's a scenario that you don't have primary key column or date column on your source table and you want to sqoop the increment data only to hdfs.
Let's say there's some table which holds history of data and new rows being inserted to on daily basis and you just need the newly inserted rows to hdfs. if your source is sql server you can create Insert or Update trigger on your history table.
you can create a Insert trigger as shown below:
CREATE TRIGGER transactionInsertTrigger
ON [dbo].[TransactionHistoryTable]
AFTER INSERT
AS
BEGIN
SET NOCOUNT ON;
INSERT INTO [dbo].[TriggerHistoryTable]
(
product ,price,payment_type,name,city,state,country,Last_Modified_Date
)
SELECT
product,price,payment_type,name,city,state,country,GETDATE() as Last_Modified_Date
FROM
inserted i
END
Create a Table to hold the records when an insert events occurs on your main table. Keep the schema same as your main table, however you can add extra columns to this.
the above trigger will insert a row into table whenever there's any new row gets inserted to your main TransactionHistoryTable.
CREATE TABLE [dbo].[TriggerHistoryTable](
[product] [varchar](20) NULL,
[price] [int] NULL,
[payment_type] [varchar](20) NULL,
[name] [varchar](20) NULL,
[city] [varchar](20) NULL,
[state] [varchar](20) NULL,
[country] [varchar](20) NULL,
[Last_Modified_Date] [date] NULL
) ON [PRIMARY]
Now if we insert two new rows to main TransactionHistoryTable, because of this insert evert, our triggered was fired and has inserted these two rows to TriggerHistoryTable also along with main TransactionHistoryTable
insert into [Transaction_db].[dbo].[TransactionHistoryTable]
values
('Product3',2100,'Visa','Cindy' ,'Kemble','England','United Kingdom')
,('Product4',50000,'Mastercard','Tamar','Headley','England','United Kingdom')
;
select * from TriggerHistoryTable;
Now you can sqoop from your TriggerHistoryTable, which will be having daily insert or updated records. You can use Incremental sqoop also since we have added a date column to this. once you have imported data to hdfs you can clear this table on daily basis or weekly. This is just an example with sql server. you can have triggers with Teradata and oracle and other databases also. you can also set up a update/delete trigger also.

You can follow these steps
1) The initial load data (previous day data) is in hdfs - Relation A
2) Import the current data into HDFS using sqoop -- Relation B
3) Use pig Load the above two hdfs directories in relation A and B define schema.
4) Convert them to tuples and join them by all columns
5) The join result will have two tuples in each row((A,B),(A,B)) , fetch the result from join where tuple B is null ((A,D),).
6) Now flatten the join by tuple A you will have new/updated records(A,D).

If your data has a field like rowid you can import using --last-value in sqoop arguments .
Please refer to https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports

Error unrecognized argument --hive-partition-key

I am getting error Unrecognized argument --hive-partition-key , when I run the following statement:
sqoop import
--connect 'jdbc:sqlserver://192.168.56.1;database=xyz_dms_cust_100;username-hadoop;password=hadoop'
--table e_purchase_category
--hive_import
--delete-target-dir
--hive-table purchase_category_p
--hive-partition-key "creation_date"
--hive-partition-value "2015-02-02"
The partitioned table exists.

Hive partition key (creation_date in your example) should not be part of your database table when you are using hive-import. When you are trying to create table in hive with partition you will not include partition column in your table schema. The same applies to sqoop hive-import as well.
Based on your sqoop command, i am guessing that creation_date column is present in your SQLServer table. If yes, you might be getting this error
ERROR tool.ImportTool: Imported Failed:
Partition key creation_date cannot be a column to import.
To resolve this issue, i have two solutions:
Make sure that the partition column is not present in the SQLServer table. So, when sqoop creates hive table it includes that partition column and its value as directory in hive warehouse.
Change the sqoop command by including a free form query to get all the columns expect the partiton column and do hive-import. Below is a example for this solution
Example:
sqoop import
--connect jdbc:mysql://localhost:3306/hadoopexamples
--query 'select City.ID, City.Name, City.District, City.Population from City where $CONDITIONS'
--target-dir /user/XXXX/City
--delete-target-dir
--hive-import
--hive-table City
--hive-partition-key "CountryCode"
--hive-partition-value "USA"
--fields-terminated-by ','
-m 1
Another method:
You can also try to do your tasks in different steps:
Create a partition table in hive (Example: city_partition)
Load data from RDBMS to sqoop using hive-import into a plain hive table (Example: city)
Using insert overwrite, import data into partition table (city_partition) from plain hive table (city) like:
INSERT OVERWRITE TABLE city_partition
PARTITION (CountryCode='USA')
SELECT id, name, district, population FROM city;

It could applied too :
sqoop import --connect jdbc:mysql://localhost/akash
--username root
--P
--table mytest
--where "dob='2019-12-28'"
--columns "id,name,salary"
--target-dir /user/cloudera/
--m 1 --hive-table mytest
--hive-import
--hive-overwrite
--hive-partition-key dob
--hive-partition-value '2019-12-28'

Sqoop Import - primary key is not included in the column family

I have a mysql table looks like:
Mebmer_ID <- primary key
Member_Name
Member_Type
I ran the command below:
./bin/sqoop import --connect jdbc:mysql://${ip}/testdb -username
root -password blabla --query 'SELECT * from member where Member_ID <
5 AND $CONDITIONS' --split-by Member_ID --hbase-create-table
--hbase-table member --column-family i
But after import, I see hbase table looks like:
rowkey - row : 1
Columns - Member_name=bla, Member_Type=bla
Note that Sqoop turned my Member_ID to Rowkey which is expected. But in my columns, I am seeing all the other fields except Member_ID. Is there anyway I can have Member_ID as my rowkey,also in Column Family, having the Member_ID column included as well?
Does this also mean, if my primary key is not called "id", after sqoop import, I am losing the name of my primary key. In my case, after import, I have no idea what rowkey used to be called "Mmember_ID".

Got it sorted by setting property sqoop.hbase.add.row.key.
e.g. sqoop import -Dsqoop.hbase.add.row.key=true

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to convert mysql DDL into hive DDL - hadoop

Related

Using teradata fast export within sqoop command

Hadoop-Sqoop import without an integer value using split-by

Incremental data load using sqoop without primary key or timestamp

Error unrecognized argument --hive-partition-key

Sqoop Import - primary key is not included in the column family

Categories

Resources