Sqoop Import - primary key is not included in the column family

Sqoop Import - primary key is not included in the column family - sqoop

I have a mysql table looks like:
Mebmer_ID <- primary key
Member_Name
Member_Type
I ran the command below:
./bin/sqoop import --connect jdbc:mysql://${ip}/testdb -username
root -password blabla --query 'SELECT * from member where Member_ID <
5 AND $CONDITIONS' --split-by Member_ID --hbase-create-table
--hbase-table member --column-family i
But after import, I see hbase table looks like:
rowkey - row : 1
Columns - Member_name=bla, Member_Type=bla
Note that Sqoop turned my Member_ID to Rowkey which is expected. But in my columns, I am seeing all the other fields except Member_ID. Is there anyway I can have Member_ID as my rowkey,also in Column Family, having the Member_ID column included as well?
Does this also mean, if my primary key is not called "id", after sqoop import, I am losing the name of my primary key. In my case, after import, I have no idea what rowkey used to be called "Mmember_ID".

Got it sorted by setting property sqoop.hbase.add.row.key.
e.g. sqoop import -Dsqoop.hbase.add.row.key=true

Related

Hadoop-Sqoop import without an integer value using split-by

I am importing data from memsql to Hdfs using Sqoop. My source table in Memsql doesn't have any integer value, I created a new table including a new column 'test' with the existing columns.
FOllowing is the query
sqoop import --connect jdbc:mysql://XXXXXXXXX:3306/db_name --username XXXX --password XXXXX --query "select closed,extract_date,open,close,cast(floor(rand()*1000000 as int) as test from tble_name where \$CONDITIONS" --target-dir /user/XXXX--split-by test;
this query gave me following error :
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'as int) as test from table_name where (1 = 0)' at line 1
I tried it another way as well:
sqoop import --connect jdbc:mysql://XXXXX:3306/XXXX --username XXXX --password XXXX --query "select closed,extract_date,open,close,ceiling(rand()*1000000) as test from table_name where \$CONDITIONS" --target-dir /user/dfsdlf --split-by test;
With the following query the job gets executed, but there is no data being transferred. It says split-by column is of float type and change it to integer type strictly.
Please help me with this to change split-by column as integer type from float type

The problem mostly seems to be related with the use of alias as the --split-by parameter.
If it's required to use the particular column in the query , you can run the query
'select closed,extract_date,open,close,ceiling(rand()*1000000) from table_name' in the console, get the column name thus coming for the table in the console and use it in --split-by 'complete_column_name_from_console' (here it should be --split-by 'ceiling(rand()*1000000)') .

I want to sqoop data using sqoop import job into hive column partitioned table. How can we do this?

I have a hive table partitioned on country column.
My RDBMS columns are as follows:
id int, fname varchar(45), lname varchar(45), email varchar(45), password varchar(45), street varchar(45), city varchar(45), state varchar(45), zipcode varchar(45), c_time timestamp
sample data:
1,Richard,Hernandez,XXXXXXXXX,XXXXXXXXX,6303 Heather Plaza,Brownsville,TX,69696,2017-07-20 20:24:17.0
Sqoop Job:
sqoop job --create customer_partition -- import --connect jdbc:mysql://host/serverName
--username root -P --table customers --check-column c_time --incremental lastmodified
--last-value 0 --merge-key id --target-dir '/user/cloudera/partitionedTables/customers_partition/'
--fields-terminated-by ',' --hive-import
--hive-table customers_partition --hive-partition-key "state";
Hive partitioned Table:
create external table customers_partition(id int, fname varchar(64), lname varchar(64), email varchar(64),
password varchar(64), street varchar(45), city varchar(45), zipcode varchar(64), cob_dt timestamp)
partitioned by (state varchar(45))
row format delimited
fields terminated by ','
location '/hdfsPath/customers_partition/';
After sqoop import in hdfs folder output file contains data as below format:
1,Richard,Hernandez,XXXXXXXXX,XXXXXXXXX,6303 Heather Plaza,Brownsville,TX,69696,2017-07-20 20:24:17.0
which points to same columns same as RDBMS columns.
When I perform hive query: select * from customers_partition; showing 0 records found in hive.
This is because the hive table column arrangement because of partition is different from normal RDBMS table column arrangement.
How can we solve this issue. I want to sqoop import data directly into hive partition table and records need to be updated when ever i run this sqoop job. If I am wrong is there any alternative way to do this?
Also how to perform same using two or more hive partitioned columns.

You need to add --hive-partition-value arguments. The partition value must be a string.
Since you are using sqoop job --create --last-value 0 is not required. Please remove it.

Incremental data load using sqoop without primary key or timestamp

I have a table that doesn't have any primary key and datemodified/timestamp. This table is just like a transaction table that keeps saving all data (No delete/update).
My problem now is I want to inject the data to HDFS without loading the whole table again every time I run the incremental load.
The code below gets the latest row imported to HDFS if my table has primary key.
sqoop job \
--create tb_w_PK_DT_append \
-- \
import \
--connect jdbc:mysql://10.217.55.176:3306/SQOOP_Test \
--username root \
--incremental append \
--check-column P_id \
--last-value 0 \
--target-dir /data \
--query "SELECT * FROM tb_w_PK_DT WHERE \$CONDITIONS" \
-m 1;
Any solution to get the latest data imported without any primary key or date modified.

I know I am bit late to answer this, but just wanted to share for reference. If There's a scenario that you don't have primary key column or date column on your source table and you want to sqoop the increment data only to hdfs.
Let's say there's some table which holds history of data and new rows being inserted to on daily basis and you just need the newly inserted rows to hdfs. if your source is sql server you can create Insert or Update trigger on your history table.
you can create a Insert trigger as shown below:
CREATE TRIGGER transactionInsertTrigger
ON [dbo].[TransactionHistoryTable]
AFTER INSERT
AS
BEGIN
SET NOCOUNT ON;
INSERT INTO [dbo].[TriggerHistoryTable]
(
product ,price,payment_type,name,city,state,country,Last_Modified_Date
)
SELECT
product,price,payment_type,name,city,state,country,GETDATE() as Last_Modified_Date
FROM
inserted i
END
Create a Table to hold the records when an insert events occurs on your main table. Keep the schema same as your main table, however you can add extra columns to this.
the above trigger will insert a row into table whenever there's any new row gets inserted to your main TransactionHistoryTable.
CREATE TABLE [dbo].[TriggerHistoryTable](
[product] [varchar](20) NULL,
[price] [int] NULL,
[payment_type] [varchar](20) NULL,
[name] [varchar](20) NULL,
[city] [varchar](20) NULL,
[state] [varchar](20) NULL,
[country] [varchar](20) NULL,
[Last_Modified_Date] [date] NULL
) ON [PRIMARY]
Now if we insert two new rows to main TransactionHistoryTable, because of this insert evert, our triggered was fired and has inserted these two rows to TriggerHistoryTable also along with main TransactionHistoryTable
insert into [Transaction_db].[dbo].[TransactionHistoryTable]
values
('Product3',2100,'Visa','Cindy' ,'Kemble','England','United Kingdom')
,('Product4',50000,'Mastercard','Tamar','Headley','England','United Kingdom')
;
select * from TriggerHistoryTable;
Now you can sqoop from your TriggerHistoryTable, which will be having daily insert or updated records. You can use Incremental sqoop also since we have added a date column to this. once you have imported data to hdfs you can clear this table on daily basis or weekly. This is just an example with sql server. you can have triggers with Teradata and oracle and other databases also. you can also set up a update/delete trigger also.

You can follow these steps
1) The initial load data (previous day data) is in hdfs - Relation A
2) Import the current data into HDFS using sqoop -- Relation B
3) Use pig Load the above two hdfs directories in relation A and B define schema.
4) Convert them to tuples and join them by all columns
5) The join result will have two tuples in each row((A,B),(A,B)) , fetch the result from join where tuple B is null ((A,D),).
6) Now flatten the join by tuple A you will have new/updated records(A,D).

If your data has a field like rowid you can import using --last-value in sqoop arguments .
Please refer to https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports

Error unrecognized argument --hive-partition-key

I am getting error Unrecognized argument --hive-partition-key , when I run the following statement:
sqoop import
--connect 'jdbc:sqlserver://192.168.56.1;database=xyz_dms_cust_100;username-hadoop;password=hadoop'
--table e_purchase_category
--hive_import
--delete-target-dir
--hive-table purchase_category_p
--hive-partition-key "creation_date"
--hive-partition-value "2015-02-02"
The partitioned table exists.

Hive partition key (creation_date in your example) should not be part of your database table when you are using hive-import. When you are trying to create table in hive with partition you will not include partition column in your table schema. The same applies to sqoop hive-import as well.
Based on your sqoop command, i am guessing that creation_date column is present in your SQLServer table. If yes, you might be getting this error
ERROR tool.ImportTool: Imported Failed:
Partition key creation_date cannot be a column to import.
To resolve this issue, i have two solutions:
Make sure that the partition column is not present in the SQLServer table. So, when sqoop creates hive table it includes that partition column and its value as directory in hive warehouse.
Change the sqoop command by including a free form query to get all the columns expect the partiton column and do hive-import. Below is a example for this solution
Example:
sqoop import
--connect jdbc:mysql://localhost:3306/hadoopexamples
--query 'select City.ID, City.Name, City.District, City.Population from City where $CONDITIONS'
--target-dir /user/XXXX/City
--delete-target-dir
--hive-import
--hive-table City
--hive-partition-key "CountryCode"
--hive-partition-value "USA"
--fields-terminated-by ','
-m 1
Another method:
You can also try to do your tasks in different steps:
Create a partition table in hive (Example: city_partition)
Load data from RDBMS to sqoop using hive-import into a plain hive table (Example: city)
Using insert overwrite, import data into partition table (city_partition) from plain hive table (city) like:
INSERT OVERWRITE TABLE city_partition
PARTITION (CountryCode='USA')
SELECT id, name, district, population FROM city;

It could applied too :
sqoop import --connect jdbc:mysql://localhost/akash
--username root
--P
--table mytest
--where "dob='2019-12-28'"
--columns "id,name,salary"
--target-dir /user/cloudera/
--m 1 --hive-table mytest
--hive-import
--hive-overwrite
--hive-partition-key dob
--hive-partition-value '2019-12-28'

How to convert mysql DDL into hive DDL

Given a SQL script containing DDL for creating tables in MySQL database, I would like to convert the script into Hive DDL, so that I can create tables into hive. I could have written an interpreter myself, but thought there might be details I could miss (e.g. data format conversion, int, bigint, time, date, etc.) since I am very new to hive DDL.
I have seen this thread How to transfer mysql table to hive?, which mentioned sqoop http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html. However, from what I see, sqoop certainly translate the DDL, but only as an intermediate step (thus the translated DDL is no where to be found). Am I missing the command that would output the translation with the MySQL DDL as an input?
For example, my MySQL DDL look like:
CREATE TABLE `user_keyword` (
`username` varchar(32) NOT NULL DEFAULT '',
`keyword_id` int(10) unsigned NOT NULL,
PRIMARY KEY (`username`,`keyword_id`),
KEY `keyword_id` (`keyword_id`),
CONSTRAINT `analyst_keywords_ibfk_1` FOREIGN KEY (`keyword_id`) REFERENCES `keywords` (`keyword_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
And the output Hive DDL would be like:
CREATE TABLE user_keyword (
username string,
keyword_id int,
);

I actually thought this was not supported, but after looking at the Source here is what I saw in HiveImport.java:
/**
* #return true if we're just generating the DDL for the import, but
* not actually running it (i.e., --generate-only mode). If so, don't
* do any side-effecting actions in Hive.
*/
private boolean isGenerateOnly() {
return generateOnly;
}
/**
* #return a File object that can be used to write the DDL statement.
* If we're in gen-only mode, this should be a file in the outdir, named
* after the Hive table we're creating. If we're in import mode, this should
* be a one-off temporary file.
*/
private File getScriptFile(String outputTableName) throws IOException {
if (!isGenerateOnly()) {
return File.createTempFile("hive-script-", ".txt",
new File(options.getTempDir()));
} else {
return new File(new File(options.getCodeOutputDir()),
outputTableName + ".q");
}
}
So basically you should be able to do only the DDL generation using the option --generate-only used in cunjunction with --outdir and your table will be create in the output dir specified and named after your table.
For example based on the link you provided:
sqoop import --verbose --fields-terminated-by ',' --connect jdbc:mysql://localhost/test --table employee --hive-import --warehouse-dir /user/hive/warehouse --fields-terminated-by ',' --split-by id --hive-table employee --outdir /tmp/mysql_to_hive/ddl --generate-only
will create /tmp/mysql_to_hive/ddl/employee.q

Alternatively, one could use the create-hive-table tool to do that. The create-hive-table tool populates a Hive metastore with a definition for a table based on a database table previously imported to HDFS, or one planned to be imported. This effectively performs the --hive-import step of sqoop-import without running the preceeding import. For example,
sqoop create-hive-table --connect jdbc:mysql://localhost/demo
-username root --table t2 --fields-terminated-by ',' --hive-table t2
This command will create a blank hive table t2 based on the schema of the same table in MySQL without importing the data.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Sqoop Import - primary key is not included in the column family - sqoop

Got it sorted by setting property sqoop.hbase.add.row.key. e.g. sqoop import -Dsqoop.hbase.add.row.key=true

Related

Hadoop-Sqoop import without an integer value using split-by

I want to sqoop data using sqoop import job into hive column partitioned table. How can we do this?

Incremental data load using sqoop without primary key or timestamp

Error unrecognized argument --hive-partition-key

How to convert mysql DDL into hive DDL

Categories

Resources