Incremental data load using sqoop without primary key or timestamp - hadoop

I have a table that doesn't have any primary key and datemodified/timestamp. This table is just like a transaction table that keeps saving all data (No delete/update).
My problem now is I want to inject the data to HDFS without loading the whole table again every time I run the incremental load.
The code below gets the latest row imported to HDFS if my table has primary key.
sqoop job \
--create tb_w_PK_DT_append \
-- \
import \
--connect jdbc:mysql://10.217.55.176:3306/SQOOP_Test \
--username root \
--incremental append \
--check-column P_id \
--last-value 0 \
--target-dir /data \
--query "SELECT * FROM tb_w_PK_DT WHERE \$CONDITIONS" \
-m 1;
Any solution to get the latest data imported without any primary key or date modified.

I know I am bit late to answer this, but just wanted to share for reference. If There's a scenario that you don't have primary key column or date column on your source table and you want to sqoop the increment data only to hdfs.
Let's say there's some table which holds history of data and new rows being inserted to on daily basis and you just need the newly inserted rows to hdfs. if your source is sql server you can create Insert or Update trigger on your history table.
you can create a Insert trigger as shown below:
CREATE TRIGGER transactionInsertTrigger
ON [dbo].[TransactionHistoryTable]
AFTER INSERT
AS
BEGIN
SET NOCOUNT ON;
INSERT INTO [dbo].[TriggerHistoryTable]
(
product ,price,payment_type,name,city,state,country,Last_Modified_Date
)
SELECT
product,price,payment_type,name,city,state,country,GETDATE() as Last_Modified_Date
FROM
inserted i
END
Create a Table to hold the records when an insert events occurs on your main table. Keep the schema same as your main table, however you can add extra columns to this.
the above trigger will insert a row into table whenever there's any new row gets inserted to your main TransactionHistoryTable.
CREATE TABLE [dbo].[TriggerHistoryTable](
[product] [varchar](20) NULL,
[price] [int] NULL,
[payment_type] [varchar](20) NULL,
[name] [varchar](20) NULL,
[city] [varchar](20) NULL,
[state] [varchar](20) NULL,
[country] [varchar](20) NULL,
[Last_Modified_Date] [date] NULL
) ON [PRIMARY]
Now if we insert two new rows to main TransactionHistoryTable, because of this insert evert, our triggered was fired and has inserted these two rows to TriggerHistoryTable also along with main TransactionHistoryTable
insert into [Transaction_db].[dbo].[TransactionHistoryTable]
values
('Product3',2100,'Visa','Cindy' ,'Kemble','England','United Kingdom')
,('Product4',50000,'Mastercard','Tamar','Headley','England','United Kingdom')
;
select * from TriggerHistoryTable;
Now you can sqoop from your TriggerHistoryTable, which will be having daily insert or updated records. You can use Incremental sqoop also since we have added a date column to this. once you have imported data to hdfs you can clear this table on daily basis or weekly. This is just an example with sql server. you can have triggers with Teradata and oracle and other databases also. you can also set up a update/delete trigger also.

You can follow these steps
1) The initial load data (previous day data) is in hdfs - Relation A
2) Import the current data into HDFS using sqoop -- Relation B
3) Use pig Load the above two hdfs directories in relation A and B define schema.
4) Convert them to tuples and join them by all columns
5) The join result will have two tuples in each row((A,B),(A,B)) , fetch the result from join where tuple B is null ((A,D),).
6) Now flatten the join by tuple A you will have new/updated records(A,D).

If your data has a field like rowid you can import using --last-value in sqoop arguments .
Please refer to https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports

Related

CRUD operations in Hive

I'm trying to do CRUD operations in Hive and able to successfully run insert query however when I tried to run update and delete getting the below exception.
FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.
List of the queries I ran
CREATE TABLE students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2))
CLUSTERED BY (age) INTO 2 BUCKETS STORED AS ORC;
INSERT INTO TABLE students
VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);
CREATE TABLE pageviews (userid VARCHAR(64), link STRING, came_from STRING)
PARTITIONED BY (datestamp STRING) CLUSTERED BY (userid) INTO 256 BUCKETS STORED AS ORC;
INSERT INTO TABLE pageviews PARTITION (datestamp = '2014-09-23')
VALUES ('jsmith', 'mail.com', 'sports.com'), ('jdoe', 'mail.com', null);
INSERT INTO TABLE pageviews PARTITION (datestamp)
VALUES ('tjohnson', 'sports.com', 'finance.com', '2014-09-23'), ('tlee', 'finance.com', null, '2014-09-21');
Source : https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Delete
Update and delete queries I'm trying to run
update students1 set age = 36 where name ='barney rubble';
update students1 set name = 'barney rubble1' where age =36;
delete from students1 where age=32;
Hive Version : 2.1(Latest)
Note : I'm aware that Hive is not for Update and Delete commands(on BigData set) still trying to do, to get awareness on Hive CRUD operations.
Can someone point/guide me the where I'm going wrong on update/delete queries.
make sure you are setting the properties listed here.
https://community.hortonworks.com/questions/37519/how-to-activate-acid-transactions-in-hive-within-h.html
I tested in Hive 1.1.0 CDH 5.8.3 and it is working. same exampled you provided in your comment

How transfer a Table from HBase to Hive?

How can I tranfer a HBase table into Hive correctly?
What I tried before can you read in this question
How insert overwrite table in hive with diffrent where clauses?
( I made one table to import all data. The problem here is that data is still in rows and not in columns. So I made 3 tables for news, social and all with a specific where clause. After that I made 2 Joins on the tables which is giving me the result table. So I had 6 Tables at all which is not really performant!)
to sum my problem up : In HBase are column familys which are saved as rows like this.
count verpassen news 1
count verpassen social 0
count verpassen all 1
What I want to achieve in Hive is a datastructure like this:
name news social all
verpassen 1 0 1
How am I supposed to do this?
Below is the approach use can use.
use hbase storage handler to create the table in hive
example script
CREATE TABLE hbase_table_1(key string, value string) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH
SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f1:val")
TBLPROPERTIES ("hbase.table.name" = "test");
I loaded the sample data you have given into hive external table.
select name,collect_set(concat_ws(',',type,val)) input from TESTTABLE
group by name ;
i am grouping the data by name.The resultant output for the above query will be
Now i wrote a custom mapper which takes the input as input parameter and emits the values.
from (select '["all,1","social,0","news,1"]' input from TESTTABLE group by name) d MAP d.input Using 'python test.py' as
all,social,news
alternatively you can use the output to insert into another table which has column names name,all,social,news
Hope this helps

Hive and Sqoop partition

I have sqoopd data from Netezza table and output file is in HDFS, but one column is a timestamp and I want to load it as a date column in my hive table. Using that column I want to create partition on date. How can i do that?
Example: in HDFS data is like = 2013-07-30 11:08:36
In hive I want to load only date (2013-07-30) not timestamps. I want to partition on that column DAILY.
How can I pass partition by column as dynamically?
I have tried with loading data into one table as source. In final table I will do insert overwrite table partition by (date_column=dynamic date) select * from table1
Set these 2 properties -
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
And the Query can be like -
INSERT OVERWRITE TABLE TABLE PARTITION (DATE_STR)
SELECT
:
:
-- Partition Col is the last column
to_date(date_column) DATE_STR
FROM table1;
You can explore the two options of hive-import - if it is an incremental import you will be able to get the current day's partition.
--hive-partition-key
--hive-partition-value
You can just load the EMP_HISTORY table from EMP by enabling dynamic partition and converting the timestamp to date using to_date date function
The code might look something like this....
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE EMP_HISTORY PARTITION (join_date)
SELECT e.name as name, e.age as age, e.salay as salary, e.loc as loc, to_date(e.join_date) as join_date from EMP e ;

Sqoop Import - primary key is not included in the column family

I have a mysql table looks like:
Mebmer_ID <- primary key
Member_Name
Member_Type
I ran the command below:
./bin/sqoop import --connect jdbc:mysql://${ip}/testdb -username
root -password blabla --query 'SELECT * from member where Member_ID <
5 AND $CONDITIONS' --split-by Member_ID --hbase-create-table
--hbase-table member --column-family i
But after import, I see hbase table looks like:
rowkey - row : 1
Columns - Member_name=bla, Member_Type=bla
Note that Sqoop turned my Member_ID to Rowkey which is expected. But in my columns, I am seeing all the other fields except Member_ID. Is there anyway I can have Member_ID as my rowkey,also in Column Family, having the Member_ID column included as well?
Does this also mean, if my primary key is not called "id", after sqoop import, I am losing the name of my primary key. In my case, after import, I have no idea what rowkey used to be called "Mmember_ID".
Got it sorted by setting property sqoop.hbase.add.row.key.
e.g. sqoop import -Dsqoop.hbase.add.row.key=true

How to convert mysql DDL into hive DDL

Given a SQL script containing DDL for creating tables in MySQL database, I would like to convert the script into Hive DDL, so that I can create tables into hive. I could have written an interpreter myself, but thought there might be details I could miss (e.g. data format conversion, int, bigint, time, date, etc.) since I am very new to hive DDL.
I have seen this thread How to transfer mysql table to hive?, which mentioned sqoop http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html. However, from what I see, sqoop certainly translate the DDL, but only as an intermediate step (thus the translated DDL is no where to be found). Am I missing the command that would output the translation with the MySQL DDL as an input?
For example, my MySQL DDL look like:
CREATE TABLE `user_keyword` (
`username` varchar(32) NOT NULL DEFAULT '',
`keyword_id` int(10) unsigned NOT NULL,
PRIMARY KEY (`username`,`keyword_id`),
KEY `keyword_id` (`keyword_id`),
CONSTRAINT `analyst_keywords_ibfk_1` FOREIGN KEY (`keyword_id`) REFERENCES `keywords` (`keyword_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
And the output Hive DDL would be like:
CREATE TABLE user_keyword (
username string,
keyword_id int,
);
I actually thought this was not supported, but after looking at the Source here is what I saw in HiveImport.java:
/**
* #return true if we're just generating the DDL for the import, but
* not actually running it (i.e., --generate-only mode). If so, don't
* do any side-effecting actions in Hive.
*/
private boolean isGenerateOnly() {
return generateOnly;
}
/**
* #return a File object that can be used to write the DDL statement.
* If we're in gen-only mode, this should be a file in the outdir, named
* after the Hive table we're creating. If we're in import mode, this should
* be a one-off temporary file.
*/
private File getScriptFile(String outputTableName) throws IOException {
if (!isGenerateOnly()) {
return File.createTempFile("hive-script-", ".txt",
new File(options.getTempDir()));
} else {
return new File(new File(options.getCodeOutputDir()),
outputTableName + ".q");
}
}
So basically you should be able to do only the DDL generation using the option --generate-only used in cunjunction with --outdir and your table will be create in the output dir specified and named after your table.
For example based on the link you provided:
sqoop import --verbose --fields-terminated-by ',' --connect jdbc:mysql://localhost/test --table employee --hive-import --warehouse-dir /user/hive/warehouse --fields-terminated-by ',' --split-by id --hive-table employee --outdir /tmp/mysql_to_hive/ddl --generate-only
will create /tmp/mysql_to_hive/ddl/employee.q
Alternatively, one could use the create-hive-table tool to do that. The create-hive-table tool populates a Hive metastore with a definition for a table based on a database table previously imported to HDFS, or one planned to be imported. This effectively performs the --hive-import step of sqoop-import without running the preceeding import. For example,
sqoop create-hive-table --connect jdbc:mysql://localhost/demo
-username root --table t2 --fields-terminated-by ',' --hive-table t2
This command will create a blank hive table t2 based on the schema of the same table in MySQL without importing the data.

Resources