incremental import in sqoop on a table with jumbled data and no modified time column - hadoop

Suppose I have a table Customer :
CustomerID CustomerName CustomerBill
7 John 100
2 Bill 500
4 Mark 200
Here CustomerID is the primary key but the records are in no particular order. There is no modified time column in the corresponding table in the database. The previous entries can change as well. How do I do incremental imports on the data?
The database I am using is Sybase and importing it to Hive.

Records are in no particular order.
append mode can not be used.
There is no modified time column in the corresponding table in the database.
lastmodified mode can not be used.
Sqoop does do anything special. It needs incrementing ID or updated timstamp to make a SQL query to fetch ONLY inserted/updated recored.

Related

PLSQL Daily record of changes on table, then select from day

Oracle PL SQL question: One table should be archived day by day. Table counts about 50.000 records. But only few records during a day are changed. Second table (destination/history table) has one additional field - import_date. Two days = 100.000 records. Should be 50.000 + feq records with informations about changes during a day.
I need one simple solution to copy data from source table to destination like a "LOG" - only changes are copied/registered. But I should have possibility to check dataset of source table from given day.
Is there such mechanism like MERGE or something like that?
Normally you'd have a day_table and a master_table. All records are loaded from the day_table into master and only master is manipulated with the day table used to store the raw data.
You could add a new column to master such as a date_modified and have the app update this field when a record changes, or a flag used to indicate it's changed.
Another way to do this is to have an active/latest flag. Instead of changing the record it is duplicated with a flag set to indicate this is a better/old record. This might be easier for comparison
e.g. select * from master_table where record = 'abcd'
This would show 2 rows - the original loaded at 1pm and the modified active one changed at 2pm.
There's no need to have another table, you could base a view on this flag then
e.g. CHANGED_RECORDS_VIEW = select * from master_table where flag = 'Y'
Once i faced a similar issue. And please find the solution below.
Tables we had :
Master table always has records it and keeps adding up.
One backup table to store all the master records on daily basis.
Solution:
From morning to evening records are inserted and updated into the master table. The concept of finding out the new records was the timestamp. Whenever a new record is inserted/updated then corresponding timestamp is added and kept.
At night, we had created a job schedule to run a procedure (Create_Job-> please check oracle documentations for further learning) which runs exactly at 10:00 pm to bulk collect all the records available in master table based on today's date and insert into the backup table.
This scenrio which i have explained to you will help you. Please check out the concept of Job scheduling which will help you. Thank you .

How sqoop treats updated rows while import?

If there is a table in Oracle(or any RDBMS) which contains data that is flushed out every day.
example:
1234,Raj,Kolkata,1000,09092015
Suppose, I import this row today using a standard sqoop import and store in HDFS in flatfile. Next day, the row is deleted from the source table.But if the same record is updated(say the sal field 1000 is updated to 2000) after 7 days.
If I run again a sqoop query how will it treat the data and how will it store?
Will there be two entries of the same record or the newer value will be updated?
will this record
<1234, Raj, Kolkata, 1000, 09092015>
be replaced by this one?
<1234, Raj, Kolkata, 2000, 09092015>
If you perform incremental imports in Sqoop, you can control what happens when one of the rows is updated as well as what happens when new rows are inserted by means of using the argument --incremental. You have two options:
append (sqoop import (...) --incremental append) This option is used when new rows are continually added to your database and you want to import them. In this case, you'd need to tell Sqoop the column that it has to check (in order to detect these new rows), by means of the check-column parameter.
lastmodified (sqoop import (...) --incremental lastmodified). This option is what you want in your example, it lets you tell Sqoop that you want to check for updated rows in the table (that you already imported) and set them to the new values. You have to bear in mind that you have to specify, by means of the parameter --check-column, the column name which Sqoop will use to detect the updated rows, and also that this column is required to hold a date value (for instance, date, datetime, time or timestamp). In your example you would need an extra column holding a date value, and you should update that value every time you change the value of any of the other columns, in order for that row to be imported.
Of course, if you update a row but you don't update the field specified by check-column of that row, it will not be updated in your destination table.
I hope this helps.

Deleting very large table records where id not in another table

I have one table values that have 80 million records. Another table values_history that has 250 million records.
I want to filter the values_history table and want to keep the only data for which id is preset in values table.
delete from values_history where id not in (select id from values);
This query takes such a long time that I have to abort the process.
Please some idea to speed up the process.
Can I delete the records in bunch like 1000000 at a time?
I have extracted out the required record and inserted into temp table .This took 2 hrs after that i dropped the table then again inserted extracted data back to the main table whole process took 4 hrs around that is fine for me.I have dropped foreign key and all other constraint before that..

Can in insert data multiple times into a bucketed hive table

I have a bucketed hive table. It has 4 buckets.
CREATE TABLE user(user_id BIGINT, firstname STRING, lastname STRING)
COMMENT 'A bucketed copy of user_info'
CLUSTERED BY(user_id) INTO 4 BUCKETS;
Initially i have inserted some records into this table using the following query.
set hive.enforce.bucketing = true;
insert into user
select * from second_user;
After this operation In HDFS I see that 4 files are created under this table dir.
Again i needed to insert another set of data into user table. So i ran the below query.
set hive.enforce.bucketing = true;
insert into user
select * from third_user;
Now another 4 files are crated under user folder dir. Now it has total 8 files.
Is this fine to do this kind of multiple inserts into a bucketed table?
Does it affect the bucketing of the table?
I figured it out!!
Actually if you do multiple inserts on a bucketed hive table. Hive wont complain as such.
All hive queries will work fine.
Having said that, Such operation spoils the bucketing concept of the table. I mean after multiple inserts into a bucketed table the sampling fails.
The TABLASAMPLE doesnt work properly after multiple inserts.
Even sort merge bucket map join also doesnt work after such operation.
I dont think that should be a issue because you have declared that you want bucketing on user_id. so every time you would insert it will create 4 more files.
Bucketing is used for faster query processing so if it is making 4 more files everytime it will be making your query processing even faster.

Import most recent data from CSV to SQL Server with SSIS

Here's the deal; the issue isn't with getting the CSV into SQL Server, it's getting it to work how I want it... which I guess is always the issue :)
I have a CSV file with columns like: DATE, TIME, BARCODE, etc... I use a derived column transformation to concatenate the DATE and TIME into a DATETIME for my import into SQL Server, and I import all data into the database. The issue is that we only get a new .CSV file every 12 hours, and for example sake we will say the .CSV is updated four times in a minute.
With the logic that we will run the job every 15 minutes, we will get a ton of overlapping data. I imagine I will use a variable, say LastCollectedTime which can be pulled from my SQL database using the MAX(READTIME). My problem comes in that I only want to collect rows with a readtime more recent than that variable.
Destination table structure:
ID, ReadTime, SubID, ...datacolumns..., LastModifiedTime where LastModifiedTime has a default value of GETDATE() on the last insert.
Any ideas? Remember, our readtime is a Derived Column, not sure if it matters or not.
Here is one approach that you can make use of:
Let's assume that your destination table in SQL Server is named BarcodeData.
Create a staging table (say BarcodeStaging) in your database that has the same column structure as your destination table BarcodeData into which CSV data is imported into.
In the SSIS package, add an Execute SQL Task before the Data Flow Task to truncate the staging table BarcodeStaging.
Import the CSV data into the staging table BarcodeStaging and not into the actual destination table.
Use the MERGE statement (I assume that you are using SQL Server 2008 or higher version), to compare the staging table BarCodeStaging and the actual destination table BarcodeData using the DateTime column as the join key. If there are unmatched rows, then copy the rows from the staging table and insert them into the destination table.
Technet link to MERGE statement: http://technet.microsoft.com/en-us/library/bb510625.aspx
Hope that helps.

Resources