How to skip header in Sqoop? - sqoop

How can i sqoop export skipping the header of my csv from the HDFS to MSSQL?
I've tried to researched about this, but i can't find any answer. So this is the problem im having right now.
If my csv have header some of my data are not saved, I think it is skipping some of the rows. Then, when i made my table data type to varchar, the header got also saved, that's why i tried to delete the header and sqoop it again, it saved my data to my table without any problem.
The csv files im getting always have a header that's why i'm looking a setting in sqoop to skip the csv header.
Thank you.

Try to send everything to a staging table and use call stored_procedure. In that stored procedure, you can filter this record before inserting into final table. I too did not find any direct property option. Otherwise the best way is to manipulate the file and then do sqoop. You are already aware about this process.

--first ~ remove Header in HDFS, then try below properties
==> sed -i 1d your_File_Name.csv
--Verify count
select count(*) from stg_card_transactions;
--Remove Dups from Stg Table
alter ignore table stg_card_transactions
add unique index idx_card_txns (card_id,transaction_dt);
--Verify no dups
select card_id,transaction_dt,count() from stg_card_transactions group by card_id,transaction_dt having count() >1;
--Dropping index used for removing dups
alter table stg_card_transactions drop index idx_card_txns;
--Loading main table
insert into card_transactions
select card_id,member_id,amount,postcode,pos_id,STR_TO_DATE(transaction_dt,'%d-%m-%Y %H:%i:%s'),status from stg_card_transactions;
commit;
--Verify the count
select count(*) from card_transactions;

Related

Cursor and CSV using utl_file package

Hi I want to create a csv file using plsql utl file. For that I am creating cursor in utl file but I dont want to enter duplicate data. Because I want to create that csv file daily from the same table. Please help
I tried by cursor but I have no idea how to restrict duplicate entries because I want to create the csv file from same table on daily basis
A cursor selects data; it is its where clause that filters which data it'll return.
Therefore, set it so that it fetches only rows you're interested in. For example, one option is to use a timestamp column which tells when was that particular row inserted into the table. Cursor would then
select ...
from that_table
where timestamp_column >= trunc(sysdate)
to select data created today. It is up to you to set it to any other value you want.

how to create table definition from csv file and also copy data at the same time

I want to load data from a csv file into Vertica. I don't want to create table and the copy data in two separate steps. Instead, I want to create the table, specify the csv file and then let vertica figure out column definitions (names, data type) itself and then load the data.
Something like create table titanic_train () as COPY FROM '/data/train.csv' PARSER fcsvparser() rejected data as table titanic_train_rejected abort on error no commit;
Is it possible?
I guess that if a table has 100s of columns then automating the create table, column definition and data copy would be much easier/faster than doing these steps separately
It's always several steps, no matter what.
Use the built-in bits of Vertica:
CREATE FLEX TABLE foo();
COPY foo FROM '/data/mycsvs/foo.csv' PARSER fCsvParser();
SELECT COMPUTE_FLEXTABLE_KEYS_AND_BUILD_VIEW('foo');
-- THEN, either:
SELECT * FROM foo_view;
-- OR: create a ROS Table:
CREATE TABLE foo_ros AS SELECT * FROM foo_view;
Get a CSV-to-DDL parser from the net, like https://github.com/marco-the-sane/d2l, and install it then:
$ d2l -coldelcomma -chardelquote -drp -copy /data/mycsvs/foo.csv | vsql
So , in the second instance, it's one step, but it calls both d2l and vsql.

How to take backup as insert queries from oracle select statement inside UNIX batch job?

I wrote a UNIX batch job which updates a table with some "where" conditions. Before updating those records, i need to take the backup (insert statements) of the records that is returned with the "where conditions" and store it in ".dat" file. Could you please help on this???
The most straightforward way to create a backup of the table would be to use a create table statement using the where condition(s) of your update statement. For example, let's take a sample update statement:
UPDATE sometable
SET field1 = 'value'
WHERE company = 'Oracle'
This update would update the field1 column of every row where the company name is Oracle. You could create a backup of sometable by issuing the following command:
CREATE TABLE sometable_backup AS (SELECT * FROM sometable WHERE company = 'Oracle');
This will create a table called sometable_backup that will contain all of the rows that match the where clause of the update.
You can then use Data Pump or another utility to create an export .dat file of that specific table. You can use that .dat file to import into other databases.

Update/Edit records in Hdfs using Hive

I have some records of people in HDFS. I use external table in Hive to view, to do my analytics on that particular data and also I can use it externally in other programs.
Recently I got an use case where I have to update the data in HDFS. As per documentation I got to know that we cant update or delete data using external table.
Another problem is the data is not ORC format. It is actually in TEXTFILE format. So I am unable to do update or delete data in internal table too. As it is in production I cant copy it to anywhere to convert it to ORC Format. Please suggest me how to Edit the data in HDFS.
You can Update or Delete using INSERT OVERWRITE + select from itself using filters and additional transformatins:
insert overwrite table mytable
select col1, --apply transformations here
col2, --for example: case when col2=something then something_else else col2 end as col2
...
colN
from mytable
where ... filter out records you want to delete
This approach will work for both External and Managed and for all storage formats. Just write select which returns required dataset and add INSERT OVERWRITE.

Hive delete duplicate records

In hive, how can I delete duplicate records ? Below is my case,
First, I load data from product table to products_rcfileformat. There are 25 rows of records on product table
FROM products INSERT OVERWRITE TABLE products_rcfileformat
SELECT *;
Second, I load data from product table to products_rcfileformat. There are 25 rows of records on product table. But this time I'm NOT using OVERWRITE clause
FROM products INSERT INTO TABLE products_rcfileformat
SELECT *;
When I query the data it give me total rows = 50 which are right
Check from hdfs, it seem hdfs make another copy of file xxx_copy_1 instead of append to 000000_0
Now I want to remove those records that read from xxx_copy_1. How can I achieve this in hive command ? If I'm not mistaken, i can remove xxx_copy_1 file by using hdfs dfs -rm command follow by rerun insert overwrite command. But I want to know whether this can it be done by using hive command example like delete statement?
Partition your data such that the rows (use window function row_number) you want to drop are in a partition unto themselves. You can then drop the partition without impacting the rest of your table. This is a fairly sustainable model, even if your dataset grows quite large.
detail about Partition .
www.tutorialspoint.com/hive/hive_partitioning.htm
Check from hdfs, it seem hdfs make another copy of file xxx_copy_1
instead of append to 000000_0
The reason is hdfs is read only, not editable, as hive warehouse files (or whatever may be the location) that is still in hdfs, so it has to create a second file.
Now I want to remove those records that read from xxx_copy_1. How can
I achieve this in hive command ?
Please check this post - Removing DUPLICATE rows in hive based on columns.
Let me know if you are satisfied with the answer there. I have another method, which removes duplicate entries but may not be in the way you want.

Resources