Drop Columns while importing Data in sqoop - sqoop

I am importing data from oracle to hive . My table doesn't have any integer columns which can be used in my primary keys .So I am not able to use it in my split-by column.
Alternatively I created a row_num column for all rows present in the table . Then this row_num column will be used in split-by column. Finally I want to drop this column from my hive table.
Column list is huge ,I dont want to select all columns using --columns neither I want to create any temporary table for this purpose.
Please let me know whether we can handle this in sqoop arguments.

Can Any little tweek on the --query parameter help you?
Something below.
sqoop import --query 'query string'

Related

How to create a table in H base-shell without passing values & rowid?

hello "I m new to Hbase.. My question is How to create a table in hbase with column family & column names inside the columnfamily without passing values and row key?Is it possible to create that table in hbase shell?
In Sql we create a table first and later we add data ..same thing how can we do it in hbase?
HBase is a nosql key value database. The tables can be created just by specifying table name and column family for example create "sampletable","m" where sampletable is table name and m is column family. If you want to use SQL queries on HBase try Apache Phoenix.

How to drop hive column?

I have two columns Id and Name in Hive table, and I want to delete the Name column. I have used following command:
ALTER TABLE TableName REPLACE COLUMNS(id string);
The result was that the Name column values were assigned to the Id column.
How can I drop a specific column of the table and is there any other command in Hive to achieve my goal?
In addition to the existing answers to the question : Alter hive table add or drop column
As per Hive documentation,
REPLACE COLUMNS removes all existing columns and adds the new set of columns.
REPLACE COLUMNS can also be used to drop columns. For example, ALTER TABLE test_change REPLACE COLUMNS (a int, b int); will remove column c from test_change's schema.
The query you are using is right. But this will modify only schema i.e, the metastore. This will not modify anything on data side.
So, before you are dropping the column you should make sure that you hav correct data file.
In your case the data file should not contain name values.
If you don't want to modify the file then create another table with only specific column that you need.
Create table tablename as select id from already_existing_table
let me know if this helps.

Sqoop: How to deal with duplicate values while importing data from RDBMS to Hive tables

Sqoop: How to deal with duplicate values while importing data from RDBMS to Hive tables.
Or to deal with redundancy option if the values are already available in Hive Tables?
If your data has a unique identifier and you are running incremental imports you can specify it on the -mergeKey value of the import. This will merge the values that where already on the table with the newest one. The newer will override the oldest.
If you are not running incremental imports you can use sqoop merge to unify data.
From sqoop docs :
When merging the datasets, it is assumed that there is a unique primary key value in each record. The column for the primary key is specified with --merge-key. Multiple rows in the same dataset should not have the same primary key, or else data loss may occur.
The important is that you do have a single unique primary key for each record. Otherwise you might generate one when importing the data. To do so you could generate the import with the --query and generate the new column with the unique key on the select of the data concatenating existing columns until you get a unique combination.
--query "SELECT CONVERT(VARCHAR(128), [colum1]) + '_' + CONVERT(VARCHAR(128), [column2]) AS CompoundKey ,* FROM [dbo].[tableName] WHERE \$CONDITIONS" \
There is no direct option from sqoop that will provide the solution that you are looking for. You will have to set up EDW kind of process to achieve your goal:
import data in staging table(hive - create staging database for this purpose) - this should be copy of target table, but data type may vary as per your transformations requirements.
load data from staging database table(hive) to target database table(hive) by doing transformations. in your case:
Insert into table trgt.table
select * from stg.table stg_tbl
where stg_tbl.col1 not in (select col1 from trgt.table);
here trgt is target database, stg is staging database - both are in hive.

Add a fixed value column to a table in Hive

I am prototyping my pig script to hive. I need to add a status column to the table which is imported from Oracle database.
My pig scripts looks like this:
user_data = LOAD 'USER_DATA' USING PigStorage(',') AS (USER_ID:int,MANAGER_ID:int,USER_NAME:int);
user_data_status = FOREACH user_data GENERATE
USER_ID,
MANAGER_ID,
USER_NAME,
'active' AS STATUS;
Here I am adding the STATUS column with 'active' value to the user_data table.
How can I add column to an existing table to add column while importing the table via Hive QL??
As far as I know, You will have to reload the data as you did in Pig.
For example, If you already have the table user_data with columns USER_ID:int,MANAGER_ID:int,USER_NAME:int and you are looking for USER_ID:int,MANAGER_ID:int,USER_NAME:int, STATUS:active
You can re-load the table user_data_status by using something like this
INSERT OVERWRITE TABLE user_data_status SELECT *, 'active' AS STATUS FROM user_data;
Though there are options to add columns to the existing table, that would only update the metadata in metastore and the values would be defaulted to NULL.
If I was you, I would rather re-load the complete data rather than looking to update the complete table using UPDATE command after Altering the column structure. Hope this helps !

how to with deal primarykey while exporting data from Hive to rdbms using sqoop

Here is a my scenario i have a data in hive warehouse and i want to export this data into a table named "sample" of "test" database in mysql. What happens if one column is primary key in sample.test and and the data in hive(which we are exporting) is having duplicate values under that key ,then obviously the job will fail , so how could i handle this kind of scenario ?
Thanks in Advance
If you want your mysql table to contain only the last row among the duplicates, you can use the following:
sqoop export --connect jdbc:mysql://<*ip*>/test -table sample --username root -P --export-dir /user/hive/warehouse/sample --update-key <*primary key column*> --update-mode allowinsert
While exporting, Sqoop converts each row into an insert statement by default. By specifying --update-key, each row can be converted into an update statement. However, if a particular row is not present for update, the row is skipped by default. This can be overridden by using --update-mode allowinsert, which allows such rows to be converted into insert statements.
Beforing performing export operation ,massage your data by removing duplicates from primary key. Take distinct on that primary column and then export to mysql.

Resources