How can I import from "SQL View" to Sqoop Hive - hadoop

I can import SQL tables to Hive, however when I try to SQL view, I am getting errors.
Any ideas ?

From Sqoop documentation:
If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.
I think it's your case here. Provide additional option like:
--split-by tablex_id
where tablex_id is a field in your view that can be used as a primary key or index.

There is no any special command to import View form RDBMS.
Use simple import command with --split-by that sqoop refers to same as primary key.

you can import the rdbms views data but u have to pass -m 1 or --split-by in your command.

Related

Sqoop import to route records with null values in a particular columns into another table

I am trying to move records with null values in a particular column to a particular table and non-null records to another while SQOOP import. tried to explore on goolge but there is not much beyond --null-string and --null-non-String params but that will just replace with the defined characters ...
I can think of following ways to handle it
once importing into hive, run a dedup to filter out the records but this is something to be tried in worst case.
handling at sqoop level itself(no clue on this)
could any expert here can help me with the above ask.
ENV details : its a plain Apache hadoop cluster. sqoop version 1.4.6
We can try making use of --query option along with the sqoop-import command
--query select * from table where column is null and $CONDITIONS
And in a similar way for not null condition also.
There will be 2 sqoop import jobs here.

incremental sqoop to HIVE table

It is known that --incremental sqoop import switch doesn't work for HIVE import through SQOOP. But what is the workaround for that?
1)One thing I could make up is that we can create a HIVE table, and bring incremental data to HDFS through SQOOP, and then manually load them. but if we are doing it , each time do that load, the data would be overwritten. Please correct me if I am wrong.
2) How effective --query is when sqooping data to HIVE?
Thank you
You can do the sqoop incremental append to the hive table, but there is no straight option, below is one of the way you can achieve it.
Store the incremental table as an external table in Hive.
It is more common to be importing incremental changes since the last time data was updated and then merging it.In the following example, --check-column is used to fetch records newer than last_import_date, which is the date of the last incremental data update:
sqoop import --connect jdbc:teradata://{host name}/Database=retail —connection manager org.apache.sqoop.teradata.TeradataConnManager --username dbc -password dbc --table SOURCE_TBL --target-dir /user/hive/incremental_table -m 1 --check-column modified_date --incremental lastmodified --last-value {last_import_date}
second part of your question
Query is also a very useful argument you can leverage in swoop import, that will give you the flexibility of basic joins on the rdbms table and flexibility to play with the date and time formats. If I were in your shoes I would do this, using the query I will import the data in the way I need and than I will append it to my original table and while loading from temporary to main table I can play more with the data. I would suggest using query if the updates are not too frequent.

Sqoop import all the table primary and non primary key tables at a time

Can we import tables using sqoop having primary and non primary at a time.Eg i am hvaing 200 primary key tables and 200 non priamry key tables in database .How can we import 400 tables at a time?
In addition to Jamie's answer:
You can add --autoreset-to-one-mapper tag in your sqoop import-all-tables... command.
Say you are using 8 mappers (-m 8) in your command. Then using above tag tables with primary keys will split as per the number of mappers and tables without primary keys will be loaded using 1 mappers.
so, overall your efficiency will improve.
Check 1st point of sqoop documentation for details.
Yes, you can add the flag --m 1 to the import command of all your tables (both the 200 with primary key and the 200 without it).
By adding this option, Sqoop will only use one mapper to retrieve all the data from the tables, so your command would look like something like this:
sqoop import-all-tables --connect your-database --username user --password pwd --m 1

how to with deal primarykey while exporting data from Hive to rdbms using sqoop

Here is a my scenario i have a data in hive warehouse and i want to export this data into a table named "sample" of "test" database in mysql. What happens if one column is primary key in sample.test and and the data in hive(which we are exporting) is having duplicate values under that key ,then obviously the job will fail , so how could i handle this kind of scenario ?
Thanks in Advance
If you want your mysql table to contain only the last row among the duplicates, you can use the following:
sqoop export --connect jdbc:mysql://<*ip*>/test -table sample --username root -P --export-dir /user/hive/warehouse/sample --update-key <*primary key column*> --update-mode allowinsert
While exporting, Sqoop converts each row into an insert statement by default. By specifying --update-key, each row can be converted into an update statement. However, if a particular row is not present for update, the row is skipped by default. This can be overridden by using --update-mode allowinsert, which allows such rows to be converted into insert statements.
Beforing performing export operation ,massage your data by removing duplicates from primary key. Take distinct on that primary column and then export to mysql.

How to create external table in Hive using sqoop. Need suggestions

Using sqoop I can create managed table but not the externale table.
Please let me know what are the best practices to unload data from data warehouse and load them in Hive external table.
1.The tables in warehouse are partitioned. Some are date wise partitioned some are state wise partitioned.
Please put your thoughts or practices used in production environment.
Sqoop does not support creating Hive external tables. Instead you might:
Use the Sqoop codegen command to generate the SQL for creating the Hive internal table that matches your remote RDBMS table (see http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_literal_sqoop_codegen_literal)
Modify the generated SQL to create a Hive external table
Execute the modified SQL in Hive
Run your Sqoop import command, loading into the pre-created Hive external table
Step 1: import data from mysql to hive table.
sqoop import
--connect jdbc:mysql://localhost/
--username training --password training
--table --hive-import --hive-table -m 1
--fields-terminated-by ','
Step 2: In hive change the table type from Managed to External.
Alter table <Table-name> SET TBLPROPERTIES('EXTERNAL'='TRUE')
Note:you can import directly into hive table or else to back end of hive.
My best suggestion is to SQOOP your data to HDFS and create EXTERNAL for Raw operations and Transformations.
Finally mashed up data to the internal table. I believe this is one of the best practices to get things done in a proper way.
Hope this helps!!!
Refer to these links:
https://mapr.com/blog/what-kind-hive-table-best-your-data/
In the above if you want to skip directly to the point -->2.2.1 External or Internal
https://hadoopsters.net/2016/07/15/hive-tables-internal-and-external-explained/
After referring to the 1st link then second will clarify most of your questions.
Cheers!!

Resources