I recently started working on sqoop - hive/hadoop on Linux. I have to import hive data from one table to oracle table. I am using simple sqoop export to do this. I have 6 million lines in hive table.
This command is giving me very poor performance and taking long time (85 minutes) to complete the job.
Query ->
sqoop export --connect jdbc:oracle:thin:#server:port:db--username user --password password--export-dir /user/hive/warehouse/tb --table tb--columns 'col1,col2..col33' --input-fields-terminated-by ',' --input-null-string '\\N' --input-null-non-string '\\N' -m 1
Is there any configuration change which can help me which can help to make it faster.
It's hard to help without any additional information. I would suggest to start the export job again and monitor the environment to see where the bottle neck is (database? network? hadoop?). It might be also helpful to try OraOop connector as it's usually faster.
Is this a "wide" dataset? It might be a sqoop bug https://issues.apache.org/jira/browse/SQOOP-2920 if number of columns is very high (in hundreds), sqoop starts choking (very high on cpu).
When number of fields is small, it's usually other way around - when sqoop is bored and Oracle can't keep up. In this case we normally don't go over 45-55 mappers.
Related
It is known that --incremental sqoop import switch doesn't work for HIVE import through SQOOP. But what is the workaround for that?
1)One thing I could make up is that we can create a HIVE table, and bring incremental data to HDFS through SQOOP, and then manually load them. but if we are doing it , each time do that load, the data would be overwritten. Please correct me if I am wrong.
2) How effective --query is when sqooping data to HIVE?
Thank you
You can do the sqoop incremental append to the hive table, but there is no straight option, below is one of the way you can achieve it.
Store the incremental table as an external table in Hive.
It is more common to be importing incremental changes since the last time data was updated and then merging it.In the following example, --check-column is used to fetch records newer than last_import_date, which is the date of the last incremental data update:
sqoop import --connect jdbc:teradata://{host name}/Database=retail —connection manager org.apache.sqoop.teradata.TeradataConnManager --username dbc -password dbc --table SOURCE_TBL --target-dir /user/hive/incremental_table -m 1 --check-column modified_date --incremental lastmodified --last-value {last_import_date}
second part of your question
Query is also a very useful argument you can leverage in swoop import, that will give you the flexibility of basic joins on the rdbms table and flexibility to play with the date and time formats. If I were in your shoes I would do this, using the query I will import the data in the way I need and than I will append it to my original table and while loading from temporary to main table I can play more with the data. I would suggest using query if the updates are not too frequent.
Stack : Installed HDP-2.3.2.0-2950 using Ambari 2.1
The source is a MS SQL database of around 1.6TB and around 25 tables
The ultimate objective is to check if the existing queries can run faster on the HDP
There isn't a luxury of time and availability to import the data several times, hence, the import has to be done once and the Hive tables, queries etc. need to be experimented with, for example, first create a normal, partitioned table in ORC. If it doesn't suffice, try indexes and so on. Possibly, we will also evaluate the Parquet format and so on
4.As a solution to 4., I decided to first import the tables onto HDFS in Avro format for example :
sqoop import --connect 'jdbc:sqlserver://server;database=dbname' --username someuser --password somepassword --as-avrodatafile --num-mappers 8 --table tablename --warehouse-dir /dataload/tohdfs/ --verbose
Now I plan to create a Hive table but I have some questions mentioned here.
My question is that given all the points above, what is the safest(in terms of time and NOT messing the HDFS etc.) approach - to first bring onto HDFS, create Hive tables and experiment or directly import in Hive(I dunno if now I delete these tables and wish to start afresh, do I have to re-import the data)
For Loading, you can try these options
1) You can do a mysql import to csv file that will be stored in your Linux file system as backup then do a distcp to HDFS.
2) As mentioned, you can do a Sqoop import and load the data to Hive table (parent_table).
For checking the performance using different formats & Partition table, you can use CTAS (Create Table As Select) queries, where you can create new tables from the base table (parent_table). In CTAS, you can mention the format like parque or avro etc and partition options is also there.
Even if you delete new tables created by CTAS, the base table will be there.
Based on my experience, Parque + partition will give a best performance, but it also depends on your data.
I see that the connection and settings are all correct. But I dint see --fetch-size in the query. By default the --fetch-size is 1000 which would take forever in your case. If the no of columns are less. I would recommend increasing the --fetch-size 10000. I have gone up to 50000 when the no of columns are less than 50. Maybe 20000 if you have 100 columns. I would recommend checking the size of data per row and then decide. If there is one column which has size greater than 1MB data in it. Then I would not recommend anything above 1000.
Is there anyone here who has worked with sqoop and hp vertica?
I am trying to export data from sqoop to vertica and I find that the performance is extremely poor.
I can switch to the HP vertica connector... but I still want to know why sqoop works so slow when exporting data to vertica.
I also found that when inserting data, sqoop does not support upserts against vertica. I want to know if this issue will be fixed anytime soon?
sqoop export -Dsqoop.export.records.per.statement=1 --driver
com.vertica.jdbc.Driver --mysql-delimiters  --username **** --password **** --
connect jdbc:vertica://hostname/schema?ConnectionLoadBalance=1 --export-dir <hdfs-
data-dir> --table <table_name>
One of the issues is that sqoop if forcing us to set sqoop.export.records.per.statement to 1 for Vertica. Otherwise it throws an error.
I've never used sqoop, but the command line data import function in vertica uses the COPY function; basically it makes a temp file and then runs a file import in the background. It wouldn't be a graceful solution, but you could try dumping your data to a gzip and then running the COPY function directly. I find that the gzip is always the bottleneck for files over a certain threshold (~50Mb+), never the COPY. Could be a backdoor to a faster import.
i work sqoop with vertica database, i use sqoop to export data from the vertica to the hive/HDFS and it work grate, you just need to add the vertica jar to the sqoop folder.
when i want to asq vertica on data that in the HDFS/Hive i use the hcatalog of the vertica. in version 8.1.* it comes with the vertica database and you don't need more connectors.
hcatalog
Using sqoop I can create managed table but not the externale table.
Please let me know what are the best practices to unload data from data warehouse and load them in Hive external table.
1.The tables in warehouse are partitioned. Some are date wise partitioned some are state wise partitioned.
Please put your thoughts or practices used in production environment.
Sqoop does not support creating Hive external tables. Instead you might:
Use the Sqoop codegen command to generate the SQL for creating the Hive internal table that matches your remote RDBMS table (see http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_literal_sqoop_codegen_literal)
Modify the generated SQL to create a Hive external table
Execute the modified SQL in Hive
Run your Sqoop import command, loading into the pre-created Hive external table
Step 1: import data from mysql to hive table.
sqoop import
--connect jdbc:mysql://localhost/
--username training --password training
--table --hive-import --hive-table -m 1
--fields-terminated-by ','
Step 2: In hive change the table type from Managed to External.
Alter table <Table-name> SET TBLPROPERTIES('EXTERNAL'='TRUE')
Note:you can import directly into hive table or else to back end of hive.
My best suggestion is to SQOOP your data to HDFS and create EXTERNAL for Raw operations and Transformations.
Finally mashed up data to the internal table. I believe this is one of the best practices to get things done in a proper way.
Hope this helps!!!
Refer to these links:
https://mapr.com/blog/what-kind-hive-table-best-your-data/
In the above if you want to skip directly to the point -->2.2.1 External or Internal
https://hadoopsters.net/2016/07/15/hive-tables-internal-and-external-explained/
After referring to the 1st link then second will clarify most of your questions.
Cheers!!
I want to import the data from DB2 databese to the hadoop(HDFS,Hive).One way is to do it by sqoop, can we do the same with some other way?Pls share the other approach of doing this..thanks
Sqoop is the best way to go. Anything else would require a serious amount of custom code. I've actually been on a project where we had a pretty esoteric reason we couldn't use Sqoop, and it ended up not being that trivial. You end up worrying about translating types, handling null values, encodings, escaping, retries, transactions, etc, etc.
Why reinvent the wheel? There are no other RDBMS <-> Hive connectors I know of because Sqoop does it well. Use Sqoop unless you have a very good, very specific reason not to.
Try this Sqoop command.
sqoop import --driver com.ibm.db2.jcc.DB2Driver --connect jdbc:db2://db2.my.com:50000/databaseName --username database_name --password database_password --table table_name --split-by tbl_primarykey --target-dir sqoopimports
Use the DB2 export utility to export data from a database to a file and then FTP flat files to Hadoop, and load into Hive.
Simple Export operation requires target file, a file format, and a source file.
db2 export to "target" of "fileformat" select * from "soruce"