I want to import the data from DB2 databese to the hadoop(HDFS,Hive).One way is to do it by sqoop, can we do the same with some other way?Pls share the other approach of doing this..thanks
Sqoop is the best way to go. Anything else would require a serious amount of custom code. I've actually been on a project where we had a pretty esoteric reason we couldn't use Sqoop, and it ended up not being that trivial. You end up worrying about translating types, handling null values, encodings, escaping, retries, transactions, etc, etc.
Why reinvent the wheel? There are no other RDBMS <-> Hive connectors I know of because Sqoop does it well. Use Sqoop unless you have a very good, very specific reason not to.
Try this Sqoop command.
sqoop import --driver com.ibm.db2.jcc.DB2Driver --connect jdbc:db2://db2.my.com:50000/databaseName --username database_name --password database_password --table table_name --split-by tbl_primarykey --target-dir sqoopimports
Use the DB2 export utility to export data from a database to a file and then FTP flat files to Hadoop, and load into Hive.
Simple Export operation requires target file, a file format, and a source file.
db2 export to "target" of "fileformat" select * from "soruce"
Related
It is known that --incremental sqoop import switch doesn't work for HIVE import through SQOOP. But what is the workaround for that?
1)One thing I could make up is that we can create a HIVE table, and bring incremental data to HDFS through SQOOP, and then manually load them. but if we are doing it , each time do that load, the data would be overwritten. Please correct me if I am wrong.
2) How effective --query is when sqooping data to HIVE?
Thank you
You can do the sqoop incremental append to the hive table, but there is no straight option, below is one of the way you can achieve it.
Store the incremental table as an external table in Hive.
It is more common to be importing incremental changes since the last time data was updated and then merging it.In the following example, --check-column is used to fetch records newer than last_import_date, which is the date of the last incremental data update:
sqoop import --connect jdbc:teradata://{host name}/Database=retail —connection manager org.apache.sqoop.teradata.TeradataConnManager --username dbc -password dbc --table SOURCE_TBL --target-dir /user/hive/incremental_table -m 1 --check-column modified_date --incremental lastmodified --last-value {last_import_date}
second part of your question
Query is also a very useful argument you can leverage in swoop import, that will give you the flexibility of basic joins on the rdbms table and flexibility to play with the date and time formats. If I were in your shoes I would do this, using the query I will import the data in the way I need and than I will append it to my original table and while loading from temporary to main table I can play more with the data. I would suggest using query if the updates are not too frequent.
Is there anyone here who has worked with sqoop and hp vertica?
I am trying to export data from sqoop to vertica and I find that the performance is extremely poor.
I can switch to the HP vertica connector... but I still want to know why sqoop works so slow when exporting data to vertica.
I also found that when inserting data, sqoop does not support upserts against vertica. I want to know if this issue will be fixed anytime soon?
sqoop export -Dsqoop.export.records.per.statement=1 --driver
com.vertica.jdbc.Driver --mysql-delimiters  --username **** --password **** --
connect jdbc:vertica://hostname/schema?ConnectionLoadBalance=1 --export-dir <hdfs-
data-dir> --table <table_name>
One of the issues is that sqoop if forcing us to set sqoop.export.records.per.statement to 1 for Vertica. Otherwise it throws an error.
I've never used sqoop, but the command line data import function in vertica uses the COPY function; basically it makes a temp file and then runs a file import in the background. It wouldn't be a graceful solution, but you could try dumping your data to a gzip and then running the COPY function directly. I find that the gzip is always the bottleneck for files over a certain threshold (~50Mb+), never the COPY. Could be a backdoor to a faster import.
i work sqoop with vertica database, i use sqoop to export data from the vertica to the hive/HDFS and it work grate, you just need to add the vertica jar to the sqoop folder.
when i want to asq vertica on data that in the HDFS/Hive i use the hcatalog of the vertica. in version 8.1.* it comes with the vertica database and you don't need more connectors.
hcatalog
Using sqoop I can create managed table but not the externale table.
Please let me know what are the best practices to unload data from data warehouse and load them in Hive external table.
1.The tables in warehouse are partitioned. Some are date wise partitioned some are state wise partitioned.
Please put your thoughts or practices used in production environment.
Sqoop does not support creating Hive external tables. Instead you might:
Use the Sqoop codegen command to generate the SQL for creating the Hive internal table that matches your remote RDBMS table (see http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_literal_sqoop_codegen_literal)
Modify the generated SQL to create a Hive external table
Execute the modified SQL in Hive
Run your Sqoop import command, loading into the pre-created Hive external table
Step 1: import data from mysql to hive table.
sqoop import
--connect jdbc:mysql://localhost/
--username training --password training
--table --hive-import --hive-table -m 1
--fields-terminated-by ','
Step 2: In hive change the table type from Managed to External.
Alter table <Table-name> SET TBLPROPERTIES('EXTERNAL'='TRUE')
Note:you can import directly into hive table or else to back end of hive.
My best suggestion is to SQOOP your data to HDFS and create EXTERNAL for Raw operations and Transformations.
Finally mashed up data to the internal table. I believe this is one of the best practices to get things done in a proper way.
Hope this helps!!!
Refer to these links:
https://mapr.com/blog/what-kind-hive-table-best-your-data/
In the above if you want to skip directly to the point -->2.2.1 External or Internal
https://hadoopsters.net/2016/07/15/hive-tables-internal-and-external-explained/
After referring to the 1st link then second will clarify most of your questions.
Cheers!!
What is the difference between Apache Sqoop and Hive? I know that sqoop is used to import/export data from RDBMS to HDFS and Hive is a SQL layer abstraction on top of Hadoop. Can I can use Sqoop for importing data into HDFS and then use Hive for querying?
Yes, you can. In fact many people use sqoop and hive for exactly what you have told.
In my project what I had to do was to load the historical data from my RDBMS which was oracle, move it to HDFS. I had hive external tables defined for this path. This allowed me to run hive queries to do transformations. Also, we used to write mapreduce programs on top of these data to come up with various analysis.
Sqoop transfers data between HDFS and relational databases. You can use Sqoop to transfer data from a relational database management system (RDBMS) such as MySQL or Oracle into HDFS and use MapReduce on the transferred data. Sqoop can export this transformed data back into an RDBMS as well. More info http://sqoop.apache.org/docs/1.4.3/index.html
Hive is a data warehouse software that facilitates querying and managing large datasets residing in HDFS. Hive provides schema on read (as opposed to schema on write for RDBMS) onto the data and the ability to query the data using a SQL-like language called HiveQL. More info https://hive.apache.org/
Yes you can. As a matter of fact, that's exactly how it is meant to be used.
Sqoop :
We can integrate with any external data sources with HDFS i.e Sql , NoSql and Data warehouses as well using this tool at the same time we export it as well since this can be used as bi-directional ways.
sqoop to move data from a relational database into Hbase.
Hive: 1.As per my understanding we can import the data from Sql databases into hive rather than NoSql Databases.
We can't export the data from HDFS into Sql Databases.
We can use both together using the below two options
sqoop create-hive-table --connect jdbc:mysql://<hostname>/<dbname> --table <table name> --fields-terminated-by ','
The above command will generate the hive table and this table name will be same name in the external table and also the schema
Load the data
hive> LOAD DATA INPATH <filename> INTO TABLE <filename>
Hive can be shortened to one step if you know that you want to import stright from a database directly into hive
sqoop import --connect jdbc:mysql://<hostname>/<dbname> --table <table name> -m 1 --hive-import
I recently started working on sqoop - hive/hadoop on Linux. I have to import hive data from one table to oracle table. I am using simple sqoop export to do this. I have 6 million lines in hive table.
This command is giving me very poor performance and taking long time (85 minutes) to complete the job.
Query ->
sqoop export --connect jdbc:oracle:thin:#server:port:db--username user --password password--export-dir /user/hive/warehouse/tb --table tb--columns 'col1,col2..col33' --input-fields-terminated-by ',' --input-null-string '\\N' --input-null-non-string '\\N' -m 1
Is there any configuration change which can help me which can help to make it faster.
It's hard to help without any additional information. I would suggest to start the export job again and monitor the environment to see where the bottle neck is (database? network? hadoop?). It might be also helpful to try OraOop connector as it's usually faster.
Is this a "wide" dataset? It might be a sqoop bug https://issues.apache.org/jira/browse/SQOOP-2920 if number of columns is very high (in hundreds), sqoop starts choking (very high on cpu).
When number of fields is small, it's usually other way around - when sqoop is bored and Oracle can't keep up. In this case we normally don't go over 45-55 mappers.