how to overwrite the data in hive using sqoop - hadoop

I am trying to load data into an already existing table in hive via sqoop from mysql database. I am referring to the below guide for reference:-
http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html#_importing_data_into_hive
--hive-import has been tried and tested successfully.
I created a hive table as below:-
create table sqoophive (id int, name string, location string)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
Loaded the data as required.
I want to use --hive-overwrite option to overwrite the content in the above table. As per the guide mentioned above - "--hive-overwrite Overwrite existing data in the Hive table."
"If the Hive table already exists, you can specify the --hive-overwrite option to indicate that existing table in hive must be replaced."
So I tried the below queries separately to get the result:-
sqoop import --connect jdbc:mysql://localhost/test --username root --password 'hr' --table sample --hive-import --hive-overwrite --hive-table sqoophive -m 1 --fields-terminated-by '\t' --lines-terminated-by '\n'
sqoop import --connect jdbc:mysql://localhost/test --username root --password 'hr' --table sample --hive-overwrite --hive-table sqoophive -m 1 --fields-terminated-by '\t' --lines-terminated-by '\n'
but rather than replacing the content in the existing table it just created a file in the below path /user/<username>/<mysqltablename>
Can somebody please explain me where I am going wrong?

the first query should work fine. I didn't give fields terminated and lines terminated as the schema already exists.
the keywords --hive-import and --hive-overwrite should be there.
if only --hive-overwrite is there, it doesn't load data to the table. just copies to hdfs.

It's putting the _SUCCESS file in
/user/<username>/<mysqltablename>
You can change where that goes with --warehouse-dir
ex: --warehouse-dir /tmp
One would think that hive-overwrite would handle this, meaning remove that directory first. But for good reason Hive doesn't want to start removing dirs in HDFS. What if something else was put in there?
hive-overwrite is saying, "I'm going to overwrite the rows in Hive, not just add to the table." Thus you will not have duplicates.
You have to remove that directory and the _SUCCESS file first; or better yet, right after the import is successful.
hadoop fs -rm -R /user/<username>/<mysqltablename>

sqoop import with out --target-dir OR --warehouse-dir (for --hive-import) will import /user/<username>/<mysqltablename>:
By default, Sqoop will import a table named foo to a directory named
foo inside your home directory in HDFS. For example, if your username
is someuser, then the import tool will write to
/user/someuser/foo/(files). You can adjust the parent directory of the
import with the --warehouse-dir argument.
You can also explicitly choose the target directory with --target-dir param
but as #hrobertv said that --hive-overwrite does not delete existing dir but it overwrites the HDFS data location of hive table. if you want to save new data at same location as origin than you would have to delete the existing table dir first and then run sqoop import with specifying --target-dir OR --warehouse-dir for --hive-overwrite to store data at specific location as per your requirement...

Related

sqoop import as parquet file to target dir, but can't find the file

I have been using sqoop to import data from mysql to hive, the command I used are below:
sqoop import --connect jdbc:mysql://localhost:3306/datasync \
--username root --password 654321 \
--query 'SELECT id,name FROM test WHERE $CONDITIONS' --split-by id \
--hive-import --hive-database default --hive-table a \
--target-dir /tmp/yfr --as-parquetfile
The Hive table is created and the data is inserted, however I can not find the parquet file.
Does anyone know?
Best regards,
Feiran
Sqoop import to hive works in 2 steps:
Fetching data from RDBMS to HDFS
Create hive table if not exists and Load data into hive table
In your case,
firstly, data is stored at --target-dir i.e. /tmp/yfr
Then, it is loaded into Hive table a using
LOAD DATA INPTH ... INTO TABLE..
command.
As mentioned in the comments, data is moved to hive warehouse directory that's why there is no data in --target-dir.

sqoop-hive Import adding an extra column

I have imported data from sqoop to hive successfully. I have added an column in Oracle and again imported the particular column to hive using sqoop-import. But,it is appending to the first column data and remaining columns with null and no new column came in hive. Can anyone resolve the issue.
With out looking at your import statements, I am assuming that in your second import you are trying to append to the existing import but only importing new column using --columns and --append arguments. It will not work this way as it will append to the file at end of the file not at end of the each line.
you will need to overwrite the existing data in hdfs using --hive-overwrite; and alter hive table for adding additional column. OR just drop the hive table and use --create-hive-table in sqoop command.
so you import command should look like this:
sqoop --import \
--connect $CONNECTION_STR \
--username $USER \
--password $PASS \
--table $ORACLE_TABLE \
--hive-import \
--hive-overwrite \
--hive-table \
--hive-home $HIVE_HOME \
--hive-table $HIVE_TABLE
Change values to actual values of your environment

Configuring Sqoop with Mysql?

I have successfully installed SQOOP now the problem is that how to implement it with RDBMS and how to load data from RDBMS to HDFS using SQOOP.
By Using Sqoop You can Load Data directly to Hive Tables or Store the data in Some target Directory in HDFS
If you Need to copy data from RDBMS into Some directory
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password {In case no password Do not Specify it}
--table tableName
--col column_name(s) {In case you need to call specific columns}
--target-dir '/tmp/myfolder'
--boundary-query 'Select min,max from table name'
--m 5 {set number of mappers to 5}
--fields-terminated-by ',' {how do you want your data to look in target file}
Boundary Query : This is something you can specify. If you do not specify this , then by default this is run in as an inner query which adds up to a complex query.
If you specify this explicitly then this runs as a normal query and hence the performance is increased.
Also you may want to restrict the number of observation ,say based on column ID, and suppose you need data from ID 1 to 1000. Then using Boundary condition and split-by you will be able to restrict your import data.
--boundary-query "select 0,1000 from employee'
--split-by ID
Split-By : You use Split by on a Sqoop import to specify the column on basis of which split is required. By default,if you do not specify this, sqoop pics up table's primary key as the Split_by column.
Split By picks up data from tables and stores them in different folders based on number of mappers. By Default Number of Mappers are 4.
This may seem unwanted but in case you have a composite primary key or no primary key at all, then sqoop fails to pick up data and may error out.
Note: You may not face any issue if you set the number of mappers to 1. In this case, no split by condition is used since there is only one mapper. So query runs fine. This can be done using
--m 1
If you Need to copy data from RDBMS into Hive Table
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password {In case no password Do not Specify it}
--table tableName
--boundary-query 'Select min,max from table name'
--m 5 {set number of mappers to 5}
--hive-import
--hive-table serviceorderdb.productinfo
--m 1
Running a query instead of calling entire table itself
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password
--query 'select name from employees where name like '%s' and $CONDITIONS'
--m 5 {set number of mappers to 5}
--target-dir '/tmp/myfolder'
--fields-terminated-by ',' {how do you want your data to look in target file}
You may see $conditions as extra parameter $CONDITIONS. This is because this time you specified no table and specified a query explicity. When Sqoop runs, it searches for a boundary conditions, which it does not find. Then It Searches for a table and a primary key for applying boundary query which again it will not find. Hence we use $CONDITIONS to explicitly specify that we are not using a query and use default boundry condition from query result.
Checking if your connection is set up properly : For this you can just call list databases and if the you see your data populated then your connection is fine.
$ sqoop list-databases
--connect jdbc:mysql://localhost/
--username root
--password pwd
Connection String for Different Databases :
MYSQL: jdbc:mysql://<hostname>:<port>/<dbname>
jdbc:mysql://127.0.0.1:3306/test_database
Oracle :#//host_name:port_number/service_name
jdbc:oracle:thin:scott/tiger#//myhost:1521/myservicename
You may learn more about sqoop imports from : https://sqoop.apache.org/docs/1.4.1-incubating/SqoopUserGuide.html
By using sqoop import command you can import data from RDBMS to HDFS, Hive and HBase
sqoop import --connect jdbc:mysql://localhost:portnumber/DBName --username root --table emp --password root -m1
By using this command data will be stored in HDFS.
Sample commands to run sqoop import (load data from RDBMS to HDFS):
Postgres
sqoop import --connect jdbc:postgresql://postgresHost/databaseName
--username username --password 123 --table tableName
MySQL
sqoop import --connect jdbc:mysql://mysqlHost/databaseName --username username --password 123 --table tableName
Oracle*
sqoop import --connect jdbc:oracle:thin:#oracleHost:1521/databaseName --username USERNAME --password 123 --table TABLENAME
SQL Server
sqoop import --connect 'jdbc:sqlserver://sqlserverhost:1433;database=dbname;username=<username>;password=<password>' --table tableName
*Sqoop won't find any columns from a table if you don't specify both the username and the table in correct case. Usually, specifying both in uppercase will resolve the issue.
Read the Sqoop User's Guide: https://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html
I also recommend the Apache Sqoop Cookbook. You will learn how to use import and export tools, do incremental import jobs, save jobs, solve problems with jdbc drivers and much more. http://shop.oreilly.com/product/0636920029519.do

Sqoop Import is completed successfully. How to view these tables in Hive

I am trying something on hadoop and its related things. For this, I have configured hadoop, hase, hive, sqoop in Ubuntu machine.
raghu#system4:~/sqoop$ bin/sqoop-import --connect jdbc:mysql://localhost:3306/mysql --username root --password password --table user --hive-import -m 1
All goes fine, but when I enter hive command line and execute show tables, there are nothing. I am able to see that these tables are created in HDFS.
I have seen some options in Sqoop import - it can import to Hive/HDFS/HBase.
When importing into Hive, it is indeed importing directly into HDFS. Then why Hive?
Where can I execute HiveQL to check the data.
From cloudera Support, I understood that I can Hue and check it. But, I think Hue is just an user interface to Hive.
Could someone help me here.
Thanks in advance,
Raghu
I was having the same issue. I was able to work around/through it by importing the data directly into HDFS and then create an External Hive table to point at that specific location in HDFS. Here is an example that works for me.
create external table test (
sequencenumber int,
recordkey int,
linenumber int,
type string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
location '/user/hdfs/testdata';
You will need to change your location to where you saved the data in HDFS.
Can you post the output from sqoop? Try using --verbose option.
Here's an example of the command I use, and it does import directly to a Hive table.
sqoop import --hive-overwrite --hive-drop-import-delims --warehouse-dir "/warehouse" --hive-table hive_users --connect jdbc:mysql://$MYSQL_HOST/$DATABASE_NAME --table users --username $MYSQL_USER --password $MYSQL_PASS --hive-import
when we are not giving any database in the sqoop import command,the table will be created in the default database with the same name of the RDBMS table name.
you can specify the database name where you want to import the the RDBMS table in hive by "--hive-database".
Instead of creating the Hive table every time, you can import the table structure in the hive using the create-hive-table command of sqoop. It will import the table as managed_table then you can convert that table to external table by changing the table properties to external table and then add partition. This will reduce the effort of finding the right data type. Please note that there will be precision change
Whenever ,you are using a Sqoop with Hive import option,the sqoop connects directly the corresponding the database's metastore and gets the corresponding table 's metadata(the table's schema),so there is no need to create a table structure in Hive.This schema is then provided to the Hive when used with Hive-import option.
So the output of all the sqoop data on HDFS will by default stored in the default directory .i.e /user/sqoop/tablename/part-m files
with hive import option,the tables will be downloaded directly into the default warehouse direcotry i.e.
/user/hive/warehouse/tablename
command : sudo -u hdfs hadoop fs -ls -R /user/
this lists recursively all the files with in the user.
Now go to Hive and type show databases.if there is only default database,
then type show tables:
remember OK is common default system output and is not part of the command output.
hive> show databases;
OK
default
Time taken: 0.172 seconds
hive> show tables;
OK
genre
log_apache
movie
moviegenre
movierating
occupation
user
Time taken: 0.111 seconds
Try sqoop command like this, its working for me and directly creating hive table, u need not create external table every time
sqoop import --connect DB_HOST --username ***** --password ***** --query "select *from SCHEMA.TABLE where \$CONDITIONS"
--num-mappers 5 --split-by PRIMARY_KEY --hive-import --hive-table HIVE_DB.HIVE_TABLE_NAME --target-dir SOME_DIR_NAME;
The command you are using imports data into the $HIVE_HOME directory. If the HIVE_HOME environment variable is not set or points to a wrong directory, you will not be able to see imported tables.
The best way to find the hive home directory is to use the Hive QL SET command:
hive -S -e 'SET' | grep warehouse.dir
Once you retrieved the hive home directory, append the --hive-home <hive-home-dir>option to your command.
Another possible reason is that in some Hive setups the metadata is cached and you cannot see the changes immediately. In this case you need to flush the metadata cache, using the INVALIDATE METADATA;command.

sqoop import complete but hive show tables can't see table

After install hadoop, hive (CDH version) I execute
./sqoop import -connect jdbc:mysql://10.164.11.204/server -username root -password password -table user -hive-import --hive-home /opt/hive/
All goes fine, but when I enter hive command line and execute show tables, there are nothing.
I use ./hadoop fs -ls, I can see /user/(username)/user existing.
Any help is appreciated.
---EDIT-----------
/sqoop import -connect jdbc:mysql://10.164.11.204/server -username root -password password -table user -hive-import --target-dir /user/hive/warehouse
import fail due to :
11/07/02 00:40:00 INFO hive.HiveImport: FAILED: Error in semantic analysis: line 2:17 Invalid Path 'hdfs://hadoop1:9000/user/ubuntu/user': No files matching path hdfs://hadoop1:9000/user/ubuntu/user
11/07/02 00:40:00 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Hive exited with status 10
at com.cloudera.sqoop.hive.HiveImport.executeExternalHiveScript(HiveImport.java:326)
at com.cloudera.sqoop.hive.HiveImport.executeScript(HiveImport.java:276)
at com.cloudera.sqoop.hive.HiveImport.importTable(HiveImport.java:218)
at com.cloudera.sqoop.tool.ImportTool.importTable(ImportTool.java:362)
at com.cloudera.sqoop.tool.ImportTool.run(ImportTool.java:423)
at com.cloudera.sqoop.Sqoop.run(Sqoop.java:144)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at com.cloudera.sqoop.Sqoop.runSqoop(Sqoop.java:180)
at com.cloudera.sqoop.Sqoop.runTool(Sqoop.java:218)
at com.cloudera.sqoop.Sqoop.main(Sqoop.java:228)
Check your hive-site.xml for the value of the property
javax.jdo.option.ConnectionURL. If you do not define this explicitly,
the default value will use a relative path for creation of hive
metastore (jdbc:derby:;databaseName=metastore_db;create=true) which
will be different depending upon where you launch the process from.
This would explain why you cannot see the table via show tables.
define this property value in your
hive-site.xml using an absolute path
no need of creating the table in hive..refer the below query
sqoop import --connect jdbc:mysql://xxxx.com/Database name --username root --password admin --table tablename (mysql table) --direct -m 1 --hive-import --create-hive-table --hive-table table name --target-dir '/user/hive/warehouse/Tablename(which u want create in hive)' --fields-terminated-by '\t'
In my case Hive stores data in /user/hive/warehouse directory in HDFS. This is where Sqoop should put it.
So I guess you have to add:
--target-dir /user/hive/warehouse
Which is default location for Hive tables (might be different in your case).
You might also want to create this table in Hive:
sqoop create-hive-table --connect jdbc:mysql://host/database --table tableName --username user --password password
in my case it creates table in hive default database, you can give it a try.
sqoop import --connect jdbc:mysql://xxxx.com/Database name --username root --password admin --table NAME --hive-import --warehouse-dir DIR --create-hive-table --hive-table NAME -m 1
Hive tables will be created by Sqoop import process. Please make sure the /user/hive/warehouse is created in you HDFS. You can browse the HDFS (http://localhost:50070/dfshealth.jsp - Browse the File System option.
Also include the HDFS local in -target dir i.e hdfs://:9000/user/hive/warehouse in the sqoop import command.
First of all , create the table definition in Hive with exact field names and types as in mysql.
Then, perform the import operation
For Hive Import
sqoop import --verbose --fields-terminated-by ',' --connect jdbc:mysql://localhost/test --table tablename --hive-import --warehouse-dir /user/hive/warehouse --fields-terminated-by ',' --split-by id --hive-table tablename
'id' can be your primary key of the existing table
'localhost' can be your local ip
'test' is database
'warehouse' directory is in HDFS
I think all you need is to specify the hive table where data should go.
add "--hive-table database.tablename" to the sqoop command and remove the --hive-home /opt/hive/. I think that should resolve the problem.

Resources