Copying data from HDFS to hive using SQOOP - hadoop

I want to copy data from HDFS to hive table. I tried below code but it doesn't throw any error and data is also not copied in mentioned hive table. Below is my code:
sqoop import --connect jdbc:mysql://localhost/sampleOne \
--username root \
--password root \
--external-table-dir "/WithFields" \
--hive-import \
--hive-table "sampleone.customers"
where sampleone is database in hive and customers is newly created table in hive and --external-table-dir is the HDFS path from where I want to load data in hive table. What else I am missing in this above code ??

If data is in HDFS, you do not need Sqoop to populate a Hive table. Steps to do this are below:
This is the data in HDFS
# hadoop fs -ls /example_hive/country
/example_hive/country/country1.csv
# hadoop fs -cat /example_hive/country/*
1,USA
2,Canada
3,USA
4,Brazil
5,Brazil
6,USA
7,Canada
This is the Hive table creation DDL
CREATE TABLE sampleone.customers
(
id int,
country string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
Verify Hive table is empty
hive (sampleone)> select * from sampleone.customers;
<no rows>
Load Hive table
hive (sampleone)> LOAD DATA INPATH '/example_hive/country' INTO TABLE sampleone.customers;
Verify Hive table has data
hive (sampleone)> select * from sampleone.customers;
1 USA
2 Canada
3 USA
4 Brazil
5 Brazil
6 USA
7 Canada
Note: This approach will move data from /example_hive/country location on HDFS to Hive warehouse directory (which will again be on HDFS) backing the table.

Related

How to import parquet data from S3 into HDFS using Sqoop?

I am trying to import data into a table in RDS. The data is in parquet file format and is present in s3.
I thought of importing the data from s3 into HDFS using Sqoop and then exporting it into RDS table using Sqoop. I was able to find the command to export data from HDFS to RDS. But I couldn’t find for importing parquet data from S3. Could you please help on how to structure the sqoop import command in this case.
You can use spark to copy data from s3 to HDFS.
Read this blog for more details.
The approach that seemed simple and best for me is as below:
Create a Parquet table in Hive and load it with the Parquet data from S3
create external table if not exists parquet_table(<column name> <column's datatype>) stored as parquet;
LOAD DATA INPATH 's3a://<bucket_name>/<parquet_file>' INTO table parquet_table
Create a CSV table in Hive and load it with the data from Parquet table
create external table if not exists csv_table(<column name> <column's datatype>)
row format delimited fields terminated by ','
stored as textfile
location 'hdfs:///user/hive/warehouse/csvdata'
Now that we have a CSV/Textfile Table in Hive, Sqoop can easily export the table from HDFS to MySQL table RDS.
export --table <mysql_table_name> --export-dir hdfs:///user/hive/warehouse/csvdata --connect jdbc:mysql://<host>:3306/<db_name> --username <username> --password-file hdfs:///user/test/mysql.password --batch -m 1 --input-null-string "\\N" --input-null-non-string "\\N" --columns <column names to be exported, without whitespace in between the column names>

Loading Sequence File data into hive table created using stored as sequence file failing

Importing the content from MySQL to HDFS as sequence files using below sqoop import command
sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db"
--username retail_dba --password cloudera
--table orders
--target-dir /user/cloudera/sqoop_import_seq/orders
--as-sequencefile
--lines-terminated-by '\n' --fields-terminated-by ','
Then i'm creating the hive table using the below command
create table orders_seq(order_id int,order_date string,order_customer_id int,order_status string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS SEQUENCEFILE
But when I tried to load sequence data obtained from 1st command into hive table using the below command
LOAD DATA INPATH '/user/cloudera/sqoop_import_seq/orders' INTO TABLE orders_seq;
It is giving the below error.
Loading data to table practice.orders_seq
Failed with exception java.lang.RuntimeException: java.io.IOException: WritableName can't load class: orders
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
Where am I going wrong?
First of all, It's necessary to have the data in that format?
Let's suppose you have to have the data in that format. The load data command is not necessary. Once the sqoop finishes importing data, you will just have to create a Hive table pointing the same directory where you sqoop the data.
One side note from your scripts:
create table orders_seq(order_id int,order_date string,order_customer_id int,order_status string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS SEQUENCEFILE
Your sqoop command says this: --fields-terminated-by ',' but when you are creating the table you are using: FIELDS TERMINATED BY '|'
In my experience, the best approach I thing is to sqoop the data as avro, this will create automatically an avro-schema. Then you will just to have to create a Hive table using the schema previously created (AvroSerde) and using the location where you stored the data you got from sqooping process.

how to define hive table structure using sqoop import-mainframe --create-hive-table command

we are trying to import a flat mainframe file to load into hive table. I was able to import and load it to hive table using sqoop import-mainframe but my entire file is placed in one column and that too the column does not have a name in it.
Is there a possibility to define the table structure in sqoop import command itself?
we are using the below command to import from mainframe and load it to Hive table
sqoop import-mainframe --connect mainframe.com --dataset mainframedataset --username xxxxx -P --hive-import --create-hive-table --hive-table table1 --warehouse-dir /warehouse/
Sample mainframe data:
ASWIN|1234|1000.00 XXXX|1235|200.00 YYYY|1236|150.00
Hive table create script generated by sqoop:
CREATE TABLE Employee ( DEFAULT_COLUMN STRING) COMMENT 'Imported by sqoop on 2016/08/26 02:12:04' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\012' STORED AS TEXTFILE
As per Sqoop docs,
By default, each record in a dataset is stored as a text record with a newline at the end. Each record is assumed to contain a single text field with the name DEFAULT_COLUMN. When Sqoop imports data to HDFS, it generates a Java class which can reinterpret the text files that it creates.
Your psv file will be loaded to HDFS.
Now create table1 (hive table) yourself using -
CREATE TABLE table1 (Name string, Empid int,Amount float) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\012' STORED AS TEXTFILE
Now run your sqoop import command without --create-hive-table tag. It should work.

Importing tables from multiple databases into Hadoop and Union

I have this specific scenario:
There are year-wise databases in SQL Server with naming like "FOOXXYY" where XXYY denotes fiscal year. Now I want to take a specific table "bar" from all these Databases, Union it into a single table in hive and store it into the HDFS.
What will be the best and fastest approach to go about it?
You need to create database, create partitioned table, add partitions, run 4 different sqoop commands to connect to each of the database and load data into partitions. Here are the sample code snippets.
Create database and then partition table like this;
CREATE TABLE `order_items`(
`order_item_id` int,
`order_item_order_id` int,
`order_item_order_date` string,
`order_item_product_id` int,
`order_item_quantity` smallint,
`order_item_subtotal` float,
`order_item_product_price` float)
PARTITIONED BY (
`order_month` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';
You can then add partitions using these commands:
alter table order_items add partition (order_month=201301);
alter table order_items add partition (order_month=201302);
Once the table is created, you can run describe formatted order_items. It will give the path of the table and you can validate using dfs -ls command in hive.
From describe formatted
Location: hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/retail_ods.db/order_items
dfs -ls /apps/hive/warehouse/retail_ods.db/order_items
You will get 2 directories, capture the path.
By now you have table and partitions for each of the year (for your case). Now you can use sqoop import command for each database to query from the table and copy into the respective partition.
You can find sample sqoop commands here. You can even pass the query as part of sqoop import commands (Google sqoop user guide).
sqoop import \
--connect "jdbc:mysql://sandbox.hortonworks.com:3306/retail_db" \
--username=retail_dba \
--password=hadoop \
--table order_items \
--target-dir /apps/hive/warehouse/retail_ods.db/order_items/order_month=201301 \
--append \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--outdir java_files

Sqoop Import is completed successfully. How to view these tables in Hive

I am trying something on hadoop and its related things. For this, I have configured hadoop, hase, hive, sqoop in Ubuntu machine.
raghu#system4:~/sqoop$ bin/sqoop-import --connect jdbc:mysql://localhost:3306/mysql --username root --password password --table user --hive-import -m 1
All goes fine, but when I enter hive command line and execute show tables, there are nothing. I am able to see that these tables are created in HDFS.
I have seen some options in Sqoop import - it can import to Hive/HDFS/HBase.
When importing into Hive, it is indeed importing directly into HDFS. Then why Hive?
Where can I execute HiveQL to check the data.
From cloudera Support, I understood that I can Hue and check it. But, I think Hue is just an user interface to Hive.
Could someone help me here.
Thanks in advance,
Raghu
I was having the same issue. I was able to work around/through it by importing the data directly into HDFS and then create an External Hive table to point at that specific location in HDFS. Here is an example that works for me.
create external table test (
sequencenumber int,
recordkey int,
linenumber int,
type string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
location '/user/hdfs/testdata';
You will need to change your location to where you saved the data in HDFS.
Can you post the output from sqoop? Try using --verbose option.
Here's an example of the command I use, and it does import directly to a Hive table.
sqoop import --hive-overwrite --hive-drop-import-delims --warehouse-dir "/warehouse" --hive-table hive_users --connect jdbc:mysql://$MYSQL_HOST/$DATABASE_NAME --table users --username $MYSQL_USER --password $MYSQL_PASS --hive-import
when we are not giving any database in the sqoop import command,the table will be created in the default database with the same name of the RDBMS table name.
you can specify the database name where you want to import the the RDBMS table in hive by "--hive-database".
Instead of creating the Hive table every time, you can import the table structure in the hive using the create-hive-table command of sqoop. It will import the table as managed_table then you can convert that table to external table by changing the table properties to external table and then add partition. This will reduce the effort of finding the right data type. Please note that there will be precision change
Whenever ,you are using a Sqoop with Hive import option,the sqoop connects directly the corresponding the database's metastore and gets the corresponding table 's metadata(the table's schema),so there is no need to create a table structure in Hive.This schema is then provided to the Hive when used with Hive-import option.
So the output of all the sqoop data on HDFS will by default stored in the default directory .i.e /user/sqoop/tablename/part-m files
with hive import option,the tables will be downloaded directly into the default warehouse direcotry i.e.
/user/hive/warehouse/tablename
command : sudo -u hdfs hadoop fs -ls -R /user/
this lists recursively all the files with in the user.
Now go to Hive and type show databases.if there is only default database,
then type show tables:
remember OK is common default system output and is not part of the command output.
hive> show databases;
OK
default
Time taken: 0.172 seconds
hive> show tables;
OK
genre
log_apache
movie
moviegenre
movierating
occupation
user
Time taken: 0.111 seconds
Try sqoop command like this, its working for me and directly creating hive table, u need not create external table every time
sqoop import --connect DB_HOST --username ***** --password ***** --query "select *from SCHEMA.TABLE where \$CONDITIONS"
--num-mappers 5 --split-by PRIMARY_KEY --hive-import --hive-table HIVE_DB.HIVE_TABLE_NAME --target-dir SOME_DIR_NAME;
The command you are using imports data into the $HIVE_HOME directory. If the HIVE_HOME environment variable is not set or points to a wrong directory, you will not be able to see imported tables.
The best way to find the hive home directory is to use the Hive QL SET command:
hive -S -e 'SET' | grep warehouse.dir
Once you retrieved the hive home directory, append the --hive-home <hive-home-dir>option to your command.
Another possible reason is that in some Hive setups the metadata is cached and you cannot see the changes immediately. In this case you need to flush the metadata cache, using the INVALIDATE METADATA;command.

Resources