How to import parquet data from S3 into HDFS using Sqoop? - hadoop

I am trying to import data into a table in RDS. The data is in parquet file format and is present in s3.
I thought of importing the data from s3 into HDFS using Sqoop and then exporting it into RDS table using Sqoop. I was able to find the command to export data from HDFS to RDS. But I couldn’t find for importing parquet data from S3. Could you please help on how to structure the sqoop import command in this case.

You can use spark to copy data from s3 to HDFS.
Read this blog for more details.

The approach that seemed simple and best for me is as below:
Create a Parquet table in Hive and load it with the Parquet data from S3
create external table if not exists parquet_table(<column name> <column's datatype>) stored as parquet;
LOAD DATA INPATH 's3a://<bucket_name>/<parquet_file>' INTO table parquet_table
Create a CSV table in Hive and load it with the data from Parquet table
create external table if not exists csv_table(<column name> <column's datatype>)
row format delimited fields terminated by ','
stored as textfile
location 'hdfs:///user/hive/warehouse/csvdata'
Now that we have a CSV/Textfile Table in Hive, Sqoop can easily export the table from HDFS to MySQL table RDS.
export --table <mysql_table_name> --export-dir hdfs:///user/hive/warehouse/csvdata --connect jdbc:mysql://<host>:3306/<db_name> --username <username> --password-file hdfs:///user/test/mysql.password --batch -m 1 --input-null-string "\\N" --input-null-non-string "\\N" --columns <column names to be exported, without whitespace in between the column names>

Related

Loading Sequence File data into hive table created using stored as sequence file failing

Importing the content from MySQL to HDFS as sequence files using below sqoop import command
sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db"
--username retail_dba --password cloudera
--table orders
--target-dir /user/cloudera/sqoop_import_seq/orders
--as-sequencefile
--lines-terminated-by '\n' --fields-terminated-by ','
Then i'm creating the hive table using the below command
create table orders_seq(order_id int,order_date string,order_customer_id int,order_status string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS SEQUENCEFILE
But when I tried to load sequence data obtained from 1st command into hive table using the below command
LOAD DATA INPATH '/user/cloudera/sqoop_import_seq/orders' INTO TABLE orders_seq;
It is giving the below error.
Loading data to table practice.orders_seq
Failed with exception java.lang.RuntimeException: java.io.IOException: WritableName can't load class: orders
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
Where am I going wrong?
First of all, It's necessary to have the data in that format?
Let's suppose you have to have the data in that format. The load data command is not necessary. Once the sqoop finishes importing data, you will just have to create a Hive table pointing the same directory where you sqoop the data.
One side note from your scripts:
create table orders_seq(order_id int,order_date string,order_customer_id int,order_status string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS SEQUENCEFILE
Your sqoop command says this: --fields-terminated-by ',' but when you are creating the table you are using: FIELDS TERMINATED BY '|'
In my experience, the best approach I thing is to sqoop the data as avro, this will create automatically an avro-schema. Then you will just to have to create a Hive table using the schema previously created (AvroSerde) and using the location where you stored the data you got from sqooping process.

how to define hive table structure using sqoop import-mainframe --create-hive-table command

we are trying to import a flat mainframe file to load into hive table. I was able to import and load it to hive table using sqoop import-mainframe but my entire file is placed in one column and that too the column does not have a name in it.
Is there a possibility to define the table structure in sqoop import command itself?
we are using the below command to import from mainframe and load it to Hive table
sqoop import-mainframe --connect mainframe.com --dataset mainframedataset --username xxxxx -P --hive-import --create-hive-table --hive-table table1 --warehouse-dir /warehouse/
Sample mainframe data:
ASWIN|1234|1000.00 XXXX|1235|200.00 YYYY|1236|150.00
Hive table create script generated by sqoop:
CREATE TABLE Employee ( DEFAULT_COLUMN STRING) COMMENT 'Imported by sqoop on 2016/08/26 02:12:04' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\012' STORED AS TEXTFILE
As per Sqoop docs,
By default, each record in a dataset is stored as a text record with a newline at the end. Each record is assumed to contain a single text field with the name DEFAULT_COLUMN. When Sqoop imports data to HDFS, it generates a Java class which can reinterpret the text files that it creates.
Your psv file will be loaded to HDFS.
Now create table1 (hive table) yourself using -
CREATE TABLE table1 (Name string, Empid int,Amount float) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\012' STORED AS TEXTFILE
Now run your sqoop import command without --create-hive-table tag. It should work.

If we are importing a data from MySql to HDFS by using Sqoop, what would be the file format which has stored in HDFS?

If we are importing a data from MySql to HDFS by using Sqoop, what would be the file format which has stored in HDFS
Sqoop has imported your data as comma-separated text files. It supports a number of other file formats, which can be activated with the arguments listed below
mSqoop arguments that control the file formats of import commands
Argument
--as-avrodatafile Data is imported as Avro files.
--as-sequencefile Data is imported as Sequence Files.
--as-textfile The default file format, with imported data as CSV text files.
example: you should pass like below.
sqoop import mysql:--/--/db --as-avrodatafile
When importing a table from MySQL to HDFS using Sqoop, table data will be stored in /user/hadoop-username/tablename folder. It will contain 2 files named _SUCCESS and part-m-00000, one directory named _logs.
The actual table data will be stored in part-m-00000. Most probably, it will be comma delimited file.
If you want to query the table using hadoop, its better to use Hive instead of HDFS. Just import using sqoop from MySQL to Hive so that you can query the table using hive command line in future.

Can I use Sqoop to import data into RCFile format?

According to http://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1764646
You can import data in one of two file formats: delimited text or
SequenceFiles.
But what about RCFile?
Is it possible to use Sqoop to import data from Oracle DB into HDFS in RCFile format?
If yes, how to do it?
Sqoop is currently not supporting RC files. There is a jira SQOOP-640 to add this functionality.
Step 1: Create a ORC formatted table (base) in Hive.
CREATE TABLE IF NOT EXISTS tablename (hivecolumns) STORED AS RCFILE
Step 2 : Sqoop import to this RC table using HCatalog tool.
SQOOP IMPORT
--connect sourcedburl
--username XXXX
--password XXXX
--table source_table
--hcatalog-database hivedb
--hcatalog-table tablename
[ HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored — RCFile format, text files, SequenceFiles, or ORC files.]

Sqoop Import is completed successfully. How to view these tables in Hive

I am trying something on hadoop and its related things. For this, I have configured hadoop, hase, hive, sqoop in Ubuntu machine.
raghu#system4:~/sqoop$ bin/sqoop-import --connect jdbc:mysql://localhost:3306/mysql --username root --password password --table user --hive-import -m 1
All goes fine, but when I enter hive command line and execute show tables, there are nothing. I am able to see that these tables are created in HDFS.
I have seen some options in Sqoop import - it can import to Hive/HDFS/HBase.
When importing into Hive, it is indeed importing directly into HDFS. Then why Hive?
Where can I execute HiveQL to check the data.
From cloudera Support, I understood that I can Hue and check it. But, I think Hue is just an user interface to Hive.
Could someone help me here.
Thanks in advance,
Raghu
I was having the same issue. I was able to work around/through it by importing the data directly into HDFS and then create an External Hive table to point at that specific location in HDFS. Here is an example that works for me.
create external table test (
sequencenumber int,
recordkey int,
linenumber int,
type string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
location '/user/hdfs/testdata';
You will need to change your location to where you saved the data in HDFS.
Can you post the output from sqoop? Try using --verbose option.
Here's an example of the command I use, and it does import directly to a Hive table.
sqoop import --hive-overwrite --hive-drop-import-delims --warehouse-dir "/warehouse" --hive-table hive_users --connect jdbc:mysql://$MYSQL_HOST/$DATABASE_NAME --table users --username $MYSQL_USER --password $MYSQL_PASS --hive-import
when we are not giving any database in the sqoop import command,the table will be created in the default database with the same name of the RDBMS table name.
you can specify the database name where you want to import the the RDBMS table in hive by "--hive-database".
Instead of creating the Hive table every time, you can import the table structure in the hive using the create-hive-table command of sqoop. It will import the table as managed_table then you can convert that table to external table by changing the table properties to external table and then add partition. This will reduce the effort of finding the right data type. Please note that there will be precision change
Whenever ,you are using a Sqoop with Hive import option,the sqoop connects directly the corresponding the database's metastore and gets the corresponding table 's metadata(the table's schema),so there is no need to create a table structure in Hive.This schema is then provided to the Hive when used with Hive-import option.
So the output of all the sqoop data on HDFS will by default stored in the default directory .i.e /user/sqoop/tablename/part-m files
with hive import option,the tables will be downloaded directly into the default warehouse direcotry i.e.
/user/hive/warehouse/tablename
command : sudo -u hdfs hadoop fs -ls -R /user/
this lists recursively all the files with in the user.
Now go to Hive and type show databases.if there is only default database,
then type show tables:
remember OK is common default system output and is not part of the command output.
hive> show databases;
OK
default
Time taken: 0.172 seconds
hive> show tables;
OK
genre
log_apache
movie
moviegenre
movierating
occupation
user
Time taken: 0.111 seconds
Try sqoop command like this, its working for me and directly creating hive table, u need not create external table every time
sqoop import --connect DB_HOST --username ***** --password ***** --query "select *from SCHEMA.TABLE where \$CONDITIONS"
--num-mappers 5 --split-by PRIMARY_KEY --hive-import --hive-table HIVE_DB.HIVE_TABLE_NAME --target-dir SOME_DIR_NAME;
The command you are using imports data into the $HIVE_HOME directory. If the HIVE_HOME environment variable is not set or points to a wrong directory, you will not be able to see imported tables.
The best way to find the hive home directory is to use the Hive QL SET command:
hive -S -e 'SET' | grep warehouse.dir
Once you retrieved the hive home directory, append the --hive-home <hive-home-dir>option to your command.
Another possible reason is that in some Hive setups the metadata is cached and you cannot see the changes immediately. In this case you need to flush the metadata cache, using the INVALIDATE METADATA;command.

Resources