Data in HDFS files not seen under hive table - hadoop

I have to create a hive table from data present in oracle tables.
I'm doing a sqoop, thereby converting the oracle data into HDFS files. Then I'm creating a hive table on the HDFS files.
The sqoop completes successfully and the files also get generated in the HDFS target directory.
Then I run the create table script in hive. The tables gets created. But it is an empty table, no data is seen in the hive table.
Has anyone faced a similar problem?

Hive default delimiter is ctrlA, if you don't specify any delimiter it will take default delimiter. Add below line in your hive script .
row format delimited fields terminated by '\t'

Your Hive script and your expectation is wrong. You are trying to create a partitioned table on the data that you have already imported, partitions won't work that way. If your query has no partition in it then you can able to see data.
Basically If you want partitioned table , you can't create on the under lying data like you have tried above. If you want hive partition load the data from intermediate table or that sqoop directory to your partitioned table to get Hive partitions.

Related

Unable to partition hive table backed by HDFS

Maybe this is an easy question but, I am having a difficult time resolving the issue. At this time, I have an pseudo-distributed HDFS that contains recordings that are encoded using protobuf 3.0.0. Then, using Elephant-Bird/Hive I am able to put that data into Hive tables to query. The problem that I am having is partitioning the data.
This is the table create statement that I am using
CREATE EXTERNAL TABLE IF NOT EXISTS test_messages
PARTITIONED BY (dt string)
ROW FORMAT SERDE
"com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
WITH serdeproperties (
"serialization.class"="path.to.my.java.class.ProtoClass")
STORED AS SEQUENCEFILE;
The table is created and I do not receive any runtime errors when I query the table.
When I attempt to load data as follows:
ALTER TABLE test_messages_20180116_20180116 ADD PARTITION (dt = '20171117') LOCATION '/test/20171117'
I receive an "OK" statement. However, when I query the table:
select * from test_messages limit 1;
I receive the following error:
Failed with exception java.io.IOException:java.lang.IllegalArgumentException: FieldDescriptor does not match message type.
I have been reading up on Hive table and have seen that the partition columns do not need to be part of the data being loaded. The reason I am trying to partition the date is both for performance but, more so, because the "LOAD DATA ... " statements move the files between directories in HDFS.
P.S. I have proven that I am able to run queries against hive table without partitioning.
Any thoughts ?
I see that you have created EXTERNAL TABLE. So you cannot add or drop partition using hive. you need to create a folder using hdfs or MR or SPARK. EXTERNAL table can only be read by hive but not managed by HDFS. You can check the hdfs location '/test/dt=20171117' and you will see that folder has not been created.
My suggestion is create the folder(partition) using "hadoop fs -mkdir '/test/20171117'" then try to query the table. although it will give 0 row. but you can add the data to that folder and read from Hive.
You need to specify a LOCATION for an EXTERNAL TABLE
CREATE EXTERNAL TABLE
...
LOCATION '/test';
Then, is the data actually a sequence file? All you've said is that it's protobuf data. I'm not sure how the elephantbird library works, but you'll want to double check that.
Then, your table locations need to look like /test/dt=value in order for Hive to read them.
After you create an external table over HDFS location, you must run MSCK REPAIR TABLE table_name for the partitions to be added to the Hive metastore

Create a HIVE table and save it to a tab-separated file?

I have some data in hdfs.
This data was migrated from a PostgreSQL database by using Sqoop.
The data has the following hadoopish format, like _SUCCESS, part-m-00000, etc.
I need to create a Hive table based on this data and then I need to export this table to a single tab-separated file.
As far as I know, I can create a table this way.
create external table table_name (
id int,
myfields string
)
location '/my/location/in/hdfs';
Then I can save the table as tsv file:
hive -e 'select * from some_table' > /home/myfile.tsv
I don't know how to load data from hdfs into a Hive table.
Moreover, should I manually define the structure of a table using create or is there any automated way when all columns are created automatically?
I don't know how to load data from hdfs into Hive table
You create a table schema over a hdfs directory like you're doing.
should I manually define the structure of a table using create or is there any automated way when all columns are created automatically?
Unless you didn't tell sqoop to create the table, you must do it manually.
export this table into a single tab-separated file.
A query might work, or unless sqoop set the delimiter to \t, then you need to create another table from the first specifying such column separator. And then, you don't even need to query the table, just run hdfs dfs -getMerge on the directory

how to preprocess the data and load into hive

I completed my hadoop course now I want to work on Hadoop. I want to know the workflow from data ingestion to visualize the data.
I am aware of how eco system components work and I have built hadoop cluster with 8 datanodes and 1 namenode:
1 namenode --Resourcemanager,Namenode,secondarynamenode,hive
8 datanodes--datanode,Nodemanager
I want to know the following things:
I got data .tar structured files and first 4 lines have got description.how to process this type of data im little bit confused.
1.a Can I directly process the data as these are tar files.if its yes how to remove the data in the first four lines should I need to untar and remove the first 4 lines
1.b and I want to process this data using hive.
Please suggest me how to do that.
Thanks in advance.
Can I directly process the data as these are tar files.
Yes, see the below solution.
if yes, how to remove the data in the first four lines
Starting Hive v0.13.0, There is a table property, tblproperties ("skip.header.line.count"="1") while creating a table to tell Hive the number of rows to ignore. To ignore first four lines - tblproperties ("skip.header.line.count"="4")
CREATE TABLE raw (line STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
CREATE TABLE raw_sequence (line STRING)
STORED AS SEQUENCEFILE
tblproperties("skip.header.line.count"="4");
LOAD DATA LOCAL INPATH '/tmp/test.tar' INTO TABLE raw;
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw;
To view the data:
select * from raw_sequence
Reference: Compressed Data Storage
Follow the below steps to achieve your goal:
Copy the data(ie.tar file) to the client system where hadoop is installed.
Untar the file and manually remove the description and save it in local.
Create the metadata(i.e table) in hive based on the description.
Eg: If the description contains emp_id,emp_no,etc.,then create table in hive using this information and also make note of field separator used in the data file and use the corresponding field separator in create table query. Assumed that file contains two columns which is separated by comma then below is the syntax to create the table in hive.
Create table tablename (emp_id int, emp_no int)
Row Format Delimited
Fields Terminated by ','
Since, data is in structured format, you can load the data into hive table using the below command.
LOAD DATA LOCAL INPATH '/LOCALFILEPATH' INTO TABLE TABLENAME.
Now, local data will be moved to hdfs and loaded into hive table.
Finally, you can query the hive table using SELECT * FROM TABLENAME;

Sqoop - Create empty hive partitioned table based on schema of oracle partitioned table

I have an oracle table which has 80 columns and id partitioned on state column. My requirement is to create a hive table with similar schema of oracle table and partitioned on state.
I tried using sqoop -create-hive-table option. But keep getting an error
ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.IllegalArgumentException: Partition key state cannot be a column to import.
I understand that in Hive the partitioned column should not be in table definition, but then how do I get around the issue?
I do not want to manually write create table command, as I have 50 such tables to import and would like to use sqoop.
Any suggestion or ideas?
Thanks
There is a turn around for this.
Below is the procedure i fallow :
On Oracle run query to get the schema for a table and store it to a file.
Move that file to Hadoop
On Hadoop create a shell script which constructs a HQL file.
That hql file contains "Hive create table statement along with columns". For this we can use the above file(Oracle schema file copied to hadoop).
For this script to run u need to just pass Hive database name,table name, partition column name,path, etc.. depending on u r customization level.At the end of this shell script add "hive -f HQL filename".
If everything is ready it just takes couple of mins for each table creation.

How to Load Data into Hive from HDFS

I am Trying to load data into hive from HDFS . But I Observed that data is moving , meaning after loading the data into hive environment if i look at the HDFS the data which i have loaded is not present . can You Please answer this question with example .
If you would like to create a table in Hive from data in HDFS without moving the data into /user/hive/warehouse/, you should use the optional EXTERNAL and LOCATION keywords. For example, from this page, we have the following example CREATE TABLE statement:
hive> CREATE EXTERNAL TABLE userline(line STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/home/admin/userdata';
Without those, Hive will take your data from HDFS and load it into /user/hive/warehouse (and if the table is dropped, the data is also deleted).

Resources