Hive Table Creation based on file structure - hadoop

i have one doubt, is there any way in HIVE which create table during load to hive warehouse or external table.
As i know hive is based on Schema On Read. so table structure must sync with file structure. but if file size is huge and we don't know its structure for example columns and their datatypes.
Than how to load those file to hive table.
so in short how to load file from HDFS to HIVE Table without knowing its schema structure.
New to Hive, Pardon if my understanding is wrong.
Thanks

By using sqoop you can create hive table while importing data.
Please refer to this link to create hive table while importing data
(or)
if you have imported data in AVRO format then you can generate avro schema by using
/usr/bin/Avro/avro-tools-*.jar then use the generated avro schema while creating table in hive then hive uses the schema and reads the data from HDFS.
Please refer to this link to extract schema from avro data file
(or)
While importing data using sqoop --as-avrodatefile then sqoop creates .avsc file with schema in it, so we can use this .avsc file creating the table.
CREATE EXTERNAL TABLE avro_tbl
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '<hdfs-location>'
TBLPROPERTIES ('avro.schema.url'='<schema-file>');
(or)
By using NiFi to import data NiFi pulls data in avro format by using ExtractAvroMetadata processor we can extract the avro schema and store into HDFS and create table by using this avro schema.
If you want to create table in ORC format then by using ConvertAvroToOrc processor adds hive.ddl attribute to the flowfile as we can execute the ddl statement to create orc table in hive.

Related

Schema on read in hive for tsv format file

I am new on hadoop. I have data in tsv format with 50 columns and I need to store the data into hive. How can I create and load the data into table on the fly without manually creating table using create table statementa using schema on read?
Hive requires you to run a CREATE TABLE statement because the Hive metastore must be updated with the description of what data location you're going to be querying later on.
Schema-on-read doesn't mean that you can query every possible file without knowing metadata beforehand such as storage location and storage format.
SparkSQL or Apache Drill, on the other hand, will let you infer the schema from a file, but you must again define the column types for a TSV if you don't want everything to be a string column (or coerced to unexpected types). Both of these tools can interact with a Hive metastore for "decoupled" storage of schema information
you can use Hue :
http://gethue.com/hadoop-tutorial-create-hive-tables-with-headers-and/
or with Spark you can infer the schema of csv file and you can save it as a hive table.
val df=spark.read
.option("delimiter", "\t")
.option("header",true)
.option("inferSchema", "true") // <-- HERE
.csv("/home/cloudera/Book1.csv")

Where does the Hive data gets stored?

I am a little confused on where does the hive stores it's data.
Does it stores it's data in HDFS or in a RDBMS ??
Does Hive Meta store uses a RDBMS to store the hive tables metadata ??
Thanks in Advance !!
Hive data are stored in one of Hadoop compatible filesystem: S3, HDFS or other compatible filesystem.
Hive metadata are stored in RDBMS like MySQL, see supported RDBMS.
The location of Hive tables data in S3 or HDFS can be specified for both managed and external tables.
The difference between managed and external tables is that DROP TABLE statement, in managed table, will drop the table and delete table's data. Whereas, for external table DROP TABLE will drop only the table and data will remain as is and can be used for creating other tables over it.
See details here: Create/Drop/Truncate Table
Here is the answer to your question. But I will suggest you to read hive books or apache hive site for better understanding.
Does it stores it's data in HDFS or in a RDBMS ?? - The Data for HIVE is always stored in HDFS. For managed tables the data is stored in hive warehouse by default which is a directory in HDFS. For HIVE External table user can specify the location anywhere in HDFS.
Does Hive Meta store uses a RDBMS to store the hive tables metadata ?? - Yes HIVE uses RDBMS to store the metadata.

Create hive table from table schema stored in .avsc file

I have a hive table schema stored in one hdfs file schema.avsc.
I want to create a hive table of the same schema and want to dump a data from another hdfs path where data is stored in HDFS file system.
1 : How can i create a table ?
2 : How can i dump a data stored in hdfs file into created table ?
How can i create a table ?
The Apache Hive documentation on the AvroSerDe shows the syntax for creating a table based on an Avro schema stored in a file. For convenience, I'll repeat one of the examples here:
CREATE TABLE kst
PARTITIONED BY (ds string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.url'='http://schema_provider/kst.avsc');
This example pulls the schema file from a web server. The documentation also shows other options, such as pulling from a local file, depending on your specific needs.
I recommend reading the entire AvroSerDe documentation page. There is a lot of useful information there about getting the most out of using Hive with Avro.
How can i dump a data stored in hdfs file into created table ?
You can define an external table that references the existing HDFS files. The documentation page for External Tables shows the syntax. Repeating an example:
CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User',
country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
STORED AS TEXTFILE
LOCATION '<hdfs_location>';
After defining the external table, you can then use an INSERT-SELECT query that reads from the external table and writes to the Avro table. The documentation on Inserting data into Hive Tables from queries describes the INSERT-SELECT syntax. For example:
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.cnt

External Hive table from AVRO files says it has no data

I created an external Hive table that points to a location that has several avro files. The create statement worked without any issues and it created the expected columns. However, the table is has no data when I try to run a query. I tried to create the table a few different ways and couldn't get it to work. I have also verified the the directory has the avro files.
CREATE EXTERNAL TABLE table_name
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/path/to/avro/data/'
TBLPROPERTIES ('avro.schema.url'='/path/to/schema/ags.avsc');
CREATE EXTERNAL TABLE table_name
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as AVRO
LOCATION '/path/to/avro/data/'
TBLPROPERTIES ('avro.schema.url'='/path/to/schema/ags.avsc');
Any ideas?
Turns out the schema file(which was produced by sqoop) was incorrect. I ended up creating a new schema file by using "avro-tools getschema " Once I used that schema file everything worked as expected.

Load from HIVE table into HDFS as AVRO file

I want to load a file into HDFS (as .avro file) from HIVE table.
Currently I am able to move a table as a file from HIVE to HDFS but I am not able to specify a particular format of my Target file. can some one help me in this.??
So your question is really
How do I convert a Hive table to a different storage format?
Create a new table with the same fields and types as the avro table and change the input format. Then insert into the new table from the old table.
INSERT OVERWRITE TABLE newtable SELECT * FROM oldtable

Resources