Hive insert and load data query is not working - hadoop

My hive query is not working. Hive allowing me to create the databases, show databases and create table as well but it don't allow me to move local file to into HDFS table and insert query is also not working.
I tried reinitialize my metastore and format namenode and created again every directory. But still anything is not working.
My datanode is not starting. Is this problem related to datanode? What should I do.
Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. Error caching map.xml.
This error is coming when I try to run any query except create table and databases.

From the errors above ,You will not be able to write in hdfs.
Hive allowing me to create the databases, show databases and create
table as well but it don't allow me to move local file to into HDFS
table and insert query is also not working.
Freeing up the HDFS Space would work

Related

SAS to HIVE2 Cloudera - Error trying to write

I have the following error while trying to write on the hive2 db :
ERROR: java.io.IOException: Could not get block locations. Source file "/tmp/sasdata-e1-...dlv - Aborting...block==null
the error appears when trying to write a new table or append rows to an existing table. I can connect correctly to the db (through a libname), read tables from the schema and when trying to create a new table the new table get created but empty because the error above happens .
Can someone help pls?
Thank you
Remember that hive is mostly just a metadatastore that helps you to read files from HDFS. Yes, it does this through a database paradigm but it's really operating on HDFS. Each table is created in an HDFS directory, and files are created.
This sounds like you don't have write permissions to the hdfs folder you are writing to. (but you have read)
To solve this problem you need to understand what user you are using and where the data is being written.
If you are creating a simple table you need to check if you can write to the hive warehouse directory. If you are purposely creating files in a specific hDFS folder you should check that.
Here's a command to help you determine where the data is being written to.
show create table [mytable]
If it doesn't mention a HDFS location you need to find get permissions to the hive warehouse. (Typicallys located hdfs:/user/hive/warehouse , but is actually defined in $HIVE_HOME/conf/hive-default.xml if it's located elsewhere).

Main purpose of the MetaStore in Hive?

I am a little confused on the purpose of the MetaStore. When you create a table in hive:
CREATE TABLE <table_name> (column1 data_type, column2 data_type);
LOAD DATA INPATH <HDFS_file_location> INTO table managed_table;
So I know this command takes the contents of the file in HDFS and creates a MetaData form of it and stores it in the MetaStore (including column types, column names, the place where it is in HDFS, etc. of each row in the HDFS file). It doesn't actually move the data from HDFS into Hive.
But what is the purpose of storing this MetaData?
When I connect to Hive using Spark SQL for example the MetaStore doesn't contain the actual information in HDFS but just MetaData. So is the MetaStore simply used by Hive to do parsing and compiling steps against the HiveQL query and to create the MapReduce jobs?
Metastore is for storing schema(table definitions including location in HDFS, serde, columns, comments, types, partition definitions, views, access permissions, etc) and statistics. There is no such operation as moving data from HDFS to Hive because Hive tables data is stored in HDFS(or other compatible filesystem like S3). You can define new table or even few tables on top of some location in HDFS and put files in it. You can change existing table location or partition location, all this information is stored in the metastore, so Hive knows how to access data. Table is a logical object defined in the metastore and data itself are just files in some location in HDFS.
See also this answer about Hive query execution flow(high level): https://stackoverflow.com/a/45587873/2700344
Hive performs schema-on-read operations, which means that for the data to be processed in some structured manner (i.e. a table-like object), the layout of said data needs to be summarized in a relational structure
takes the contents of the file in HDFS and creates a MetaData form of it
As far as I know, no files are actually read when you create a table.
SparkSQL connects to the metastore directly. Both Spark and HiveServer have their own query parsers. It's not part of the metastore. MapReduce/Tez/Spark jobs are also not handled by the metastore. It's just a relational database. If it's Mysql, Postgres, or Oracle, you can easily go connect to it and inspect the contents. By default, both Hive and Spark use an embedded Derby database

Unable to partition hive table backed by HDFS

Maybe this is an easy question but, I am having a difficult time resolving the issue. At this time, I have an pseudo-distributed HDFS that contains recordings that are encoded using protobuf 3.0.0. Then, using Elephant-Bird/Hive I am able to put that data into Hive tables to query. The problem that I am having is partitioning the data.
This is the table create statement that I am using
CREATE EXTERNAL TABLE IF NOT EXISTS test_messages
PARTITIONED BY (dt string)
ROW FORMAT SERDE
"com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
WITH serdeproperties (
"serialization.class"="path.to.my.java.class.ProtoClass")
STORED AS SEQUENCEFILE;
The table is created and I do not receive any runtime errors when I query the table.
When I attempt to load data as follows:
ALTER TABLE test_messages_20180116_20180116 ADD PARTITION (dt = '20171117') LOCATION '/test/20171117'
I receive an "OK" statement. However, when I query the table:
select * from test_messages limit 1;
I receive the following error:
Failed with exception java.io.IOException:java.lang.IllegalArgumentException: FieldDescriptor does not match message type.
I have been reading up on Hive table and have seen that the partition columns do not need to be part of the data being loaded. The reason I am trying to partition the date is both for performance but, more so, because the "LOAD DATA ... " statements move the files between directories in HDFS.
P.S. I have proven that I am able to run queries against hive table without partitioning.
Any thoughts ?
I see that you have created EXTERNAL TABLE. So you cannot add or drop partition using hive. you need to create a folder using hdfs or MR or SPARK. EXTERNAL table can only be read by hive but not managed by HDFS. You can check the hdfs location '/test/dt=20171117' and you will see that folder has not been created.
My suggestion is create the folder(partition) using "hadoop fs -mkdir '/test/20171117'" then try to query the table. although it will give 0 row. but you can add the data to that folder and read from Hive.
You need to specify a LOCATION for an EXTERNAL TABLE
CREATE EXTERNAL TABLE
...
LOCATION '/test';
Then, is the data actually a sequence file? All you've said is that it's protobuf data. I'm not sure how the elephantbird library works, but you'll want to double check that.
Then, your table locations need to look like /test/dt=value in order for Hive to read them.
After you create an external table over HDFS location, you must run MSCK REPAIR TABLE table_name for the partitions to be added to the Hive metastore

Questions about Hive

I have this environment:
Haddop environment (1 master, 4 slaves) with several applications:
ambari, hue, hive, sqoop, hdfs ... Server in production (separate
from hadoop) with mysql database.
My goal is:
Optimize the queries made on this mysql server that are slow to
execute today.
What did I do:
I imported the mysql data to HDFS using Sqoop.
My doubts:
I can not make selects direct in HDFS using Hive?
Do I have to load the data into Hive and make the queries?
If new data is entered into the mysql database, what is the best way
to get this data and insert it into HDFS and then insert it into
Hive again? (Maybe in real time)
Thank you in advance
I can not make selects direct in HDFS using Hive?
You can. Create External Table in hive specifying your hdfs location. Then you can perform any HQL over it.
Do I have to load the data into Hive and make the queries?
In case of external table, you don't need to load data in hive; your data resides in the same HDFS directory.
If new data is entered into the mysql database, what is the best way to get this data.
You can use Sqoop Incremental Import for this. It will fetch only newly added/updated data (depending upon incremental mode). You can create a sqoop job and schedule it as per your need.
You can try Impala which is much faster than Hive in case of SQL queries. You need to define tables most probably specifying some delimiter, storage format and where the data is stored on HDFS (I don't know what kind of data are you storing). Then you can write SQL queries which will take the data from HDFS.
I have no experience with real-time data ingestion from relational databases, however you can try scheduling Sqoop jobs with cron.

How Hive stores the data (loaded from HDFS)?

I am fairly new to Hadoop (HDFS and Hbase) and Hadoop Eco system (Hive, Pig, Impala etc.). I have got a good understanding of Hadoop components such as NamedNode, DataNode, Job Tracker, Task Tracker and how they work in tandem to store the data in efficient manner.
While trying to understand fundamentals of data access layer such as Hive, I need to understand where exactly a table’s data (created in Hive) gets stored? We can create external and internal table in Hive. As external tables can be in HDFS or any other file system, Hive doesnt store data for such tables in warehouse. What about internal tables? This table will be created as a directory on one of the data nodes on Hadoop Cluster. Once we load data in these tables from local or HDFS file system, are there further files getting created to store data in tables created in Hive?
Say for example:
A sample file named test_emp_feedback.csv was brought from local file system to HDFS.
A table (emp_feedback) was created in Hive with a structure similar to csv file structure. This lead to creation of a directory in Hadoop cluster say /users/big_data/hive/emp_feedback
Now once I create the table and load data in emp_feedback table from test_emp_feedback.csv
Is Hive going to create a copy of file in emp_feedback directory? Wont it cause data redundancy?
Creating a Managed table will create a directory with Same name as table name at Hive warehouse directory(Usually at /user/hive/warehouse/dbname/tablename).Also the table structure(Hive Metadata) is created in the metastore(RDBMS/HCat).
Before you load the data on the table, this directory(with the same name as table name under hive warehouse) is empty.
There could be 2 possible scenarios.
If the table is external the data is not copied to warehouse directory at all.
If the table is managed(not external), when you load your data to the table it is moved(not Copied) from current HDFS location to Hive warehouse directory9/user/hive/warehouse//). So this will not replicate the data.
Caution: It is always advisable to create external table unless the data is only used by hive. Dropping a managed table would delete the data from HDFS(Warehouse of HIVE).
HadoopGig
To answer you Question :
For External Tables:
Hive does not move the data into its warehouse directory. If the external table is dropped, then the table metadata is deleted but not the data.
For Internal tables
Hive moves data into its warehouse directory. If the table is dropped, then the table metadata and the data will be deleted.
For your reference
Difference between Internal & External tables:
For External Tables
External table stores files on the HDFS server but tables are not linked to the source file completely.
If you delete an external table the file still remains on the HDFS server.
As an example if you create an external table called “table_test” in HIVE using HIVE-QL and link the table to file “file”, then deleting “table_test” from HIVE will not delete “file” from HDFS.
External table files are accessible to anyone who has access to HDFS file structure and therefore security needs to be managed at the HDFS file/folder level.
Meta data is maintained on master node, and deleting an external table from HIVE only deletes the metadata not the data/file.
For Internal Tables
Stored in a directory based on settings in hive.metastore.warehouse.dir, by default internal tables are stored in the following directory /user/hive/warehouse you can change it by updating the location in the config file.
Deleting the table deletes the metadata and data from master-node and HDFS respectively.
Internal table file security is controlled solely via HIVE. Security needs to be managed within HIVE, probably at the schema level (depends on organization).
Hive may have internal or external tables, this is a choice that affects how data is loaded, controlled, and managed.
Use EXTERNAL tables when:
The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schema (tables or views) at a single data set or if you are iterating through various possible schema.
Hive should not own data and control settings, directories, etc., you may have another program or process that will do those things.
You are not creating table based on existing table (AS SELECT).
Use INTERNAL tables when:
The data is temporary.
You want Hive to completely manage the life-cycle of the table and data.
Source:
HDInsight: Hive Internal and External Tables Intro
Internal & external tables in Hadoop- HIVE
It would not cause data redundancy. For managed (not external) tables Hive moves the data into its warehouse directory. In your example, the data will be moved from original location on HDFS to '/users/big_data/hive/emp_feedback'.
Be careful with the removal of the managed table, it will lead to removal data on HDFS also.
You can send data in two days
A) use LOAD DATA INPATH 'file_location_of_csv' INTO TABLE emp_feedback;
Note that this command will remove content at source directory and create a internal table
OR)
B) Use copyFromLocal or put command to copy local file into HDFS and then create external table and copy the data into table. Now data won't be moved from source. You can drop external table but still source data is available.
e.g.
create external table emp_feedback (
emp_id int,
emp_name string
)
location '/location_in_hdfs_for_csv file';
When you drop an external table, it only drops the meta data of HIVE table. Data still exists at HDFS file location.
Got it. This is what I was able to understand so far.
It all depends upon which type of table is being created and where from the file is picked up. Below are possible use cases
enter image description here

Resources