Error creating a Hive table in HDInsight from a different blob container: Path is not legal - hadoop

CREATE TABLE test1 (Column1 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA INPATH 'asv://hivetest#mystorageaccount.blob.core.windows.net/foldername' OVERWRITE INTO TABLE test1 ;
Loading the data generates the following error:
FAILED: Error in semantic analysis: Line 1:18 Path is not legal
''asv://hivetest#mystorageaccount.blob.core.windows.net/foldername'':
Move from:
asv://hivetest#mystorageaccount.blob.core.windows.net/foldername to:
asv://hdi1#hdinsightstorageaccount.blob.core.windows.net/hive/warehouse/test1
is not valid. Please check that values for params "default.fs.name"
and "hive.metastore.warehouse.dir" do not conflict.
The container hivetest is not my default HDInsight container. It is even located on a different storage account. However, the problem is probably not with the account credentials, as I have edited core-site.xml to include mystorageaccount.
How can I load data from a non-default container?

Apparently it's impossible by design to load data into a Hive table from a non-default container. The workaround suggested by the answer in the link is using an external table.
I was trying to use a non-external table so I can take advantage of partitioning, but apparently it's possible to partition even an external table, as explained here.

Related

data deleted from hdfs after using hive load command [duplicate]

When load data from HDFS to Hive, using
LOAD DATA INPATH 'hdfs_file' INTO TABLE tablename;
command, it looks like it is moving the hdfs_file to hive/warehouse dir.
Is it possible (How?) to copy it instead of moving it, in order, for the file, to be used by another process.
from your question I assume that you already have your data in hdfs.
So you don't need to LOAD DATA, which moves the files to the default hive location /user/hive/warehouse. You can simply define the table using the externalkeyword, which leaves the files in place, but creates the table definition in the hive metastore. See here:
Create Table DDL
eg.:
create external table table_name (
id int,
myfields string
)
location '/my/location/in/hdfs';
Please note that the format you use might differ from the default (as mentioned by JigneshRawal in the comments). You can use your own delimiter, for example when using Sqoop:
row format delimited fields terminated by ','
I found that, when you use EXTERNAL TABLE and LOCATION together, Hive creates table and initially no data will present (assuming your data location is different from the Hive 'LOCATION').
When you use 'LOAD DATA INPATH' command, the data get MOVED (instead of copy) from data location to location that you specified while creating Hive table.
If location is not given when you create Hive table, it uses internal Hive warehouse location and data will get moved from your source data location to internal Hive data warehouse location (i.e. /user/hive/warehouse/).
An alternative to 'LOAD DATA' is available in which the data will not be moved from your existing source location to hive data warehouse location.
You can use ALTER TABLE command with 'LOCATION' option. Here is below required command
ALTER TABLE table_name ADD PARTITION (date_col='2017-02-07') LOCATION 'hdfs/path/to/location/'
The only condition here is, the location should be a directory instead of file.
Hope this will solve the problem.

Inserting local csv to a Hive table from Qubole

I have a csv on my local machine, and I access Hive through Qubole web console. I am trying to upload the csv as a new table, but couldn't figure out. I have tried the following:
LOAD DATA LOCAL INPATH <path> INTO TABLE <table>;
I get the error saying No files matching path file
I am guessing that the csv has to be in some remote server where hive is actually running, and not on my local machine. The solutions I saw doesn't explain how to handle this issue. Can someone help me out reg. this?
Qubole allows you to define hive external/managed tables on the data sitting on your cloud storage ( s3 or azure storage ) - so LOAD from your local box wont work. you will have to upload this on your cloud storage and then define an external table against it -
CREATE External TABLE orc1ext(
`itinid` string, itinid1 string)
stored as ORC
LOCATION
's3n://mybucket/def.us.qubole.com/warehouse/testing.db/orc1';
INSERT INTO TABLE orc1ext SELECT itinid, itinid
FROM default.default_qubole_airline_origin_destination LIMIT 5;
First, create a table on hive using the field names present in your csv file.syntax which you are using seems correct.
Use below syntax for creating table
CREATE TABLE foobar(key string, stats map<string, bigint>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
MAP KEYS TERMINATED BY ':' ;
and then load data using below format,then mention path name correctly
LOAD DATA LOCAL INPATH '/yourfilepath/foobar.csv' INTO TABLE foobar;

Hive error - Select * from table ;

I created one external table in hive which was successfully created.
create external table load_tweets(id BIGINT,text STRING)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION '/user/cloudera/data/tweets_raw';
But, when I did:
hive> select * from load_tweets;
I got the below error:
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.ByteArrayInputStream#5dfb0646; line: 1, column: 2]**
Please suggest me how to fix this. Is it the twitter o/p file which was created using flume was corrupted or anything else?
You'll need to do two additional things.
1) Put data into the file (perhaps using INSERT). Or maybe it's already there. In either case, you'll then need to
2) from Hive, msck repair table load_tweets;
For Hive tables, the schema and other meta-information about the data is stored in what's called the Hive Metastore -- it's actually a relational database under the covers. When you perform operations on Hive tables created without the LOCATION keyword (that is, internal, not external tables), the Hive will automatically update the metastore.
But most Hive use-cases cause data to be appended to files that are updated using other processes, and thus external tables are common. If new partitions are created externally, before you can query them with Hive you need to force the metastore to sync with the current state of the data using msck repair table <tablename>;.

Where is HIVE metadata stored by default?

I have created an external table in Hive using following:
create external table hpd_txt(
WbanNum INT,
YearMonthDay INT ,
Time INT,
HourlyPrecip INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
stored as textfile
location 'hdfs://localhost:9000/user/hive/external';
Now this table is created in location */hive/external.
Step-1: I loaded data in this table using:
load data inpath '/input/hpd.txt' into table hpd_txt;
the data is successfully loaded in the specified path ( */external/hpd_txt)
Step-2: I delete the table from */hive/external path using following:
hadoop fs -rmr /user/hive/external/hpd_txt
Questions:
why is the table deleted from original path? (*/input/hpd.txt is deleted from hdfs but table is created in */external path)
After I delete the table from HDFS as in step 2, and again I use show tables; It still gives the table hpd_txt in the external path.
so where is this coming from.
Thanks in advance.
Hive doesn't know that you deleted the files. Hive still expects to find the files in the location you specified. You can do whatever you want in HDFS and this doesn't get communicated to hive. You have to tell hive if things change.
hadoop fs -rmr /user/hive/external/hpd_txt
For instance the above command doesn't delete the table it just removes the file. The table still exists in hive metastore. If you want to delete the table then use:
drop if exists tablename;
Since you created the table as an external table this will drop the table from hive. The files will remain if you haven't removed them. If you want to delete an external table and the files the table is reading from you can do one of the following:
Drop the table and then remove the files
Change the table to managed and drop the table
Finally the location of the metastore for hive is by default located here /usr/hive/warehouse.
The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes is handy if you already have data generated. Else, you will have data loaded (conventionally or by creating a file in the directory being pointed by the hive table)
When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.
An EXTERNAL table points to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir.
Source: Hive docs
So, in your step 2, removing the file /user/hive/external/hpd_txt removes the data source(data pointing to the table) but the table still exists and would continue to point to hdfs://localhost:9000/user/hive/external as it was created
#Anoop : Not sure if this answers your question. let me know if you have any questions further.
Do not use load path command. The Load operation is used to MOVE ( not COPY) the data into corresponding Hive table. Use put Or copyFromLocal to copy file from non HDFS format to HDFS format. Just provide HDFS file location in create table after execution of put command.
Deleting a table does not remove HDFS file from disk. That is the advantage of external table. Hive tables just stores metadata to access data files. Hive tables store actual data of data file in HIVE tables. If you drop the table, the data file is untouched in HDFS file location. But in case of internal tables, both metadata and data will be removed if you drop table.
After going through you helping comments and other posts, I have found answer to my question.
If I use LOAD INPATH command then it "moves" the source file to the location where external table is being created. Which although, wont be affected in case of dropping the table, but changing the location is not good. So use local inpath in case of loading data in Internal tables .
To load data in external tables from a file located in the HDFS, use the location in the CREATE table query which will point to the source file, for example:
create external table hpd(WbanNum string,
YearMonthDay string ,
Time string,
hourprecip string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
stored as textfile
location 'hdfs://localhost:9000/input/hpd/';
So this sample location will point to the data already present in HDFS in this path. so no need to use LOAD INPATH command here.
Its a good practice to store a source files in their private dedicated directories. So that there is no ambiguity while external tables are created as data is in a properly managed directory system.
Thanks a lot for helping me understand this concept guys! Cheers!

WHY does this simple Hive table declaration work? As if by magic

The following HQL works to create a Hive table in HDInsight which I can successfully query. But, I have several questions about WHY it works:
My data rows are, in fact, terminated by carriage return line feed, so why does 'COLLECTION ITEMS TERMINATED BY \002' work? And what is \002 anyway? And no location for the blob is specified so, again, why does this work?
All attempts at creating the same table and specifying "CREATE EXTERNAL TABLE...LOCATION '/user/hive/warehouse/salesorderdetail'" have failed. The table is created but no data is returned. Leave off "external" and don't specify any location and suddenly it works. Wtf?
CREATE TABLE IF NOT EXISTS default.salesorderdetail(
SalesOrderID int,
ProductID int,
OrderQty int,
LineTotal decimal
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
STORED AS TEXTFILE
Any insights are greatly appreciated.
UPDATE:Thanks for the help so far. Here's the exact syntax I'm using to attempt external table creation. (I've only changed the storage account name.) I don't see what I'm doing wrong.
drop table default.salesorderdetailx;
CREATE EXTERNAL TABLE default.salesorderdetailx(SalesOrderID int,
ProductID int,
OrderQty int,
LineTotal decimal)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
STORED AS TEXTFILE
LOCATION 'wasb://mycn-1#my.blob.core.windows.net/mycn-1/hive/warehouse/salesorderdetailx'
When you create your cluster in HDInsight, you have to specify underlying blob storage. It assumes that you are referencing that blob storage. You don't need to specific a location because your query is creating an internal table (see answer #2 below) which is created at a default location. External tables need to specify a location in Azure blob storage (outside of the cluster) so that the data in the table is not deleted when the cluster is dropped. See the Hive DDL for more information.
By default, tables are created as internal, and you have to specify the "external" to make them external tables.
Use EXTERNAL tables when:
Data is used outside Hive
You need data to be updateable in real time
Data is needed when you drop the cluster or the table
Hive should not own data and control settings, directories, etc.
Use INTERNAL tables when:
You want Hive to manage the data and storage
Short term usage (like a temp table)
Creating table based on existing table (AS SELECT)
Does the container "user/hive/warehouse/salesorderdetail" exist in your blob storage? That might explain why it is failing for your external table query.

Resources