I am trying to load a file from hdfs into hive using spark sql using below queries.
hiveContext.sql("CREATE EXTERNAL TABLE IF NOT EXISTS src (value STRING)")
hiveContext.sql("LOAD DATA INPATH '/data/spark_test/kv1.txt' INTO TABLE src")
hiveContext.sql("FROM src SELECT *").collect().foreach(println)
What I find is,After the 2nd statement ie loading the file, I see the file in /apps/hive/warehouse/src/ but it is not found in /data/spark_test/kv1.txt anymore. why is it so)?Spark version 1.6.1 used here.
This is default behavior of hive, when you load data to table using load data command hive moves original source data to table location.
You can find same file inside table location, run below commands to find source file.
describe extended src; --copy location
hadoop fs -ls <location>
As the src is external table so you may directly create external table on top of data instead of loading in next step.
hiveContext.sql("CREATE EXTERNAL TABLE IF NOT EXISTS src (value STRING) location '/data/spark_test'")
Related
When load data from HDFS to Hive, using
LOAD DATA INPATH 'hdfs_file' INTO TABLE tablename;
command, it looks like it is moving the hdfs_file to hive/warehouse dir.
Is it possible (How?) to copy it instead of moving it, in order, for the file, to be used by another process.
from your question I assume that you already have your data in hdfs.
So you don't need to LOAD DATA, which moves the files to the default hive location /user/hive/warehouse. You can simply define the table using the externalkeyword, which leaves the files in place, but creates the table definition in the hive metastore. See here:
Create Table DDL
eg.:
create external table table_name (
id int,
myfields string
)
location '/my/location/in/hdfs';
Please note that the format you use might differ from the default (as mentioned by JigneshRawal in the comments). You can use your own delimiter, for example when using Sqoop:
row format delimited fields terminated by ','
I found that, when you use EXTERNAL TABLE and LOCATION together, Hive creates table and initially no data will present (assuming your data location is different from the Hive 'LOCATION').
When you use 'LOAD DATA INPATH' command, the data get MOVED (instead of copy) from data location to location that you specified while creating Hive table.
If location is not given when you create Hive table, it uses internal Hive warehouse location and data will get moved from your source data location to internal Hive data warehouse location (i.e. /user/hive/warehouse/).
An alternative to 'LOAD DATA' is available in which the data will not be moved from your existing source location to hive data warehouse location.
You can use ALTER TABLE command with 'LOCATION' option. Here is below required command
ALTER TABLE table_name ADD PARTITION (date_col='2017-02-07') LOCATION 'hdfs/path/to/location/'
The only condition here is, the location should be a directory instead of file.
Hope this will solve the problem.
I am using Vertica 7.2 and am trying to access ORC data in HDFS. The directory location in HDFS is '/user/<path_to_ORC_dir>/' with all the ORC files that underlie a Hive table that is stored in ORC format.
The hadoopConfDir parameter in Vertica has been set to /etc/hadoop/conf. The hadoop conf directory from the separate hadoop cluster has been copied to each node in the Vertica cluster under /etc/hadoop/conf/. I have made an external Vertica table to read from the hdfs location using:
CREATE EXTERNAL TABLE test (col1 INT, etc...) AS COPY FROM 'hdfs:///user/<path_to_ORC_dir>/*' on any node orc;
However, when I try to query from the table I get the following error
select * from test;
Error opening file [hdfs:///user/<path_to_ORC_dir>/000004_0] for read: Could not find HDFS configurations for [hdfs:///user/<path_to_ORC_dir>/000004_0]
My ORC files are named 0..._0 and the file specified in the error changes each time I query.
When I make a table using the specified file in error, instead of the entire directory, I can query the table without any problems.
CREATE EXTERNAL TABLE test1 (col1 INT, etc...) AS COPY FROM 'hdfs:///user/<path_to_ORC_dir>/000004_0' on any node orc;
select * from test1;
...correct results...
What is the cause of the HDFS configuration error when trying to read the entire directory as opposed to the single file? Also to note, I can query the Hive table that is build on 'hdfs:///user//' without any problems.
I have created an external table in Hive using following:
create external table hpd_txt(
WbanNum INT,
YearMonthDay INT ,
Time INT,
HourlyPrecip INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
stored as textfile
location 'hdfs://localhost:9000/user/hive/external';
Now this table is created in location */hive/external.
Step-1: I loaded data in this table using:
load data inpath '/input/hpd.txt' into table hpd_txt;
the data is successfully loaded in the specified path ( */external/hpd_txt)
Step-2: I delete the table from */hive/external path using following:
hadoop fs -rmr /user/hive/external/hpd_txt
Questions:
why is the table deleted from original path? (*/input/hpd.txt is deleted from hdfs but table is created in */external path)
After I delete the table from HDFS as in step 2, and again I use show tables; It still gives the table hpd_txt in the external path.
so where is this coming from.
Thanks in advance.
Hive doesn't know that you deleted the files. Hive still expects to find the files in the location you specified. You can do whatever you want in HDFS and this doesn't get communicated to hive. You have to tell hive if things change.
hadoop fs -rmr /user/hive/external/hpd_txt
For instance the above command doesn't delete the table it just removes the file. The table still exists in hive metastore. If you want to delete the table then use:
drop if exists tablename;
Since you created the table as an external table this will drop the table from hive. The files will remain if you haven't removed them. If you want to delete an external table and the files the table is reading from you can do one of the following:
Drop the table and then remove the files
Change the table to managed and drop the table
Finally the location of the metastore for hive is by default located here /usr/hive/warehouse.
The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes is handy if you already have data generated. Else, you will have data loaded (conventionally or by creating a file in the directory being pointed by the hive table)
When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.
An EXTERNAL table points to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir.
Source: Hive docs
So, in your step 2, removing the file /user/hive/external/hpd_txt removes the data source(data pointing to the table) but the table still exists and would continue to point to hdfs://localhost:9000/user/hive/external as it was created
#Anoop : Not sure if this answers your question. let me know if you have any questions further.
Do not use load path command. The Load operation is used to MOVE ( not COPY) the data into corresponding Hive table. Use put Or copyFromLocal to copy file from non HDFS format to HDFS format. Just provide HDFS file location in create table after execution of put command.
Deleting a table does not remove HDFS file from disk. That is the advantage of external table. Hive tables just stores metadata to access data files. Hive tables store actual data of data file in HIVE tables. If you drop the table, the data file is untouched in HDFS file location. But in case of internal tables, both metadata and data will be removed if you drop table.
After going through you helping comments and other posts, I have found answer to my question.
If I use LOAD INPATH command then it "moves" the source file to the location where external table is being created. Which although, wont be affected in case of dropping the table, but changing the location is not good. So use local inpath in case of loading data in Internal tables .
To load data in external tables from a file located in the HDFS, use the location in the CREATE table query which will point to the source file, for example:
create external table hpd(WbanNum string,
YearMonthDay string ,
Time string,
hourprecip string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
stored as textfile
location 'hdfs://localhost:9000/input/hpd/';
So this sample location will point to the data already present in HDFS in this path. so no need to use LOAD INPATH command here.
Its a good practice to store a source files in their private dedicated directories. So that there is no ambiguity while external tables are created as data is in a properly managed directory system.
Thanks a lot for helping me understand this concept guys! Cheers!
I am using hive v0.13
My data is stored in hdfs, I use create "CREATE external TABLE" to create a table for those data. Everything works fine, I can issue "select" statements. The question is under the warehouse directory (hive.metastore.warehouse.dir), I don't see any files/data get added, is this normal? I know with "external" table data will not get copy to warehouse directory but shouldn't there be table meta data be stored under there?
When you create a internal table hive creates a directory with table name under the directory you have specified in hive.metastore.warehouse.dir. For me it /apps/hive/warehouse.
Suppose you have created a table name test_tbl then there will be a directory /apps/hive/warehouse/test_tbland hive store metadata into mysql or your configured RDBMS for store metadata.and when you load data using LOAD DATA INPATH command into this directory.
But in external table you specify a location in your create statement hence hive doesn't create any directory in default warehouse directory because you have already provided the location. it just store metadata information in RDBMS
You can directly load data into that location using hdfs dfs -put command and hive will treat that data for the table which is associated with that particular directory. Hence it is expected behavior for external table.
when you create a external table Metadata will be genrally stored in the RDBMS i.e., in metastore database and the data which you insert or load will be stored in the directory.
either it is an external or managed table metadata will always be in RDBMS when you query on any table hive will actually get the table schema from metastore and data from HDFS evaluates the schema with data and displays.
So, there wont be any metadata created in warehouse for external tables.
I have log files stored as text in HDFS. When I load the log files into a Hive table, all the files are copied.
Can I avoid having all my text data stored twice?
EDIT: I load it via the following command
LOAD DATA INPATH '/user/logs/mylogfile' INTO TABLE `sandbox.test` PARTITION (day='20130221')
Then, I can find the exact same file in:
/user/hive/warehouse/sandbox.db/test/day=20130220
I assumed it was copied.
use an external table:
CREATE EXTERNAL TABLE sandbox.test(id BIGINT, name STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/logs/';
if you want to use partitioning with an external table, you will be responsible for managing the partition directories.
the location specified must be an hdfs directory..
If you drop an external table hive WILL NOT delete the source data.
If you want to manage your raw files, use external tables. If you want hive to do it, the let hive store inside of its warehouse path.
I can say, instead of copying data by your java application directly to HDFS, have those file in local file system, and import them into HDFS via hive using following command.
LOAD DATA LOCAL INPATH '/your/local/filesystem/file.csv' INTO TABLE `sandbox.test` PARTITION (day='20130221')
Notice the LOCAL
You can use alter table partition statement to avoid data duplication.
create External table if not exists TestTable (testcol string) PARTITIONED BY (year INT,month INT,day INT) row format delimited fields terminated by ',';
ALTER table TestTable partition (year='2014',month='2',day='17') location 'hdfs://localhost:8020/data/2014/2/17/';
Hive (atleast when running in true cluster mode) can not refer to external files in local file system. Hive can automatically import the files during table creation or load operation. The reason behind this can be that Hive runs MapReduce jobs internally to extract the data. MapReduce reads from the HDFS as well as writes back to HDFS and even runs in distributed mode. So if the file is stored in local file system, it can not be used by the distributed infrastructure.