I am trying to load data into hive from RDBMS, using sqoop.
Once I populate the hive table with data, and try to run a count(*), the query runs forever and ever. Also if I drop the (external) hive table and delete everything from the hdfs directory and then create a similar, the table gets pre populated with old data(same as in dropped table)even after I delete everything from my hdfs directory and in-fact the trash is also cleared.
Still, the data gets populated and a count(*) runs indefinitely on it.
UPDATE 1
Its a stand alone sandbox hortonworks(2.4) environment.
I dropped the table from hive and also removed related files from HDFS.
I have a script to create and load data.
drop table employee;
and the I run following commands
hadoop fs -rm -r /user/hive/warehouse/intermidiateTable/* ,and,
hadoop fs -rm -r .Trash/Current/user/hive/warehouse/intermidiateTable/*
and then i create the table using same query as this:
create external table employee (id int, name string, account_no bigint, balance bigint, date_field timestamp, created_by string, created_date string,batch_id int, updated_by string, updated_date string)
row format delimited
fields terminated by ','
lines terminated by '\n'
location '/user/hive/warehouse/intermidiateTable';
and when i do select query the table gets populated with older data.
Als0, a select count(*) runs indefinitely.
Recommend a solution somebody.
If you are creating external table inside warehouse directory itself, then what is the purpose of declaring table as 'external',no?
Aren't external table supposed to be outside the warehouse directory so you have control over data files rather than hive itself.
Related
I am trying to drop a table and recreate it in Hive. After dropping the table if I run select query on the table it shows old rows which were in the table before dropping. How is this possible when the table is already dropped ? Why does it retain rows even after table is dropped and recreated ?
hive> select * from abc;
A 30
B 40
hive> drop table abc;
hive> create external table abc ( name string, qty int);
hive> select * from abc;
A 30
B 40
The problem is you are dropping the external table so whenever we dropped this table at that time source file of that table is still exist on that path so whenever we are going to create a new external table with same table name then data can directly extract from source path, for resolving this issue First get path of the table using following command :
hive> describe formatted database_name.table_name;
Then copy entire location which appear in description, for example :
/user/hive/warehouse/database_name.db/table_name
After this use following command to truncate all the data from given table :
hive> dfs -rmr /user/hive/warehouse/database_name.db/table_name;
OR
hive> dfs -rm -r /user/hive/warehouse/database_name.db/table_name;
Then you can wipe it completely using DROP TABLE command.
I don't know Hive, but if it is anything like Oracle (which I, kind of, know), then external table points to a file stored on your disk.
Therefore, once you dropped it you couldn't use it (of course). But then you created another EXTERNAL TABLE (see? 5th line in your example) and of course that you were able to select from it once again.
Because, you didn't delete the FILE that is a data source for that external table.
On using PARTITIONED BY or CLUSTERED BY keywords while creating Hive tables,
hive would create separate files corresponding to each partition or bucket. But for external tables is this still valid. As my understanding is data files corresponding to external files are not managed by hive. So does hive create additional files corresponding to each partition or bucket and move corresponding data in to these files.
Edit - Adding details.
Few extracts from "Hadoop: Definitive Guide" - "Chapter 17: Hive"
CREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);
When we load data into a partitioned table, the partition values are specified explicitly:
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');
At the filesystem level, partitions are simply nested sub directories of the table directory.
After loading a few more files into the logs table, the directory structure might look like this:
The above table was obviously a managed table, so hive had the ownership of data and created a directory structure for each partition as in the above tree structure.
In case of external table
CREATE EXTERNAL TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);
Followed by same set of load operations -
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');
How will hive handle these partitions. As for external tables with out partition, hive will simply point to the data file and fetch any query result by parsing the data file. But in case of loading data in to a partitioned external table, where are the partitions created.
Hope fully in hive warehouse? Can someone support or clarify this?
Suppose partitioning on date as this is a common thing to do.
CREATE EXTERNAL TABLE mydatabase.mytable (
var1 double
, var2 INT
, date String
)
PARTITIONED BY (date String)
LOCATION '/user/location/wanted/';
Then add all your partitions;
ALTER TABLE mytable ADD PARTITION( date = '2017-07-27' );
ALTER TABLE mytable ADD PARTITION( date = '2017-07-28' );
So on and so forth.
Finally you can add your data in the proper location. You will have an external partitioned file.
There is an easy way to do this.
Create your External Hive table first.
CREATE EXTERNAL TABLE database.table (
id integer,
name string
)
PARTITIONED BY (country String)
LOCATION 'xxxx';
Next you have to run a MSCK command (metastore consistency check)
msck repair table database.table
This command will recover all partitions that are available in your path and update the metastore. Now, if you run your query against your table, data from all partitions will be retrieved.
I have a Hive table TEST with this configuration:
create external table if not exists TEST (
ID bigint,
ACTIVITY_ID string,
BATCH_NBR
)
PARTITIONED BY (year INT, month INT, day INT)
CLUSTERED BY (BATCH_NBR) into 20 buckets
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/user/lake/hive/test';
And I have Hive files in this location which I can easily load into Hive table and it works.
/user/lake/hive/test/2013/01/01/part-r-00001
Now if I create another table STORE and insert some data from this TEST table, folder structures are getting changes for the Test table. I was expecting after loading the same data, location for the STORE table will have something like this:
/user/core/store/2014/07/03/batch123231.1313
But the above location changed to this:
/user/core/store/year=2013/month=01/day=01/
I'm using insert overwrite table STORE select * from TEST; query for loading STORE table from TEST.
How can I load that table and preserve the same folder structure in destination?
Internal table in Hive will follow their own/default folder structure in /apps/hive/warehouse folder and will not preserve folder structure if the data is loaded from an external Hive table. I was using internal table for "Store", so it was not working as expected.
Is it possible to have multiple hive tables represented within the same HDFS directory structure? In other words, is there a way to have multiple hive tables pointing to same/overlapping HDFS paths?
Here is my situation:
I have a table named "mytable", located in hdfs:/tables/mytable
CREATE EXTERNAL TABLE mytable
(
id int,
...
[a whole bunch of columns]
...
PARTITIONED BY (logname STRING)
STORED AS [I-do-not-know-what-just-yet]
LOCATION 'hdfs:/tables/mytable';
So, HDFS will look like:
hdfs:/tables/mytable/logname=tarzan/....
hdfs:/tables/mytable/logname=jane/....
hdfs:/tables/mytable/logname=whoa/....
Is it possible to have a hive table, named "tarzan", located in hdfs:/tables/mytable/logname=tarzan ? Same with hive table "jane", located in hdfs:/tables/mytable/logname=jane, etc.
The tarzan, jane, whoa, etc sub-tables share some columns (timestamp, ip_address, country, user_id, and some others), but there will also be a lot of columns that they do not have in common.
Is there a way to store this data once in HDFS, and use it for multiple tables as I described above? Furthermore, is there a way to store the data in an efficient way, since many of the tables will have columns that are not in common? Would a file format like RCFILE or PARQUET work in this case?
Thanks so much for any hints or help anyone can provide,
Yes, we can have multiple hive tables with the same underlying HDFS directory.
Example:
Create table emp and load data file file3 into it.
create table emp (id int, name string, salary int)
row format delimited
fields terminated by ','
-- default location would be used
load data
local inpath '/home/parv/testfiles/file3'
into table emp;
Create another table mirror. When you will select data from mirror table, it would be as same as of emp table (contents of file3).
create table mirror (id int, name string, salary int)
row format delimited
fields terminated by ','
location 'hdfs:///user/hive/warehouse/parv.db/base';
Load data into mirror table. When you will select data either from mirror table or emp table, it would return same results (contents of file3 and file4).
load data
local inpath '/home/parv/testfiles/file4'
into table mirror;
Conclusion:
Same data files are shared among both tables emp and mirror.
But, strange, the HDFS filesystem only shows data directory for emp table and not for mirror table. However, both the tables are present in hive and so can be queried.
Answering my own question:
It IS possible to have multiple hive tables represented by the same HDFS directory structure, but for what I am looking to do:
A mytable table partitioned by logname (logname=tarzan, logname=jane, etc...)
A separate table for each logname: A "tarzan" table with only columns used by the tarzan table, and not any other logname, same for the "jane" table, etc
Only represent the data one time in HDFS
A better solution is to have the 1 mytable table, partitioned by logname, AND create views for each logname table, with only the subset of columns needed in each.
Yes, you could point multiple tables to the same location on HDFS. However, Hive doesn't support dynamic columns.
Is there a reason you can't just have 3 different tables? This would allow you do have different schemas (columns) for each.
--Brandon
I have a log file in HDFS, values are delimited by comma. For example:
2012-10-11 12:00,opened_browser,userid111,deviceid222
Now I want to load this file to Hive table which has columns "timestamp","action" and partitioned by "userid","deviceid". How can I ask Hive to take that last 2 columns in log file as partition for table? All examples e.g. "hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');" require definition of partitions in the script, but I want partitions to set up automatically from HDFS file.
The one solution is to create intermediate non-partitioned table with all that 4 columns, populate it from file and then make an INSERT into first_table PARTITION (userid,deviceid) select from intermediate_table timestamp,action,userid,deviceid; but that is and additional task and we will have 2 very similiar tables.. Or we should create external table as intermediate.
Ning Zhang has a great response on the topic at http://grokbase.com/t/hive/user/114frbfg0y/can-i-use-hive-dynamic-partition-while-loading-data-into-tables.
The quick context is that:
Load data simply copies data, it doesn't read it so it cannot figure out what to partition
Would suggest that you load data into an intermediate table first (or using an external table pointing to all the files) and then letting partition dynamic insert to kick in to load it into a partitioned table
As mentioned in #Denny Lee's answer, we need to involve a staging table(invites_stg)
managed or external and then INSERT from staging table to partitioned table(invites in this case).
Make sure we have these two properties set to:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
And finally insert to invites,
INSERT OVERWRITE TABLE India PARTITION (STATE) SELECT COL's FROM invites_stg;
Refer this link for help: http://www.edupristine.com/blog/hive-partitions-example
I worked this very same scenario, but instead, what we did is create separate HDFS data files for each partition you need to load.
Since our data is coming from a MapReduce job, we used MultipleOutputs in our Reducer class to multiplex the data into their corresponding partition file. Afterwards, it is just a matter of building the script using the Partition from the HDFS file name.
How about
LOAD DATA INPATH '/path/to/HDFS/dir/file.csv' OVERWRITE INTO TABLE DB.EXAMPLE_TABLE PARTITION (PARTITION_COL_NAME='PARTITION_VALUE');
CREATE TABLE India (
OFFICE_NAME STRING,
OFFICE_STATUS STRING,
PINCODE INT,
TELEPHONE BIGINT,
TALUK STRING,
DISTRICT STRING,
POSTAL_DIVISION STRING,
POSTAL_REGION STRING,
POSTAL_CIRCLE STRING
)
PARTITIONED BY (STATE STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
5. Instruct hive to dynamically load partitions
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;