Hive - dynamic partition insert fails with less than one hundred partitions - hadoop

I am trying to insert records in hive from one table (not partitioned) to another using dynamic partitioning. I've set these hive properties as suggested in few other questions:
hive.exec.dynamic.partition=True
hive.exec.dynamic.partition.pernode=5000
hive.exec.dynamic.partition.pernode=2048
hive.exec.dynamic.partition.mode=nonstrict
Here you can find the table definitions together with the insert query I am running:
Non-partitioned table
CREATE EXTERNAL TABLE IF NOT EXISTS dataretention.non_partitioned(
recordType STRING,
potentialDuplicate STRING,
...
partDate STRING,
partHour STRING,
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/tmp/temp-tables/csv/not-partitioned'
Partitioned table
CREATE TABLE IF NOT EXISTS dataretention.partitioned(
recordType STRING,
potentialDuplicate STRING,
...)
PARTITIONED BY (partDate STRING, partHour STRING) STORED AS ORC LOCATION '/tmp/temp-tables/partitioned'
Insert query
INSERT INTO dataretention.partitioned PARTITION(partDate, partHour) SELECT recordType, ... partDate, partHour FROM dataretention.non_partitioned;
The file is 1000 record long, the tables have 158 fields. I've tested this first using the hortonworks HDP 2.4 on a sandbox. Since I was thinking about a resource problem, I've moved to a 4-machine m4.xlarge (4-cores, 16 GB ram each) cluster on AWS. I was no able to go over 9 different partitions on the sandbox, and 99 different ones on the cluster. I get a vertex failed from Tez four times and after that it quits the job.
Any help or suggestion is really appreciated. Thank you.

Related

Rules to be followed before creating a Hive partitioned table

As part of my requirement, I have to create a new Hive table and insert into it programmatically. To do that, I have the following DDL to create a Hive table:
CREATE EXTERNAL TABLE IF NOT EXISTS countData (
tableName String,
ssn String,
hiveCount String,
sapCount String,
countDifference String,
percentDifference String,
sap_UpdTms String,
hive_UpdTms String)
COMMENT 'This table contains record count of corresponding tables of all the source systems present on Hive & SAP'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '';
To insert data into a partition of a Hive table I can handle using an insert query from the program. Before creating the table, in the above DDL, I haven't added the "PARTITIONED BY" column as I am not totally clear with the rules of partitioning a Hive table. Couple of rules I know are
While inserting the data from a query, partition column should be the last one.
PARTITIONED BY column shouldn't be an existing column in the table.
Could anyone let me know if there are any other rules for partitioning a Hive table ?
Also in my case, we run the program twice a day to insert data into the table and every time it runs, there could be 8k to 10k records. I am thinking of adding a PARTITIONED BY column for current date (just "mm/dd/yyyy") and inserting it from the code.
Is there a better way to implement the partition idea for my requirement, if adding a date (String format) is not recommended ?
What you mentioned is fine, but I would recommend yyyyMMdd format because it sorts better and is more standardized than seeing 03/05 and not knowing which is the day, and what is the month.
If you want to run it twice a day, and you care about the time the job runs, then do PARTITIONED BY (dt STRING, hour STRING)
Also, don't use STORED AS TEXT. Use Parquet or ORC instead.

Hive: Does hive support partitioning and bucketing while usiing external tables

On using PARTITIONED BY or CLUSTERED BY keywords while creating Hive tables,
hive would create separate files corresponding to each partition or bucket. But for external tables is this still valid. As my understanding is data files corresponding to external files are not managed by hive. So does hive create additional files corresponding to each partition or bucket and move corresponding data in to these files.
Edit - Adding details.
Few extracts from "Hadoop: Definitive Guide" - "Chapter 17: Hive"
CREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);
When we load data into a partitioned table, the partition values are specified explicitly:
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');
At the filesystem level, partitions are simply nested sub directories of the table directory.
After loading a few more files into the logs table, the directory structure might look like this:
The above table was obviously a managed table, so hive had the ownership of data and created a directory structure for each partition as in the above tree structure.
In case of external table
CREATE EXTERNAL TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);
Followed by same set of load operations -
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');
How will hive handle these partitions. As for external tables with out partition, hive will simply point to the data file and fetch any query result by parsing the data file. But in case of loading data in to a partitioned external table, where are the partitions created.
Hope fully in hive warehouse? Can someone support or clarify this?
Suppose partitioning on date as this is a common thing to do.
CREATE EXTERNAL TABLE mydatabase.mytable (
var1 double
, var2 INT
, date String
)
PARTITIONED BY (date String)
LOCATION '/user/location/wanted/';
Then add all your partitions;
ALTER TABLE mytable ADD PARTITION( date = '2017-07-27' );
ALTER TABLE mytable ADD PARTITION( date = '2017-07-28' );
So on and so forth.
Finally you can add your data in the proper location. You will have an external partitioned file.
There is an easy way to do this.
Create your External Hive table first.
CREATE EXTERNAL TABLE database.table (
id integer,
name string
)
PARTITIONED BY (country String)
LOCATION 'xxxx';
Next you have to run a MSCK command (metastore consistency check)
msck repair table database.table
This command will recover all partitions that are available in your path and update the metastore. Now, if you run your query against your table, data from all partitions will be retrieved.

Is it possible to have multiple hive tables represented within the same HDFS directory structure?

Is it possible to have multiple hive tables represented within the same HDFS directory structure? In other words, is there a way to have multiple hive tables pointing to same/overlapping HDFS paths?
Here is my situation:
I have a table named "mytable", located in hdfs:/tables/mytable
CREATE EXTERNAL TABLE mytable
(
id int,
...
[a whole bunch of columns]
...
PARTITIONED BY (logname STRING)
STORED AS [I-do-not-know-what-just-yet]
LOCATION 'hdfs:/tables/mytable';
So, HDFS will look like:
hdfs:/tables/mytable/logname=tarzan/....
hdfs:/tables/mytable/logname=jane/....
hdfs:/tables/mytable/logname=whoa/....
Is it possible to have a hive table, named "tarzan", located in hdfs:/tables/mytable/logname=tarzan ? Same with hive table "jane", located in hdfs:/tables/mytable/logname=jane, etc.
The tarzan, jane, whoa, etc sub-tables share some columns (timestamp, ip_address, country, user_id, and some others), but there will also be a lot of columns that they do not have in common.
Is there a way to store this data once in HDFS, and use it for multiple tables as I described above? Furthermore, is there a way to store the data in an efficient way, since many of the tables will have columns that are not in common? Would a file format like RCFILE or PARQUET work in this case?
Thanks so much for any hints or help anyone can provide,
Yes, we can have multiple hive tables with the same underlying HDFS directory.
Example:
Create table emp and load data file file3 into it.
create table emp (id int, name string, salary int)
row format delimited
fields terminated by ','
-- default location would be used
load data
local inpath '/home/parv/testfiles/file3'
into table emp;
Create another table mirror. When you will select data from mirror table, it would be as same as of emp table (contents of file3).
create table mirror (id int, name string, salary int)
row format delimited
fields terminated by ','
location 'hdfs:///user/hive/warehouse/parv.db/base';
Load data into mirror table. When you will select data either from mirror table or emp table, it would return same results (contents of file3 and file4).
load data
local inpath '/home/parv/testfiles/file4'
into table mirror;
Conclusion:
Same data files are shared among both tables emp and mirror.
But, strange, the HDFS filesystem only shows data directory for emp table and not for mirror table. However, both the tables are present in hive and so can be queried.
Answering my own question:
It IS possible to have multiple hive tables represented by the same HDFS directory structure, but for what I am looking to do:
A mytable table partitioned by logname (logname=tarzan, logname=jane, etc...)
A separate table for each logname: A "tarzan" table with only columns used by the tarzan table, and not any other logname, same for the "jane" table, etc
Only represent the data one time in HDFS
A better solution is to have the 1 mytable table, partitioned by logname, AND create views for each logname table, with only the subset of columns needed in each.
Yes, you could point multiple tables to the same location on HDFS. However, Hive doesn't support dynamic columns.
Is there a reason you can't just have 3 different tables? This would allow you do have different schemas (columns) for each.
--Brandon

Hive: Create Table and Partition By

I have a table with loaded data as following:
create table xyzlogTable (dateC string , hours string, minutes string, seconds string, TimeTaken string, Method string, UriQuery string, ProtocolStatus string) row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' with serdeproperties( "input.regex" = "(\\S+)\\t(\\d+):(\\d+):(\\d+)\\t(\\S+)\\t(\\S+)\\t(\\S+)\\t(\\S+)", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s") stored as textfile;
load data local inpath '/home/hadoop/hive/xyxlogData/' into table xyxlogTable;
total row count is found to be more than 3 million. some queries work fine and some get into infinite loop.
after seeing that select, group by queries taking long time and sometimes not even returning results, decided to go for partitioning.
But both the following statements are failing:
create table xyzlogTable (datenonQuery string , hours string, minutes string, seconds string, TimeTaken string, Method string, UriQuery string, ProtocolStatus string) partitioned by (dateC string);
FAILED: Error in metadata: AlreadyExistsException(message:Table xyzlogTable already exists)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
Alter table xyzlogTable (datenonQuery string , hours string, minutes string, seconds string, TimeTaken string, Method string, UriQuery string, ProtocolStatus string) partitioned by (dateC string);
FAILED: Parse Error: line 1:12 cannot recognize input 'xyzlogTable' in alter table statement
Any idea whats the problem!
This is precisely why I prefer using external tables in Hive. The table you created is not external (you used create table instead of create external table). With non-external tables, dropping the table, drops the metadata (name, column names, types, etc.) and the data of the table in HDFS. On the contrary, when an external table is dropped, only the metadata is removed, the data in HDFS sticks around.
You have a few options going forward:
If the cost of import is high and the data is already not partitioned. Keep this table around but create a new table say xyzlogTable_partitioned that will be a partitioned version of this table. You can use Dynamic Partitioning in Hive to populate this new table.
If the cost of import is high but the data is already partitioned; for example say you already have data in separate files for each partition in HDFS. Create a new partitioned table and have a bash script (or equivalent), move (or copy and later delete, if you are conservative) from the HDFS directory corresponding to the un-partitioned table to the directory corresponding to the appropriate partition of the new table.
If import is cheap: drop the entire table. Re-create a new partitioned table and re-import. Many times if the import process is not aware of the partitioning schema (in other words, if the import can't directly push data into appropriate partitions), it's a common use case to have an unpartitioned table (like the one you already have) as a staging table and then use a Hive query or dynamic partitioning to populate a new partitioned table which gets used in subsequent queries of the workflow.
You should first drop your table which was already created and then create the partitioned table. Or change your table name.

Hive loading in partitioned table

I have a log file in HDFS, values are delimited by comma. For example:
2012-10-11 12:00,opened_browser,userid111,deviceid222
Now I want to load this file to Hive table which has columns "timestamp","action" and partitioned by "userid","deviceid". How can I ask Hive to take that last 2 columns in log file as partition for table? All examples e.g. "hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');" require definition of partitions in the script, but I want partitions to set up automatically from HDFS file.
The one solution is to create intermediate non-partitioned table with all that 4 columns, populate it from file and then make an INSERT into first_table PARTITION (userid,deviceid) select from intermediate_table timestamp,action,userid,deviceid; but that is and additional task and we will have 2 very similiar tables.. Or we should create external table as intermediate.
Ning Zhang has a great response on the topic at http://grokbase.com/t/hive/user/114frbfg0y/can-i-use-hive-dynamic-partition-while-loading-data-into-tables.
The quick context is that:
Load data simply copies data, it doesn't read it so it cannot figure out what to partition
Would suggest that you load data into an intermediate table first (or using an external table pointing to all the files) and then letting partition dynamic insert to kick in to load it into a partitioned table
As mentioned in #Denny Lee's answer, we need to involve a staging table(invites_stg)
managed or external and then INSERT from staging table to partitioned table(invites in this case).
Make sure we have these two properties set to:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
And finally insert to invites,
INSERT OVERWRITE TABLE India PARTITION (STATE) SELECT COL's FROM invites_stg;
Refer this link for help: http://www.edupristine.com/blog/hive-partitions-example
I worked this very same scenario, but instead, what we did is create separate HDFS data files for each partition you need to load.
Since our data is coming from a MapReduce job, we used MultipleOutputs in our Reducer class to multiplex the data into their corresponding partition file. Afterwards, it is just a matter of building the script using the Partition from the HDFS file name.
How about
LOAD DATA INPATH '/path/to/HDFS/dir/file.csv' OVERWRITE INTO TABLE DB.EXAMPLE_TABLE PARTITION (PARTITION_COL_NAME='PARTITION_VALUE');
CREATE TABLE India (
OFFICE_NAME STRING,
OFFICE_STATUS STRING,
PINCODE INT,
TELEPHONE BIGINT,
TALUK STRING,
DISTRICT STRING,
POSTAL_DIVISION STRING,
POSTAL_REGION STRING,
POSTAL_CIRCLE STRING
)
PARTITIONED BY (STATE STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
5. Instruct hive to dynamically load partitions
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

Resources