I have a table in which one field is having json data (around 135 fields in json format). now i have a script which is selecting this json data and loading this 135 fields into another table (table 2) having exact 135 similar columns. Below is the sample format of fields in table 1 and respective columns in table 2.
table1
column1
{"transactiondetails":{"transaction_id":"123","timestamp":"2020-01-01T09:03:57" ---____-----____----____},
"inquiry_details":{inquiry_id":"123","Language":"English(us)","postal_code":"123456" -----_____-----_____},
128 more fields similar to this....
table2- columns name
transactionid, timestamp, inquiry_id, language, postal_code..... 128 more columns
the issue with the hql is it's taking a lot of time to run.The hql script is a simple insert overwrite into table2 by selecting data from table 1. Below is the performance tuning parameters i am using .
set.spark.executor.extrajavaOptions=-Xss16m
set hive.execution.engine=spark;
set mapreduce.map.memory.mb=4096;
set spark.master=yarn-cluster;
set spark.serializer=org.apache.spark.serializer.kyroserializer;
set hive.exec.dynamic.partitioning=true;
set hive.exec.dynamic.partitioning.mode=nonstrict;
set hive.execute.parallel=true;
set hive.exec.parallel.thread.number=10;
is there any way to reduce the time taken to do this task.
Related
As part of my requirement, I have to create a new Hive table and insert into it programmatically. To do that, I have the following DDL to create a Hive table:
CREATE EXTERNAL TABLE IF NOT EXISTS countData (
tableName String,
ssn String,
hiveCount String,
sapCount String,
countDifference String,
percentDifference String,
sap_UpdTms String,
hive_UpdTms String)
COMMENT 'This table contains record count of corresponding tables of all the source systems present on Hive & SAP'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '';
To insert data into a partition of a Hive table I can handle using an insert query from the program. Before creating the table, in the above DDL, I haven't added the "PARTITIONED BY" column as I am not totally clear with the rules of partitioning a Hive table. Couple of rules I know are
While inserting the data from a query, partition column should be the last one.
PARTITIONED BY column shouldn't be an existing column in the table.
Could anyone let me know if there are any other rules for partitioning a Hive table ?
Also in my case, we run the program twice a day to insert data into the table and every time it runs, there could be 8k to 10k records. I am thinking of adding a PARTITIONED BY column for current date (just "mm/dd/yyyy") and inserting it from the code.
Is there a better way to implement the partition idea for my requirement, if adding a date (String format) is not recommended ?
What you mentioned is fine, but I would recommend yyyyMMdd format because it sorts better and is more standardized than seeing 03/05 and not knowing which is the day, and what is the month.
If you want to run it twice a day, and you care about the time the job runs, then do PARTITIONED BY (dt STRING, hour STRING)
Also, don't use STORED AS TEXT. Use Parquet or ORC instead.
I am working with hive, I needed to create a table with 'n' normal column and 100 or more as partition columns and I am able to create that table successfully.
now When I come to load that table with data of another table with same schema and all columns are non-partition columns, I am getting error like this:
Failed with exception MetaException(message:Attempt to store value
Failed with exception MetaException(message:Attempt to store value "c1=v1/c2=v2/c3=v3/....c100=v100"
in column "PART_NAME" that has maximum length of 767. Please correct
your data!)
By taking last line of error in consideration, I tried to reduce the column name and their values, so that the resultant partition path will get shorter and it worked!! but it should not be like that in real time scenario size of column name and their values could be anything and so of partition path.
e.g. Here is my create table Query:
CREATE TABLE xyz( c0 int)
PARTITIONED BY ( c1 String,c2 String,c3 String,c4 String.......c100 String) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
And here is my insert into query:
INSERT INTO TABLE xyz PARTITION (gc1,c2,c3....,c100) SELECT c0,c1,c2,c3,c4....,c100 FROM table123;
Am I doing something wrong or should I have to set some properties to make use of so many partitions like 100 or more?
Please give me any clue I am stuck on this.
Thanks
I agreed with the experts that we should not go for so many partitions in a table.
Also I would like to quote this as most of the nodes are unix/linux based and we can not create folder or file name having length greater than 255 bytes. That may be the reason you are getting this error, as partition is a folder only.
Linux has a maximum filename length of 255 characters for most
filesystems (including EXT4), and a maximum path of 4096 characters.
eCryptfs is a layered filesystem.
I have a large table in hive that has 1.5 bil+ values. One of the columns is category_id, which has ~20 distinct values. I want to sample the table such that I have 1 mil values for each category.
I checked out Random sample table with Hive, but including matching rows and Hive: Creating smaller table from big table and I figured out how to get a random sample from the entire table, but I'm still unable to figure out how to get a sample for each category_id.
I understand you want to sample your table in multiple files. You might want to check Hive bucketing or Dynamic partitions to balance your records between multiple folder/files.
I have to make a POC with Hadoop for a database using interactive query (~300To log database). I'm trying Impala but i didn't find any solution to use sorted or indexed data. I'm a newbie so i don't even know if it is possible.
How to query sorted/indexed columns in Impala ?
By the way, here is my table's code (simplified).
I would like to have a fast access on the "column_to_sort" below.
CREATE TABLE IF NOT EXISTS myTable (
unique_id STRING,
column_to_sort INT,
content STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\073'
STORED AS textfile;
Using Hive 0.12.0, I am looking to populate a table that is partitioned and uses buckets with data stored on HDFS. I would also like to create an index of this table on a foreign key which I will use a lot when joining tables.
I have a working solution but something tells me it is very inefficient.
Here is what I do:
I load my data in a "flat" intermediate table (no partition, no buckets):
LOAD DATA LOCAL INPATH 'myFile' OVERWRITE INTO TABLE my_flat_table;
Then I select the data I need from this flat table and insert it into the final partitioned and bucketed table:
FROM my_flat_table
INSERT OVERWRITE TABLE final_table
PARTITION(date)
SELECT
col1, col2, col3, to_date(my_date) AS date;
The bucketing was defined earlier when I created my final table:
CREATE TABLE final_table
(col1 TYPE1, col2 TYPE2, col3 TYPE3)
PARTITIONED BY (date DATE)
CLUSTERED BY (col2) INTO 64 BUCKETS;
And finally, I create the index on the same column I use for bucketing (is that even useful?):
CREATE INDEX final_table_index ON TABLE final_table (col2) AS 'COMPACT';
All of this is obviously really slow, so how would I go about optimizing the loading process?
Thank you
Whenever I had a similar requirement, I used almost the same approach being used by you as I couldn't find an efficiently working alternative.
However to make the process of Dynamic Partitioning a bit fast, I tried setting few configuration parameters like:
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions = 2000;
set hive.exec.max.dynamic.partitions.pernode = 10000;
I am sure you must be using the first two, and the last two you can set depending on your data size.
You can check out this Configuration Properties page and decide for yourself which parameters might help in making your process fast e.g. increasing number of reducers used.
I can not guarantee that using this approach will save your time but definitely you will make the most out of your cluster set up.