We have our dataset in s3 (parquet files) in the below format, data divided as multiple parquet files based on the row number
data1_1000000.parquet
data1000001_2000000.parquet
data2000001_3000000.parquet
...
Created hive table on top of it using,
CREATE EXTERNAL TABLE parquet_hive (
foo string
) STORED AS PARQUET
LOCATION 's3://myBucket/myParquet/';
Totally there 22000 parquet files and the size of the folder is nearly 300GB. When i run the count query on this table in Hive, it is taking 6 hours to return the result which is nearly 7 billion records. How can we make it faster? Can i create partition or index on the table or this is the time it usually take when pulling data from s3. Can anyone advice, what is wrong here.
Thanks.
Related
I have a partitioned ORC table in Hive. After loading the table with all possible partitions I get on HDFS - multiple ORC files i.e. each partition directory on HDFS has an ORC file in it. I need to combine all these ORC files under each partition to a single big ORC file for some use-case.
Can someone suggest me a way to combine these multiple ORC files (belonging to each partition) into a single big ORC file.
I've tried creating a new Non Partitioned ORC table from the Partitioned table.. It does reduce the number of files but not to a single file.
PS: Creating a table out of another one is a completely a map task and hence setting the number of reducers to 1 using the property 'set mapred.reduce.tasks=1;' doesn't help.
Thanks
You can use the CONCATENATE command to combine the small orc files. This can be done at table as well as partition level:
The syntax as per the orc documentation:
users can request an efficient merge of small ORC files together by
issuing a CONCATENATE command on their table or partition. The files
will be merged at the stripe level without reserialization.
ALTER TABLE istari [PARTITION partition_spec] CONCATENATE;
Currently an insert overwrite table T1 select * from T2; will take around 100 minutes in my cluster. Table T1 is ORC formatted and T2 is text formatted. I am reading a 60 GB of text data from T2 and inserting into ORC table T1(10 GB after insertion). If i use text format for both tables insert will take around 50 min. In both cases what are the things we can do to improve write speed( I have large tables coming in) or any other suggestions??
I have recently derived an approach which splits the source file into partitions this takes around 6mins from text table to orc table in hive for 100GB data.
Approach below
Before inserting the file into text table
1.split the file into small partitions in unix location using split command
2.then remove the original file from the path and just keep the files splitted.
Inserting into text table
3.now load the data into text table
4.it will take some mins to load and u can see that there will be same number of partitions as you have done at unix level
Inserting into orc table
Ex: you have splitted the actual file into let say 20 partitions
then you would see 20 tasks/containers being run on the cluster to load into the orc table which is very much faster than the other
solutions which i came across
#despicable-me
That is probably a normal behaviour as when you write data from text to text - it just writes data line by line from one file into another. Text-to-ORC will do some more work besides of it. Comparing to the text-to-text operation, text-to-orc importing will perform additional bucket-partition operations and compression operations to you data. That is the resaon of your time impacts. ORC format gives two main benefits upon text format:
save of space due to compression
improve access time to work with the data
Usually the INSERT operation is a single time operation, while access operations will be very frequent. So it usually makes sence to spend some more time at the beginning on importing the data and then have a huge benefite in saving space due to optimized storage of the data and
in optimized access time to this data
I have an use case which required around 200 hive parquet table.
I need to load these parquet table from flat text files. But we can not directly load parquet table from flat text file.
So I am using following approach
Created a temporary managed text table.
Loaded temp table with text data.
Created external parquet table.
Loaded parquet table with text table using select query.
Dropped text file for temporary text table (but keep table in metastore).
As this approach is keeping temporary metadata (for 200 tables) in metastore. So I have second approach is that I will drop temporary text table too along with text files from hdfs. And next time re-create temporary table and delete once parquet get created.
Now, as I need to follow above steps for all 200 tables for every 2 hours, So will creating and deleting tables from metastore impact anything in cluster during production?
Which approach can impact production, keeping temporary metadata in metastore, creating and deleting tables (metadata) from hive metastore?
Which approach can impact production, keeping temporary metadata in
metastore, creating and deleting tables (metadata) from hive
metastore?
No, there is no impact, the backend of the HiveMetastore should be able to handle 200 * n Changes per hour easily. If you're unsure start with 50 tables and monitor the backend DB performance.
I have two scripts which parse data from raw logs and write it into ORC tables in HIVE. One script creates more columns and another less. Both tables partitioned by date field.
As the result I have ORC tables with different sizes of files.
Table with larger number of columns consists of many small files (~4MB per file inside each partition) and tables with less columns consists of few large files (~250 MB per file inside each partition).
I suppose it happens because of stripe.size setting in ORC. But I don't know how to check size of stripe for existing table. Commands like "show create" and "describe" don't reveal any custom settings, it means that stripe size for tables should be equal to 256 MB.
I'm looking for any advice to check stripe.size for existing ORC table.
Or explanation how file size inside ORC tables depends on data in that tables.
P.s.It matters later when I'm reading from that tables with Map Reduce and there are small number of reducers for tables with big files.
Try the Hive ORC File Dump Utility: ORC File Dump Utility.
I created a table in hive from data stored in hdfs with this command:
create external table users
(ID INT, NAME STRING, ADRESS STRING, EMAIL STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/data/tpch/users';
This users table stored in hdfs has 10gb. And the create table just took 1second to create the table and load the data. So this is strange or it is really fast. My doubt is, to check the time of load tables with data in hive can be with that command above with location? Or that command just create a reference to data stored in hdfs?
So what is the correct way to check the time to load data in hive tables?
Because 1second seems really fast, mysql or another relational database probably need 30 or more minutes for load 10gb of data into a table.
Your create table statement is pointing to external storage for the tables, so Hive is not copying the data over. The documentation explains external tables like this:
External Tables
The EXTERNAL keyword lets you create a table and provide a LOCATION so
that Hive does not use a default location for this table. This comes
in handy if you already have data generated. When dropping an EXTERNAL
table, data in the table is NOT deleted from the file system.
An EXTERNAL table points to any HDFS location for its storage, rather
than being stored in a folder specified by the configuration property
hive.metastore.warehouse.dir.
This is not 100% explicit, but the idea is that Hive is pointing to the table contents rather than managing it directly.