Repartition in Hadoop - hadoop

My question is mostly theoretical, but i have some tables that already follow some sort of partition scheme, lets say my table is partitioned by day, but after working with the data for sometime we want to modifity to month partitions instead, i could easily recreare the table with the new partition definition and reinsert the data, is this the best approach? sounds slow when the data is huge, i have seen there are multiple alter commands in hive for partitions, is there one that can help me achieve what i need?
Maybe there is another choice of concatenating the files and then recreating the table with the new partition?
ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])] CONCATENATE;
If there are any relevant references they are appreciated as well.

If the files are in daily folders, you can not mount many daily folders into single month partition, for each month, files needs to be moved to month folder. You can not do it as a metadata only operation.
If you are good in shell scripting you can write loop in hadoop fs -ls <table location> | sort, in the loop save path into variable, check if substring including yyyy-MM is different from previous, then create yyyy-MM folder. For each row in a loop do copy everything into month location (hadoop fs -cp daily_location/* month_location/), all can be done in single loop.
If you are on S3 and using AWS-CLI commands, creating of folders is not necessary, just copy.
If there are too many small files, you may want to concatenate them in monthly folders, if it is ORC, you can execute ALTER TABLE PARTITION CONCATENATE. If not ORC, then better use Hive INSERT OVERWRITE, it will do all that for you, you can configure merge task and finally your files will be in optimal size. Additionally you can improve compression efficiency and make possible to use bloom filters and internal indexes(if it is ORC/Parquet) if you add distribute by partition_col sort by <keys used in filters/joins>, this will greatly reduce table size and improve queries performance.
So, better use Hive for this task because it gives you opportunity to improve data storage: change storage format, concatenate files, sort to reduce compressed size and make indices and bloom filters be really useful.

Related

HDFS File Compaction with continuous ingestion

We have few tables in HDFS which are getting approx. 40k new files per day. We need to compact these tables every two weeks and for that we need to stop ingestion.
We have spark ingestion getting data from kafka and adding to HDFS (Hive external tables) every 30 mins. The data is queried as soon as it is ingested, our SLA is less than an hour so we can not increase the batch interval.
The tables are partition on two fields, we get older data constantly so most of the partitions are updated during each injection batch
eg:
/user/head/warehouse/main_table/state=CA/store=macys/part-00000-017258f8-aaa-bbb-ccc-wefdsds.c000.snappy.parquet
We are looking into ways to reduce number of file creations but even with that we will have to do compaction every 3/4 weeks if not two.
As most of the partitions are updated constantly, we need to stop the injection (~ 1 day) before starting compaction which is impacting our users.
I am looking for ways to compact automatically with out stopping the ingestion?
The chosen partitioning scheme is somewhat unfortunate. Still there are a couple of things you can do. I'm relying on the fact that you can change partition's location atomically in Hive (alter table ... partition ... set location):
Copy a partition's hdfs directory to a different location
Compact copied data
Copy new files that were ingested since step 1
do "alter table ... partition ... set location" to point Hive to a new compacted location.
Start ingesting to this new location (in case if this step is tricky you can just as well replace the "small" files in the original partition location with their compacted version and do "alter table ... partition ... set location" again to point Hive back to the original partition location.
You'll have to keep this process running iterating partition-by-partition on a continuous basis.
Thank you Facha for your suggestions, really appreciate it.
I am pretty new to HDFS concept so please dont mind basic questions,
What would be the impact on running queries which are accessing these specific files while doing swapping of uncompacted files with compacted files (alter table ... partition ... set location). I believe that the queries might fail. Who can we minimize the impact?
Copy a partition's hdfs directory to a different location
As we have two partitions in one table, state and store, will I have to iterate through each sub partition?
/tableName/state=CA/store=macys/file1.parquet
/tableName/state=CA/store=macys/file2.parquet
/tableName/state=CA/store=JCP/file2.parquet
/tableName/state=CA/store=JCP/file2.parquet
/tableName/state=NY/store=macys/file1.parquet
/tableName/state=NY/store=macys/file2.parquet
/tableName/state=NY/store=JCP/file2.parquet
/tableName/state=NY/store=JCP/file2.parquet
For each state
for each store
get list of files in this dir to replace later
compact
/tableName/state=$STATE/store=$STORE (SPARK JOb?)
replace uncompacted files with compacted files
alter table ... partition ... set location
I would prefer your other suggestion in step 5 " just as well replace the "small" files in the original partition location with their compacted version"
How would I go ahead with implementing it, will it be best done with scripting or scala or some other programing language. I have basic knowledge of scripting, good experiencs in java and new to scala but can learn in couple of days.
Regards,
P

How to efficiently update Impala tables whose files are modified very frequently

We have a Hadoop-based solution (CDH 5.15) where we are getting new files in HDFS in some directories. On top os those directories we have 4-5 Impala (2.1) tables. The process writing those files in HDFS is Spark Structured Streaming (2.3.1)
Right now, we are running some DDL queries as soon as we get the files written to HDFS:
ALTER TABLE table1 RECOVER PARTITONS to detect new partitions (and their HDFS directories and files) added to the table.
REFRESH table1 PARTITIONS (partition1=X, partition2=Y), using all the keys for each partition.
Right now, this DDL is taking a bit too long and they are getting queued in our system, damaging the data availability of the system.
So, my question is: Is there a way to do this data incorporation more efficiently?
We have considered:
Using the ALTER TABLE .. RECOVER PARTITONS but as per the documentation, it only refreshes new partitions.
Tried to use REFRESH .. PARTITON ... with multiple partitions at once, but the statement syntaxis does not allow to do that.
Tried batching the queries but the Hive JDBC drives does not support batching queries.
Shall we try to do those updates in parallel given that the system is already busy?
Any other way you are aware of?
Thanks!
Victor
Note: The way in which we know what partitions need refreshed is by using HDFS events as with Spark Structured Streaming we donĀ“t know exactly when the files are written.
Note #2: Also, the files written in HDFS are sometimes small, so it would be great if it could be possible to merge those files at the same time.
Since nobody seems to have the answer for my problem, I would like to share the approach we took to make this processing more efficient, comments are very welcome.
We discovered (doc. is not very clear on this) that some of the information stored in the Spark "checkpoints" in HDFS is a number of metadata files describing when each Parquet file was written and how big was it:
$hdfs dfs -ls -h hdfs://...../my_spark_job/_spark_metadata
w-r--r-- 3 hdfs 68K 2020-02-26 20:49 hdfs://...../my_spark_job/_spark_metadata/3248
rw-r--r-- 3 hdfs 33.3M 2020-02-26 20:53 hdfs://...../my_spark_job/_spark_metadata/3249.compact
w-r--r-- 3 hdfs 68K 2020-02-26 20:54 hdfs://...../my_spark_job/_spark_metadata/3250
...
$hdfs dfs -cat hdfs://...../my_spark_job/_spark_metadata/3250
v1
{"path":"hdfs://.../my_spark_job/../part-00004.c000.snappy.parquet","size":9866555,"isDir":false,"modificationTime":1582750862638,"blockReplication":3,"blockSize":134217728,"action":"add"}
{"path":"hdfs://.../my_spark_job/../part-00004.c001.snappy.parquet","size":526513,"isDir":false,"modificationTime":1582750862834,"blockReplication":3,"blockSize":134217728,"action":"add"}
...
So, what we did was:
Build a Spark Streaming Job polling that _spark_metadata folder.
We use a fileStream since it allow us to define the file filter to use.
Each entry in that stream is one of those JSON lines, which is parsed to extract the file path and size.
Group the files by the parent folder (which maps to each Impala partition) they belong to.
For each folder:
Read a dataframe loading only the targeted Parquet files (to avoid race conditions with the other job writing the files)
Calculate how many blocks to write (using the size field in the JSON and a target block size)
Coalesce the dataframe to the desired number of partitions and write it back to HDFS
Execute the DDL REFRESH TABLE myTable PARTITION ([partition keys derived from the new folder]
Finally, delete the source files
What we achieved is:
Limit the DDLs, by doing one refresh per partition and batch.
By having batch time and block size configurable, we are able to adapt our product to different deployment scenarios with bigger or smaller datasets.
The solution is quite flexible, since we can assign more or less resources to the Spark Streaming job (executors, cores, memory, etc.) and also we can start/stop it (using its own checkpointing system).
We are also studying the possibily of applying some data repartitioning, while doing this process, to have partitions as close as possible to the most optimum size.

How does the CONCATENATE in ALTER TABLE command in HIVE works

I am trying to understand how exactly the ALTER TABLE CONCATENATE in HIVE Works.
I saw this link How does Hive 'alter table <table name> concatenate' work? but all I got from this links is that for ORC Files, the merge happens at a stripe level.
I am looking for a detailed explanation of how CONCATENATE works. As an e.g I initially had 500 small ORC Files in the HDFS. I ran the Hive ALTER TABLE CONCATENATE and the files merged to 27 bigger files. Subsequent runs of CONCATENATE reduced the number of files to 16 and finally I ended up in two large files.( used version Hive 0.12 ) So I wanted to understand
How exactly CONCATENATE works? Does it looks at the existing number of files , as well as the size ? How will it determine the no: of output ORC files after concatenation?
Is there any known issues with using the Concatenate ? We are planning to run the concatenate one a day in the maintenance window
Is Using CTAS an alternative to concatenate and which is better? Note that my requirement is to reduce the no of ORC files (ingested through Nifi) without compromising performance of Read
Any help is appreciated and thanks in advance
Concatenated file size can be controlled with following two values:
set mapreduce.input.fileinputformat.split.minsize=268435456;
set hive.exec.orc.default.block.size=268435456;
These values should be set based on your HDFS/MapR-FS block size.
As commented by #leftjoin it is indeed the case that you can get different output files for the same underlying data.
This is discussed more in the linked HCC thread but the key point is:
Concatenation depends on which files are chosen first.
Note that having files of different sizes, should not be a problem in normal situations.
If you want to streamline your process, then depending on how big your data is, you may also want to batch it a bit before writing to HDFS. For instance, by setting the batch size in NiFi.

Spark read.parquet takes too much time

Hi I don't understand why this code takes too much time.
val newDataDF = sqlContext.read.parquet("hdfs://192.168.111.70/u01/dw/prod/stage/br/ventas/201711*/*")
It's supposed than no bytes are transferred to the driver program, isn't it? How does read.parquet works?
What I can see from the Spark web UI is that read.spark fires about 4000 tasks (there's a lot of parquet files inside that folder).
The issue most likely is the file indexing that has to occur as the first step of loading a DataFrame. You said the spark.read.parquet fires off 4000 tasks, so you probably have many partition folders? Spark will get an HDFS directory listing and recursively get the FileStatus (size and splits) of all files in each folder. For efficiency Spark indexes the files in parallel, so you want to ensure you have enough cores to make it as fast as possible. You can also be more explicit in the folders you wish to read or define a Parquet DataSource table over the data to avoid the partition discovery each time you load it.
spark.sql("""
create table mydata
using parquet
options(
path 'hdfs://192.168.111.70/u01/dw/prod/stage/br/ventas/201711*/*'
)
""")
spark.sql("msck repair table mydata")
From this point on, when you query the data it will no longer have to do the partition discovery, but it'll still have to get the FileStatus for the files within the folders you query. If you add new partitions you can either add the partition explicitly of force a full repair table again:
spark.sql("""
alter table mydata add partition(foo='bar')
location 'hdfs://192.168.111.70/u01/dw/prod/stage/br/ventas/201711/foo=bar'
""")

1 Billion records join(Filters) in Spark with Parquet file format vs HadoopText Input format

When reading a 1 Billion records of a table in Spark from Hive and this table have date and country columns as partitions. It is running for very long time since we are doing many transformations on it. If I change the Hive table file format to Parquet then will it be there any performance? Any suggestions on improvement of performance .
Change the Orc to Parquet maybe will not improve the performance.
But it depends of the type of data you have. If you are working with nested objects you need to use Parquet, Orc is not good for that.
But to create some improvement, I suggest you to do some steps that can help with your data in Hive.
Check the number of files in Hive.
One common thing that can create big problems in Hive Query is the number of files in each partition, and the size of these files are. If you are using Spark to store the data, I suggest you to check the size of the files and if they are stored with the size of your Hadoop block. If not, try to use the command CONCATENATE to solve that problem. As you can see here.
Predicate PushDown
This is what Hive, and Orc files can give you with the best performance in query the data. I suggest you to run one ANALYSE command to force the creation of the Statistics of your table, this will improve the performance and if the data is not efficient this will help. Check here and with this will update the Hive Metastore and will give you some relevant data information.
Ordered Data
If it is possible, try to store your data ordered by some column, and filter and do other stuffs in that column. Your join can be improved with this.

Resources