exception: org.apache.spark.sql.delta.ConcurrentAppendException: Files were added to the root of the table by a concurrent update - spark-streaming

I have a simple Spark job that streams data to a Delta table.
The table is pretty small and is not partitioned.
A lot of small parquet files are created.
As recommended in the documentation (https://docs.delta.io/1.0.0/best-practices.html) I added a compaction job that runs once a day.
val path = "..."
val numFiles = 16
spark.read
.format("delta")
.load(path)
.repartition(numFiles)
.write
.option("dataChange", "false")
.format("delta")
.mode("overwrite")
.save(path)
Every time the compaction job runs the streaming job gets the following exception:
org.apache.spark.sql.delta.ConcurrentAppendException: Files were added to the root of the table by a concurrent update. Please try the operation again.
I tried to add the following config parameters to the streaming job:
spark.databricks.delta.retryWriteConflict.enabled = true # would be false by default
spark.databricks.delta.retryWriteConflict.limit = 3 # optionally limit the maximum amout of retries
It doesn't help.
Any idea how to solve the problem?

When you're streaming the data in, small files are being created (additive) and these files are being referenced in your delta log (an update). When you perform your compaction, you're trying to resolve the small files overhead by collating the data into larger files (currently 16). These large files are created alongside the small, but the change occurs when the delta log is written to. That is, transactions 0-100 make 100 small files, compaction occurs, and your new transaction tells you to now refer to the 16 large files instead. The problem is, you've already had transactions 101-110 occur from the streaming job while the compaction was occurring. After all, you're compacting ALL of your data and you essentially have a merge conflict.
The solution is is to go to the next step in the best practices and only compact select partitions using:
.option("replaceWhere", partition)
When you compact every day, the partition variable should represent the partition of your data for yesterday. No new files are being written to that partition, and the delta log can identify that the concurrent changes will not apply to currently incoming data for today.

Related

What should be the minimum size of a valid ORC file with snappy compression

The scenario I am dealing with here is each hour 10k orc files are getting generated in HDFS by spark streaming application and after the end of the hour, a spark merge job runs and merge those small files in some bigger chunk and write it to hive landing path for external table to pick up. Sometimes, a corrupt ORC file is making the merge job to fail. The job would be to find out the corrupt ORC file and move it into a badrecordspath and then let the spark merge job begin. After going through the theory of ORC file, it seems a valid ORC file will have "ORC"(as a string) followed by another byte in the end of the file. How do I check that in optimised way so that it won't take much time to validate those 10K orc files. I thought of writing bash shell script but it seems to take some good amount of time to validate HDFS orc files. My idea is to narrow down the validation if I know the minimum size of a valid ORC file coz most of our corrupt files are very tiny in size(mostly 3 bytes). So if I get any suggestion, that would be very helpful.
PS: I can't use set spark.sql.files.ignoreCorruptFiles=true because I have to track the files and move those to bad records path.
Found out a solution. We can use set spark.sql.files.ignoreCorruptFiles=true and then we can track the ignored files using the below method:
def trackIgnoreCorruptFiles(df: DataFrame): List[Path] = {
val listOfFileAfterIgnore = df.withColumn("file_name", input_file_name)
.select("file_name")
.distinct()
.collect()
.map(x => new Path(x(0).toString))
.toList
listOfCompleteFiles.diff(listOfFileAfterIgnore)
}
input_file_name is an in built spark udf which returns the complete path of the file and we are getting it as a column in that dataframe df.This method returns the list of path of those files remain after ignore by spark. The list diff will give you the actual list of files ignored by spark. Then you can easily move those list of files to a badRecordsPath for future analysis.

HDFS File Compaction with continuous ingestion

We have few tables in HDFS which are getting approx. 40k new files per day. We need to compact these tables every two weeks and for that we need to stop ingestion.
We have spark ingestion getting data from kafka and adding to HDFS (Hive external tables) every 30 mins. The data is queried as soon as it is ingested, our SLA is less than an hour so we can not increase the batch interval.
The tables are partition on two fields, we get older data constantly so most of the partitions are updated during each injection batch
eg:
/user/head/warehouse/main_table/state=CA/store=macys/part-00000-017258f8-aaa-bbb-ccc-wefdsds.c000.snappy.parquet
We are looking into ways to reduce number of file creations but even with that we will have to do compaction every 3/4 weeks if not two.
As most of the partitions are updated constantly, we need to stop the injection (~ 1 day) before starting compaction which is impacting our users.
I am looking for ways to compact automatically with out stopping the ingestion?
The chosen partitioning scheme is somewhat unfortunate. Still there are a couple of things you can do. I'm relying on the fact that you can change partition's location atomically in Hive (alter table ... partition ... set location):
Copy a partition's hdfs directory to a different location
Compact copied data
Copy new files that were ingested since step 1
do "alter table ... partition ... set location" to point Hive to a new compacted location.
Start ingesting to this new location (in case if this step is tricky you can just as well replace the "small" files in the original partition location with their compacted version and do "alter table ... partition ... set location" again to point Hive back to the original partition location.
You'll have to keep this process running iterating partition-by-partition on a continuous basis.
Thank you Facha for your suggestions, really appreciate it.
I am pretty new to HDFS concept so please dont mind basic questions,
What would be the impact on running queries which are accessing these specific files while doing swapping of uncompacted files with compacted files (alter table ... partition ... set location). I believe that the queries might fail. Who can we minimize the impact?
Copy a partition's hdfs directory to a different location
As we have two partitions in one table, state and store, will I have to iterate through each sub partition?
/tableName/state=CA/store=macys/file1.parquet
/tableName/state=CA/store=macys/file2.parquet
/tableName/state=CA/store=JCP/file2.parquet
/tableName/state=CA/store=JCP/file2.parquet
/tableName/state=NY/store=macys/file1.parquet
/tableName/state=NY/store=macys/file2.parquet
/tableName/state=NY/store=JCP/file2.parquet
/tableName/state=NY/store=JCP/file2.parquet
For each state
for each store
get list of files in this dir to replace later
compact
/tableName/state=$STATE/store=$STORE (SPARK JOb?)
replace uncompacted files with compacted files
alter table ... partition ... set location
I would prefer your other suggestion in step 5 " just as well replace the "small" files in the original partition location with their compacted version"
How would I go ahead with implementing it, will it be best done with scripting or scala or some other programing language. I have basic knowledge of scripting, good experiencs in java and new to scala but can learn in couple of days.
Regards,
P

How to efficiently update Impala tables whose files are modified very frequently

We have a Hadoop-based solution (CDH 5.15) where we are getting new files in HDFS in some directories. On top os those directories we have 4-5 Impala (2.1) tables. The process writing those files in HDFS is Spark Structured Streaming (2.3.1)
Right now, we are running some DDL queries as soon as we get the files written to HDFS:
ALTER TABLE table1 RECOVER PARTITONS to detect new partitions (and their HDFS directories and files) added to the table.
REFRESH table1 PARTITIONS (partition1=X, partition2=Y), using all the keys for each partition.
Right now, this DDL is taking a bit too long and they are getting queued in our system, damaging the data availability of the system.
So, my question is: Is there a way to do this data incorporation more efficiently?
We have considered:
Using the ALTER TABLE .. RECOVER PARTITONS but as per the documentation, it only refreshes new partitions.
Tried to use REFRESH .. PARTITON ... with multiple partitions at once, but the statement syntaxis does not allow to do that.
Tried batching the queries but the Hive JDBC drives does not support batching queries.
Shall we try to do those updates in parallel given that the system is already busy?
Any other way you are aware of?
Thanks!
Victor
Note: The way in which we know what partitions need refreshed is by using HDFS events as with Spark Structured Streaming we donĀ“t know exactly when the files are written.
Note #2: Also, the files written in HDFS are sometimes small, so it would be great if it could be possible to merge those files at the same time.
Since nobody seems to have the answer for my problem, I would like to share the approach we took to make this processing more efficient, comments are very welcome.
We discovered (doc. is not very clear on this) that some of the information stored in the Spark "checkpoints" in HDFS is a number of metadata files describing when each Parquet file was written and how big was it:
$hdfs dfs -ls -h hdfs://...../my_spark_job/_spark_metadata
w-r--r-- 3 hdfs 68K 2020-02-26 20:49 hdfs://...../my_spark_job/_spark_metadata/3248
rw-r--r-- 3 hdfs 33.3M 2020-02-26 20:53 hdfs://...../my_spark_job/_spark_metadata/3249.compact
w-r--r-- 3 hdfs 68K 2020-02-26 20:54 hdfs://...../my_spark_job/_spark_metadata/3250
...
$hdfs dfs -cat hdfs://...../my_spark_job/_spark_metadata/3250
v1
{"path":"hdfs://.../my_spark_job/../part-00004.c000.snappy.parquet","size":9866555,"isDir":false,"modificationTime":1582750862638,"blockReplication":3,"blockSize":134217728,"action":"add"}
{"path":"hdfs://.../my_spark_job/../part-00004.c001.snappy.parquet","size":526513,"isDir":false,"modificationTime":1582750862834,"blockReplication":3,"blockSize":134217728,"action":"add"}
...
So, what we did was:
Build a Spark Streaming Job polling that _spark_metadata folder.
We use a fileStream since it allow us to define the file filter to use.
Each entry in that stream is one of those JSON lines, which is parsed to extract the file path and size.
Group the files by the parent folder (which maps to each Impala partition) they belong to.
For each folder:
Read a dataframe loading only the targeted Parquet files (to avoid race conditions with the other job writing the files)
Calculate how many blocks to write (using the size field in the JSON and a target block size)
Coalesce the dataframe to the desired number of partitions and write it back to HDFS
Execute the DDL REFRESH TABLE myTable PARTITION ([partition keys derived from the new folder]
Finally, delete the source files
What we achieved is:
Limit the DDLs, by doing one refresh per partition and batch.
By having batch time and block size configurable, we are able to adapt our product to different deployment scenarios with bigger or smaller datasets.
The solution is quite flexible, since we can assign more or less resources to the Spark Streaming job (executors, cores, memory, etc.) and also we can start/stop it (using its own checkpointing system).
We are also studying the possibily of applying some data repartitioning, while doing this process, to have partitions as close as possible to the most optimum size.

Apache Nifi - Consume Kafka + Merge Content + Put HDFS to avoid small files

I am having around 2000000 messages in Kafka topic and I want to put these records into HDFS using NiFi,so I am using PutHDFS processor for this along with ConsumeKafka_0_10 but it generates small files in HDFS, So I am using Merge Content processor for the merging the records before pushing the file.
Please help if the configuration needs changes This works fine for small number of messages but writes a single file for every record when it comes to topics with massive data.
Thank you!!
The Minimum Number of Entries is set to 1 which means it could have anywhere from 1 to the Max Number of Entries. Try making that something higher like 100k.

Spark read.parquet takes too much time

Hi I don't understand why this code takes too much time.
val newDataDF = sqlContext.read.parquet("hdfs://192.168.111.70/u01/dw/prod/stage/br/ventas/201711*/*")
It's supposed than no bytes are transferred to the driver program, isn't it? How does read.parquet works?
What I can see from the Spark web UI is that read.spark fires about 4000 tasks (there's a lot of parquet files inside that folder).
The issue most likely is the file indexing that has to occur as the first step of loading a DataFrame. You said the spark.read.parquet fires off 4000 tasks, so you probably have many partition folders? Spark will get an HDFS directory listing and recursively get the FileStatus (size and splits) of all files in each folder. For efficiency Spark indexes the files in parallel, so you want to ensure you have enough cores to make it as fast as possible. You can also be more explicit in the folders you wish to read or define a Parquet DataSource table over the data to avoid the partition discovery each time you load it.
spark.sql("""
create table mydata
using parquet
options(
path 'hdfs://192.168.111.70/u01/dw/prod/stage/br/ventas/201711*/*'
)
""")
spark.sql("msck repair table mydata")
From this point on, when you query the data it will no longer have to do the partition discovery, but it'll still have to get the FileStatus for the files within the folders you query. If you add new partitions you can either add the partition explicitly of force a full repair table again:
spark.sql("""
alter table mydata add partition(foo='bar')
location 'hdfs://192.168.111.70/u01/dw/prod/stage/br/ventas/201711/foo=bar'
""")

Resources