How to delete empty partitions in cratedb? - sharding

Cratedb:4.x.x
We have one table in which we are doing partition based on day.
we will take snapshot of tables based on that partition and after taking backup we delete the data of that day.
Due to multiple partition, shards count is more than 2000 and configured shard is 6
I have observed that old partitions have no data but still exist in database.
So it will take more time to become healthy and available to write data after restarting the crate.
So Is there any way to delete those partition?
Is there any way to stop replication of data on startup the cluster? cause it takes too much time to become healthy cluster and due to that table is not writable until that process finished.
Any solution for this issue will be great help?

You should be able to delete empty partitions with a DELETE with an exact match on the partitioned by column. Like DELETE FROM <tbl> WHERE <partitioned_by_column> = <value>

Related

exception: org.apache.spark.sql.delta.ConcurrentAppendException: Files were added to the root of the table by a concurrent update

I have a simple Spark job that streams data to a Delta table.
The table is pretty small and is not partitioned.
A lot of small parquet files are created.
As recommended in the documentation (https://docs.delta.io/1.0.0/best-practices.html) I added a compaction job that runs once a day.
val path = "..."
val numFiles = 16
spark.read
.format("delta")
.load(path)
.repartition(numFiles)
.write
.option("dataChange", "false")
.format("delta")
.mode("overwrite")
.save(path)
Every time the compaction job runs the streaming job gets the following exception:
org.apache.spark.sql.delta.ConcurrentAppendException: Files were added to the root of the table by a concurrent update. Please try the operation again.
I tried to add the following config parameters to the streaming job:
spark.databricks.delta.retryWriteConflict.enabled = true # would be false by default
spark.databricks.delta.retryWriteConflict.limit = 3 # optionally limit the maximum amout of retries
It doesn't help.
Any idea how to solve the problem?
When you're streaming the data in, small files are being created (additive) and these files are being referenced in your delta log (an update). When you perform your compaction, you're trying to resolve the small files overhead by collating the data into larger files (currently 16). These large files are created alongside the small, but the change occurs when the delta log is written to. That is, transactions 0-100 make 100 small files, compaction occurs, and your new transaction tells you to now refer to the 16 large files instead. The problem is, you've already had transactions 101-110 occur from the streaming job while the compaction was occurring. After all, you're compacting ALL of your data and you essentially have a merge conflict.
The solution is is to go to the next step in the best practices and only compact select partitions using:
.option("replaceWhere", partition)
When you compact every day, the partition variable should represent the partition of your data for yesterday. No new files are being written to that partition, and the delta log can identify that the concurrent changes will not apply to currently incoming data for today.

HDFS File Compaction with continuous ingestion

We have few tables in HDFS which are getting approx. 40k new files per day. We need to compact these tables every two weeks and for that we need to stop ingestion.
We have spark ingestion getting data from kafka and adding to HDFS (Hive external tables) every 30 mins. The data is queried as soon as it is ingested, our SLA is less than an hour so we can not increase the batch interval.
The tables are partition on two fields, we get older data constantly so most of the partitions are updated during each injection batch
eg:
/user/head/warehouse/main_table/state=CA/store=macys/part-00000-017258f8-aaa-bbb-ccc-wefdsds.c000.snappy.parquet
We are looking into ways to reduce number of file creations but even with that we will have to do compaction every 3/4 weeks if not two.
As most of the partitions are updated constantly, we need to stop the injection (~ 1 day) before starting compaction which is impacting our users.
I am looking for ways to compact automatically with out stopping the ingestion?
The chosen partitioning scheme is somewhat unfortunate. Still there are a couple of things you can do. I'm relying on the fact that you can change partition's location atomically in Hive (alter table ... partition ... set location):
Copy a partition's hdfs directory to a different location
Compact copied data
Copy new files that were ingested since step 1
do "alter table ... partition ... set location" to point Hive to a new compacted location.
Start ingesting to this new location (in case if this step is tricky you can just as well replace the "small" files in the original partition location with their compacted version and do "alter table ... partition ... set location" again to point Hive back to the original partition location.
You'll have to keep this process running iterating partition-by-partition on a continuous basis.
Thank you Facha for your suggestions, really appreciate it.
I am pretty new to HDFS concept so please dont mind basic questions,
What would be the impact on running queries which are accessing these specific files while doing swapping of uncompacted files with compacted files (alter table ... partition ... set location). I believe that the queries might fail. Who can we minimize the impact?
Copy a partition's hdfs directory to a different location
As we have two partitions in one table, state and store, will I have to iterate through each sub partition?
/tableName/state=CA/store=macys/file1.parquet
/tableName/state=CA/store=macys/file2.parquet
/tableName/state=CA/store=JCP/file2.parquet
/tableName/state=CA/store=JCP/file2.parquet
/tableName/state=NY/store=macys/file1.parquet
/tableName/state=NY/store=macys/file2.parquet
/tableName/state=NY/store=JCP/file2.parquet
/tableName/state=NY/store=JCP/file2.parquet
For each state
for each store
get list of files in this dir to replace later
compact
/tableName/state=$STATE/store=$STORE (SPARK JOb?)
replace uncompacted files with compacted files
alter table ... partition ... set location
I would prefer your other suggestion in step 5 " just as well replace the "small" files in the original partition location with their compacted version"
How would I go ahead with implementing it, will it be best done with scripting or scala or some other programing language. I have basic knowledge of scripting, good experiencs in java and new to scala but can learn in couple of days.
Regards,
P

Spark read.parquet takes too much time

Hi I don't understand why this code takes too much time.
val newDataDF = sqlContext.read.parquet("hdfs://192.168.111.70/u01/dw/prod/stage/br/ventas/201711*/*")
It's supposed than no bytes are transferred to the driver program, isn't it? How does read.parquet works?
What I can see from the Spark web UI is that read.spark fires about 4000 tasks (there's a lot of parquet files inside that folder).
The issue most likely is the file indexing that has to occur as the first step of loading a DataFrame. You said the spark.read.parquet fires off 4000 tasks, so you probably have many partition folders? Spark will get an HDFS directory listing and recursively get the FileStatus (size and splits) of all files in each folder. For efficiency Spark indexes the files in parallel, so you want to ensure you have enough cores to make it as fast as possible. You can also be more explicit in the folders you wish to read or define a Parquet DataSource table over the data to avoid the partition discovery each time you load it.
spark.sql("""
create table mydata
using parquet
options(
path 'hdfs://192.168.111.70/u01/dw/prod/stage/br/ventas/201711*/*'
)
""")
spark.sql("msck repair table mydata")
From this point on, when you query the data it will no longer have to do the partition discovery, but it'll still have to get the FileStatus for the files within the folders you query. If you add new partitions you can either add the partition explicitly of force a full repair table again:
spark.sql("""
alter table mydata add partition(foo='bar')
location 'hdfs://192.168.111.70/u01/dw/prod/stage/br/ventas/201711/foo=bar'
""")

Can Hive periodically append or insert incremental data to the same table file in hdfs?

I'm loading network captured data every minute from Spark streaming (from Flume exec), then aggregate data according to ip address, save to Hive at the end. To make it faster I create Hive ORC table with partition on ip address, it works well. The only issue is every minute it creates many (depends on how many ip addresses) kb small files, now I use "ALTER TABLE...CONCATENATE;" to merge them manually, but I think it could be easier, so want to ask whether there is solution that can incrementally merge/append new data to first minute table files instead of creating new table files every minute. Any suggestion is appreciated!
I give up, looks no direct solution as Hive can't append content to existing datafile for performance consideration. Now my alternative is still to concatenate it every week, the problem is query will be broken with error message (complaining it can't find data file) when it's doing concatenation, so there is big business impact. Now I'm thinking replacing Hive with HBase or Kudu which is more flexible and can provide update/delete operation.

How row level deletes are handled in HBASE?

I am new bee in HBASE. So could someone please clarify my query on Row level deletes in HBase.
Say we have 10 records in a table. So every record will be stored in separate HFile. So if we try to delete any record, it will delete the
actual HFile. I understood, this is how row level deletes are handled in HBASE.
But during compaction Smaller HFiles will be converted to large HFile.
So all the data will be stored together in larger HFiles. Now, how row level deletes will be handled if all the data is stored together?
Basically it just gets marked for deletion and the actual deletion happens during the next compaction. Please see the Deletion in HBase article for details.
HFile is not created as soon as you insert data. First the data is stored in memstore. Once the memstore is sufficiently large, it is flushed to HFile. New HFile is not created for every record or row. Also remember since records are stored in memory, they get sorted and then flushed to HFile. This is how records in HFiles are always sorted.
HFiles are immutable [any files for that matter in HDFS are expected to be immutable]. Deletion of records does not happen right away. They are marked for deletion. And when the system runs compaction (Minor or Major), the records marked for deletion are actually deleted and the new HFile does not contain it. If the compaction is not initiated, the record still exists. However, it is masked from displaying whenever queried for.

Resources