How row level deletes are handled in HBASE? - hadoop

I am new bee in HBASE. So could someone please clarify my query on Row level deletes in HBase.
Say we have 10 records in a table. So every record will be stored in separate HFile. So if we try to delete any record, it will delete the
actual HFile. I understood, this is how row level deletes are handled in HBASE.
But during compaction Smaller HFiles will be converted to large HFile.
So all the data will be stored together in larger HFiles. Now, how row level deletes will be handled if all the data is stored together?

Basically it just gets marked for deletion and the actual deletion happens during the next compaction. Please see the Deletion in HBase article for details.

HFile is not created as soon as you insert data. First the data is stored in memstore. Once the memstore is sufficiently large, it is flushed to HFile. New HFile is not created for every record or row. Also remember since records are stored in memory, they get sorted and then flushed to HFile. This is how records in HFiles are always sorted.
HFiles are immutable [any files for that matter in HDFS are expected to be immutable]. Deletion of records does not happen right away. They are marked for deletion. And when the system runs compaction (Minor or Major), the records marked for deletion are actually deleted and the new HFile does not contain it. If the compaction is not initiated, the record still exists. However, it is masked from displaying whenever queried for.

Related

How to delete empty partitions in cratedb?

Cratedb:4.x.x
We have one table in which we are doing partition based on day.
we will take snapshot of tables based on that partition and after taking backup we delete the data of that day.
Due to multiple partition, shards count is more than 2000 and configured shard is 6
I have observed that old partitions have no data but still exist in database.
So it will take more time to become healthy and available to write data after restarting the crate.
So Is there any way to delete those partition?
Is there any way to stop replication of data on startup the cluster? cause it takes too much time to become healthy cluster and due to that table is not writable until that process finished.
Any solution for this issue will be great help?
You should be able to delete empty partitions with a DELETE with an exact match on the partitioned by column. Like DELETE FROM <tbl> WHERE <partitioned_by_column> = <value>

HBase : HFile stats not changed after flush

I have a HBase table 'emp'. I created some rows in it using hbase-shell, among which the biggest rowkey is 123456789.
When I check on HBase UI (the web console) following the below path :
regions -> emp,,1582232348771.4f2d545621630d98353802540fbf8b00. -> hdfs://namenode:9000/hbase/data/default/emp/4f2d545621630d98353802540fbf8b00/personal data/15a04db0d3a44d2ca7e12ab05684c876 (store file)
I can see Key of biggest row: 123456789, so everything is good.
But the problem came when I deleted the row containing the rowkey 123456789 using hbase-shell. I also put some other rows, then finally flush the table flush 'emp'.
I see a second HFile generated. But the Key of biggest row of the first HFile is always 123456789.
So I am very confused : this row no longer exist in my hbase table, and I already did a flush (so everything in memstore should be in HFile). Why in stats it always shows this rowkey ? What is going on behind the scene ?
And how can I update the stats ?
You're correct in that everything in the memstore is now in HFiles, but until a compaction takes place the deleted row will still exist, albeit marked for deletion in the new, second HFile.
If you force a compaction with major_compact ‘table_name’, ‘col_fam’, you should see this record disappear (and be left with one HFile). Maybe there's a small bug in stats that doesn't take deleted records into account?

Spark read.parquet takes too much time

Hi I don't understand why this code takes too much time.
val newDataDF = sqlContext.read.parquet("hdfs://192.168.111.70/u01/dw/prod/stage/br/ventas/201711*/*")
It's supposed than no bytes are transferred to the driver program, isn't it? How does read.parquet works?
What I can see from the Spark web UI is that read.spark fires about 4000 tasks (there's a lot of parquet files inside that folder).
The issue most likely is the file indexing that has to occur as the first step of loading a DataFrame. You said the spark.read.parquet fires off 4000 tasks, so you probably have many partition folders? Spark will get an HDFS directory listing and recursively get the FileStatus (size and splits) of all files in each folder. For efficiency Spark indexes the files in parallel, so you want to ensure you have enough cores to make it as fast as possible. You can also be more explicit in the folders you wish to read or define a Parquet DataSource table over the data to avoid the partition discovery each time you load it.
spark.sql("""
create table mydata
using parquet
options(
path 'hdfs://192.168.111.70/u01/dw/prod/stage/br/ventas/201711*/*'
)
""")
spark.sql("msck repair table mydata")
From this point on, when you query the data it will no longer have to do the partition discovery, but it'll still have to get the FileStatus for the files within the folders you query. If you add new partitions you can either add the partition explicitly of force a full repair table again:
spark.sql("""
alter table mydata add partition(foo='bar')
location 'hdfs://192.168.111.70/u01/dw/prod/stage/br/ventas/201711/foo=bar'
""")

How exactly Cassandra read procedure works?

I have a little experience with cassandra But I have one query regarding cassandra read process.
Suppose we have 7 sstables for a given table in our cassandra db now If we perform any read query which is not cached in memtable So Cassandra will look into the sstables. My question is:-
During this process will cassandra load all the sstables(7) into the memtable or It will just look into the all the sstables and will load relevant rows in memtable instead of loading all the sstables ?
Thanking you in advance!!
And please do correct me If I have interpreted something wrong.
And It also would be great If some one can explain/mention better resources to know about working of sstables.
During this process will cassandra load all the sstables(7)
No. Cassandra wouldn't load all the 7 SSTables. Each SSTable has a BloomFilter (in-memory) that tells the possibility for having the data in that SSTable.
If BloomFilter indicates a possibility of having the data in the SSTable, it looks into the partition key cache and gets the compression offset map (in-memory) to retrieve the compressed block that has the data we are looking for.
If found in the partition key cache, then the compressed block is read (I/O) to get the data.
If not found, it looks into partition summary to get the location of index entry and reads that location (I/O) into memory and continues with compression offset map flow earlier.
To start with, this Cassandra Reads link I think should help and depicts the flow pictorially. Capturing below the read path from above link for quick reference.
And one more thing, there is also a row cache which contains the hot rows (accessed frequently) and this will not result in hitting/loading the SSTable if found in the row cache.
Go through this rowcache link to understand row cache and partition key cache.
Another great presentation shared by Jeff Jirsa, Understanding Cassandra Table Options. Really worth going through it.
On a different note, there is compaction the happens periodically to reduce the number of SSTables and delete the rows based on tombstones.

How to count rows that got copied from one Hive table to another

I am moving data from one hive table to another hive table. While moving data, I add few new columns add partition and also applying compression.
I wanted to know if there is an easy way to know that number of rows moved from one table to another are same. Just to validate the moving data action.
Currently I am doing count on both table, which is taking too much time as number of rows are in 10^10 magnitude.
Thanks
When a map-reduce job is triggered during the transfer of data from first table to the second, you can use the RECORDS counter from map/reduce to validate the row count.

Resources