We have lot of data in a HBase table. Am new to this NoSQL world. We are looking to keep data only for fixed time. Should we write a separate clean up script or can we rely on TTL configuration?
I went through the docs available but am not understanding the exact behaviour.
The HBase documentation clearly says that data older than TTL will be automatically deleted by HBase.
Remember that data is never deleted by HBase until it does a compaction -- where it rewrites all of its data files. Once the data passes it TTL it will be invisible until a major compaction happens.
It behaves the way it says, i.e all the values in a row whose timestamps are older than the
configured TTL will be deleted at the next major compaction. It is an attribute of the column family. If you want the TTL to apply to the entire table, simply set it to the same value
for each column family in the table. This way you will get rid of the data once you are done with it.
Related
Currently, there's a denormalized table inside a MySQL database that contains hundreds of columns and millions of records.
The original source of the data does not have any way to track the changes so the entire table is dropped and rebuilt every day by a CRON job.
Now, I would like to import this data into Elaticsearch. What is the best way to approach this? Should I use logstash to connect directly to the table and import it or is there a better way?
Exporting the data into JSON or similar is an expensive process since we're talking about gigabytes of data every time.
Also, should I drop the index in elastic as well or is there a way to make it recognize the changes?
In any case - I'd recommend using index templates to simplify index creation.
Now for the ingestion strategy, I see two possible options:
Rework your ETL process to do a merge instead of dropping and recreating the entire table. This would definitely be slower but would allow shipping only deltas to ES or any other data source.
As you've imagined yourself - you should be probably fine with Logstash using daily jobs. Create a daily index and drop the old one during the daily migration.
You could introduce buffers, such as Kafka to your infrastructure, but I feel that might be an overkill for your current use case.
RocksDB supports concurrent writes on a memtable via the option, allow_concurrent_memtable_write which is a part of RocksDB Immutable DBOptions. Since this is a DBOption, this setting is applicable to all CFs created in the DB.
But I have a requirement where i want to enable concurrent writes in certain CFs and disable in others. Treating it more like a ColumnFamilyOptions.
I understand that, I can have two database pointers and separate the column families based on concurrent_writes setting. Still I would like to know if it can be done within the same DB.
Thanks in advance
No it is not possible, its a DB Level option not a column family option.
I have a table in Hbase named 'xyz' . When I do an update operation on this table , it updates a table even though it is same record .
How can I control second record to not be added.
Eg:
create 'ns:xyz',{NAME=>'cf1',VERSIONS => 5}
put 'ns:xyz','1','cf1:name','NewYork'
put 'ns:xyz','1','cf1:name','NewYork'
Above put statements are giving 2 records with different timestamp if I check all versions. I am expecting that it should not add 2nd record because it have same value
HBase isn't going to look through the entire row and work out if it's the same as the data you're adding. That would be an expensive operation, and HBase prides itself on its fast insert speeds.
If you're really eager to do this (and I'd ask if you really want to do this), you should perform a GET first to see if the data is already present in the table.
You could also write a Coprocessor to do this every time you PUT data, but again the performance would be undesirable.
As mentioned by #Ben Watson, HBase is best known for it's performance in write since it doesn't need to check for the existence of a value as multiple versions will be maintained by default.
One hack what you can do is, you can use custom versioning. As show in the below screenshot, you have two versions already for a row key. Now if you are going to insert the same record with the same timestamp. HBase would be overwriting the same record with just the value.
NOTE: It is left to your application to get the same timestamp for a particular value.
I've seen a few (literally, only a few) links and nothing in the documentation that talks about clustering with Firebird, that it can be done.
Then, I shot for the moon on this question CLUSTER command for Firebird?, but answerer told me that Firebird doesn't even have clustered indexes at all, so now I'm really confused.
Does Firebird physically order data at all? If so, can it be ordered by any key, not just primary, and can the clustering/defragging be turned on and off so that it only does it during downtime?
If not, isn't this a hit to performance since it will take the disk longer to put together disparate rows that naturally should be right next to each other?
(DB noob)
MVCC
I found out that Firebird is based upon MVCC, so old data actually isn't overwritten until a "sweep". I like that a lot!
Again, I can't find much, but it seems like a real shame that data wouldn't be defragged according to a key.
This says that database pages are defragmented but provides no further explanation.
Firebird does not cluster records. It was designed to avoid the problems that require clustering and the fragmentation problems that come with clustered indexes. Indexes and data are stored separately, on different types of pages. Each data page contains data from only one table. Records are stored in the order they were inserted, give or take concurrent inserts, which generally go on separate pages. When old records are removed, new records will be stored in their place, so new records sometimes appear on the same page as older ones.
Many tables use an artificial primary key, generally ascending, which might be a database generated sequence or a timestamp. That practice causes records to be stored in key order, but that order is by no means guaranteed. Nor is it very interesting. When the primary key is artificial, most queries that return groups of related records are done on secondary indexes. That's a performance hit for records that are clustered because look-ups on secondary indexes require traversing two indexes because the secondary index provides only the key to the primary index, which must be traversed to find the data.
On the larger issue of defragmentation and space usage, Firebird tracks the free space on pages so new records will be inserted on pages that have had records removed. If a page becomes completely empty, it will be reallocated. This space management is done as the database runs. As you know, Firebird uses Multi-Version Concurrency Control, so when a record is updated or deleted, Firebird creates a new record version, but keeps the old version around. When all transactions that were running before the change was committed have ended, the old record version no longer serves any purposes, and Firebird will remove it. In many applications, old versions are removed in the normal course of running the database. When a transaction touches a record with old versions, Firebird checks the state of the old versions and removes them if no running transaction can read them. There is a function called "Sweep" that systematically removes unneeded old record versions. Sweep can run concurrently with other database activity, though it's better to schedule it when the database load is low. So no, it's not true that nothing is removed until you run a sweep.
Best regards,
Ann Harrison
who's worked with Firebird and it's predecessors for an embarassingly long time
BTW - as the first person to answer mentioned, Firebird does leave space on pages so that the old version of a record stays on the same page as the newer version. It's not a fixed percentage of the space, but 16 bytes per record stored on the page, so pages of tables with very short records have more free space and tables that have long records have less.
On restore, database pages are created ~70% full (as I recall, unless you specify gbak's -use_all_space switch) and the restore is done one table at a time, writing pages to the end of the database file as needed. You can imagine a scenario where pages could be condensed down to much less. Hence bringing the data together and "defragging" it.
As far as controlling the physical grouping on disk or doing an online defrag -- in Firebird there is none. Remember that just because you need to access a page does not mean your disk does a read -- file system and database cache can avoid it!
Suppose that a couple hundred Gigs after starting to use HIVE I want to add a column.
From the various articles & pages I have seen, I cannot understand the consequences in terms of
storage space required (double ?)
blocking (can I still read the table in other processes) ?
time (is it quick or as slow as a MysqL change ?)
underlying storage (do I need to change all the underlying files ? How can it be done using RCFile ?)
Bonus to whoever can answer the same question on structs in a HIVE column.
If you add a column to a hive table, only the underlying metastore is updated.
The required storage space is not increased as long as you do not add data
The change can be made while other processes are accessing the table
The change is very quick (only the underlying metastore is updated)
You do not have to change the underlying files. Existing records have the value null for the new column
I hope this helps.
ALTER TABLE commands modifies the METADATA only. The underlying data remains untouched. However, it is user's responsibility to ensure that the any alteration does not break the data consistency.
Also any changes to METADATA is applied to the metastore - which is most typically MySQL - in which case the response time is comparable.
Altering the definition will only modify how the files are read, not the contents of the underlying files.
If your files were tab delimited text with 3 columns, you could create a table that references those files with a scheme like new_table(line STRING) that would read the entire line without parsing out columns based upon the tab characters.
When you add a column, since there are no more delimiters in the record, it will default to NULL, as Helmut mentioned.