Why B-tree block or SSTable file implementations keep headers/indexes at the end of file/block? - leveldb

Why this tendency (of keeping the "control structure" and the end of file/block) exists?

This is more of a question about the SSTable format rather than about Bigtable.
See https://www.quora.com/Why-do-HFile-and-SSTable-store-their-indexes-at-the-end-of-the-file.
It's to avoid having to seek again when writing out the index - you can build up the index in memory as you write out the file and then write the index at the end.
The index may have offsets and you may not know its correct size until you are at the end.

Related

VSAM Search VS COBOL search/loop

I have a file that could contain about 3 million records. Certain records of this file will need to be updated multiple times throughout the run of the program. If I need to pull specific records from this file, which of the following is more efficient:
Indexed VSAM search
Indexed flat file with a COBOL search all
Buffering all of the data into working storage and writing a loop to handle the search
Obviously, if you can buffer all of the data into memory (and if the host system can support a working-set of pages which is big enough to allow all of it to actually remain in RAM without paging, then this would probably be the fastest possible approach.
But, be very careful to consider "hidden disk-I/O" caused by the virtual-memory paging subsystem! If the requested "in-memory" data is, in fact, not "in memory," a page-fault will occur and your process will stop in its tracks until the page has been retrieved. (And if "page stealing" occurs, well, you're in trouble. Your "in-memory" strategy just turned into a possibly very-inefficient(!) disk-based one. If keys are distributed randomly, then your process has a gigantic working-set that it is accessing randomly. If all of that memory is not actually in memory, and will stay there, you're in trouble.
If you are making updates to a large file, consider sorting the updates-delta file before processing it, so that all occurrences of the same key will be adjacent. You can now write your COBOL program to take advantage of this (and, of course, to abend if an out-of-sequence record is ever detected!). If the key in "this" record is identical to the key of the "previous" one, then you do not need to re-read the record. (And, you do not actually need to write the old record, until the key does change.) As the indexed-file access method is presented with the succession of keys, each key is likely to be "close to" the one previously-requested, such that some of the necessary index-tree pages will already be in-memory. Obviously, you will need to benchmark this, but the amount of time spent sorting the file can be far less than the amount of time spent in index-lookups. (Which actually can be considerable.)
The answer of Mike has the important issue about "hidden I/O" in (depends on the machine, configuration, amount of data)...
If you very likely need to update many records the option Mike suggest is the most useful one.
If you very likely need to update not much records (I'd guess you're likely below 2%) another approach can be quite faster (needs a benchmark !):
read every key via indexed VSAM search
store the changed record in memory (big occurs table), if you will only change some values and the record is quite big then only store all possible changed values + key in the table without an actual REWRITE
before doing a VSAM search: look in your occurs table if you read the key
already, take the values either from there or get a new one
...
at program end: go through your occurs and REQRITE all records (if you have the complete record a REWRITE is enough, otherwise you'd need a READ first to get the complete record)
Performance is often: "know your data and possible program flow, then try the best 2-3 approach, benchmark and decide".

are writes always faster than reads in Cassandra?

I was listening to this talk on Data modelling in Cassandra. The speakers makes the general statement that 'writes are faster than reads in Cassandra'.
Is this case always true? if so why?
That's still true even though is not a big difference like in past. A write in general perform better because it doesn't involve too much the I/O -- a write operation is completed when the data has been both written in the commit log (file) and in memory (memtable). When the memtable reach the max size then all table is flushed in a disk sstable. Differently a read may require more I/O for different reasons. A read operation first involve reading from a bloom filter (a filter associated to sstable that might save I/O time saying that a data is surely not present in the associated sstable) and then, if filter returns a positive value, Cassandra starts seeking the sstable to look for data. HTH, Carlo

hbase - If the rowkey is designed very long(eg 200 letter ),but it is helpful for scan and filter. is there any harmful for the long rowkey design?

If the rowkey is designed very long(eg 200 letter ),but it is helpful for scan and filter. is there harmful for the long rowkey design?
I would say - don't make rowkeys too long. Even though, long rowkeys may seem tempting for scanning based on some filters , it will take up more heap space than warranted. Store files in hbase are LSM trees. To speed up random access within the store files , an index is stored for Data Block and Meta Block , which contain first key for each block along with other information. When added up for lots of blocks, it may take up big chunk of RAM.
Check the total size of store file index for hfiles and see if this is problemous in your case or not. If unavoidable then
some solutions may be - increase the block size and enable compression.
Also look at https://issues.apache.org/jira/browse/HBASE-3551 for some interesting read.
There're two choices for rowkey design,tall-narrow and flat-wide. According to your business you can choose one of them. There's no harm for the long rowkey desing.

What are the differences between 'shrink space compact' and 'coalesce'?

The oracle documentation says that during altering an index clauses shrink space compact and coalesce are quite similar and could be replaced by each other, but Tom found some differences in the behavior.
Since coalesce is not available in standart edition of Oracle Database, I suppose there're some benefits in using it.
So, what are the differences? Can I perform shrink space compact on a dynamically changing index?
The above answer is false. There are basically 4 options.
1 - ALTER INDEX COALESCE
2 - ALTER INDEX SHRINK SPACE
3 - ALTER INDEX SHRINK SPACE COMPACT
4 - ALTER INDEX REBUILD
Options 1 and 3 do NOT free up blocks. They just free up space in existing blocks. Coalesce does a little bit worse job, there will be more blocks with only 25-50% free space, while with shrink space compact, there will be more blocks with 75-100% free space. The total number of blocks, however, stay the same. For example, an index with 200 blocks with coalesce, and after deleting 1/5 of the rows randomly, will have ~1/5 of the index blocks have 25-50% free space while the rest remain full.
On the other hand, shrink space and rebuild do free up the blocks, and merge them into existing ones, thus reducing the total number of blocks. I think the only difference is speed. When you delete only 5% from a large table, there's no reason to rebuild the entire index, and it will be very slow. However, shrink space might be a little bit faster here, because it does not rebuild the entire index, just reorganizes the blocks.
Obviously the fastest choice would be coalescing or shrinking space with compact option.
First of all, indexes generally do not need to be frequently rebuilt. They generally grow to a steady size and stay there, and rebuilding them produces only a temporary benefit to queries that is then counterbalanced by increased load in modifying them due to an increased rate of block splits. So don't forget that the best optimisation for a process is to eliminate it completely -- if you think you have a need for frequent rebuilds then post a question and maybe the cause can be explained and a different approach be found.
Anyway, coalesce reduces the number of blocks that are holding index data, thus freeing up blocks completely so that they can be re-used for new index entries. The freed blocks are still allocated to the index, though. This can prevent indexes from growing too large.
Shrink does something similar but moves the populated blocks to allow freed blocks at the "end" of the index segment to be deallocated from it. Thus the index segment actually gets smaller. This requires an exclusive lock on the table.

Fastest way to remove duplicate lines in very large .txt files

What is the best way to remove duplicate lines from large .txt files like 1 GB and more ?
Because removing one-after-another duplicates is simple, we can turn this problem to just sorting file.
Assume, that we can't load whole data to RAM, because of it's size.
I'm just waiting to retreive all records from SQL table with one unique index field (I loaded file lines to table earlier) and wondering, does exists way to speed it up.
You could try a bloom filter. While you may get some false positives (though you can get arbitrarily close to 0% at the cost of more processing) it should be pretty fast as you don't need to compare or even do a log(n) search for each line you see.

Resources