What is the default value of 'hbase.client.scanner.caching' for HBase 0.90.6 version in particular? Is it 1 or 100?
Pretty sure it's 100. The book says so, and 1 would be an odd default for a value defined like this:
Number of rows that will be fetched when calling next on a scanner if it is not served from (local, client) memory. Higher caching values will enable faster scanners but will eat up more memory and some calls of next may take longer and longer times when the cache is empty.
Related
I am just writing a simple big data application on my existing data o HBase, sometime I feel that Scan could work faster than one Get, so I want to experiment it and convert my Get commands to the exact scan
Therefore if I have below keys and would like to Get(12)
row keys
12
123
21
22
what I eed to put as Startrow and Stoprow of my scan or I might configure other paramter in scan?
If the direction of the scan is default (i.e. not reversed), then what normally works for me is to set the start row as 12 and the stop row as 12x, where x is some trailing special character that you know doesn't occur in your row key space and is likely to be lexicographically later after all possible characters in your row key range. For example i usually use '~' as the trailing symbol, but maybe something else might work better for you.
Also, scan has the .setLimit(int) parameter, which can limit your scan to just 1. You can use both elements together. However I'm not sure why this should work faster than Get.
If you feel that your scans work faster than gets, maybe it has something to do with Call Queue configuration of your cluster. For example maybe your cluster is configured to allocate more handlers to Scans rather than Gets. That's not default behavior, but it's possible that someone may have configured it that way, and if your cluster is very busy, maybe that's why you are feeling it that way.
In some of my indices, I'm doing "index.blocks.read_only_allow_delete": true by using the PUT /index/_settings API call. But after around 10 seconds, the setting disappears and the index is writable again.
I'm wondering if this can be a bug in ES, as in version 6.8 a change was made to reset this setting automatically when a node whose disk had gone over the flooding stage, was again below the normal thresholds.
I'm experiencing that odd behaviour in ES 7.9. What I expected is that, if ES changed the setting to true because of the watermarks, then it could reset it to false later. But if an operator changes the setting to true manually, then ES was going to respect that setting.
These are the docs where I read about that behaviour:
Controls the flood stage watermark, which defaults to 95%. Elasticsearch enforces a read-only index block ( index.blocks.read_only_allow_delete ) on every index that has one or more shards allocated on the node, and that has at least one disk exceeding the flood stage. This setting is a last resort to prevent nodes from running out of disk space. The index block is automatically released when the disk utilization falls below the high watermark.
Cross-posted here.
I ended up using index.blocks.read_only instead, as this one is not updated by ElasticSearch automatically.
I was wondering how can I configure Hbase in a way to store just the first version of each cell? Suppose the following Htable:
row_key cf1:c1 timestamp
----------------------------------------
1 x t1
After putting ("1","cf1:c2",t2) in the scenario of ColumnDescriptor.DEFAULT_VERSIONS = 2 the mentioned Htable becomes:
row_key cf1:c1 timestamp
----------------------------------------
1 x t1
1 x t2
where t2>t1.
My question would be how can I change this scenario in a way that the first version of cell would be the only version that could be store and retrieve. I mean in the provided example the only version would be 't1' one! Thus, I want to change hbase in a way that ignore insertion on duplicates.
I know that setting VERSIONS to 1 for Htable and putting based on Long.MAX_VALUE - System.currentTimeMillis() would solve my problem but I dont know is it the best solution or not?! What is the concern of changing tstamp to Long.MAX_VALUE - System.currentTimeMillis()? Does it has any performance issue?
There are two strategies that I can think of:
1. One version + inverted timestamp
Setting VERSIONS to 1 for Htable and putting based on Long.MAX_VALUE - System.currentTimeMillis() will generally work and does not have any major performance issues.
On write:
When multiple versions of the same cell are written to hbase, at any point in time, all versions will be written (without any impact on performance). After compaction only the cell with the highest timestamp will survive.
The cell with the highest timestamp in this scheme is the one written by the client with the lowest value for System.currentTimeMillis(). It should be noted that this might not actually be the machine who tried to write to the cell first, since hbase clients might be out of sync.
On read:
When multiple versions of the same cell are found pruning will occur at that time. This can happen at any time, since your writes can occur at any time, even after compaction. This has a very slight impact on performance.
2. checkAndPut
To get true ordering through atomicity, meaning only the first write to reach the region server will succeed, you can use the checkAndPut operation:
From the docs:
public boolean checkAndPut(byte[] row, byte[] family, byte[] qualifier, byte[] value, Put put) throws IOException
Atomically checks if a row/family/qualifier value matches the expected
value. If it does, it adds the put. If the passed value is null, the
check is for the lack of column (ie: non-existance)`
So by setting value to null your Put will only succeed if the cell did not exist. If your Put succeeded then the return value will be true. This gives true atomicity, but at a write performance cost.
On write:
A row lock is set and a Get is issued internally before existance is checked. Once non-existance is confirmed the Put is issued. As you can imagine this has a pretty big performance impact for each write, since each write now also involves a read and a lock.
During compaction nothing needs to happen, because only one Put will ever make it to hbase. Which is always the first Put to reach the region server.
It should be noted that there is no way to batch these kind of checkAndPut operations by using checkAndMutate, since each Put needs it own check. This means each put needs to be a separate request, which means you will be paying a latency cost as well when writing in batches.
On read:
Only ever one version will make it to Hbase, so there is no impact here.
Picking between strategies:
If true ordering really matters or you may need to read each row after or before you write to hbase anyway (for example to find out if your write succeeded or not), you're better of with strategy 2, otherwise, in all other cases, I'd recommend strategy 1, since its write performance is much better. In that case just make sure your clients are properly time synced.
You can insert the Put with Long.MAX_VALUE - timestampand configure the table to store only 1 version (max versions => 1). This way only the first (earliest) Put will be returned by the Scan because all successive Puts will have a smaller timestamp value.
I'm parsing data from one table and writing it back to another one. Input are characteristics, written as text. Output is a boolean field that needs to be updated. For example a characteristic would be "has 4 wheel drive" and I want to set a boolean has_4weeldrive to true.
I'm going through all the characteristics that belong to a car and set it to true if found, else to null. The filter after the tmap_1 filters the rows for which the attribute is true, and then updates that in a table. I want to do that for all different characteristics (around 10).
If I do it for one characteristic the job runs fine, as soon as I have more than 1 it only loads 1 record and waits indefinitely. I can of course make 10 jobs and it will run, but I need to touch all the characteristics 10 times, that doesn't feel right. Is this a locking issue? Is there a better way to do this? Target and source db is Postgresql if that makes a difference.
Shared connections could cause problems like this.
Also make sure you're committing after each update. Talend use 1 thread for execution (except the enterprise version) so multiple shared outputs could cause problems.
Setting the commit to 1 should eliminate the problem.
I would like to know how to change the limit on an item that is cached in memcache 1.7.1.1. This is not the quota size for the node or cluster but rather the limit on a single cached item. This limit, from what I have read, is set to 1 MB.
I have content which is larger than the 1 MB per item limit and I would like to increase it to something higher.
I have read of a configuration parameter to "membase.exe", the "-I" parameter. But I cannot seem to find it.
I really do not want to go down the road of recompiling with my own limit coded in.
Thank you.
This is really only half an answer, but have you considered just using a membase bucket instead? You'll get a 20 MB limit with no config changes necessary.