HBase - hotspotting check - hadoop

I am using HBase. And I am suspecting that rowkey has caused hotspoting. Before trying with salting of rowkey, I would like to check if hotspoting has already occurred. Is there any way in HBase to analyze data distribution in region servers to check if hotspoting has occurred?
Thanks,
Partha

You can use the HMaster Info Web UI to detect this.
It should be http://master-address:16010 by default.
If it's not available, you can check if the UI is not disabled in the conf (hbase-site.xml) and be sure that hbase.master.info.port is not set to -1.
When you are on it, you have to click on the table that you want to check.
You will be on this page
https://docs.prediction.io/images/cloudformation/hbase-32538c47.png
Then if you see that one region server has a lot more regions than the others, this is a good hint that one of your region server is probably hotspotted.
It means that the regions in this part of the rowkey scope are splitted more often ! The request per second can also be an indicator but to my experience, it's not always really accurate.
But this is just good hints and the only simple good way that I know to be sure that a hotspot is occuring is to bench it. Because when it happens, the write performance are really, REALLY different. So, you should check the througput that you have with an hashed rowkey with the same data then compare. You'll see very quickly if there is an hotspot.

Related

How do I get Sphinx stats?

I am new to Sphinx and want to make it index a 2 million row table (~1.5GB in size). I will use plain indexes.
At the moment, I don't know how much memory should I put in the mem_limit config. My idea is that I could simply keep the default, and then I could see how many results are being swapped (stay in disk) or expired (how frequently used results in memory go to disk).
I'm not sure yet exactly how Sphinx works, anyway, but this is my understanding for now. However, how can I see stats like these, just like we can see the STATS for Memcached?
Having some kind of stats would definitely help me know how to better tune Sphinx for my application.
In case it's relevant, I use MariaDB and PHP on CentOS.
In case its not clear mem_limit is ONLY used by the indexer program. For use during creating the index.
... frankly the setting isnt all that critical. Just set it as high as you have available memory.
It's not applicable to searchd which actually answers queries.
There is 'SHOW STATUS' command, but it doesnt really have anything about menory
http://sphinxsearch.com/docs/current.html#sphinxql-show-status
... memory usage (and there are no variables to control it!) can be got from general OS commands. On linux for example, possibly something like memstat

Monitor Hadoop Cluster using Collectl

I am evaluating various system monitoring tools to use one to monitor my hadoop cluster.
One of the tools I am impressed by is collectl. I have been playing around with it since a couple of days.
I am struggling to find how can we aggregate the metrics captured by collectl when using colmux?
Say, I have 10 nodes in my hadoop cluster each running collectl as a service. Using colmux I can see the
performance metrics of each node in a single view (in single and multi-line formats). Great!
But what if I am considering aggregate of CPU, IO etc on all the nodes in the cluster. That is I want to find
how my cluster as a whole is performing by aggregating the performance metrics from each node into corresponding
numbers, thereby giving me cluster-level metrics instead of node-level.
Any help is greatly appreciated. Thanks!
I had already answered this on the mailing list but for the benefit of those not on it I'll repeat myself here..
That's a cool idea. So if I understand you correctly you might see some sort of total line at the bottom? I can always add to my wish list but no promises. But I think I may also have a solution if you don't mind doing a little extra work on your own ;) btw - can I assume you've installed readkey so you can change sort columns with the arrow keys?
If you run colmux with --noesc, it will take it out of full screen more and simply print everything as scrolling output. If you then also include "--lines 99999" (or some big number) it will print all the output from all the remote systems so you don't miss anything. Finally you can pipe the output through perl, python, bash, or whatever your favorite scripting tool might be and do the totals yourself. Then whenever you see a new header fly by, print the totals and reset the counters to 0. You could even add timestamps and maybe even ultimately make it your own opensource project. I bet others would find it useful too.
-mark

Tabular Model Not Caching Results

Can anyone, someone point me in the direction of how to troubleshoot why a Tabular model that I have built does not seem to want to cache query results?
It is my understanding that MDX queries to Tabular model will be cached, however with our model they never seem to be! And I can't figure out why..
My best guess is that it's memory pressure, and the system is clearing down the RAM, but even that is a guess..
Are there any counters, DMVs, or other perfmon stats etc that i can use to actually see what is going on and check?
Thanks.
Plenty of places to look, but I'd recommend starting with a Profiler/xEvent trace. Below is an example of 2 runs of the same MDX query.
The first run is on a cold-cache...
The second run is on a warm-cache and you can see that it is resolving the query from cache...
This is much easier to see if you can isolate the query on non-production server (e.g. test/dev environment). There are quite a few reasons why a particular query may not be taking advantage of the cache...but you need to first confirm that it is not using the cache.

How to deactivate safe mode in the mongo shell?

Short question is on the title: I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
Long Question for those willing to know the context:
I am working on a huge set of data like
{
_id:ObjectId("azertyuiopqsdfghjkl"),
stringdate:"2008-03-08 06:36:00"
}
and some other fields and there are about 250M documents like that (whole database with the indexes weights 36Go). I want to convert the date in a real ISODATE field. I searched a bit how I could make an update query like
db.data.update({},{$set:{date:new Date("$stringdate")}},{multi:true})
but did not find how to make this work and resolved myself to make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value. The query use the _id so the default index is used.
Problem is that it takes a very long time. I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added. I also set an index on a relevant field to process the database chunk by chunk. Finally I ran several concurrent mongo clients on both the server and my workstation to ensure that the limitant factor is the database lock availability and not any other factor like cpu or network costs.
I monitored the whole thing with mongotop, mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time. I am a bit disappointed mongodb does not have a more precise granularity on its write lock, why not allowing concurrent write operations on the same collection as long as there is no risk of interference? Now that I think about it I should have sharded the collection on a dozen shards even while staying on the same server, because there would have been individual locks on each shard.
But since I can't do a thing right now to the current database structure, I searched how to improve performance to at least spend 90% of my time writing in mongo (from 70% currently), and I figured out that since I ran my script in the default mongo shell, every time I make an update, there is also a getLastError() which is called afterwards and I don't want it because there is a 99.99% chance of success and even in case of failure I can still make an aggregation request after the end of the big process to retrieve the single exceptions.
I don't think I would gain so much performance by deactivating the getLastError calls, but I think itis worth trying.
I took a look at the documentation and found confirmation of the default behavior, but not the procedure for changing it. Any suggestion?
I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
You can use db.getLastError({w:0}) ( http://docs.mongodb.org/manual/reference/method/db.getLastError/ ) to do what you want but it won't help.
This is because for one:
make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value.
When using the shell in a non-interactive mode like within a loop it doesn't actually call getLastError(). As such downing your write concern to 0 will do nothing.
I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added.
I did tell people when they asked about this stuff to add those fields incase of movement but instead they listened to the guy who said "leave them out! They use space!".
I shouldn't feel smug but I do. That's an unfortunately side effect of being right when you were told you were wrong.
mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time
That's because of all the movement in your documents, kinda hard to fix that.
I am a bit disappointed mongodb does not have a more precise granularity on its write lock
The write lock doesn't actually denote the concurrency of MongoDB, this is another common misconception that stems from the transactional SQL technologies.
Write locks in MongoDB are mutexs for one.
Not only that but there are numerous rules which dictate that operations will subside to queued operations under certain circumstances, one being how many operations waiting, another being whether the data is in RAM or not, and more.
Unfortunately I believe you have got yourself stuck in between a rock and hard place and there is no easy way out. This does happen.

Write heavy dml-operations in MongoDB

I'm running MongoDB (2.2) on Linux, and I have a few questions.
I have schema with many fields + sub-fields and one index for this fields.
How fast are updates/delete done on the index -- I have about 3 Updates/Deletes etc. a second.
Is there a rule, like after 10,000 updates you have to compact or rebuild the index?
Are changes in the fields immediately visible in the index? If not is there a delay or a temporary table for this updates/deletes?
Thanks in advance - Brandon
Indexes are updated at the time of insert/update/remove. About performance the best answer would be to just test it.
Not that I would know of. If you need to do regular compaction or repair you should have replication too (but you can have it on the same host if resources permit)
Yes (well, on the same DB connection - on other it might take a bit more time. But if you're having that problem I'm not the right person to answer you anyway ;)
Having said that, I strongly suggest you take a look at some of the presentations at http://www.10gen.com/presentations - I'm sorry i can't point out the ones that were particularly interesting and usable, I suggest you browse and pick the ones that seem interesting to you.
Note that MongoDB does things VERY differently and has quite a few gotchas for the unprepared. It is however a great DB once you know how to use it.

Resources