Hbase concurrency making it slow - hadoop

I have 1 master server and 5 region server and each server has 200 GB disk space and 16 GB RAM on each. I created a table in HBase which has 10 million records. I am using hbase-0.96 version on hadoop 2.
Table Name - sh_self_profiles
column family - profile
In this table, we have 30 columns in each row.
When I get a single column value from HBase, it takes around 10 ms. My problem is when I hit 100 or more concurrent requests the time slowly accumulates and increases to more than 400 ms instead of completing in 10ms only. When 100 requests are hit linearly, each one takes 10 ms only.

One thing that you should check is how well distributed your table is.
You can do this by going to the HBase master web console http://:60010, you will be able to see how many regions you have for your table. If you have not done anything special on table creation you could easily have only one or two regions, which means that all the requests are being directed to a single region server.
If this is the case, you can recreate your table with pre-split regions (I would suggest a multiple of 5, such as 15 or 20), and make sure that the concurrent gets that you are doing are equally spread over the row-key space.
Also, pls check how much RAM you have allocated to the region server - you might need to increase it from the default. If you are not running anything else other than HBase Region Sever on those machines, you could probably increase to 8GB ram.
Other than that, you could also adjust the default for hbase.regionserver.handler.count.
I hope this helps.

Which client are you using? Are you using the standard Java client, the Thrift client, the HTTP REST client, or something else? If your use case is a high amount of random reads of single column values, I highly recommend you try asynchbase as it is much faster than the standard synchronous Java client.

Related

Why queries are getting progressively slower when using postgres with spring batch?

I'm running a job using Spring Batch 4.2.0 with postgres (11.2) as backend. It's all wrapped in a spring boot app. I've 5 steps and each runs using a simple partitioning strategy to divide data by id ranges and reads data into each partition (which are processed by separate threads). I've about 18M rows in the table, each step reads, changes few fields and writes back. Each step reads all 18M rows and writes back. The issue I'm facing is, the queries that run to pull data into each thread scans data by id range like,
select field_1, field_2, field_66 from table where id >= 1 and id < 10000.
In this case each thread processes 10_000 rows at a time. When there's no traffic the query takes less than a second to read all 10,000 rows. But when the job runs there's about 70 threads reading all that data in. It goes progressively slower to almost a minute and a half, any ideas where to start troubleshooting this?
I do see autovacuum running in the backgroun for almost the whole duration of job. It definitely has enough memory to hold all that data in memory (about 6GB max heap). Postgres has sufficient shared_buffers 2GB, max_wal_size 2GB but not sure if that in itself is sufficient. Another thing I see is loads of COMMIT queries hanging around when checking through pg_stat_activity. Usually as much as number of partitions. So, instead of 70 connections being used by 70 partitions there are 140 conections used up with 70 of them running COMMIT. As time progresses these COMMITs get progressively slower too.
You are probably hitting https://github.com/spring-projects/spring-batch/issues/3634.
This issue has been fixed and will be part of version 4.2.3 planned to be released this week.

Creating high throughput Elasticsearch cluster

We are in process of implementing Elasticsearch as a search solution in our organization. For the POC we implemented a 3-Node cluster ( each node with 16 VCores and 60 GB RAM and 6 * 375GB SSDs) with all the nodes acting as master, data and co-ordination node. As it was a POC indexing speeds were not a consideration we were just trying to see if it will work or not.
Note : We did try to index 20 million documents on our POC cluster and it took about 23-24 hours to do that which is pushing us to take time and design the production cluster with proper sizing and settings.
Now we are trying to implement a production cluster (in Google Cloud Platform) with emphasis on both indexing speed and search speed.
Our use case is as follows :
We will bulk index 7 million to 20 million documents per index ( we have 1 index for each client and there will be only one cluster). This bulk index is a weekly process i.e. we'll index all data once and will query it for whole week before refreshing it.We are aiming for a 0.5 million document per second indexing throughput.
We are also looking for a strategy to horizontally scale when we add more clients. I have mentioned the strategy in subsequent sections.
Our data model has nested document structure and lot of queries on nested documents which according to me are CPU, Memory and IO intensive. We are aiming for sub second query times for 95th percentile of queries.
I have done quite a bit of reading around this forum and other blogs where companies have high performing Elasticsearch clusters running successfully.
Following are my learnings :
Have dedicated master nodes (always odd number to avoid split-brain). These machines can be medium sized ( 16 vCores and 60 Gigs ram) .
Give 50% of RAM to ES Heap with an exception of not exceeding heap size above 31 GB to avoid 32 bit pointers. We are planning to set it to 28GB on each node.
Data nodes are the workhorses of the cluster hence have to be high on CPUs, RAM and IO. We are planning to have (64 VCores, 240 Gb RAM and 6 * 375 GB SSDs).
Have co-ordination nodes as well to take bulk index and search requests.
Now we are planning to begin with following configuration:
3 Masters - 16Vcores, 60GB RAM and 1 X 375 GB SSD
3 Cordinators - 64Vcores, 60GB RAM and 1 X 375 GB SSD (Compute Intensive machines)
6 Data Nodes - 64 VCores, 240 Gb RAM and 6 * 375 GB SSDs
We have a plan to adding 1 Data Node for each incoming client.
Now since hardware is out of windows, lets focus on indexing strategy.
Few best practices that I've collated are as follows :
Lower number of shards per node is good of most number of scenarios, but have good data distribution across all the nodes for a load balanced situation. Since we are planning to have 6 data nodes to start with, I'm inclined to have 6 shards for the first client to utilize the cluster fully.
Have 1 replication to survive loss of nodes.
Next is bulk indexing process. We have a full fledged spark installation and are going to use elasticsearch-hadoop connector to push data from Spark to our cluster.
During indexing we set the refresh_interval to 1m to make sure that there are less frequent refreshes.
We are using 100 parallel Spark tasks which each task sending 2MB data for bulk request. So at a time there is 2 * 100 = 200 MB of bulk requests which I believe is well within what ES can handle. We can definitely alter these settings based on feedback or trial and error.
I've read more about setting cache percentage, thread pool size and queue size settings, but we are planning to keep them to smart defaults for beginning.
We are open to use both Concurrent CMS or G1GC algorithms for GC but would need advice on this. I've read pros and cons for using both and in dilemma in which one to use.
Now to my actual questions :
Is sending bulk indexing requests to coordinator node a good design choice or should we send it directly to data nodes ?
We will be sending query requests via coordinator nodes. Now my question is, lets say since my data node has 64 cores, each node has thread pool size of 64 and 200 queue size. Lets assume that during search data node thread pool and queue size is completely exhausted then will the coordinator nodes keep accepting and buffering search requests at their end till their queue also fill up ? Or will 1 thread on coordinator will also be blocked per each query request ?
Say a search request come up to coordinator node it blocks 1 thread there and send request to data nodes which in turn blocks threads on data nodes as per where query data is lying. Is this assumption correct ?
While bulk indexing is going on ( assuming that we do not run indexing for all the clients in parallel and schedule them to be sequential) how to best design to make sure that query times do not take much hit during this bulk index.
References
https://thoughts.t37.net/designing-the-perfect-elasticsearch-cluster-the-almost-definitive-guide-e614eabc1a87
https://thoughts.t37.net/how-we-reindexed-36-billions-documents-in-5-days-within-the-same-elasticsearch-cluster-cd9c054d1db8
https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
We did try to index 20 million documents on our POC cluster and it took about 23-24 hours
That is surprisingly little — like less than 250 docs/s. I think my 8GB RAM laptop can insert 13 million docs in 2h. Either you have very complex documents, some bad settings, or your bottleneck is on the ingestion side.
About your nodes: I think you could easily get away with less memory on the master nodes (like 32GB should be plenty). Also the memory on data nodes is pretty high; I'd normally expect heap in relation to the rest of the memory to be 1:1 or for lots of "hot" data maybe 1:3. Not sure you'll get the most out of that 1:7.5 ratio.
CMS vs G1GC: If you have a current Elasticsearch and Java version, both are an option, otherwise CMS. You're generally trading throughput for (GC) latency, so if you benchmark be sure to have a long enough timeframe to properly hit GC phases and run as close to production queries in parallel as possible.
Is sending bulk indexing requests to coordinator node a good design choice or should we send it directly to data nodes ?
I'd say the coordinator is fine. Unless you use a custom routing key and the bulk only contains data for that specific data node, 5/6th of the documents would need to be forwarded to other data nodes anyway (if you have 6 data nodes). And you can offload the bulk processing and coordination handling to non data nodes.
However, overall it might make more sense to have 3 additional data nodes and skip the dedicated coordinating node. Though this is something you can only say for certain by benchmarking your specific scenario.
Now my question is, lets say since my data node has 64 cores, each node has thread pool size of 64 and 200 queue size. Lets assume that during search data node thread pool and queue size is completely exhausted then will the coordinator nodes keep accepting and buffering search requests at their end till their queue also fill up ? Or will 1 thread on coordinator will also be blocked per each query request ?
I'm not sure I understand the question. But have you looked into https://www.elastic.co/blog/why-am-i-seeing-bulk-rejections-in-my-elasticsearch-cluster, which might shed some more light on this topic?
While bulk indexing is going on ( assuming that we do not run indexing for all the clients in parallel and schedule them to be sequential) how to best design to make sure that query times do not take much hit during this bulk index.
While there are different queues for different query operations, there is otherwise no clear separation of tasks (like "only use 20% of the resources for indexing). Maybe go a little more conservative on the parallel bulk requests to avoid overloading the node.
If you are not reading from an index while it's being indexed (ideally you flip an alias once done): You might want to disable the refresh rate entirely and let Elasticsearch create segments as needed, but do a force refresh and change the setting once done. Also you could try running with 0 replicas while indexing, change replicas to 1 once done, and then wait for it to finish — though I'd benchmark if this is helping overall and if it's worth the added complexity.

HBase request per second is zero

I just created a table in HBase and filled it with data. From the 7 regionservers it appears the data was written to region servers 6 and 7.
But I dont understand why the requests per second is zero for servers 6 and 7?
Read request count and Write request count are the total number of read and write requests seen by a particular region server since its restart. These numbers are kept in memory only for performance reasons and exposed via JMX and regionserver load API that the HBase UI uses to expose them. You could fetch them yourself using the API (or JMX) and export to a DB for persistence.
Request per second is the rate of total requests (read+write) that the regionserver in question is seeing right now. The rate is calculated based on the delta of the number of requests seen by that regionserver during a period divided by the length of the period. This particular detail (and this period) can differ based on HBase versions. In HBase 2.x, it is controlled by hbase.regionserver.metrics.period; while in previous versions there was no such setting and the period was fixed (if I remember correctly).
To answer your question, the comparison of rate of total requests and the count of total requests is not apples-to-apples. The rate only reflects current traffic while the count reflects lifetime number of requests since region server's restart. If you really think about it, it does not make sense to have a rate of requests for lifetime, because any real use case is concerned with the current rate only.
If you bulk-filled the tables via put(List<Put>), there would only have been a very small number of requests, as records are sent in batches.

Correcting improper usage of Cassandra

I have a similar question that was unanswered (but had many comments):
How to make Cassandra fast
My setup:
Ubuntu Server
AWS service - Intel(R) Xeon(R) CPU E5-2680 v2 # 2.80GHz, 4GB Ram.
2 Nodes of Cassandra Datastax Community Edition: (2.1.3).
PHP 5.5.9. With datastax php-driver
I come from a MySQL database knowledge with very basic NoSQL hands on in terms of ElasticSearch (now called Elastic) and MongoDB in terms of Documents storage.
When I read how to use Cassandra, here are the bullets that I understood
It is distributed
You can have replicated rings to distribute data
You need to establish partition keys for maximum efficiency
Rethink your query rather than to use indices
Model according to queries and not data
Deletes are bad
You can only sort starting from the second key of your primary key set
Cassandra has "fast" write
I have a PHP Silex framework API that receive a batch json data and is inserted into 4 tables as a minimum, 6 at maximum (mainly due to different types of sort that I need).
At first I only had two nodes of Cassandra. I ran Apache Bench to test. Then I added a third node, and it barely shaved off a fraction of a second at higher batch size concurrency.
Concurrency Batch size avg. time (ms) - 2 Nodes avg. time (ms) - 3 Nodes
1 5 288 180
1 50 421 302
1 400 1 298 1 504
25 5 1 993 2 111
25 50 3 636 3 466
25 400 32 208 21 032
100 5 5 115 5 167
100 50 11 776 10 675
100 400 61 892 60 454
A batch size is the number of entries (to the 4-6 tables) it is making per call.
So batch of 5, means it is making 5x (4-6) table insert worth of data. At higher batch size / concurrency the application times out.
There are 5 columns in a table with relatively small size of data (mostly int with text being no more than approx 10 char long)
My keyspace is the following:
user_data | True | org.apache.cassandra.locator.SimpleStrategy | {"replication_factor":"1"}
My "main" question is: what did I do wrong? It seems to be this is relatively small data set of that considering that Cassandra was built on BigDataTable at very high write speed.
Do I add more nodes beyond 3 in order to speed up?
Do I change my replication factor and do Quorum / Read / Write and then hunt for a sweet spot from the datastax doc: http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html
Do I switch framework, go node.js for higher concurrency for example.
Do I rework my tables, as I have no good example as to how effectively use column family? I need some hint for this one.
For the table question:
I'm tracking history of a user. User has an event and is associated to a media id, and there so extra meta data too.
So columns are: event_type, user_id, time, media_id, extra_data.
I need to sort them differently therefore I made different tables for them (as I understood how Cassandra data modeling should work...I am perhaps wrong). Therefore I'm replicating the different data across various tables.
Help?
EDIT PART HERE
the application also has redis and mysql attached for other CRUD points of interest such as retrieving a user data and caching it for faster pull.
so far on avg with MySQL and then Redis activated, I have a 72ms after Redis kicks in, 180ms on MySQL pre-redis.
The first problem is you're trying to benchmark the whole system, without knowing what any individual component can do. Are you trying to see how fast an individual operation is? or how many operations per second you can do? They're different values.
I typically recommend you start by benchmarking Cassandra. Modern Cassandra can typically do 20-120k operations per second per server. With RF=3, that means somewhere between 5k and 40k reads / second or writes/second. Use cassandra-stress to make sure cassandra is doing what you expect, THEN try to loop in your application and see if it matches. If you slow way down, then you know the application is your bottleneck, and you can start thinking about various improvements (different driver, different language, async requests instead of sync, etc).
Right now, you're doing too much and analyzing too little. Break the problem into smaller pieces. Solve the individual pieces, then put the puzzle together.
Edit: Cassandra 2.1.3 is getting pretty old. It has some serious bugs. Use 2.1.11 or 2.2.3. If you're just starting development, 2.2.3 may be OK (and let's assume you'll actually go to production with 2.2.5 or so). If you're ready to go prod tomorrow, use 2.1.x instead.

Cassandra running out of memory (heap space)

We are experimenting a bit with Cassandra lately (version 1.0.7) and we seem to have some problems with memory. We use EC2 as our test environment and we have three nodes with 3.7G of memory and 1 core # 2.4G, all running Ubuntu server 11.10.
The problem is that the node we hit from our thrift interface dies regularly (approximately after we store 2-2.5G of data). Error message: OutOfMemoryError: Java Heap Space and according to the log it in fact used all of the allocated memory.
The nodes are under relatively constant load and store about 2000-4000 row keys a minute, which are batched through the Trift interface in 10-30 row keys at once (with about 50 columns each). The number of reads is very low with around 1000-2000 a day and only requesting the data of a single row key. The is currently only one used column family.
The initial thought was that something was wrong in the cassandra-env.sh file. So, we specified the variables 'system_memory_in_mb' (3760) and the 'system_cpu_cores' (1) according to our nodes' specification. We also changed the 'MAX_HEAP_SIZE' to 2G and the 'HEAP_NEWSIZE' to 200M (we think the second is related to the Garbage Collection). Unfortunately, that did not solve the issue and the node we hit via thrift keeps on dying regularly.
In case you find this useful, swap is off and unevictable memory seems to be very high on all 3 servers (2.3GB, we usually observe the amount of unevictable memory on other Linux servers of around 0-16KB) (We are not quite sure how the unevictable memory ties into Cassandra, its just something we observed while looking into the problem). The CPU is pretty much idle the entire time. The heap memory is clearly being reduced once in a while according to nodetool, but obviously grows over the limit as time goes by.
Any ideas? Thanks in advance.
cassandra-env.sh defaults are perfect for almost all workloads, so until you know why this is happening best to put them back to their defaults or you may be making things worse without realizing.
I see concurrent reads and writes of 2k/sec/node on our cluster, so 2k-4k writes per minute is very little, although the fact that it's only the node accepting your connections that is dying is a little strange.
If you connect your app to the thrift endpoint on one of the other nodes is it then that one that dies?
Client connections use memory so might be worth double checking you're not connecting too many at a time. "netstat -A inet | grep 9160" on the dying cassandra node should tell you how many client connections you have. Depending heavily on your application you'd expect 10s or 100s rather than 1000s.
What do the writes look like?
Are you writing the same row keys repeatedly and if so are you appending new column names or overwriting the same ones?
How big is each write? Anything else you can tell me?
If you're overwriting the same column names in the same row keys constantly compaction may be struggling.
If you're appending new column names to the same row keys constantly you might be growing your rows too large to fit into memory.
the output of "nodetool -h localhost tpstats" on the dying node might also give some clues as to where you're falling down. Anything constantly pending is probably bad news, especially at such a low write rate.
If you're going to use cassandra in production you should get graphing of the internals to better understand what's going on. jmxtrans and graphite should be your new best friends.
There are some things you can try tweaking. First make sure you dont have row caching on your column family. Also worth while checking the log for errors and tpstats incase something died due to an error and something is getting backed up in a queue. The stack trace of the exception could be meaningful too since there are actually different types of OOMs that could just mean kernel tweaks.
If your just using too much memory per node then you want for the size of your data set try checking the cfstats, you can identify roughly how much space is spent on bloom filters. As you have more rows in a CF this can get linearly larger and is part of the base minimum memory your nodes are going to require.
nodetool cfstats | grep Bloom.*Used | awk '{ SUM += $5} END { print SUM " bytes" }'
Since you dont read very often you can probably increase the false positive rate on them. Each SSTable has a bloom filter it uses to check if a row exists in it or not. You can change with cqlsh
ALTER TABLE MyColumnFamily WITH bloom_filter_fp_chance = 0.1;
After that call an upgrade on that CF (this will be slow) per node
nodetool upgradesstables MyKeyspace MyColumnFamily
There are consequences to this where reads may take longer since there is a 10%-ish (the .1) chance it will check SSTables for rows that dont exist in it, resulting in extra disk seeks.
Another major memory sink if you have column families with large amount of rows is the sampling rate of the index. This can be modified per node level in the cassandra.yaml
http://www.datastax.com/docs/1.1/configuration/node_configuration#index-interval
If you have it set up to take heap dumps on OOM (-XX:+HeapDumpOnOutOfMemoryError on by default I believe) there should be some heap dumps available in the /var/lib/cassandra/data directory. You can open these up in visualvm or whatever tool you like to identify what part of the heaps is where.

Resources