Is 200K rows/second Clickhouse max performance? - performance

I am testing Clickhouse insert performance and so far I am able to insert over 200K rows/second. To me, this is good. However I see system utilizations is not very high and wonder if I can push more.
CH is in a server with Dual xxx 14 cores # 2.4 GHz, 56 vCPU with 256GB mem. And insert 1B rows in 1 hour 10 minutes. During that time I see,
load avg: 23.68, 22.44, 20.32
%Cpu: 2.93 us, 0.54 sy, 0.14 ni, 95.3 id, 0.96 wa, 0.05 hi, 0.09 si, 0 st
clickhouse-serv (%CPU, RES): 134.3%, 25.6g
These numbers above are average from "top" of every 5 seconds.
I have observed that clickhouse-server' %CPU usage is never above 200% as if there is a hard limit.
CH version: 21.2.2.8
Engine: Buffer (MergeTree) w/ default configuration; w/o Buffer it performs 10% less
dataset: in json, 2608 B/row, 150 columns
per insert: 500K rows, which is about 1.2GB
insert by 20 processes with clickhouse-clients from a different server
500K rows/insert and 20 clients give best performance (I have tried different numbers)
Linux 4.18.x (Red Hat)
Questions:
Is 200K rows/second (or %200 CPU usage) max per CH server? If not, how can I improve?
Can I have more than one CH server instances in one server? Will it be practical and give better performance?
In case there is no certain limit on the clickhouse-server side (or I am doing something wrong), I am checking if any others can impose such limit to applications (clickhouse-server).
Thanks in advance.

dataset: in json, 2608 B/row, 150 columns
insert by 20 processes with clickhouse-clients from a different server
In this case clickhouse-client parses JSON and probably CPU utilization is 100% at a different server. You need more inserting nodes to parse JSON.

Related

How to speed up big query in ClickHouse?

Backgroud:
I submitted a local query in ClickHouse (without using cache), and it processed 414.43 million rows, 42.80 GB.
The query lasted 100+ seconds.
My ClickHouse instances were installed on AWS c5.9xlarge EC2 with 12T st1 EBS
During this query, the IOPS is up to 500 and read throughput is up to 20M/s.
And as a comparison, st1 EBS max IOPS is 500 and max throughput is 500M/s.
Here is my question:
Does 500 IOPS actually limit my query (file-reading) speed? (never mind the cache) Should I change EBS volume type to gp2 or io1 to increase IOPS?
Is there any setting can improve throughput under the same IOPS? (as I can see, it's far from ceiling actually)
I tried increasing 'max_block_size' to read more file at one time, but it doesn't seem to work.
How to extend the cache time?Big query took minutes. Cache took seconds. But cache doesn't seem to last very long.
How can I warm-up columns to meet all requirements? Please show sqls.
Does 500 IOPS actually limit my query (file-reading) speed?
yes
Should I change EBS volume type to gp2 or io1 to increase IOPS?
yes
Is there any setting can improve throughput under the same IOPS?
tune max_bytes_to_read
reduce number of columns (in select)
reduce number of parts (in select)
How to extend the cache time?
min_merge_bytes_to_use_direct_io=1
How can I warm-up columns to meet all requirements? Please show sqls.
select a,b,c,d from T Format Null

What will be specific benchmark of elasticsearch reads, writes and updates?

We are using Elasticsearch (version 5.6.0) data updates of around 13M documents with each document in the nested structure having max 100 key value pair, it takes around 34 min to update 99 indices. Hardware is as follows:
5 M4-4x large machines (32G RAM and 8 cores)
500GB disk
So, what should be the Ideal update time elasticsearch should take for the update?
What are the optimization I can do to get good performance?

How to increase greenplum concurrency and # query per sec

We have a fairly big Greenplum v4.3 cluster. 18 hosts, each host has 3 segment nodes. Each host has approx 40 cores and 60G memory.
The table we have is 30 columns wide, which has 0.1 billion rows. The query we are testing has 3-10 secs response time when there is no concurrency pressure. As we increase the # of queries we fired in parallel, the latency is decreasing from avg 3 secs to 50ish secs as expected.
But we've found that regardless how many queries we fired in parallel, we only have like very low QPS(query per sec), almost just 3-5 queries/sec. We've set the max_memory=60G, memory_limit=800MB, and active_statments=100, hoping the CPU and memory can be highly utilized, but they are still poorly used, like 30%-40%.
I have a strong feeling, we tried to feed up the cluster in parallel badly, hoping to take the best out of the CPU and Memory utilization. But it doesn't work as we expected. Is there anything wrong with the settings? or is there anything else I am not aware of?
There might be multiple reasons for such behavior.
Firstly, every Greenplum query uses no more than one processor core on one logical segment. Say, you have 3 segments on every node with 40 physical cores. Running two parallel queries will utilize maximum 2 x 3 = 6 cores on every node, so you will need about 40 / 6 ~= 6 parallel queries to utilize all of your CPUs. So, maybe for your number of cores per node its better to create more segments (gpexpand can do this). By the way, are the tables that used in the queries compressed?
Secondly, it may be a bad query. If you will provide a plan for the query, it may help to understand. There some query types in Greenplum that may have master as a bottleneck.
Finally, that might be some bad OS or blockdev settings.
I think this document page Managing Resources might help you mamage your resources
You can use Resource Group limit/controll your resource especialy concurrency attribute(The maximum number of concurrent transactions, including active and idle transactions, that are permitted in the resource group).
Resouce queue help limits ACTIVE_STATEMENTS
Note: The ACTIVE_STATEMENTS will be the total statement current running, when you have 50s cost queries and next incoming queries, this could not be working, mybe 5 * 50 is better.
Also, you need config memory/CPU settings to enable your query can be proceed.

Correcting improper usage of Cassandra

I have a similar question that was unanswered (but had many comments):
How to make Cassandra fast
My setup:
Ubuntu Server
AWS service - Intel(R) Xeon(R) CPU E5-2680 v2 # 2.80GHz, 4GB Ram.
2 Nodes of Cassandra Datastax Community Edition: (2.1.3).
PHP 5.5.9. With datastax php-driver
I come from a MySQL database knowledge with very basic NoSQL hands on in terms of ElasticSearch (now called Elastic) and MongoDB in terms of Documents storage.
When I read how to use Cassandra, here are the bullets that I understood
It is distributed
You can have replicated rings to distribute data
You need to establish partition keys for maximum efficiency
Rethink your query rather than to use indices
Model according to queries and not data
Deletes are bad
You can only sort starting from the second key of your primary key set
Cassandra has "fast" write
I have a PHP Silex framework API that receive a batch json data and is inserted into 4 tables as a minimum, 6 at maximum (mainly due to different types of sort that I need).
At first I only had two nodes of Cassandra. I ran Apache Bench to test. Then I added a third node, and it barely shaved off a fraction of a second at higher batch size concurrency.
Concurrency Batch size avg. time (ms) - 2 Nodes avg. time (ms) - 3 Nodes
1 5 288 180
1 50 421 302
1 400 1 298 1 504
25 5 1 993 2 111
25 50 3 636 3 466
25 400 32 208 21 032
100 5 5 115 5 167
100 50 11 776 10 675
100 400 61 892 60 454
A batch size is the number of entries (to the 4-6 tables) it is making per call.
So batch of 5, means it is making 5x (4-6) table insert worth of data. At higher batch size / concurrency the application times out.
There are 5 columns in a table with relatively small size of data (mostly int with text being no more than approx 10 char long)
My keyspace is the following:
user_data | True | org.apache.cassandra.locator.SimpleStrategy | {"replication_factor":"1"}
My "main" question is: what did I do wrong? It seems to be this is relatively small data set of that considering that Cassandra was built on BigDataTable at very high write speed.
Do I add more nodes beyond 3 in order to speed up?
Do I change my replication factor and do Quorum / Read / Write and then hunt for a sweet spot from the datastax doc: http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html
Do I switch framework, go node.js for higher concurrency for example.
Do I rework my tables, as I have no good example as to how effectively use column family? I need some hint for this one.
For the table question:
I'm tracking history of a user. User has an event and is associated to a media id, and there so extra meta data too.
So columns are: event_type, user_id, time, media_id, extra_data.
I need to sort them differently therefore I made different tables for them (as I understood how Cassandra data modeling should work...I am perhaps wrong). Therefore I'm replicating the different data across various tables.
Help?
EDIT PART HERE
the application also has redis and mysql attached for other CRUD points of interest such as retrieving a user data and caching it for faster pull.
so far on avg with MySQL and then Redis activated, I have a 72ms after Redis kicks in, 180ms on MySQL pre-redis.
The first problem is you're trying to benchmark the whole system, without knowing what any individual component can do. Are you trying to see how fast an individual operation is? or how many operations per second you can do? They're different values.
I typically recommend you start by benchmarking Cassandra. Modern Cassandra can typically do 20-120k operations per second per server. With RF=3, that means somewhere between 5k and 40k reads / second or writes/second. Use cassandra-stress to make sure cassandra is doing what you expect, THEN try to loop in your application and see if it matches. If you slow way down, then you know the application is your bottleneck, and you can start thinking about various improvements (different driver, different language, async requests instead of sync, etc).
Right now, you're doing too much and analyzing too little. Break the problem into smaller pieces. Solve the individual pieces, then put the puzzle together.
Edit: Cassandra 2.1.3 is getting pretty old. It has some serious bugs. Use 2.1.11 or 2.2.3. If you're just starting development, 2.2.3 may be OK (and let's assume you'll actually go to production with 2.2.5 or so). If you're ready to go prod tomorrow, use 2.1.x instead.

What is elastic search bounded by? Is it cpu, memory etc

I am running elastic search in my personal box.
Memory: 6GB
Processor: Intel® Core™ i3-3120M CPU # 2.50GHz × 4
OS: Ubuntu 12.04 - 64-bit
ElasticSearch Settings: Only running locally
Version : 1.2.2
ES_MIN_MEM=3g
ES_MAX_MEM=3g
threadpool.bulk.queue_size: 3000
indices.fielddata.cache.size: 25%
http.compression: true
bootstrap.mlockall: true
script.disable_dynamic: true
cluster.name: elasticsearch
index size: 252MB
Scenario:
I am trying to test the performance of my bulk queries/aggregations. The test case is to run asynchronous http requests to node.js which in turn will call elastic search. The tests are running from a Java method. Started with 50 requests at a time. Each request is divided and parallized in to two asynchronous(async.parallel) bulk queries in node.js. I am using node-elasticsearch api (uses elasticsearch 1.3 api). The two bulk queries contain 13 and 10 queries respectively.And the two are asynchronously sent to elastic search from node.js. When the Elastic Search returns, the query results are combined and sent back to the test case.
Observations:
I see that all the cpu cores are utilized 100%. Memory is utilized around 90%. The response time for all 50 requests combined is 30 seconds. If I run just the single queries each alone, in the bulk queries, each are returning in less than 100 milli-seconds. Node.js is taking negligible time to forward requests to elastic search and combine responses from elastic search.
Even if run the test case synchronously from java, the response time does not change. I may say that elastic search is not doing parallel processing. Is this because I am CPU or memory bound? One more observation: if I change heap size for elastic search from 1 - 3GB, the response time does not change.
Also I am pasting top command output:
top - 18:04:12 up 4:29, 5 users, load average: 5.93, 5.16, 4.15
Tasks: 224 total, 3 running, 221 sleeping, 0 stopped, 0 zombie
Cpu(s): 98.2%us, 1.0%sy, 0.0%ni, 0.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 5955796k total, 5801920k used, 153876k free, 1548k buffers
Swap: 6133756k total, 708336k used, 5425420k free, 460436k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17410 root 20 0 7495m 3.3g 27m S 366 58.6 5:09.57 java
15356 rmadd 20 0 1015m 125m 3636 S 19 2.2 1:14.03 node
Questions:
Is this expected, because I am running Elastic Search in my local machine and not in a cluster? Can I improve my performance in my local machine? I would definitely start a cluster. But I want to know, how to improve the performance scalably. What is it that the elastic search is bound to?
I am not able to find this in forums. And am sure this would help others. Thanks for your help.

Resources