Are there anyways to speed Greenplum up? - greenplum

I have 3 vms(one master two segment hosts of the opensource version), each has 32 cores 64 threads and the memory of 251G.
There is one big table,which have nearly 70 fields and one hundred million records.
The parts of definition as follows:
with (appendonly=true,compresslevel=5)
distributed by(record_id) partition by range(dt_date)
(partition p201012 start ('2021-01-01'::date) end ('2021-01-31'::date) every ('1 days'::interval))
The culster have 30 primary segments and 30 mirror segments.
Both insertion(<2000 records/s) and selection(about 25s) are too slow,since we have one hundred million records a day and more than one second is not allowed.
So my questions are: is there anyone using Gpdb? Are there anyways to speed it up?
Thank u!

My initial thought is that you have way too many primaries/mirrors (30) for that few cpus on the system. The general rule of thumb is 3-4 cpu/vcpu per segment (i.e., postgres database). Your system should be reconfigured to only have 2 primaries per host to better utilize the cpu and memory on the hosts; 1 might even be better with that small amount of memory on the segment hosts.
As it stands now, you are swamping the system with too many databases trying to utilize too few system resources.
Jim

Related

How to increase greenplum concurrency and # query per sec

We have a fairly big Greenplum v4.3 cluster. 18 hosts, each host has 3 segment nodes. Each host has approx 40 cores and 60G memory.
The table we have is 30 columns wide, which has 0.1 billion rows. The query we are testing has 3-10 secs response time when there is no concurrency pressure. As we increase the # of queries we fired in parallel, the latency is decreasing from avg 3 secs to 50ish secs as expected.
But we've found that regardless how many queries we fired in parallel, we only have like very low QPS(query per sec), almost just 3-5 queries/sec. We've set the max_memory=60G, memory_limit=800MB, and active_statments=100, hoping the CPU and memory can be highly utilized, but they are still poorly used, like 30%-40%.
I have a strong feeling, we tried to feed up the cluster in parallel badly, hoping to take the best out of the CPU and Memory utilization. But it doesn't work as we expected. Is there anything wrong with the settings? or is there anything else I am not aware of?
There might be multiple reasons for such behavior.
Firstly, every Greenplum query uses no more than one processor core on one logical segment. Say, you have 3 segments on every node with 40 physical cores. Running two parallel queries will utilize maximum 2 x 3 = 6 cores on every node, so you will need about 40 / 6 ~= 6 parallel queries to utilize all of your CPUs. So, maybe for your number of cores per node its better to create more segments (gpexpand can do this). By the way, are the tables that used in the queries compressed?
Secondly, it may be a bad query. If you will provide a plan for the query, it may help to understand. There some query types in Greenplum that may have master as a bottleneck.
Finally, that might be some bad OS or blockdev settings.
I think this document page Managing Resources might help you mamage your resources
You can use Resource Group limit/controll your resource especialy concurrency attribute(The maximum number of concurrent transactions, including active and idle transactions, that are permitted in the resource group).
Resouce queue help limits ACTIVE_STATEMENTS
Note: The ACTIVE_STATEMENTS will be the total statement current running, when you have 50s cost queries and next incoming queries, this could not be working, mybe 5 * 50 is better.
Also, you need config memory/CPU settings to enable your query can be proceed.

Cassandra partition size and performance?

I was playing around with cassandra-stress tool on my own laptop (8 cores, 16GB) with Cassandra 2.2.3 installed out of the box with having its stock configuration. I was doing exactly what was described here:
http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema
And measuring its insert performance.
My observations were:
using the code from https://gist.github.com/tjake/fb166a659e8fe4c8d4a3 without any modifications I had ~7000 inserts/sec.
when modifying line 35 in the code above (cluster: fixed(1000)) to "cluster: fixed(100)", i. e. configuring my test data distribution to have 100 clustering keys instead of 1000, the performance was jumping up to ~11000 inserts/sec
when configuring it to have 5000 clustering keys per partition, the performance was reducing to just 700 inserts/sec
The documentation says however Cassandra can support up to 2 billion rows per partition. I don't need that much still I don't get how just 5000 records per partition can slow the writes 10 times down or am I missing something?
Supporting is a little different from "best performaning". You can have very wide partitions, but the rule-of-thumb is to try to keep them under 100mb for misc performance reasons. Some operations can be performed more efficiently when the entirety of the partition can be stored in memory.
As an example (this is old example, this is a complete non issue post 2.0 where everything is single pass) but in some versions when the size is >64mb compaction has a two pass process, that halves compaction throughput. It still worked with huge partitions. I've seen many multi gb ones that worked just fine. but the systems with huge partitions were difficult to work with operationally (managing compactions/repairs/gcs).
I would say target the rule of thumb initially of 100mb and test from there to find own optimal. Things will always behave differently based on use case, to get the most out of a node the best you can do is some benchmarks closest to what your gonna do (true of all systems). This seems like something your already doing so your definitely on the right path.

High speed time series query postgresql hardware considerations

Just have some general questions on hardware choices for polling lots of low-medium density sensor data time series as fast as possible.
System Overview
The data consists of multiple time series of approx 50-100K data points from various sensors at different locations at different times. ~15 columns of data in the main table, but some columns have long array values in them. I'm interfacing with C#/python to send queries to a local database and then work on the returned data.
The Problem and Some Thoughts
As it stands, the logging server is on average hardware (budget 4tb hdds, basic quadcore, 8gb ram) and depending on the amount of data requested in the query, it takes forever (re: "annoying amounts of waiting") to return an entire time series. I have thousands of independent time series and I want to compare multiple ones against each other. Returning a single time series takes at least 30 for 50K rows with a * can take 30+ seconds, but as low as 100ms for only a few columns. A lot of different queries are used so I don't have the luxury of caching for repeated queries.
So what I was thinking of is, instead of work directly from the online server, make an offline copy of the database onto more dedicated hardware that can be used for faster analysis. I don't need all of the data at any one time (i.e. I can pick a a location and time range, copy that offline, and work on that)
The table design is very basic. The most often queried table data has primary key:
PRIMARY KEY (location_id, time_logged, sensor_id)
With simple query like
SELECT * FROM table
WHERE location_id = 2154321 AND sensor_id = 254;
This is most common, and will be used along with some additional WHERE conditions.
There are thousands of location_id and potentially dozens of sensor_id for each location_id
The Questions
(with regard to the above table/pk/query setup)
How much does more RAM help with faster queries. i.e. 64GB vs 8GB
How much does faster cpu / more cores help with faster queries. What kind of CPU (quad/6/12+) would provide the best speedup. Is there diminishing returns?
How would one set up hard disk drives to help with faster queries (RAID cluster with SSD or mechanical).
Would getting a couple or even 4+ cheap mechanical hard disk drives provide a significant speedup in RAID?
I've read about columnar store and how it can be useful for time series (https://www.citusdata.com/blog/76-postgresql-columnar-store-for-analytics). Can anyone shed any insight on this and is it worth setting up?
Will increasing the planner statistics value help significantly?
Any general recommendations / first steps to get the best kind of query speedup? Another dedicated PC with multi-core and lots of ram? A NAS? Dedicated PC with multi-drive RAID
I'm relatively new to working with databases so don't really know what to expect in terms of performance so any pointers would be helpful.
thanks!

Performance issues when pushing data at a constant rate to Elasticsearch on multiple indexes at the same time

We are experiencing some performance issues or anomalies on a elasticsearch specifically on a system we are currently building.
The requirements:
We need to capture data for multiple of our customers, who will query and report on them on a near real time basis. All the documents received are the same format with the same properties and are in a flat structure (all fields are of primary type and no nested objects). We want to keep each customer’s information separate from each other.
Frequency of data received and queried:
We receive data for each customer at a fluctuating rate of 200 to 700 documents per second – with the peak being in the middle of the day.
Queries will be mostly aggregations over around 12 million documents per customer – histogram/percentiles to show patterns over time and the occasional raw document retrieval to find out what happened a particular point in time. We are aiming to serve 50 to 100 customer at varying rates of documents inserted – the smallest one could be 20 docs/sec to the largest one peaking at 1000 docs/sec for some minutes.
How are we storing the data:
Each customer has one index per day. For example, if we have 5 customers, there will be a total of 35 indexes for the whole week. The reason we break it per day is because it is mostly the latest two that get queried with occasionally the remaining others. We also do it that way so we can delete older indexes independently of customers (some may want to keep 7 days, some 14 days’ worth of data)
How we are inserting:
We are sending data in batches of 10 to 2000 – every second. One document is around 900bytes raw.
Environment
AWS C3-Large – 3 nodes
All indexes are created with 10 shards with 2 replica for the test purposes
Both Elasticsearch 1.3.2 and 1.4.1
What we have noticed:
If I push data to one index only, Response time starts at 80 to 100ms for each batch inserted when the rate of insert is around 100 documents per second. I ramp it up and I can reach 1600 before the rate of insert goes to close to 1sec per batch and when I increase it to close to 1700, it will hit a wall at some point because of concurrent insertions and the time will spiral to 4 or 5 seconds. Saying that, if I reduce the rate of inserts, Elasticsearch recovers nicely. CPU usage increases as rate increases.
If I push to 2 indexes concurrently, I can reach a total of 1100 and CPU goes up to 93% around 900 documents per second.
If I push to 3 indexes concurrently, I can reach a total of 150 and CPU goes up to 95 to 97%. I tried it many times. The interesting thing is that response time is around 109ms at the time. I can increase the load to 900 and response time will still be around 400 to 600 but CPU stays up.
Question:
Looking at our requirements and findings above, is the design convenient for what’s asked? Are there any tests that I can do to find out more? Is there any setting that I need to check (and change)?
I've been hosting thousands of Elasticsearch clusters on AWS over at https://bonsai.io for the last few years, and have had many a capacity planning conversation that sound like this.
First off, it sounds to me like you have a pretty good cluster design and test rig going here. My first intuition here is that you are legitimately approaching the limits of your c3.large instances, and will want to bump up to a c3.xlarge (or bigger) fairly soon.
An index per tenant per day could be reasonable, if you have relatively few tenants. You may consider an index per day for all tenants, using filters to focus your searches on specific tenants. And unless there are obvious cost savings to discarding old data, then filters should suffice to enforce data retention windows as well.
The primary benefit of segmenting your indices per tenant would be to move your tenants between different Elasticsearch clusters. This could help if you have some tenants with wildly larger usage than others. Or to reduce the potential for Elasticsearch's cluster state management to be a single point of failure for all tenants.
A few other things to keep in mind that may help explain the performance variance you're seeing.
Most importantly here, indexing is incredibly CPU bottlenecked. This makes sense, because Elasticsearch and Lucene are fundamentally just really fancy string parsers, and you're sending piles of strings. (Piles are a legitimate unit of measurement here, right?) Your primary bottleneck is going to be the number and speed of your CPU cores.
In order to take the best advantage of your CPU resources while indexing, you should consider the number of primary shards you're using. I'd recommend starting with three primary shards to distribute the CPU load evenly across the three nodes in your cluster.
For production, you'll almost certainly end up on larger servers. The goal is for your total CPU load for your peak indexing requirements ends up under 50%, so you have some additional overhead for processing your searches. Aggregations are also fairly CPU hungry. The extra performance overhead is also helpful for gracefully handling any other unforeseen circumstances.
You mention pushing to multiple indices concurrently. I would avoid concurrency when bulk updating into Elasticsearch, in favor of batch updating with the Bulk API. You can bulk load documents for multiple indices with the cluster-level /_bulk endpoint. Let Elasticsearch manage the concurrency internally without adding to the overhead of parsing more HTTP connections.
That's just a quick introduction to the subject of performance benchmarking. The Elasticsearch docs have a good article on Hardware which may also help you plan your cluster size.

what is the max number of nodes in rethinkdb?

I've read on rethinkdb's doc that we can have a number of nodes from one to sixteen but actually I don't know if it is a way of speaking or a real limit.
I launched 20 VirtualBox VMs to create a cluster and I found troubles to have all nodes in the cluster online at the same time, 3 or 4 nodes loose connectivity. This makes sense with the 16 limit but I havent found similar limits for other nosql databases.
Is 16 a real maximum number of nodes per cluster limit on rethinkdb?
thanks!
Short answer is: There is no hard limit.
It is written 16 machines because that is what we have tested so far.
Some tests have been run with 64 nodes and while it doesn't scale as much as it should, it still works.
RethinkDB is aiming for a smooth experience with 100 servers and 100.000 tables -- see https://github.com/rethinkdb/rethinkdb/issues/1861 to track progress.
Also if you run 20 VMs on the same machine, the host may not have enough resources to run the cluster, which would explains the timeouts.

Resources