We have 2 instances of the same galera cluster, so the same database schema with a data difference of 1 month. respectively they are a staging and a production environment.
They run on an identical hardware configuration, same operating system, same galera version, same configuration as a 3 node configuration. The network latency is a non-problem each node resides on a different docker container on the same machine and they are regularly backed up.
There is a particularly heavy query that in the staging env takes up to 2 seconds to run, while on the production env actually gets up to 2 minutes and half. Examining the query plan the magnitude of the number of records difference for each step is not that far apart, but the actual query plan varies in strategy in the 2 following steps:
staging (2s):
8 DERIVED b ref bay_depend_id,bay_relies_on_id bay_depend_id 4 rwms.l.bay_id 2 Using where
8 DERIVED b ref bay_depend_id,bay_relies_on_id bay_depend_id 4 rwms.l.bay_id 2 Using where
production (2mins):
8 DERIVED b ALL bay_depend_id,bay_relies_on_id 1 Using where; Using join buffer (flat, BNL join)
8 DERIVED b ALL bay_depend_id,bay_relies_on_id 1 Using where; Using join buffer (incremental, BNL join)
The query plan for all the other steps of the query is actually the same with the difference being in the number of records, but at these 2 steps the actual evaluation strategy changes. And I can't really wrap my head around it, or on how to resolve this issue, mostly because every working condition is radically the same.
How big are those tables? (On each node.)
EXPLAIN is not consistent. Your example of such is rather drastic and deserves more investigation. Please provide the query, table size(s), and EXPLAIN FORMAT=JSON SELECT ... Also SHOW CREATE TABLE.
Doing this may get them in sync:
ANALYZE TABLE ...
It is fast and rebuilds the statistics for the table given. Note: There is no sharing of statistics between nodes, so do that on each node.
Related
See these 2 Snowflake queries profile images. They are doing similar work (Update the same 370M table join with small tables(one case is 21k, the other one is 9k), but the performance result is 5x).
The query finished around 15 mins, using one xsmall VDW:
Fast query finished around 15 mins
And this query, update the same table of 370M rows, join with an even small DIM table of 9k, but still running after 1 hour 30 mins
Still, running after 90 minutes
From the query profile, I cannot explain why the 2nd query runs so much slower than the first one. The 2nd one is run right after the first one.
Any idea? Thanks
in the second query you can see bytes spilled to local storage is 272gb. This means that the work done in processing was too large to fit in the cluster memory and so had to spill to local attached SSD. From a performance perspective this is a costly operation and I think probably why the 2nd query took so long to run (query 1 only had 2gb of spilling). The easiest solution to this is to increase the size of the VDW - or you could rewrite the query:
https://docs.snowflake.net/manuals/user-guide/ui-query-profile.html#queries-too-large-to-fit-in-memory
Note also that query 1 managed to read 100% of its data set from VDW memory - which is very efficient - whereas query2 could only find about half of its data set there and so had to perform remote io (read from cloud storage) to get the rest. Queries/work performed prior to running query 1 and 2 had retrieved that information to the local VDW cache, and retains this info on an LRU basis.
The join for the slow query is producing more rows than are flowing into it. This can be what you want, but often it's caused by duplicate values in the tables. I'd do a sanity check on whether that's expected here.
A few days ago I was granted the very interesting project, of recoding the data input of a large simulation framework.
Before I accepted that task, it used an ... interesting ... file reader. That sadly didn't work.
The task is easy to describe: get up to 2+ million floats per second from a database (originally planned PostgresQL, now looking into ElasticSearch or Cassandra) into a Java proxy. Remap, buffer and feed them to a high-performance data bus.
The second part was easy enough, and the first part was standard, right? Erm, sadly wrong.
As it turned out, I somehow smashed into a brick wall of "not more than 1 Mb / s"... which is only 5% of the transfer rate needed. Yes, yes, the problem could surely be killed with more hardware, but why introduce a "built in brake", if there might be better solutions?
A tad background on the data:
A unit consists of a uuid, a type and a timeseries of values (floats). One float per minute for a whole year -> 60*24*365 = 525,600 floats.
The consumer asks 150k units, for 15 time-steps a second -> 2.25 mio floats per second.
We plan to pre-split the timeseries into day-batches, which seems to be a tad more manageable. Even at this rate that's about 9k*150k units = 1.35 GB of data for the buffer.
So, I tried several different ways to store and retrieve the data.
In PostgresQL the most promising approach till now is a table:
uuid uuid,
type text,
t1...t365 float4[]
PK is the combination of uuid and type.
Simple enough, I only need to SELECT tx FROM table.
Info on my testing setup: as I can't yet test with a full dataset (even faking them is painstackingly slow, at 2 sec per unit) I usually test with batches of 2k units.
Under such conditions the SELECT needs 12 seconds. According to my tests that's NOT due to transmission bottleneck (same result if I use a DB on the same machine). Which is roughly 5% of the speed I need, and much slower than I assumed it would be. 1 Mb/s???
I haven't found any apparent bottleneck yet, so throwing more hardware at it, might even not work at all.
Anyone knowledgable about PostgresQL an idea what's slowing it down so much?
OK, so I tried another system. ElasticSearch might be faster, and also solve some other questions, like a nasty cluster access API I might have to add later on.
mapping: {
_index: timeseries_ID
_type: unit
uuid: text(keyword)
type: text(keyword)
day: integer
data: array[float]
}
To my surprise, this was even slower. Only 100k floats per second tops. Only 2% of what I need. Yes yes I know, add more nodes, but isn't ElasticSearch supposed to be a tad faster than Postgres?
Anyone an idea for an ElasticSearch mapping or any configs that might be more promising?
Now, on monday I'll get to work with my tests with Cassandra. The planned table is simple:
day int PRIMARY KEY,
uuid text,
type text,
data array[float]
Is there a more sensible setup for this kind of data for Cassandra?
Or does anyone have a better idea alltogether to solve this problem?
Clarifications
If I do understand what you want I can't see why you are using a database at all.
Reason behind using a DB is, that the data comes from outside people, with varying IT-skills. So we need a central storage for the data, with an API as well as maybe a simple web frontend to check the integrity and clean up and reformate it into our own internal system. A database seems to work better on that front than some proprietory file system. That one could be more easily clogged and made unusable by inexperienced data contributors.
I'm not sure it's clear on what you're trying to do here or why you seem to be randomly trying different database servers.
It's not randomly. The Postgres was a typical case of someone going: "oh but we do use that one already for our data, so why should I have to learn something new?"
The Elasticsearch approach is trying to leverage the distributed cluster and replication features. Thus we can have a central permanent storage, and just order a temporary ES cluster on our vms with replicates of the needed data. This way ES would handle all the data transport into the computer cluster.
Cassandra is a suggestion by my boss, as it's supposed to be much more scalable than Postgres. Plus it could comfort those that prefer to have a more SQLy API.
But - can you show e.g. the query in PostgreSQL you are trying to run - is it multiple uuids but a short time-period or one uuid and a longer one?
Postgres: The simplest approach is to get all uuids and exactly one daybatch. So a simple:
SELECT t1 FROM table;
Where t1 is the column holding the data for dayset one.
For testing with my (till now) limited fake data (roughly 2% of the full set) I sadly have to go with: SELECT t1, t2 ... t50 FROM table
Depending on testing, I might also go with splitting that one large select into some/many smaller ones. I'm thinking about going by a uuid-hash based split, with indexes set accordingly of course. It's all a question of balancing overhead and reliability. Nothing is final yet.
Are we talking multiple consumers or just one?
At the start one consumer, planned is to have multiple instances of the simulation. Much smaller instances though. The 150k unit one is supposed to be the "single large consumer" assumption.
How often are the queries issued?
As needed, in the full fledged approach that would be every 96 seconds. More often if I switch towards smaller queries.
Is this all happening on the same machine or is the db networked to this Java proxy and if so by what.
At the moment I'm testing on one or two two machines only: preliminary tests are made solely on my workstation, with later tests moving the db to a second machine. They are networking over a standard gigabit LAN.
In the full fledged version the simulation(s) will run on vms on a cluster with the DB having a dedicated strong server for itself.
add the execution plan generated using explain (analyze, verbose)
Had to jury rigg something with a small batch (281 units) only:
Seq Scan on schema.fake_data (cost=0.00..283.81 rows=281 width=18) (actual time=0.033..3.335 rows=281 loops=1)
Output: t1, Planning time: 0.337 ms, Execution time: 1.493 ms
Executing the thing for real: 10 sec for a mere 1.6 Mb.
Now faking a 10k unit by calling t1-t36 (I know, not even close to the real thing):
Seq Scan on opsim.fake_data (cost=0.00..283.81 rows=281 width=681) (actual time=0.012..1.905 rows=281 loops=1)
Output: *, Planning time: 0.836 ms, Execution time: 2.040 ms
Executing the thing for real: 2 min for ~60 Mb.
The problem is definitely not the planning or execution. Neither is it the network, as I get the same slow read on my local system. But heck, even a slow HDD has at LEAST 30 Mb/s, a cheap network 12.5 MB/s ... I know I know, that's brutto, but how come that I get < 1Mb/s out of those dbs? Is there some bandwith limit per connection? Aunt google at least gave me no indication for anything like that.
I have a similar question that was unanswered (but had many comments):
How to make Cassandra fast
My setup:
Ubuntu Server
AWS service - Intel(R) Xeon(R) CPU E5-2680 v2 # 2.80GHz, 4GB Ram.
2 Nodes of Cassandra Datastax Community Edition: (2.1.3).
PHP 5.5.9. With datastax php-driver
I come from a MySQL database knowledge with very basic NoSQL hands on in terms of ElasticSearch (now called Elastic) and MongoDB in terms of Documents storage.
When I read how to use Cassandra, here are the bullets that I understood
It is distributed
You can have replicated rings to distribute data
You need to establish partition keys for maximum efficiency
Rethink your query rather than to use indices
Model according to queries and not data
Deletes are bad
You can only sort starting from the second key of your primary key set
Cassandra has "fast" write
I have a PHP Silex framework API that receive a batch json data and is inserted into 4 tables as a minimum, 6 at maximum (mainly due to different types of sort that I need).
At first I only had two nodes of Cassandra. I ran Apache Bench to test. Then I added a third node, and it barely shaved off a fraction of a second at higher batch size concurrency.
Concurrency Batch size avg. time (ms) - 2 Nodes avg. time (ms) - 3 Nodes
1 5 288 180
1 50 421 302
1 400 1 298 1 504
25 5 1 993 2 111
25 50 3 636 3 466
25 400 32 208 21 032
100 5 5 115 5 167
100 50 11 776 10 675
100 400 61 892 60 454
A batch size is the number of entries (to the 4-6 tables) it is making per call.
So batch of 5, means it is making 5x (4-6) table insert worth of data. At higher batch size / concurrency the application times out.
There are 5 columns in a table with relatively small size of data (mostly int with text being no more than approx 10 char long)
My keyspace is the following:
user_data | True | org.apache.cassandra.locator.SimpleStrategy | {"replication_factor":"1"}
My "main" question is: what did I do wrong? It seems to be this is relatively small data set of that considering that Cassandra was built on BigDataTable at very high write speed.
Do I add more nodes beyond 3 in order to speed up?
Do I change my replication factor and do Quorum / Read / Write and then hunt for a sweet spot from the datastax doc: http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html
Do I switch framework, go node.js for higher concurrency for example.
Do I rework my tables, as I have no good example as to how effectively use column family? I need some hint for this one.
For the table question:
I'm tracking history of a user. User has an event and is associated to a media id, and there so extra meta data too.
So columns are: event_type, user_id, time, media_id, extra_data.
I need to sort them differently therefore I made different tables for them (as I understood how Cassandra data modeling should work...I am perhaps wrong). Therefore I'm replicating the different data across various tables.
Help?
EDIT PART HERE
the application also has redis and mysql attached for other CRUD points of interest such as retrieving a user data and caching it for faster pull.
so far on avg with MySQL and then Redis activated, I have a 72ms after Redis kicks in, 180ms on MySQL pre-redis.
The first problem is you're trying to benchmark the whole system, without knowing what any individual component can do. Are you trying to see how fast an individual operation is? or how many operations per second you can do? They're different values.
I typically recommend you start by benchmarking Cassandra. Modern Cassandra can typically do 20-120k operations per second per server. With RF=3, that means somewhere between 5k and 40k reads / second or writes/second. Use cassandra-stress to make sure cassandra is doing what you expect, THEN try to loop in your application and see if it matches. If you slow way down, then you know the application is your bottleneck, and you can start thinking about various improvements (different driver, different language, async requests instead of sync, etc).
Right now, you're doing too much and analyzing too little. Break the problem into smaller pieces. Solve the individual pieces, then put the puzzle together.
Edit: Cassandra 2.1.3 is getting pretty old. It has some serious bugs. Use 2.1.11 or 2.2.3. If you're just starting development, 2.2.3 may be OK (and let's assume you'll actually go to production with 2.2.5 or so). If you're ready to go prod tomorrow, use 2.1.x instead.
I'm running a simplistic application on Spark/Cassandra cluster. Since moving to a new environment (Spark 1.5 instead of 1.2 and minor Cassandra version upgrade too) substantial performance downgrade was observed (from 4 s. to 1-5 m. for same task and same amounts of data).
After initial investigation it seems, that for exactly same code from spark-driver's perspective, there are many more tasks generated (20+k, where it used to be up to 5) and logs on executor's end also reflect the same situation:
many sequential executions of the same query on different partitions:
...
CassandraTableScanRDD: Fetched 0 rows from x.y for partition 20324 in 0.138 s.
CassandraTableScanRDD: Fetched 0 rows from x.y for partition 20327 in 0.058 s.
CassandraTableScanRDD: Fetched 0 rows from x.y for partition 20329 in 0.053 s.
...
where it used to be a single one:
CassandraTableScanRDD: Fetched 905 rows from x.y for partition 0 in 2.992 s.
Since application code is the same, I wonder what could possibly have caused such a difference in partitioning behavior and what can be done to remediate that?
NB! Setup of both environments if different, configuration is not shared/inherited.
Thanks.
The new version of the Spark Cassandra Connector uses a System table inside of more modern Cassandra to estimate split size. This table is updated every (5 minutes currently) although the number of splits you are seeing is extremely large. The value read out of this table is divided by your split size.
If you are using C* less than 2.1.5 this table does not exist and the partitioning will need to be done manually.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md#what-does-inputsplitsize_in_mb-use-to-determine-size
You can manually pass in the number of splits via the ReadConf if you are continuing to see issues.
I want to know what configuration setup would be ideal for my case. I have 4 servers (nodes) each with 128 GB RAM. I'll have all 4 nodes under one cluster.
Total number number of indexes would be 10, each getting data of 1500000 documents per day.
Since I'll have 4 servers (nodes) so for all these nodes I'll set master:true, and data:true, so that if one node goes down, other becomes master. Every index will have 5 shards.
I want to know which config parameters should I alter in order to gain maximum potential from elastic.
Also tell me how much memory is enough for my usage, since I'll have very frequent select queries in production (may be 1000 requests per second).
Need a detailed suggestion.s
I'm not sure anyone can give you a definitive answer to exactly how to configure your servers since it is very dependent on your data structure, mapping and specific queries.
You should read this great article series by Elastic regarding production environments