Correcting improper usage of Cassandra - performance

I have a similar question that was unanswered (but had many comments):
How to make Cassandra fast
My setup:
Ubuntu Server
AWS service - Intel(R) Xeon(R) CPU E5-2680 v2 # 2.80GHz, 4GB Ram.
2 Nodes of Cassandra Datastax Community Edition: (2.1.3).
PHP 5.5.9. With datastax php-driver
I come from a MySQL database knowledge with very basic NoSQL hands on in terms of ElasticSearch (now called Elastic) and MongoDB in terms of Documents storage.
When I read how to use Cassandra, here are the bullets that I understood
It is distributed
You can have replicated rings to distribute data
You need to establish partition keys for maximum efficiency
Rethink your query rather than to use indices
Model according to queries and not data
Deletes are bad
You can only sort starting from the second key of your primary key set
Cassandra has "fast" write
I have a PHP Silex framework API that receive a batch json data and is inserted into 4 tables as a minimum, 6 at maximum (mainly due to different types of sort that I need).
At first I only had two nodes of Cassandra. I ran Apache Bench to test. Then I added a third node, and it barely shaved off a fraction of a second at higher batch size concurrency.
Concurrency Batch size avg. time (ms) - 2 Nodes avg. time (ms) - 3 Nodes
1 5 288 180
1 50 421 302
1 400 1 298 1 504
25 5 1 993 2 111
25 50 3 636 3 466
25 400 32 208 21 032
100 5 5 115 5 167
100 50 11 776 10 675
100 400 61 892 60 454
A batch size is the number of entries (to the 4-6 tables) it is making per call.
So batch of 5, means it is making 5x (4-6) table insert worth of data. At higher batch size / concurrency the application times out.
There are 5 columns in a table with relatively small size of data (mostly int with text being no more than approx 10 char long)
My keyspace is the following:
user_data | True | org.apache.cassandra.locator.SimpleStrategy | {"replication_factor":"1"}
My "main" question is: what did I do wrong? It seems to be this is relatively small data set of that considering that Cassandra was built on BigDataTable at very high write speed.
Do I add more nodes beyond 3 in order to speed up?
Do I change my replication factor and do Quorum / Read / Write and then hunt for a sweet spot from the datastax doc: http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html
Do I switch framework, go node.js for higher concurrency for example.
Do I rework my tables, as I have no good example as to how effectively use column family? I need some hint for this one.
For the table question:
I'm tracking history of a user. User has an event and is associated to a media id, and there so extra meta data too.
So columns are: event_type, user_id, time, media_id, extra_data.
I need to sort them differently therefore I made different tables for them (as I understood how Cassandra data modeling should work...I am perhaps wrong). Therefore I'm replicating the different data across various tables.
Help?
EDIT PART HERE
the application also has redis and mysql attached for other CRUD points of interest such as retrieving a user data and caching it for faster pull.
so far on avg with MySQL and then Redis activated, I have a 72ms after Redis kicks in, 180ms on MySQL pre-redis.

The first problem is you're trying to benchmark the whole system, without knowing what any individual component can do. Are you trying to see how fast an individual operation is? or how many operations per second you can do? They're different values.
I typically recommend you start by benchmarking Cassandra. Modern Cassandra can typically do 20-120k operations per second per server. With RF=3, that means somewhere between 5k and 40k reads / second or writes/second. Use cassandra-stress to make sure cassandra is doing what you expect, THEN try to loop in your application and see if it matches. If you slow way down, then you know the application is your bottleneck, and you can start thinking about various improvements (different driver, different language, async requests instead of sync, etc).
Right now, you're doing too much and analyzing too little. Break the problem into smaller pieces. Solve the individual pieces, then put the puzzle together.
Edit: Cassandra 2.1.3 is getting pretty old. It has some serious bugs. Use 2.1.11 or 2.2.3. If you're just starting development, 2.2.3 may be OK (and let's assume you'll actually go to production with 2.2.5 or so). If you're ready to go prod tomorrow, use 2.1.x instead.

Related

Why queries are getting progressively slower when using postgres with spring batch?

I'm running a job using Spring Batch 4.2.0 with postgres (11.2) as backend. It's all wrapped in a spring boot app. I've 5 steps and each runs using a simple partitioning strategy to divide data by id ranges and reads data into each partition (which are processed by separate threads). I've about 18M rows in the table, each step reads, changes few fields and writes back. Each step reads all 18M rows and writes back. The issue I'm facing is, the queries that run to pull data into each thread scans data by id range like,
select field_1, field_2, field_66 from table where id >= 1 and id < 10000.
In this case each thread processes 10_000 rows at a time. When there's no traffic the query takes less than a second to read all 10,000 rows. But when the job runs there's about 70 threads reading all that data in. It goes progressively slower to almost a minute and a half, any ideas where to start troubleshooting this?
I do see autovacuum running in the backgroun for almost the whole duration of job. It definitely has enough memory to hold all that data in memory (about 6GB max heap). Postgres has sufficient shared_buffers 2GB, max_wal_size 2GB but not sure if that in itself is sufficient. Another thing I see is loads of COMMIT queries hanging around when checking through pg_stat_activity. Usually as much as number of partitions. So, instead of 70 connections being used by 70 partitions there are 140 conections used up with 70 of them running COMMIT. As time progresses these COMMITs get progressively slower too.
You are probably hitting https://github.com/spring-projects/spring-batch/issues/3634.
This issue has been fixed and will be part of version 4.2.3 planned to be released this week.

How to get 2+ million floats a second from a DB to a consumer?

A few days ago I was granted the very interesting project, of recoding the data input of a large simulation framework.
Before I accepted that task, it used an ... interesting ... file reader. That sadly didn't work.
The task is easy to describe: get up to 2+ million floats per second from a database (originally planned PostgresQL, now looking into ElasticSearch or Cassandra) into a Java proxy. Remap, buffer and feed them to a high-performance data bus.
The second part was easy enough, and the first part was standard, right? Erm, sadly wrong.
As it turned out, I somehow smashed into a brick wall of "not more than 1 Mb / s"... which is only 5% of the transfer rate needed. Yes, yes, the problem could surely be killed with more hardware, but why introduce a "built in brake", if there might be better solutions?
A tad background on the data:
A unit consists of a uuid, a type and a timeseries of values (floats). One float per minute for a whole year -> 60*24*365 = 525,600 floats.
The consumer asks 150k units, for 15 time-steps a second -> 2.25 mio floats per second.
We plan to pre-split the timeseries into day-batches, which seems to be a tad more manageable. Even at this rate that's about 9k*150k units = 1.35 GB of data for the buffer.
So, I tried several different ways to store and retrieve the data.
In PostgresQL the most promising approach till now is a table:
uuid uuid,
type text,
t1...t365 float4[]
PK is the combination of uuid and type.
Simple enough, I only need to SELECT tx FROM table.
Info on my testing setup: as I can't yet test with a full dataset (even faking them is painstackingly slow, at 2 sec per unit) I usually test with batches of 2k units.
Under such conditions the SELECT needs 12 seconds. According to my tests that's NOT due to transmission bottleneck (same result if I use a DB on the same machine). Which is roughly 5% of the speed I need, and much slower than I assumed it would be. 1 Mb/s???
I haven't found any apparent bottleneck yet, so throwing more hardware at it, might even not work at all.
Anyone knowledgable about PostgresQL an idea what's slowing it down so much?
OK, so I tried another system. ElasticSearch might be faster, and also solve some other questions, like a nasty cluster access API I might have to add later on.
mapping: {
_index: timeseries_ID
_type: unit
uuid: text(keyword)
type: text(keyword)
day: integer
data: array[float]
}
To my surprise, this was even slower. Only 100k floats per second tops. Only 2% of what I need. Yes yes I know, add more nodes, but isn't ElasticSearch supposed to be a tad faster than Postgres?
Anyone an idea for an ElasticSearch mapping or any configs that might be more promising?
Now, on monday I'll get to work with my tests with Cassandra. The planned table is simple:
day int PRIMARY KEY,
uuid text,
type text,
data array[float]
Is there a more sensible setup for this kind of data for Cassandra?
Or does anyone have a better idea alltogether to solve this problem?
Clarifications
If I do understand what you want I can't see why you are using a database at all.
Reason behind using a DB is, that the data comes from outside people, with varying IT-skills. So we need a central storage for the data, with an API as well as maybe a simple web frontend to check the integrity and clean up and reformate it into our own internal system. A database seems to work better on that front than some proprietory file system. That one could be more easily clogged and made unusable by inexperienced data contributors.
I'm not sure it's clear on what you're trying to do here or why you seem to be randomly trying different database servers.
It's not randomly. The Postgres was a typical case of someone going: "oh but we do use that one already for our data, so why should I have to learn something new?"
The Elasticsearch approach is trying to leverage the distributed cluster and replication features. Thus we can have a central permanent storage, and just order a temporary ES cluster on our vms with replicates of the needed data. This way ES would handle all the data transport into the computer cluster.
Cassandra is a suggestion by my boss, as it's supposed to be much more scalable than Postgres. Plus it could comfort those that prefer to have a more SQLy API.
But - can you show e.g. the query in PostgreSQL you are trying to run - is it multiple uuids but a short time-period or one uuid and a longer one?
Postgres: The simplest approach is to get all uuids and exactly one daybatch. So a simple:
SELECT t1 FROM table;
Where t1 is the column holding the data for dayset one.
For testing with my (till now) limited fake data (roughly 2% of the full set) I sadly have to go with: SELECT t1, t2 ... t50 FROM table
Depending on testing, I might also go with splitting that one large select into some/many smaller ones. I'm thinking about going by a uuid-hash based split, with indexes set accordingly of course. It's all a question of balancing overhead and reliability. Nothing is final yet.
Are we talking multiple consumers or just one?
At the start one consumer, planned is to have multiple instances of the simulation. Much smaller instances though. The 150k unit one is supposed to be the "single large consumer" assumption.
How often are the queries issued?
As needed, in the full fledged approach that would be every 96 seconds. More often if I switch towards smaller queries.
Is this all happening on the same machine or is the db networked to this Java proxy and if so by what.
At the moment I'm testing on one or two two machines only: preliminary tests are made solely on my workstation, with later tests moving the db to a second machine. They are networking over a standard gigabit LAN.
In the full fledged version the simulation(s) will run on vms on a cluster with the DB having a dedicated strong server for itself.
add the execution plan generated using explain (analyze, verbose)
Had to jury rigg something with a small batch (281 units) only:
Seq Scan on schema.fake_data (cost=0.00..283.81 rows=281 width=18) (actual time=0.033..3.335 rows=281 loops=1)
Output: t1, Planning time: 0.337 ms, Execution time: 1.493 ms
Executing the thing for real: 10 sec for a mere 1.6 Mb.
Now faking a 10k unit by calling t1-t36 (I know, not even close to the real thing):
Seq Scan on opsim.fake_data (cost=0.00..283.81 rows=281 width=681) (actual time=0.012..1.905 rows=281 loops=1)
Output: *, Planning time: 0.836 ms, Execution time: 2.040 ms
Executing the thing for real: 2 min for ~60 Mb.
The problem is definitely not the planning or execution. Neither is it the network, as I get the same slow read on my local system. But heck, even a slow HDD has at LEAST 30 Mb/s, a cheap network 12.5 MB/s ... I know I know, that's brutto, but how come that I get < 1Mb/s out of those dbs? Is there some bandwith limit per connection? Aunt google at least gave me no indication for anything like that.

Cassandra partition size and performance?

I was playing around with cassandra-stress tool on my own laptop (8 cores, 16GB) with Cassandra 2.2.3 installed out of the box with having its stock configuration. I was doing exactly what was described here:
http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema
And measuring its insert performance.
My observations were:
using the code from https://gist.github.com/tjake/fb166a659e8fe4c8d4a3 without any modifications I had ~7000 inserts/sec.
when modifying line 35 in the code above (cluster: fixed(1000)) to "cluster: fixed(100)", i. e. configuring my test data distribution to have 100 clustering keys instead of 1000, the performance was jumping up to ~11000 inserts/sec
when configuring it to have 5000 clustering keys per partition, the performance was reducing to just 700 inserts/sec
The documentation says however Cassandra can support up to 2 billion rows per partition. I don't need that much still I don't get how just 5000 records per partition can slow the writes 10 times down or am I missing something?
Supporting is a little different from "best performaning". You can have very wide partitions, but the rule-of-thumb is to try to keep them under 100mb for misc performance reasons. Some operations can be performed more efficiently when the entirety of the partition can be stored in memory.
As an example (this is old example, this is a complete non issue post 2.0 where everything is single pass) but in some versions when the size is >64mb compaction has a two pass process, that halves compaction throughput. It still worked with huge partitions. I've seen many multi gb ones that worked just fine. but the systems with huge partitions were difficult to work with operationally (managing compactions/repairs/gcs).
I would say target the rule of thumb initially of 100mb and test from there to find own optimal. Things will always behave differently based on use case, to get the most out of a node the best you can do is some benchmarks closest to what your gonna do (true of all systems). This seems like something your already doing so your definitely on the right path.

Hbase concurrency making it slow

I have 1 master server and 5 region server and each server has 200 GB disk space and 16 GB RAM on each. I created a table in HBase which has 10 million records. I am using hbase-0.96 version on hadoop 2.
Table Name - sh_self_profiles
column family - profile
In this table, we have 30 columns in each row.
When I get a single column value from HBase, it takes around 10 ms. My problem is when I hit 100 or more concurrent requests the time slowly accumulates and increases to more than 400 ms instead of completing in 10ms only. When 100 requests are hit linearly, each one takes 10 ms only.
One thing that you should check is how well distributed your table is.
You can do this by going to the HBase master web console http://:60010, you will be able to see how many regions you have for your table. If you have not done anything special on table creation you could easily have only one or two regions, which means that all the requests are being directed to a single region server.
If this is the case, you can recreate your table with pre-split regions (I would suggest a multiple of 5, such as 15 or 20), and make sure that the concurrent gets that you are doing are equally spread over the row-key space.
Also, pls check how much RAM you have allocated to the region server - you might need to increase it from the default. If you are not running anything else other than HBase Region Sever on those machines, you could probably increase to 8GB ram.
Other than that, you could also adjust the default for hbase.regionserver.handler.count.
I hope this helps.
Which client are you using? Are you using the standard Java client, the Thrift client, the HTTP REST client, or something else? If your use case is a high amount of random reads of single column values, I highly recommend you try asynchbase as it is much faster than the standard synchronous Java client.

Duplicate Key Filtering

I am looking for a distributed solution to screen/filter a large volume of keys in real-time. My application generates over 100 billion records per day, and I need a way to filter duplicates out of the stream. I am looking for a system to store a rolling 10 days’ worth of keys, at approximately 100 bytes per key. I was wondering how this type of large scale problem has been solved before using Hadoop. Would HBase be the correct solution to use? Has anyone ever tried a partially in-memory solution like Zookeeper?
I can see a number of solutions to your problem, but the real-time requirement really narrows it down. By real-time do you mean you want to see if a key is a duplicate as its being created?
Let's talk about queries per second. You say 100B/day (that's a lot, congratulations!). That's 1.15 Million queries per second (100,000,000,000 / 24 / 60 / 60). I'm not sure if HBase can handle that. You may want to think about something like Redis (sharded perhaps) or Membase/memcached or something of that sort.
If you were to do it in HBase, I'd simply push the upwards of a trillion keys (10 days x 100B keys) as the keys in the table, and put some value in there to store it (because you have to). Then, you can just do a get to figure out if the key is in there. This is kind of hokey and doesn't fully utilize hbase as it is only fully utilizing the keyspace. So, effectively HBase is a b-tree service in this case. I don't think this is a good idea.
If you relax the restraint to not have to do real-time, you could use MapReduce in batch to dedup. That's pretty easy: it's just Word Count without the counting. You group by the key you have and then you'll see the dups in the reducer if multiple values come back. With enough nodes an enough latency, you can solve this problem efficiently. Here is some example code for this from the MapReduce Design Patterns book: https://github.com/adamjshook/mapreducepatterns/blob/master/MRDP/src/main/java/mrdp/ch3/DistinctUserDriver.java
ZooKeeper is for distributed process communication and synchronization. You don't want to be storing trillions of records in zookeeper.
So, in my opinion, you're better served by a in-memory key/value store such as redis, but you'll be hard pressed to store that much data in memory.
I am afraid that it is impossible with traditional systems :|
Here is what U have mentioned:
100 billion per days means approximation 1 million per second!!!!
size of the key is 100 bytes.
U want to check for duplicates in a 10 day working set means 1 trillion items.
These assumptions results in look up in a set of 1 trillion objects that totally size in 90 TERABYTES!!!!!
Any solution to this real-time problem shall provide a system that can look up 1 million items per second in this volume of data.
I have some experience with HBase, Cassandra, Redis, and Memcached. I am sure that U cannot achieve this performance on any disk-based storage like HBase, Cassandra, or HyperTable (and add any RDBMSs like MySQL, PostgreSQl, and... to these). The best performance of redis and memcached that I have heard practically is around 100k operations per second on a single machine. This means that U must have 90 machines each having 1 TERABYTES of RAM!!!!!!!!
Even a batch processing system like Hadoop cannot do this job in less than an hour and I guess it will take hours and days on even a big cluster of 100 machines.
U R talking about very very very big numbers (90 TB, 1M per second). R U sure about this???

Resources