How to get 2+ million floats a second from a DB to a consumer? - performance

A few days ago I was granted the very interesting project, of recoding the data input of a large simulation framework.
Before I accepted that task, it used an ... interesting ... file reader. That sadly didn't work.
The task is easy to describe: get up to 2+ million floats per second from a database (originally planned PostgresQL, now looking into ElasticSearch or Cassandra) into a Java proxy. Remap, buffer and feed them to a high-performance data bus.
The second part was easy enough, and the first part was standard, right? Erm, sadly wrong.
As it turned out, I somehow smashed into a brick wall of "not more than 1 Mb / s"... which is only 5% of the transfer rate needed. Yes, yes, the problem could surely be killed with more hardware, but why introduce a "built in brake", if there might be better solutions?
A tad background on the data:
A unit consists of a uuid, a type and a timeseries of values (floats). One float per minute for a whole year -> 60*24*365 = 525,600 floats.
The consumer asks 150k units, for 15 time-steps a second -> 2.25 mio floats per second.
We plan to pre-split the timeseries into day-batches, which seems to be a tad more manageable. Even at this rate that's about 9k*150k units = 1.35 GB of data for the buffer.
So, I tried several different ways to store and retrieve the data.
In PostgresQL the most promising approach till now is a table:
uuid uuid,
type text,
t1...t365 float4[]
PK is the combination of uuid and type.
Simple enough, I only need to SELECT tx FROM table.
Info on my testing setup: as I can't yet test with a full dataset (even faking them is painstackingly slow, at 2 sec per unit) I usually test with batches of 2k units.
Under such conditions the SELECT needs 12 seconds. According to my tests that's NOT due to transmission bottleneck (same result if I use a DB on the same machine). Which is roughly 5% of the speed I need, and much slower than I assumed it would be. 1 Mb/s???
I haven't found any apparent bottleneck yet, so throwing more hardware at it, might even not work at all.
Anyone knowledgable about PostgresQL an idea what's slowing it down so much?
OK, so I tried another system. ElasticSearch might be faster, and also solve some other questions, like a nasty cluster access API I might have to add later on.
mapping: {
_index: timeseries_ID
_type: unit
uuid: text(keyword)
type: text(keyword)
day: integer
data: array[float]
}
To my surprise, this was even slower. Only 100k floats per second tops. Only 2% of what I need. Yes yes I know, add more nodes, but isn't ElasticSearch supposed to be a tad faster than Postgres?
Anyone an idea for an ElasticSearch mapping or any configs that might be more promising?
Now, on monday I'll get to work with my tests with Cassandra. The planned table is simple:
day int PRIMARY KEY,
uuid text,
type text,
data array[float]
Is there a more sensible setup for this kind of data for Cassandra?
Or does anyone have a better idea alltogether to solve this problem?
Clarifications
If I do understand what you want I can't see why you are using a database at all.
Reason behind using a DB is, that the data comes from outside people, with varying IT-skills. So we need a central storage for the data, with an API as well as maybe a simple web frontend to check the integrity and clean up and reformate it into our own internal system. A database seems to work better on that front than some proprietory file system. That one could be more easily clogged and made unusable by inexperienced data contributors.
I'm not sure it's clear on what you're trying to do here or why you seem to be randomly trying different database servers.
It's not randomly. The Postgres was a typical case of someone going: "oh but we do use that one already for our data, so why should I have to learn something new?"
The Elasticsearch approach is trying to leverage the distributed cluster and replication features. Thus we can have a central permanent storage, and just order a temporary ES cluster on our vms with replicates of the needed data. This way ES would handle all the data transport into the computer cluster.
Cassandra is a suggestion by my boss, as it's supposed to be much more scalable than Postgres. Plus it could comfort those that prefer to have a more SQLy API.
But - can you show e.g. the query in PostgreSQL you are trying to run - is it multiple uuids but a short time-period or one uuid and a longer one?
Postgres: The simplest approach is to get all uuids and exactly one daybatch. So a simple:
SELECT t1 FROM table;
Where t1 is the column holding the data for dayset one.
For testing with my (till now) limited fake data (roughly 2% of the full set) I sadly have to go with: SELECT t1, t2 ... t50 FROM table
Depending on testing, I might also go with splitting that one large select into some/many smaller ones. I'm thinking about going by a uuid-hash based split, with indexes set accordingly of course. It's all a question of balancing overhead and reliability. Nothing is final yet.
Are we talking multiple consumers or just one?
At the start one consumer, planned is to have multiple instances of the simulation. Much smaller instances though. The 150k unit one is supposed to be the "single large consumer" assumption.
How often are the queries issued?
As needed, in the full fledged approach that would be every 96 seconds. More often if I switch towards smaller queries.
Is this all happening on the same machine or is the db networked to this Java proxy and if so by what.
At the moment I'm testing on one or two two machines only: preliminary tests are made solely on my workstation, with later tests moving the db to a second machine. They are networking over a standard gigabit LAN.
In the full fledged version the simulation(s) will run on vms on a cluster with the DB having a dedicated strong server for itself.
add the execution plan generated using explain (analyze, verbose)
Had to jury rigg something with a small batch (281 units) only:
Seq Scan on schema.fake_data (cost=0.00..283.81 rows=281 width=18) (actual time=0.033..3.335 rows=281 loops=1)
Output: t1, Planning time: 0.337 ms, Execution time: 1.493 ms
Executing the thing for real: 10 sec for a mere 1.6 Mb.
Now faking a 10k unit by calling t1-t36 (I know, not even close to the real thing):
Seq Scan on opsim.fake_data (cost=0.00..283.81 rows=281 width=681) (actual time=0.012..1.905 rows=281 loops=1)
Output: *, Planning time: 0.836 ms, Execution time: 2.040 ms
Executing the thing for real: 2 min for ~60 Mb.
The problem is definitely not the planning or execution. Neither is it the network, as I get the same slow read on my local system. But heck, even a slow HDD has at LEAST 30 Mb/s, a cheap network 12.5 MB/s ... I know I know, that's brutto, but how come that I get < 1Mb/s out of those dbs? Is there some bandwith limit per connection? Aunt google at least gave me no indication for anything like that.

Related

extremely high SSD write rate with multiple concurrent writers

I'm using QuestDB as backend for storing collected data using the same script for different data sources.
My problem ist the extremly high disk (ssd) usage. During 4 days it has written 335MB per second.
What am I doing wrong?
Inserting data using the ILP interface
sender.row(
metric,
symbols=symbols,
columns=data,
at=row['ts']
)
I don't know how much data you are ingesting, so not sure if 335 MB per second is much or not. But since you are surprised by it I am going to assume your throughput is lower than that. It might be the case your data is out of order, specially if ingesting from multiple data sources.
QuestDB keeps the data per table always in incremental order by designated timestamp. If data arrives out of order, the whole partition needs to be rewritten. This might lead to write amplification where you see your data is being rewritten very often.
Until literally a few days ago, to fine tune this you would need to change the default config, but since version 6.6.1, this is dynamically adjusted.
Maybe you want to give a try to version 6.6.1, or alternatively if data from different sources is arriving out of order (relative to each other), you might want to create separate tables for different sources, so data is always in order for each table.
I have been experimenting a lot and it seems that you're absolutely right. I was ingesting 14 different clients into a single table. After having splitted this to 14 tables, one for each client, the problem disappeared.
Another advantage is the fact that I need a symbol less as I do not have to distinguish the rows.
By the way - thank you and your team for this marvellous tool you gave us! It makes my work so much easier!!
Saludos

Cassandra partition size and performance?

I was playing around with cassandra-stress tool on my own laptop (8 cores, 16GB) with Cassandra 2.2.3 installed out of the box with having its stock configuration. I was doing exactly what was described here:
http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema
And measuring its insert performance.
My observations were:
using the code from https://gist.github.com/tjake/fb166a659e8fe4c8d4a3 without any modifications I had ~7000 inserts/sec.
when modifying line 35 in the code above (cluster: fixed(1000)) to "cluster: fixed(100)", i. e. configuring my test data distribution to have 100 clustering keys instead of 1000, the performance was jumping up to ~11000 inserts/sec
when configuring it to have 5000 clustering keys per partition, the performance was reducing to just 700 inserts/sec
The documentation says however Cassandra can support up to 2 billion rows per partition. I don't need that much still I don't get how just 5000 records per partition can slow the writes 10 times down or am I missing something?
Supporting is a little different from "best performaning". You can have very wide partitions, but the rule-of-thumb is to try to keep them under 100mb for misc performance reasons. Some operations can be performed more efficiently when the entirety of the partition can be stored in memory.
As an example (this is old example, this is a complete non issue post 2.0 where everything is single pass) but in some versions when the size is >64mb compaction has a two pass process, that halves compaction throughput. It still worked with huge partitions. I've seen many multi gb ones that worked just fine. but the systems with huge partitions were difficult to work with operationally (managing compactions/repairs/gcs).
I would say target the rule of thumb initially of 100mb and test from there to find own optimal. Things will always behave differently based on use case, to get the most out of a node the best you can do is some benchmarks closest to what your gonna do (true of all systems). This seems like something your already doing so your definitely on the right path.

High speed time series query postgresql hardware considerations

Just have some general questions on hardware choices for polling lots of low-medium density sensor data time series as fast as possible.
System Overview
The data consists of multiple time series of approx 50-100K data points from various sensors at different locations at different times. ~15 columns of data in the main table, but some columns have long array values in them. I'm interfacing with C#/python to send queries to a local database and then work on the returned data.
The Problem and Some Thoughts
As it stands, the logging server is on average hardware (budget 4tb hdds, basic quadcore, 8gb ram) and depending on the amount of data requested in the query, it takes forever (re: "annoying amounts of waiting") to return an entire time series. I have thousands of independent time series and I want to compare multiple ones against each other. Returning a single time series takes at least 30 for 50K rows with a * can take 30+ seconds, but as low as 100ms for only a few columns. A lot of different queries are used so I don't have the luxury of caching for repeated queries.
So what I was thinking of is, instead of work directly from the online server, make an offline copy of the database onto more dedicated hardware that can be used for faster analysis. I don't need all of the data at any one time (i.e. I can pick a a location and time range, copy that offline, and work on that)
The table design is very basic. The most often queried table data has primary key:
PRIMARY KEY (location_id, time_logged, sensor_id)
With simple query like
SELECT * FROM table
WHERE location_id = 2154321 AND sensor_id = 254;
This is most common, and will be used along with some additional WHERE conditions.
There are thousands of location_id and potentially dozens of sensor_id for each location_id
The Questions
(with regard to the above table/pk/query setup)
How much does more RAM help with faster queries. i.e. 64GB vs 8GB
How much does faster cpu / more cores help with faster queries. What kind of CPU (quad/6/12+) would provide the best speedup. Is there diminishing returns?
How would one set up hard disk drives to help with faster queries (RAID cluster with SSD or mechanical).
Would getting a couple or even 4+ cheap mechanical hard disk drives provide a significant speedup in RAID?
I've read about columnar store and how it can be useful for time series (https://www.citusdata.com/blog/76-postgresql-columnar-store-for-analytics). Can anyone shed any insight on this and is it worth setting up?
Will increasing the planner statistics value help significantly?
Any general recommendations / first steps to get the best kind of query speedup? Another dedicated PC with multi-core and lots of ram? A NAS? Dedicated PC with multi-drive RAID
I'm relatively new to working with databases so don't really know what to expect in terms of performance so any pointers would be helpful.
thanks!

mongodb performance: string length

Is there an optimum length for short strings in mongodb with performance in mind?
I'm currently implementing a comment system limiting the comment length somewhere around 150 - 300 chars and was wondering if there is a string length in that general range that would be more performant than others.
The thing about MongoDB is that performance is generally hardware dependent - the only way you can really find out is to test this on the hardware you'll be using in production, with test data as close to real data as possible.
I've conducted quite a few tests on MongoDB, both on my laptop and on a Xeon server. I noticed horrible results for the laptop, e.g. a bulk insert of 10,000 records would take 90 seconds. But the same test on the server took 0.2 seconds, which I wasn't expecting. Of course the server was going to be faster, but my point is that you can't really make any assumptions on speed, based on other's results.

Riak performance - unexpected results

In the last days I played a bit with riak. The initial setup was easier then I thought. Now I have a 3 node cluster, all nodes running on the same vm for the sake of testing.
I admit, the hardware settings of my virtual machine are very much downgraded (1 CPU, 512 MB RAM) but still I am a quite surprised by the slow performance of riak.
Map Reduce
Playing a bit with map reduce I had around 2000 objects in one bucket, each about 1k - 2k in size as json. I used this map function:
function(value, keyData, arg) {
var data = Riak.mapValuesJson(value)[0];
if (data.displayname.indexOf("max") !== -1) return [data];
return [];
}
And it took over 2 seconds just for performing the http request returning its result, not counting the time it took in my client code to deserialze the results from json. Removing 2 of 3 nodes seemed to slightly improve the performance to just below 2 seconds, but this still seems really slow to me.
Is this to be expected? The objects were not that large in bytesize and 2000 objects in one bucket isnt that much, either.
Insert
Batch inserting of around 60.000 objects in the same size as above took rather long and actually didnt really work.
My script which inserted the objects in riak died at around 40.000 or so and said it couldnt connect to the riak node anymore. In the riak logs I found an error message which indicated that the node ran out of memory and died.
Question
This is really my first shot at riak, so there is definately the chance that I screwed something up.
Are there any settings I could tweak?
Are the hardware settings too constrained?
Maybe the PHP client library I used for interacting with riak is the limiting factor here?
Running all nodes on the same physical machine is rather stupid, but if this is a problem - how can i better test the performance of riak?
Is map reduce really that slow? I read about the performance hit that map reduce has on the riak mailing list, but if Map Reduce is slow, how are you supposed to perform "queries" for data needed nearly in realtime? I know that riak is not as fast as redis.
It would really help me a lot if anyone with more experience in riak could help me out with some of these questions.
This answer is a bit late, but I want to point out that Riak's mapreduce implementation is designed primarily to work with links, not entire buckets.
Riak's internal design is actually pretty much optimized against working with entire buckets. That's because buckets are not considered to be sequential tables but a keyspace distributed across a cluster of nodes. This means that random access is very fast — probably O(log n), but don't quote me on that — whereas serial access is very, very, very slow. Serial access, the way Riak is currently designed, necessarily means asking all nodes for their data.
Incidentally, "buckets" in Riak terminology are, confusingly and disappointingly, not implemented the way you probably think. What Riak calls a bucket is in reality just a namespace. Internally, there is only one bucket, and keys are stored with the bucket name as a prefix. This means that no matter how small or large you bucket is, enumerating the keys in a single bucket of size n will take m time, where m is the total number of keys in all buckets.
These limitations are implementation choices by Basho, not necessarily design flaws. Cassandra implements the exact same partitioning model as Riak, but supports efficient sequential range scans and mapreduce across large amounts of keys. Cassandra also implements true buckets.
A recommendation I'd have now that some time has passed and several new versions of Riak have come about is this. Never rely on full bucket map/reduce, that's not an optimized operation, and chances are very good there are other ways to optimize your map/reduce so you don't have to look through so much data to pull out the singlets you need.
Secondary indices now available in newer versions of Riak are definitely the way to go in this regard. Put an index on the objects you want to find (perhaps named 'ismax_int' with a value of 0 or 1). You can map/reduce a secondary index with hundreds of thousands of keys in microseconds which a full bucket scan would have taken multiple seconds to consider.
I don't have direct experience of Riak, but have worked with Cassandra a little, which is similar.
Firstly, performance will probably depend a lot on the number of cores available, and the memory. These systems are usually heavily pipelined and concurrent and benefit from a lot of cores. 4+ cores and 4GB+ of RAM would be a good starting point.
Secondly, MapReduce is designed for batch processing, not realtime queries.
Riak and all similar Key-Value stores are designed for high write performance, high read performance for simple lookups, no complex querying at all.
Just for comparison, Cassandra on a single node (6 core, 6GB) can do 20,000 individual inserts per second.

Resources