I'm using QuestDB as backend for storing collected data using the same script for different data sources.
My problem ist the extremly high disk (ssd) usage. During 4 days it has written 335MB per second.
What am I doing wrong?
Inserting data using the ILP interface
sender.row(
metric,
symbols=symbols,
columns=data,
at=row['ts']
)
I don't know how much data you are ingesting, so not sure if 335 MB per second is much or not. But since you are surprised by it I am going to assume your throughput is lower than that. It might be the case your data is out of order, specially if ingesting from multiple data sources.
QuestDB keeps the data per table always in incremental order by designated timestamp. If data arrives out of order, the whole partition needs to be rewritten. This might lead to write amplification where you see your data is being rewritten very often.
Until literally a few days ago, to fine tune this you would need to change the default config, but since version 6.6.1, this is dynamically adjusted.
Maybe you want to give a try to version 6.6.1, or alternatively if data from different sources is arriving out of order (relative to each other), you might want to create separate tables for different sources, so data is always in order for each table.
I have been experimenting a lot and it seems that you're absolutely right. I was ingesting 14 different clients into a single table. After having splitted this to 14 tables, one for each client, the problem disappeared.
Another advantage is the fact that I need a symbol less as I do not have to distinguish the rows.
By the way - thank you and your team for this marvellous tool you gave us! It makes my work so much easier!!
Saludos
Related
I am trying to improve the performance of my database, which simplified set-up is the following :
EDIT
One table with 3 rows (id_device, timestamp, data) with a composite btree index (id_device, timestamp)
1k devices sending data every minute
The insert are quite fast, since PostgreSQL merely writes the rows in the order they are received. However, when trying to get many data with consecutive timestamp of a given device, the query is not so fast. The way I understand it is that due to the way the data is collected, there is never more than one row of a given device on each page of the table. Therefore, if I want to get 10k data with consecutive timestamp of a given device, PostgreSQL has to fetch 10k pages from disk. Besides, since this operation can be done on any of the 1k devices, those pages are not going to be kept in RAM.
I have tried to CLUSTER the table, and it indeed solve the performance issue, but this operation is incredibly long (~1 day) and it locks the entire table, so I discarded this solution.
I have read about the partitionning, but that would mean a lot of scripting if I need to add a new table every time a new devices is connected, and it seems to me a bit bug-prone.
I am rather confident in the fact that this set-up is not particularly original, so is there an advice I could use?
Thanks for reading,
Guillaume
I'm guessing your index also has low selectivity, because you're indexing device_id first (which are only 1000 different) and not timestamp first.
Depends on what you do with the data you fetch, but maybe the solution could be batching the operation, such as fetching the data for a predetermined period and processing data for all 1000 devices in one go.
I'm doing some research for a new project, for which the constraints and specifications have yet to be set. One thing that is wanted is a large number of paths, directly under the root domain. This could ramp up to millions of paths. The paths don't have a common structure or unique parts, so I have to look for exact matches.
Now I know it's more efficient to break up those paths, which would also help in the path lookup. However I'm researching the possibility here, so bear with me.
I'm evaluating methods to accomplish this, while maintaining excellent performance. I thought of the following methods:
Storing the paths in an SQL database and doing a lookup on every request. This seems like the worst option and will definitely not be used.
Storing the paths in a key-value store like Redis. This would be a lot better, and perform quite well I think (have to benchmark it though).
Doing string/regex matching - like many frameworks do out of the box - for this amount of possible matches is nuts and thus not really an option. But I could see how doing some sort of algorithm where you compare letter-by-letter, in combination with some smart optimizations, could work.
But maybe there are tools/methods I don't know about that are far more suited for this type of problem. I could use any tips on how to accomplish this though.
Oh and in case anyone is wondering, no this isn't homework.
UPDATE
I've tested the Redis approach. Based on two sets of keywords, I got 150 million paths. I've added each of them using the set command, with the value being a serialized string of id's I can use to identify the actual keywords in the request. (SET 'keyword1-keyword2' '<serialized_string>')
A quick test in a local VM with a data set of one million records returned promising results: benchmarking 1000 requests took 2ms on average. And this was on my laptop, which runs tons of other stuff.
Next I did a complete test on a VPS with 4 cores and 8GB of RAM, with the complete set of 150 million records. This yielded a database of 3.1G in file size, and around 9GB in memory. Since the database could not be loaded in memory entirely, Redis started swapping, which caused terrible results: around 100ms on average.
Obviously this will not work and scale nice. Either each web server needs to have a enormous amount of RAM for this, or we'll have to use a dedicated Redis-routing server. I've read an article from the engineers at Instagram, who came up with a trick to decrease the database size dramatically, but I haven't tried this yet. Either way, this does not seem the right way to do this. Back to the drawing board.
Storing the paths in an SQL database and doing a lookup on every request. This seems like the worst option and will definitely not be used.
You're probably underestimating what a database can do. Can I invite you to reconsider your position there?
For Postgres (or MySQL w/ InnoDB), a million entries is a notch above tiny. Store the whole path in a field, add an index on it, vacuum, analyze. Don't do nutty joins until you've identified the ID of your key object, and you'll be fine in terms of lookup speeds. Say a few ms when running your query from psql.
Your real issue will be the bottleneck related to disk IO if you get material amounts of traffic. The operating motto here is: the less, the better. Besides the basics such as installing APC on your php server, using Passenger if you're using Ruby, etc:
Make sure the server has plenty of RAM to fit that index.
Cache a reference to the object related to each path in memcached.
If you can categorize all routes in a dozen or so regex, they might help by allowing the use of smaller, more targeted indexes that are easier to keep in memory. If not, just stick to storing the (possibly trailing-slashed) entire path and move on.
Worry about misses. If you've a non-canonical URL that redirects to a canonical one, store the redirect in memcached without any expiration date and begone with it.
Did I mention lots of RAM and memcached?
Oh, and don't overrate that ORM you're using, either. Chances are it's taking more time to build your query than your data store is taking to parse, retrieve and return the results.
RAM... Memcached...
To be very honest, Reddis isn't so different from a SQL + memcached option, except when it comes to memory management (as you found out), sharding, replication, and syntax. And familiarity, of course.
Your key decision point (besides excluding iterating over more than a few regex) ought to be how your data is structured. If it's highly structured with critical needs for atomicity, SQL + memcached ought to be your preferred option. If you've custom fields all over and obese EAV tables, then playing with Reddis or CouchDB or another NoSQL store ought to be on your radar.
In either case, it'll help to have lots of RAM to keep those indexes in memory, and a memcached cluster in front of the whole thing will never hurt if you need to scale.
Redis is your best bet I think. SQL would be slow and regular expressions from my experience are always painfully slow in queries.
I would do the following steps to test Redis:
Fire up a Redis instance either with a local VM or in the cloud on something like EC2.
Download a dictionary or two and pump this data into Redis. For example something from here: http://wordlist.sourceforge.net/ Make sure you normalize the data. For example, always lower case the strings and remove white space at the start/end of the string, etc.
I would ignore the hash. I don't see the reason you need to hash the URL? It would be impossible to read later if you wanted to debug things and it doesn't seem to "buy" you anything. I went to http://www.sha1-online.com/, and entered ryan and got ea3cd978650417470535f3a4725b6b5042a6ab59 as the hash. The original text would be much smaller to put in RAM which will help Redis. Obviously for longer paths, the hash would be better, but your examples were very small. =)
Write a tool to read from Redis and see how well it performs.
Profit!
Keep in mind that Redis needs to keep the entire data set in RAM, so plan accordingly.
I would suggest using some kind of key-value store (i.e. a hashing store), possibly along with hashing the key so it is shorter (something like SHA-1 would be OK IMHO).
How many views per bucket is too much, assuming a large amount of data in the bucket (>100GB, >100M documents, >12 document types), and assuming each view applies only to one document type? Or asked another way, at what point should some document types be split into separate buckets to save on the overhead of processing all views on all document types?
I am having a hard time deciding how to split my data into couchbase buckets, and the performance implications of the views required on the data. My data consists of more than a dozen relational DBs, with at least half with hundreds of millions of rows in a number of tables.
The http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-writing-bestpractice.html doc section "using document types" seems to imply having multiple document types in the same bucket is not ideal because views on specific document types are updated for all documents, even those that will never match the view. Indeed, it suggests separating data into buckets to avoid this overhead.
Yet there is a limit of 10 buckets per cluster for performance reasons. My only conclusion therefore is that each cluster can handle a maximum of 10 large collections of documents efficiently. Is this accurate?
Tug's advice was right on and allow me to add some perspective as well.
A bucket can be considered most closely related to (though not exactly) a "database instantiation" within the RDMS world. There will be multiple tables/schemas within that "database" and those can all be combined within a bucket.
Think about a bucket as a logical grouping of data that all shares some common configuration parameters (RAM quota, replica count, etc) and you should only need to split your data into multiple buckets when you need certain datasets to be controlled separately. Other reasons are related to very different workloads to different datasets or the desire to be able to track the workload to those datasets separately.
Some examples:
-I want to control the caching behavior for one set of data differently than another. For instance, many customers have a "session" bucket that they want always in RAM whereas they may have a larger, "user profile" bucket that doesn't need all the data cached in RAM. Technically these two data sets could reside in one bucket and allow Couchbase to be intelligent about which data to keep in RAM, but you don't have as much guarantee or control that the session data won't get pushed out...so putting it in its own bucket allows you to enforce that. It also gives you the added benefit of being able to monitor that traffic separately.
-I want some data to be replicated more times than others. While we generally recommend only one replica in most clusters, there are times when our users choose certain datasets that they want replicated an extra time. This can be controlled via separate buckets.
-Along the same lines, I only want some data to be replicated to another cluster/datacenter. This is also controlled per-bucket and so that data could be split to a separate bucket.
-When you have fairly extreme differences in workload (especially around the amount of writes) to a given dataset, it does begin to make sense from a view/index perspective to separate the data into a separate bucket. I mention this because it's true, but I also want to be clear that it is not the common case. You should use this approach after you identify a problem, not before because you think you might.
Regarding this last point, yes every write to a bucket will be picked up by the indexing engine but by using document types within the JSON, you can abort the processing for a given document very quickly and it really shouldn't have a detrimental impact to have lots of data coming in that doesn't apply to certain views. If you don't mind, I'm particularly curious at which parts of the documentation imply otherwise since that certainly wasn't our intention.
So in general, we see most deployments with a low number of buckets (2-3) and only a few upwards of 5. Our limit of 10 comes from some known CPU and disk IO overhead of our internal tracking of statistics (the load or lack thereof on a bucket doesn't matter here). We certainly plan to reduce this overhead with future releases, but that still wouldn't change our recommendation of only having a few buckets. The advantages of being able to combine multiple "schemas" into a single logical grouping and apply view/indexes across that still exist regardless.
We are in the process right now of coming up with much more specific guidelines and sizing recommendations (I wrote those first two blogs as a stop-gap until we do).
As an initial approach, you want to try and keep the number of design documents around 4 because by default we process up to 4 in parallel. You can increase this number, but that should be matched by increased CPU and disk IO capacity. You'll then want to keep the number of views within each document relatively low, probably well below 10, since they are each processed in serial.
I recently worked with one user who had an fairly large amount of views (around 8 design documents and some dd's with nearly 20 views) and we were able to drastically bring this down by combining multiple views into one. Obviously it's very application dependent, but you should try to generate multiple different "queries" off of one index. Using reductions, key-prefixing (within the views), and collation, all combined with different range and grouping queries can make a single index that may appear crowded at first, but is actually very flexible.
The less design documents and views you have, the less disk space, IO and CPU resources you will need. There's never going to be a magic bullet or hard-and-fast guideline number unfortunately. In the end, YMMV and testing on your own dataset is better than any multi-page response I can write ;-)
Hope that helps, please don't hesitate to reach out to us directly if you have specific questions about your specific use case that you don't want published.
Perry
As you can see from the Couchbase documentation, it is not really possible to provide a "universal" rules to give you an exact member.
But based on the best practice document that you have used and some discussion(here) you should be able to design your database/views properly.
Let's start with the last question:
YES the reason why Couchbase advice to have a small number of bucket is for performance - and more importantly resources consumption- reason. I am inviting you to read these blog posts that help to understand what's going on "inside" Couchbase:
Sizing 1: http://blog.couchbase.com/how-many-nodes-part-1-introduction-sizing-couchbase-server-20-cluster
Sizing 2: http://blog.couchbase.com/how-many-nodes-part-2-sizing-couchbase-server-20-cluster
Compaction: http://blog.couchbase.com/compaction-magic-couchbase-server-20
So you will see that most of the "operations" are done by bucket.
So let's now look at the original question:
yes most the time your will organize the design document/and views by type of document.
It is NOT a problem to have all the document "types" in a single(few) buckets, this is in fact the way your work with Couchbase
The most important part to look is, the size of your doc (to see how "long" will be the parsing of the JSON) and how often the document will be created/updated, and also deleted, since the JS code of the view is ONLY executed when you create/change the document.
So what you should do:
1 single bucket
how many design documents? (how many types do you have?)
how any views in each document you will have?
In fact the most expensive part is not during the indexing or quering it is more when you have to rebalance the data and indices between nodes (add, remove , failure of nodes)
Finally, but it looks like you already know it, this chapter is quite good to understand how views works (how the index is created and used):
http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-operation.html
Do not hesitate to add more information if needed.
I have a very large set of data (~3 million records) which needs to be merged with updates and new records on a daily schedule. I have a stored procedure that actually breaks up the record set into 1000 record chunks and uses the MERGE command with temp tables in an attempt to avoid locking the live table while the data is updating. The problem is that it doesn't exactly help. The table still "locks up" and our website that uses the data receives timeouts when attempting to access the data. I even tried splitting it up into 100 record chunks and even tried a WAITFOR DELAY '000:00:5' to see if it would help to pause between merging the chunks. It's still rather sluggish.
I'm looking for any suggestions, best practices, or examples on how to merge large sets of data without locking the tables.
Thanks
Change your front end to use NOLOCK or READ UNCOMMITTED when doing the selects.
You can't NOLOCK MERGE,INSERT, or UPDATE as the records must be locked in order to perform the update. However, you can NOLOCK the SELECTS.
Note that you should use this with caution. If dirty reads are okay, then go ahead. However, if the reads require the updated data then you need to go down a different path and figure out exactly why merging 3M records is causing an issue.
I'd be willing to bet that most of the time is spent reading data from the disk during the merge command and/or working around low memory situations. You might be better off simply stuffing more ram into your database server.
An ideal amount would be to have enough ram to pull the whole database into memory as needed. For example, if you have a 4GB database, then make sure you have 8GB of RAM.. in an x64 server of course.
I'm afraid that I've quite the opposite experience. We were performing updates and insertions where the source table had only a fraction of the number of rows as the target table, which was in the millions.
When we combined the source table records across the entire operational window and then performed the MERGE just once, we saw a 500% increase in performance. My explanation for this is that you are paying for the up front analysis of the MERGE command just once instead of over and over again in a tight loop.
Furthermore, I am certain that merging 1.6 million rows (source) into 7 million rows (target), as opposed to 400 rows into 7 million rows over 4000 distinct operations (in our case) leverages the capabilities of the SQL server engine much better. Again, a fair amount of the work is in the analysis of the two data sets and this is done only once.
Another question I have to ask is well is whether you are aware that the MERGE command performs much better with indexes on both the source and target tables? I would like to refer you to the following link:
http://msdn.microsoft.com/en-us/library/cc879317(v=SQL.100).aspx
From personal experience, the main problem with MERGE is that since it does page lock it precludes any concurrency in your INSERTs directed to a table. So if you go down this road it is fundamental that you batch all updates that will hit a table in a single writer.
For example: we had a table on which INSERT took a crazy 0.2 seconds per entry, most of this time seemingly being wasted on transaction latching, so we switched this over to using MERGE and some quick tests showed that it allowed us to insert 256 entries in 0.4 seconds or even 512 in 0.5 seconds, we tested this with load generators and all seemed to be fine, until it hit production and everything blocked to hell on the page locks, resulting in a much lower total throughput than with the individual INSERTs.
The solution was to not only batch the entries from a single producer in a MERGE operation, but also to batch the batch from producers going to individual DB in a single MERGE operation through an additional level of queue (previously also a single connection per DB, but using MARS to interleave all the producers call to the stored procedure doing the actual MERGE transaction), this way we were then able to handle many thousands of INSERTs per second without problem.
Having the NOLOCK hints on all of your front-end reads is an absolute must, always.
I have a big doubt.
Let's take as example a database for a whatever company's orders.
Let's say that this company make around 2000 orders per month, so, around 24K order per year, and they don't want to delete any orders, even if it's 5 years old (hey, this is an example, numbers don't mean anything).
In the meaning of have a good database query speed, its better have just one table, or will be faster having a table for every year?
My idea was to create a new table for the orders each year, calling such orders_2008, orders_2009, etc..
Can be a good idea to speed up db queries?
Usually the data that are used are those of the current year, so there are less lines the better is..
Obviously, this would give problems when I search in all the tables of the orders simultaneously, because should I will to run some complex UNION .. but this happens in the normal activities very rare.
I think is better to have an application that for 95% of the query is fast and the remaining somewhat slow, rather than an application that is always slow.
My actual database is on 130 tables, the new version of my application should have about 200-220 tables.. of which about 40% will be replicated annually.
Any suggestion?
EDIT: the RDBMS will be probably Postgresql, maybe (hope not) Mysql
Smaller tables are faster. Period.
If you have history that is rarely used, then getting the history into other tables will be faster.
This is what a data warehouse is about -- separate operational data from historical data.
You can run a periodic extract from operational and a load to historical. All the data is kept, it's just segregated.
Before you worry about query speed, consider the costs.
If you split the code into separate code, you will have to have code that handles it. Every bit of code you write has the chance to be wrong. You are asking for your code to be buggy at the expense of some unmeasured and imagined performance win.
Also consider the cost of machine time vs. programmer time.
If you use indexes properly, you probably need not split it into multiple tables. Most modern DBs will optimize access.
Another option you might consider is to have a table for the current year, and at the end append the data to another table which has data for all the previous years. ?
I would not split tables by year.
Instead I would archive data to a reporting database every year, and use that when needed.
Alternatively you could partition the data, amongst drives, thus maintaining performance, although i'm unsure if this is possible in postgresql.
For the volume of data you're looking at splitting the data seems like a lot of trouble for little gain. Postgres can do partitioning, but the fine manual [1] says that as a rule of thumb you should probably only consider it for tables that exceed the physical memory of the server. In my experience, that's at least a million rows.
http://www.postgresql.org/docs/current/static/ddl-partitioning.html
I agree that smaller tables are faster. But it depends on your business logic if it makes sense to split a single entity over multiple tables. If you need a lot of code to manage all the tables than it might not be a good idea.
It also depends on the database what logic you're able to use to tackle this problem. In Oracle a table can be partitioned (on year for example). Data is stored physically in different table spaces which should make it faster to address (as I would assume that all data of a single year is stored together)
An index will speed things up but if the data is scattered across the disk than a load of block reads are required which can make it slow.
Look into partitioning your tables in time slices. Partitioning is good for the log-like table case where no foreign keys point to the tables.