I am using MongoDB as a temporary log store. The collection receives ~400,000 new rows an hour. Each row contains a UNIX timestamp and a JSON string.
Periodically I would like to copy the contents of the collection to a file on S3, creating a file for each hour containing ~400,000 rows (eg. today_10_11.log contains all the rows received between 10am and 11am). I need to do this copy while the collection is receiving inserts.
My question: what is the performance impact of having an index on the timestamp column on the 400,000 hourly inserts verses the additional time it will take to query an hours worth of rows.
The application in question is using written in Ruby running on Heroku and using the MongoHQ plugin.
Mongo indexes the _id field by default, and the ObjectId already starts with a timestamp, so basically, Mongo is already indexing your collection by insertion time for you. So if you're using the Mongo defaults, you don't need to index a second timestamp field (or even add one).
To get the creation time of an object id in ruby:
ruby-1.9.2-p136 :001 > id = BSON::ObjectId.new
=> BSON::ObjectId('4d5205ed0de0696c7b000001')
ruby-1.9.2-p136 :002 > id.generation_time
=> 2011-02-09 03:11:41 UTC
To generate the object ids for a given time:
ruby-1.9.2-p136 :003 > past_id = BSON::ObjectId.from_time(1.week.ago)
=> BSON::ObjectId('4d48cb970000000000000000')
So, for example, if you wanted to load all docs inserted in the past week, you'd simply search for _ids greater than past_id and less than id. So, through the Ruby driver:
collection.find({:_id => {:$gt => past_id, :$lt => id}}).to_a
=> #... a big array of hashes.
You can, of course, also add a separate field for timestamps, and index it, but there's no point in taking that performance hit when Mongo's already doing the necessary work for you with its default _id field.
More information on object ids.
I have an application like yours, and currently it has 150 million log records. At 400k an hour, this DB will get large fast. 400k inserts an hour with indexing on timestamp will be much more worthwhile than doing an unindexed query. I have no problem with inserting tens of millions of records in an hour with indexed timestamp, yet if I do an unindexed query on timestamp it takes a couple of minutes on a 4 server shard (cpu bound). Indexed query comes up instantly. So definitely index it, the write overhead on indexing is not that high and 400k records an hour is not much for mongo.
One thing you do have to look out for is memory size though. At 400k records an hour you are doing 10 million a day. That would consume about 350MB of memory a day to keep that index in memory. So if this goes for a while your index can get larger than memory fast.
Also, if you are truncating records after some time period using remove, I have found that removes create a large amount of IO to disk and it is disk bound.
Certainly on every write you will need to update the index data. If you're going to be doing large queries on the data you will definitely want an index.
Consider storing the timestamp in the _id field instead of a MongoDB ObjectId. As long as you are storing unique timestamps you'll be OK here. _id doesn't have to be an ObjectID, but has an automatic index on _id. This may be your best bet as you won't add an additional index burden.
I'd just use a capped collection, unindexed, with space for, say 600k rows, to allow for slush. Once per hour, dump the collection to a text file, then use grep to filter out rows that aren't from your target date. This doesn't let you leverage the nice bits of the DB, but it means you don't have to ever worry about collection indexes, flushes, or any of that nonsense. The performance-critical bit of it is keeping the collection free for inserts, so if you can do the "hard" bit (filtering by date) outside of the context of the DB, you shouldn't have any appreciable performance impact. 400-600k lines of text is trivial for grep, and likely shouldn't take more than a second or two.
If you don't mind a bit of slush in each log, you can just dump and gzip the collection. You'll get some older data in each dump, but unless you insert over 600k rows between dumps, you should have a continuous series of log snapshots of 600k rows apiece.
Related
I've got a 3GB SQLite database file with a single table with 40 million rows and 14 fields (mostly integers and very short strings and one longer string), no indexes or keys or other constraints -- so really nothing fancy. I want to check if there are entries where a specific integer field has a specific value. So of course I'm using
SELECT EXISTS(SELECT 1 FROM FooTable WHERE barField=?)
I haven't got much experience with SQLite and databases in general and on my first test query, I was shocked that this simple query took about 30 seconds. Subsequent tests showed that it is much faster if a matching row occurs at the beginning, which of course makes sense.
Now I'm thinking of doing an initial SELECT DISTINCT barField FROM FooTable at application startup, and caching the results in software. But I'm sure there must be a cleaner SQLite way to do this, I mean, that should be part of what a DBMS's job right?
But so far, I've only created primary keys for speeding up queries, which doesn't work here because the field values are non-unique. So how can I speed up this query so that it works at constant time? (It doesn't have to be lightning fast, I'd be completely fine if it was under one second.)
Thanks in advance for answering!
P.S. Oh, and there will be about 500K new rows every month for an indefinite period of time, and it would be great if that doesn't significantly increase query time.
Adding an index on barField should speed up the subquery inside the EXISTS clause:
CREATE INDEX barIdx ON FooTable (barField);
To satisfy the query, SQLite would only have to seek the index once and detect that there is at least one matching value.
I'm creating the datamodel for a timeseries application on Cassandra 2.1.3. We will be preserving X amount of data for each user of the system and I'm wondering what is the best approach to design for this requirement.
Option1:
Use a 'bucket' in the partition key, so data for X period goes into the same row. Something like this:
((id, bucket), timestamp) -> data
I can delete a single row at once at the expense of maintaining this bucket concept. It also limits the range I can query on timestamp, probably resulting in several queries.
Option2:
Store all the data in the same row. N deletes are per column.
(id, timestamp) -> data
Range queries are easy again. But what about performance after many column deletes?
Given that we plan to use TTL to let the data expire, which of the two models would deliver the best performance? Is the tombstone overhead of Option1 << Option2 or will there be a tombstone per column on both models anyway?
I'm trying to avoid to bury myself in the tombstone graveyard.
I think it will all depend on how much data you plan on having for the given partition key you end up choosing, what your TTL is and what queries you are making.
I typically lean towards option #1, especially if your TTL is the same for all writes. In addition if you are using LeveledCompactionStrategy or DataTieredCompactionStrategy, Cassandra will do a great job keeping data from the same partition in the same SSTable, which will greatly improve read performance.
If you use Option #2, data for the same partition could likely be spread across multiple levels (if using LCS) or just in general multiple sstables, which may cause you to read from a lot of SSTables, depending on the nature of your queries. There is also the issue of hotspotting, where you could overload particular cassandra nodes if you have a really wide partition.
The other benefit of #1 (which you allude to), is that you can easily delete the entire partition, which creates a single tombstone marker which is much cheaper. Also, if you are using the same TTL, data within that partition will expire pretty much at the same time.
I do agree that it is a bit of a pain to have to make multiple queries to read across multiple partitions as it pushes some complexity into the application end. You may also need to maintain a separate table to keep track of the buckets for the given id if they can not be determined implicitly.
As far as performance goes, do you see it as likely that you will need to read cross-partitions when your application makes queries? For example, if you have a query for 'the most recent 1000 records' and a partition typically is wider than that, you may only need to make 1 query for Option #1. However, if you want to have a query like 'give me all records', Option #2 may be better as otherwise you'll need to a make queries for each bucket.
After creating the tables you described above:
CREATE TABLE option1 (
... id bigint,
... bucket bigint,
... timestamp timestamp,
... data text,
... PRIMARY KEY ((id, bucket), timestamp)
... ) WITH default_time_to_live=10;
CREATE TABLE option2 (
... id bigint,
... timestamp timestamp,
... data text,
... PRIMARY KEY (id, timestamp)
... ) WITH default_time_to_live=10;
I inserted a test row:
INSERT INTO option1 (id,bucket,timestamp,data) VALUES (1,2015,'2015-03-16 11:24:00-0500','test1');
INSERT INTO option2 (id,timestamp,data) VALUES (1,'2015-03-16 11:24:00-0500','test2');
...waited 10 seconds, queried with tracing on, and I saw identical tombstone counts for each table. So I either way that shouldn't be too much of a concern for you.
The real issue, is that if you think you'll ever hit the limit of 2 billion columns per partition, then Option #1 is the safe one. If you have a lot of data Option #1 might perform better (because you'll be eliminating the need to look at partitions that don't match your bucket), but really either one should be fine in that respect.
tl;dr;
As the issues of performance and tombstones are going to be similar no matter which option you choose, I'm thinking that Option #2 is the better one, just due to ease of querying.
I have approx. 10 million Article objects in a Mongoid database. The huge number of Article objects makes the queries quite time consuming to perform.
As exemplified below, I am registering for each week (e.g. 700 days from now .. 7 days from now, 0 days from now) how many articles are in the database.
But for every query I make, the time consumption is increased, and Mongoid's CPU usage quickly reaches +100%.
articles = Article.where(published: true).asc(:datetime)
days = Date.today.mjd - articles.first.datetime.to_date.mjd
days.step(0, -7) do |n|
current_date = Date.today - n.days
previous_articles = articles.lt(datetime: current_date)
previous_good_articles = previous_articles.where(good: true).size
previous_bad_articles = previous_articles.where(good: false).size
end
Is there a way to save the Article objects to memory, so only need to call the database on the first line?
A MongoDB database is not build for that.
I think the best way is to run daily a script that creates your data for that day and save it in a Redis Database http://www.redis.io
Redis stores your data in the server memory, so you can access it every time of the day.
And is very quick.
Don't Repeat Yourself (DRY) is a best-practice that applies not only to code but also to processing. Many applications have natural epochs for summarizing data, a day is a good choice in your question, and if the data is historical, it only has to be summarized once. So you reduce processing of 10 million Article document down to 700 day-summary documents. You need special code for merging in today if you want up to the moment accurate data, but the previous savings is well worth the effort.
I politely disagree with the statement, "A MongoDB database is not build for that." You can see from the above that it is all about not repeating processing. The 700 day-summary documents can be stored in any reasonable data store. Since you already are using MongoDB, simply use another MongoDB collection for the day summaries. There's no need to spin up another data store if you don't want to. The summary data will easily fit in memory, and the reduction in processing means that your working set size will no longer be blown out by the historical processing.
I have configured free text search on a table in my postgres database. Pretty simple stuff, with firstname, lastname and email. This works well and is fast.
I do however sometimes experience looong delays when inserting a new entry into the table, where the insert keeps running for minutes and also generates huge WAL files. (We use the WAL files for replication).
Is there anything I need to be aware of with my free text index? Like Postgres maybe randomly restructuring it for performance reasons? My index is currently around 400 MB big.
Thanks in advance!
Christian
Given the size of the WAL files, I suspect you are right that it is an index update/rebalancing that is causing the issue. However I have to wonder what else is going on.
I would recommend against storing tsvectors in separate columns. A better way is to run an index on to_tsvector()'s output. You can have multiple indexes for multiple languages if you need. So instead of a trigger that takes, say, a field called description and stores the tsvector in desc_tsvector, I would recommend just doing:
CREATE INDEX mytable_description_tsvector_idx ON mytable(to_tsvector(description));
Now, if you need a consistent search interface across a whole table, there are more elegant ways of doing this using "table methods."
In general the functional index approach has fewer issues associated with it than anything else.
Now a second thing you should be aware of are partial indexes. If you need to, you can index only records of interest. For example, if most of my queries only check the last year, I can:
CREATE INDEX mytable_description_tsvector_idx ON mytable(to_tsvector(description))
WHERE created_at > now() - '1 year'::interval;
If the number of documents is more will the querying of data gets slower in CouchDB?
Example Scenario:
I have a combobox in a form for customer name. When the user types the customer name, I have to do autofilling.
There will be around 10k customer documents in the CouchDB. I understand that i have to create a view to do the same.
CouchDB database is in the local machine where the application resides.
Question:
Will it take more than 2 - 3 seconds to query the DB for matching customer names?
Will querying take more time for each query if there are many documents in the CouchDB (say around 100000 documents)?
Any pointers on how to create views/index will be helpful.
Thanks in advance.
The view runs on every document, but only once. After that, the document's view value(s) are stored forever. Fetching a customer by name will be very fast because you would normally have only a few new documents to process in the view at query time.
Query time will not increase noticeably if you have more documents. Technically, access times grow logarithmically with the number of documents. However, in practice fetching documents is basically constant time and very unlikely to be a problem.