Does parse take equal time to any amount of load data or it varies based on size? I expect to have 20k to 50k records in a table for SAAS project.
If you can give me this meta information then I would be grateful.
The time is influenced by record count. But, it's also influenced by indexes and the type of query you're doing. 50k records isn't a particularly large set, but it really depends on the data and the query as to how it will perform.
Related
I am planning to use Elasticsearch to store user orders data. There could be 20 million orders per year in my system. 20 million orders probably take about 10GB size.
My question is whether I should create one index to include all orders' data. I have read ES doc saying we'd better keep 20GB data in one primary shard. If I create one index with 5 primary shards, does it mean I am fine to save 100GB (200 millions) orders in this index?
Another approach is to create index per year, for example, I create index order-2020, order-2021, order-2022 etc. And I can create less primary shard for each index. I understand using this pattern may benefit if I want to add a retention period on my order data. But apart from that, what other benefits I can have to use this pattern?
From query performance perspective, which approach is better?
In terms of search speed and aggregation accuracy, multi-index multi-fragment will inevitably have some loss, but in terms of data health, it is recommended to split the data by year, you can use alias to establish index association, and the loss in query performance is much less than that in aggregation.
We have a database with more than a billion daily statistical records. Each record has multiple metrics (m1 through m10), and several immutable tags.
Record can also be associated with zero or more groups. The idea was to use multiple tags (e.g. g1, g2) to indicate the belonging of specific record to specific group.
Our data is stored on daily level, and most time-series databases are really optimized for more granular data. This represents a problem when we want to produce monthly, or quarterly graphs (e.g. InfluxDB have maximum aggregation period of 7d). We need a database that is really optimized for day-level data points and can produce quick aggregations on month/quarter/year level.
Furthermore, the relationship between records and groups is mutable. We need the database to support the batch update of records (pseudo: ADD TAG group1 TO records WHERE record_id: 101), or at least fast deletion/reinserting of updated data. This operation should be relatively fast.
We need something that can produce near-real-time results when aggregating data across tens of millions (filtered) records.
Our original solution is based on elasticsearch and it works quite well, but wanted to explore alternatives in time-series databases niche. Can anyone recommend a time-series database that supports these features?
Try ClickHouse. It is optimized for real-time processing and querying big amounts of data. We successfully used it to store hundreds of billions of records per day on a 15-node cluster. ClickHouse is able to scan billions of records per second per CPU core and its query performance scales linearly with the number of available CPU cores.
ClickHouse also supports infrequent data updates, so you can update groups for particular rows.
If you want more tradituonal TSDB, then take a look at VictoriaMetrics. It is built on architecture ideas from ClickHouse, so it is fast and provides good on-disk data compression.
I have content that is about 50 TB large. The number of documents in this set is about 250 million. The daily increment to this is not very large nay my be about 10000 documents of varying sizes totaling under 50 MB.
The current indexing effort is taking way too long and is guesstimated to complete in 100+ days!!!
So ... is this really that large of a data set? To me, 50 TB of content (in this day and age) is not very large. Do you have content of this size? If you do, how did you improve time taken for one-time indexing? Also, how did you improve time taken by real-time indexing?
If you can answer .. great. If you can point me in the right direct direction ... appreciate that as well.
Thanks in advance.
rd
There are number of factors to consider.
You can start with Client to index. Which client are you using. Is it Solrj, or any framework which listens to databases(like oracle or Hbase) or rest API.
This can make a difference, given that Solr is good at handling them, however the client framework and data preparation at client, also needs to be optimized. For example, if you use Hbase Indexer(which reads from Hbase tables and writes to Solr), you can expect few millions to be indexed in hour or so. Then, this should not take much time to complete 250 million.
After the client, you enter into Solr environment. How many fields are you indexing in you document. Also do you have stored fields or any other overheads for field types.
Config parameters like autoCommit based on number of records or RAm size, softCommit as mentioned in the comment above, Parallel Threads to index data, Hardware are some of the points to cosider.
You can find comprehensive check list here and can verify each. Happy Designing
I am researching Hadoop to see which of its products suits our need for quick queries against large data sets (billions of records per set)
The queries will be performed against chip sequencing data. Each record is one line in a file. To be clear below shows a sample record in the data set.
one line (record) looks like:
1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC U2 0 0 1 4 ***103570835*** F .. 23G 24C
The highlighted field is called "position of match" and the query we are interested in is the # of sequences in a certain range of this "position of match". For instance the range can be "position of match" > 200 and "position of match" + 36 < 200,000.
Any suggestions on the Hadoop product I should start with to accomplish the task? HBase,Pig,Hive, or ...?
Rough guideline: If you need lots of queries that return fast and do not need to aggregate data, you want to use HBase. If you are looking at tasks that are more analysis and aggregation-focused, you want Pig or Hive.
HBase allows you to specify start and end rows for scans, meaning it should be satisfy the query example you provide, and seems most appropriate for your use case.
For posterity, here's the answer Xueling received on the Hadoop mailing list:
First, further detail from Xueling:
The datasets wont be updated often.
But the query against a data set is
frequent. The quicker the query, the
better. For example we have done
testing on a Mysql database (5 billion
records randomly scattered into 24
tables) and the slowest query against
the biggest table (400,000,000
records) is around 12 mins. So if
using any Hadoop product can speed up
the search then the product is what we
are looking for.
The response, from Cloudera's Todd Lipcon:
In that case, I would recommend the
following:
Put all of your data on HDFS
Write a MapReduce job that sorts the data by position of match
As a second output of this job, you can write a "sparse index" -
basically a set of entries like this:
where you're basically giving offsets
into every 10K records or so. If you
index every 10K records, then 5
billion total will mean 100,000 index
entries. Each index entry shouldn't be
more than 20 bytes, so 100,000 entries
will be 2MB. This is super easy to fit
into memory. (you could probably index
every 100th record instead and end up
with 200MB, still easy to fit in
memory)
Then to satisfy your count-range
query, you can simply scan your
in-memory sparse index. Some of the
indexed blocks will be completely
included in the range, in which case
you just add up the "number of entries
following" column. The start and
finish block will be partially
covered, so you can use the file
offset info to load that file off
HDFS, start reading at that offset,
and finish the count.
Total time per query should be <100ms
no problem.
A few subsequent replies suggested HBase.
You could also take a short look at JAQL (http://code.google.com/p/jaql/), but unfortunately it's for querying JSON data. But maybe this helps anyway.
You may need to look at No-SQL Database approaches like HBase or Cassandra. I would prefer HBase, as it has a growing community.
We have about 10K rows in a table. We want to have a form where we have a select drop down that contains distinct values of a given column in this table. We have an index on the column in question.
To increase performance I created a little cache table that contains the distinct values so we didn't need to do a select distinct field from table against 10K rows. Surprisingly it seems doing select * from cachetable (10 rows) is no faster than doing the select distinct against 10K rows. Why is this? Is the index doing all the work? At what number of rows in our main table will there be a performance improvement by querying the cache table?
For a DB, 10K rows is nothing. You're not seeing much difference because the actual calculation time is minimal, with most of it consumed by other, constant, overhead.
It's difficult to predict when you'd start noticing a difference, but it would probably be at around a million rows.
If you've already set up caching and it's not detrimental, you may as well leave it in.
10k rows is not much... start caring when you reach 500k ~ 1 million rows.
Indexes do a great job, specially if you just have 10 different values for that index.
This depends on numerous factors - the amount of memory your DB has, the size of the rows in the table, use of a parameterised query and so forth, but generally 10K is not a lot of rows and particularly if the table is well indexed then it's not going to cause any modern RDBMS any sweat at all.
As a rule of thumb I would generally only start paying close attention to performance issues on a table when it passes the 100K rows mark, and 500K doesn't usually cause much of a problem if indexed correctly and accessed by such. Performance usually tends to fall off catastrophically on large tables - you may be fine on 500K rows but crawling on 600K - but you have a long way to go before you are at all likely to hit such problems.
Is the index doing all the work?
You can tell how the query is being executed by viewing the execution plan.
For example, try this:
explain plan for select distinct field from table;
select * from table(dbms_xplan.display);
I notice that you didn't include an ORDER BY on that. If you do not include ORDER BY then the order of the result set may be random, particularly if oracle uses the HASH algorithm for making a distinct list. You ought to check that.
So I'd look at the execution plans for the original query that you think is using an index, and at the one based on the cache table. Maybe post them and we can comment on what's really going on.
Incidentaly, the cache table would usually be implemented as a materialised view, particularly if the master table is generally pretty static.
Serious premature optimization. Just let the database do its job, maybe with some tweaking to the configuration (especially if it's MySQL, which has several cache types and settings).
Your query in 10K rows most probably uses HASH SORT UNIQUE.
As 10K most probably fit into db_buffers and hash_area_size, all operations are performed in memory, and you won't note any difference.
But if the query will be used as a part of a more complex query, or will be swapped out by other data, you may need disk I/O to access the data, which will slow your query down.
Run your query in a loop in several sessions (as many sessions as there will be users connected), and see how it performs in that case.
For future plans and for scalability, you may want to look into an indexing service that uses pure memory or something faster than the TCP DB round-trip. A lot of people (including myself) use Lucene to achieve this by normalizing the data into flat files.
Lucene has a built-in Ram Drive directory indexer, which can build the index all in memory - removing the dependency on the file system, and greatly increasing speed.
Lately, I've architected systems that have a single Ram drive index wrapped by a Webservice. Then, I have my Ajax-like dropdowns query into that Webservice for high availability and high speed - no db layer, no file system, just pure memory and if remote tcp packet speed.
If you have an index on the column, then all the values are in the index and the dbms never has to look in the table. It just looks in the index which just has 10 entries. If this is mostly read only data, then cache it in memory. Caching helps scalability and a lot by relieving the database of work. A query that is quick on a database with no users, might perform poorly if a 30 queries are going on at the same time.