Group By Date in Rethinkdb is slow - rethinkdb

I am trying group by date like following for total count
r.db('analytic').table('events').group([r.row('created_at').inTimezone("+08:00").year(), r.row('created_at').inTimezone("+08:00").month(),r.row('created_at').inTimezone("+08:00").day()]).count()
However, it slow and it took over 2 seconds for 17656 records.
Is there any way to get data faster for group by date ?

If you want to group and count all the records, it's going to have to read every record, so the speed will be determined mostly by your hardware rather than the specific query. If you only want one particular range of dates, you could get that much faster with an indexed between query.

Related

Best way to retrieve 150,000 records from Oracle with JDBC

I have been searching for an answer to this today, and it seems the best approach divides opinion somewhat.
I have 150,000 records that I need to retrieve from an Oracle database using JDBC. Is it better to retrieve the data using one select query and allowing the JDBC driver to take care of transferring the records from the database using Oracle cursor and default fetchSize - OR to split up the query into batches using LIMIT / OFFSET?
With the LIMIT / OFFSET option, I think the pros are that you can take control over the number of results you return in each chunk. The cons are that the query is executed multiple times, and you also need to run a COUNT(*) up front using the same query to calculate the number of iterations required.
The pros of retrieving all at once are that you rely on the JDBC driver to manage the retrieval of data from the database. The cons are that the setFetchSize() hint can sometimes be ignored meaning that we could end up with a huge resultSet containing all 150,000 records at once!!
Would be great to hear some real life experiences solving similar issues, and recommendations would be much appreciated.
The native way in Oracle JDBC is to use the prepareStatement for the query, executeQuery and fetch
in a loop the results with defined fetchSize
Yes, of course the details are Oracle Database and JDBC Driver Version dependent and in some case the required fetchSize
can be ignored. But the typical problem is that the required fetch size is reset to fetchSize = 1 and you effectively makes a round trip for each record. (not that you get all records at once).
Your alternative with LIMIT seems to be meaningfull on the first view. But if you investigate the implementation you will probably decide to not use it.
Say you will divide the result set in 15 chunks 10K each:
You open 15 queries, each of them on average with a half of the resource consumption as the original query (OFFSET select the data and skips them).
So the only think you will reach is that the processing will take aproximatly 7,5x more time.
Best Practice
Take your query, write a simple script with JDBC fetch, use 10046 trace to see the effective used fetch size.
Test with a range of fetch sizes and observe the perfomance; choose the optimal one.
my preference is to maintain a safe execution time with the ability to continue if interrupted. i prefer this approach because it is future proof and respects memory and execution time limits. remember you're not planning for today, you're planning for 6m down the road. what may be 150,000 today may be 1.5m in 6 months.
i use a length + 1 recipe to know if there is more to fetch, although the count query will enable you to do a progress bar in % if that is important.
when considering 150,000 record result set, this is a memory pressure question. this will depend on the average size of each row. if it is a row with three integers, that's small. if it is a row with a bunch of text elements to store user profile details then that's potentially very large. so be prudent with what fields you're pulling.
also need to ask - you may not need to pull all the records all the time. it may be useful to apply a sync pattern. to only pull records with an updated date newer than your last pull.

Google datastore - index a date created field without having a hotspot

I am using Google Datastore and will need to query it to retrieve some entities. These entities will need to be sorted by newest to oldest. My first thought was to have a date_created property which contains a timestamp. I would then index this field and sort on this field. The problem with this approach is it will cause hotspots in the database (https://cloud.google.com/datastore/docs/best-practices).
Do not index properties with monotonically increasing values (such as a NOW() timestamp). Maintaining such an index could lead to hotspots that impact Cloud Datastore latency for applications with high read and write rates.
Obviously sorting data on dates is properly the most common sorting performed on a database. If I can't index timestamps, is there another way I can accomplish being able to sort my queires from newest to oldest without hotspots?
As you note, indexing monotonically changed values doesn't scale and can lead to hotspots. Whether you are potentially impacted by this depends on your particular usage.
As a general rule, the hotspotting point of this pattern is 500 writes per second. If you know you're definitely going to stay under that you probably don't need to worry.
If you do need higher than 500 writes per second, but have a upper limit in mind, you could attempt a sharded approach. Basically, if you upper on writes per second is x, then n = ceiling(x/500), where n is the number of shards. When you write your timestamp, prepend random(1, n) at the start. This creates n random key ranges that each can perform up to 500 writes per second. When you query your data, you'll need to issue n queries and do some client side merging of the result streams.

HBase Group by time stamp and count

I would like to scan the entire Hbase table and get the count of the number of records added on a particular day on daily basis.
Since we do not have multiple versions of the columns, I can use the time stamp of the latest version(which will always be one).
One approach is to use map reduce. Where map scans all the rows, and we emit timestamp(actual date) and 1 as key and value. Then the reducer, I would count based on timestamp value. Approach is similar to group count based on timestamp.
Is there a better way of doing this? Once implemented, this job would be run on a daily basis, to verify the counts with other modules(Hive table row count and solr document count). I use this, as the starting point to identify any errors, during flow at different integration points in application.

Parse: parse performance if we have 50000 rows in a table

Does parse take equal time to any amount of load data or it varies based on size? I expect to have 20k to 50k records in a table for SAAS project.
If you can give me this meta information then I would be grateful.
The time is influenced by record count. But, it's also influenced by indexes and the type of query you're doing. 50k records isn't a particularly large set, but it really depends on the data and the query as to how it will perform.

Parse DB - Lazy-evaluation on queries

Does the Parse.com database engine have built in lazy evaluation for repeated queries?
For example, lets say I have a table with millions of rows and there is a column that must be summed several times per minute. Obviously one would not sum millions of values every time. Should I have a running total variable which is updated upon every row insertion, or would the repeated queries be handled with laziness?
On Parse, you should use counters and increment/decrement them based on other actions. Count queries do not scale.
You can use before/afterSave triggers or other cloud functions, in Cloud Code, to modify these counters.

Resources