HBase Group by time stamp and count - hadoop

I would like to scan the entire Hbase table and get the count of the number of records added on a particular day on daily basis.
Since we do not have multiple versions of the columns, I can use the time stamp of the latest version(which will always be one).
One approach is to use map reduce. Where map scans all the rows, and we emit timestamp(actual date) and 1 as key and value. Then the reducer, I would count based on timestamp value. Approach is similar to group count based on timestamp.
Is there a better way of doing this? Once implemented, this job would be run on a daily basis, to verify the counts with other modules(Hive table row count and solr document count). I use this, as the starting point to identify any errors, during flow at different integration points in application.

Related

Cassandra | How can I compare the current set of data with the previous one?

I am new to Cassandra DB and I have a need to store a set of data in a table periodically (in every 15 minutes). This set of data can be of 1500 records. Now, I have to insert this set of data in Cassandra table in such a way that all these 1500 records are tied with the same partition key, meaning all these 1500 records must be present in the same node.
After 15 minutes, again a batch of 1500 records will have to be stored in the same fashion, but a different partition key.
The GOAL is to compare last two sets of data and find the ones with the differences.
So the 1500 records (now) will be compared to 1500 records (previous) and I need to find out which ones have changed and then do some business logic on the changed ones.
If I use timeuuid as the partition key then all my 1500 records will have a different timeuuid and thus will not be present in the same node.
I was searching about maintaining incremental counters in Cassandra but seems like there is no good way, and besides that maintaining a COUNTER table in a single node is an anti-pattern to distributed design.
How to create auto increment IDs in Cassandra
Can you guys please suggest me the optimal way to solve this problem ?
In simpler words, my requirements comes down to :
How can I compare the current set of data with the previous one ?
By the way, I will be using Springboot to Connect and write data to Cassandra.
Thanks in advance !

Google datastore - index a date created field without having a hotspot

I am using Google Datastore and will need to query it to retrieve some entities. These entities will need to be sorted by newest to oldest. My first thought was to have a date_created property which contains a timestamp. I would then index this field and sort on this field. The problem with this approach is it will cause hotspots in the database (https://cloud.google.com/datastore/docs/best-practices).
Do not index properties with monotonically increasing values (such as a NOW() timestamp). Maintaining such an index could lead to hotspots that impact Cloud Datastore latency for applications with high read and write rates.
Obviously sorting data on dates is properly the most common sorting performed on a database. If I can't index timestamps, is there another way I can accomplish being able to sort my queires from newest to oldest without hotspots?
As you note, indexing monotonically changed values doesn't scale and can lead to hotspots. Whether you are potentially impacted by this depends on your particular usage.
As a general rule, the hotspotting point of this pattern is 500 writes per second. If you know you're definitely going to stay under that you probably don't need to worry.
If you do need higher than 500 writes per second, but have a upper limit in mind, you could attempt a sharded approach. Basically, if you upper on writes per second is x, then n = ceiling(x/500), where n is the number of shards. When you write your timestamp, prepend random(1, n) at the start. This creates n random key ranges that each can perform up to 500 writes per second. When you query your data, you'll need to issue n queries and do some client side merging of the result streams.

Group By Date in Rethinkdb is slow

I am trying group by date like following for total count
r.db('analytic').table('events').group([r.row('created_at').inTimezone("+08:00").year(), r.row('created_at').inTimezone("+08:00").month(),r.row('created_at').inTimezone("+08:00").day()]).count()
However, it slow and it took over 2 seconds for 17656 records.
Is there any way to get data faster for group by date ?
If you want to group and count all the records, it's going to have to read every record, so the speed will be determined mostly by your hardware rather than the specific query. If you only want one particular range of dates, you could get that much faster with an indexed between query.

How to append the data to existing hive table without partition

I have created hive table which contains historical stock data of past 10 years. From now i have to append the data on daily bases.
I thought of creating the partition based on date but it leads many partitions approximately 3000 plus a new partition for every new date, i think this is not feasible.
Can any one suggest a best approach to store all the historical data in the table and append the new data as it comes.
As for every partitioned table, the decision on how to partition your table depends primarily on how you are going to query the table.
Another consideration is how much data you're going to have per partition, as partitions should not bee too small. Each one should be at least at as an absolute minimum as big as one HDFS block since it would otherwise take too many directories.
This said, I don't think 3000 partitions would be a problem. At a previous job we had a huge table with one partition per hour, each hour was about 20Gbytes, and we had 6 months of data, so about 4000 partitions, and it worked just fine.
In our case, most people care the most about the last week and the last day.
I suggest as first thing you research how the table is going to be used, that is, are all the 10 years be used, or just mostly the most recent data ?
As second thing, study how big is the data, consider if it may grow in size with the new loads, and see how big each partition is going to be.
Once you've determined these 2 points, you can make a decision, you could just use daily partitions (which could be fine, 3000 partitions is not bad), or you could do weekly, or monthly.
You can use this command
LOAD DATA LOCAL INPATH '<FILE_PATH>' INTO TABLE <TABLE_NAME>;
It will create new files under HDFS directory mapped to table name. Even though there are not too many partitions with it, you will still run into too many files issue.
Periodically, you need to do this:
Create stage table
Move data by running LOAD command from target table to stage table
You can run insert command into target table selecting from stage table
Now it will load data with number of files equal to number of reducers.
You can delete stage table
You can run this process at regular intervals (probably once in a month).

Parse DB - Lazy-evaluation on queries

Does the Parse.com database engine have built in lazy evaluation for repeated queries?
For example, lets say I have a table with millions of rows and there is a column that must be summed several times per minute. Obviously one would not sum millions of values every time. Should I have a running total variable which is updated upon every row insertion, or would the repeated queries be handled with laziness?
On Parse, you should use counters and increment/decrement them based on other actions. Count queries do not scale.
You can use before/afterSave triggers or other cloud functions, in Cloud Code, to modify these counters.

Resources