Have anyone faced the situation that Nifi ommits some records when using incremental refresh with maximum value? We foreseen the column which is used for maximum value some records are inserting not at the time when their col value states. Is there any way to investigate such situation?
Related
I am new to Cassandra DB and I have a need to store a set of data in a table periodically (in every 15 minutes). This set of data can be of 1500 records. Now, I have to insert this set of data in Cassandra table in such a way that all these 1500 records are tied with the same partition key, meaning all these 1500 records must be present in the same node.
After 15 minutes, again a batch of 1500 records will have to be stored in the same fashion, but a different partition key.
The GOAL is to compare last two sets of data and find the ones with the differences.
So the 1500 records (now) will be compared to 1500 records (previous) and I need to find out which ones have changed and then do some business logic on the changed ones.
If I use timeuuid as the partition key then all my 1500 records will have a different timeuuid and thus will not be present in the same node.
I was searching about maintaining incremental counters in Cassandra but seems like there is no good way, and besides that maintaining a COUNTER table in a single node is an anti-pattern to distributed design.
How to create auto increment IDs in Cassandra
Can you guys please suggest me the optimal way to solve this problem ?
In simpler words, my requirements comes down to :
How can I compare the current set of data with the previous one ?
By the way, I will be using Springboot to Connect and write data to Cassandra.
Thanks in advance !
I have been searching for an answer to this today, and it seems the best approach divides opinion somewhat.
I have 150,000 records that I need to retrieve from an Oracle database using JDBC. Is it better to retrieve the data using one select query and allowing the JDBC driver to take care of transferring the records from the database using Oracle cursor and default fetchSize - OR to split up the query into batches using LIMIT / OFFSET?
With the LIMIT / OFFSET option, I think the pros are that you can take control over the number of results you return in each chunk. The cons are that the query is executed multiple times, and you also need to run a COUNT(*) up front using the same query to calculate the number of iterations required.
The pros of retrieving all at once are that you rely on the JDBC driver to manage the retrieval of data from the database. The cons are that the setFetchSize() hint can sometimes be ignored meaning that we could end up with a huge resultSet containing all 150,000 records at once!!
Would be great to hear some real life experiences solving similar issues, and recommendations would be much appreciated.
The native way in Oracle JDBC is to use the prepareStatement for the query, executeQuery and fetch
in a loop the results with defined fetchSize
Yes, of course the details are Oracle Database and JDBC Driver Version dependent and in some case the required fetchSize
can be ignored. But the typical problem is that the required fetch size is reset to fetchSize = 1 and you effectively makes a round trip for each record. (not that you get all records at once).
Your alternative with LIMIT seems to be meaningfull on the first view. But if you investigate the implementation you will probably decide to not use it.
Say you will divide the result set in 15 chunks 10K each:
You open 15 queries, each of them on average with a half of the resource consumption as the original query (OFFSET select the data and skips them).
So the only think you will reach is that the processing will take aproximatly 7,5x more time.
Best Practice
Take your query, write a simple script with JDBC fetch, use 10046 trace to see the effective used fetch size.
Test with a range of fetch sizes and observe the perfomance; choose the optimal one.
my preference is to maintain a safe execution time with the ability to continue if interrupted. i prefer this approach because it is future proof and respects memory and execution time limits. remember you're not planning for today, you're planning for 6m down the road. what may be 150,000 today may be 1.5m in 6 months.
i use a length + 1 recipe to know if there is more to fetch, although the count query will enable you to do a progress bar in % if that is important.
when considering 150,000 record result set, this is a memory pressure question. this will depend on the average size of each row. if it is a row with three integers, that's small. if it is a row with a bunch of text elements to store user profile details then that's potentially very large. so be prudent with what fields you're pulling.
also need to ask - you may not need to pull all the records all the time. it may be useful to apply a sync pattern. to only pull records with an updated date newer than your last pull.
I am using Google Datastore and will need to query it to retrieve some entities. These entities will need to be sorted by newest to oldest. My first thought was to have a date_created property which contains a timestamp. I would then index this field and sort on this field. The problem with this approach is it will cause hotspots in the database (https://cloud.google.com/datastore/docs/best-practices).
Do not index properties with monotonically increasing values (such as a NOW() timestamp). Maintaining such an index could lead to hotspots that impact Cloud Datastore latency for applications with high read and write rates.
Obviously sorting data on dates is properly the most common sorting performed on a database. If I can't index timestamps, is there another way I can accomplish being able to sort my queires from newest to oldest without hotspots?
As you note, indexing monotonically changed values doesn't scale and can lead to hotspots. Whether you are potentially impacted by this depends on your particular usage.
As a general rule, the hotspotting point of this pattern is 500 writes per second. If you know you're definitely going to stay under that you probably don't need to worry.
If you do need higher than 500 writes per second, but have a upper limit in mind, you could attempt a sharded approach. Basically, if you upper on writes per second is x, then n = ceiling(x/500), where n is the number of shards. When you write your timestamp, prepend random(1, n) at the start. This creates n random key ranges that each can perform up to 500 writes per second. When you query your data, you'll need to issue n queries and do some client side merging of the result streams.
I would like to scan the entire Hbase table and get the count of the number of records added on a particular day on daily basis.
Since we do not have multiple versions of the columns, I can use the time stamp of the latest version(which will always be one).
One approach is to use map reduce. Where map scans all the rows, and we emit timestamp(actual date) and 1 as key and value. Then the reducer, I would count based on timestamp value. Approach is similar to group count based on timestamp.
Is there a better way of doing this? Once implemented, this job would be run on a daily basis, to verify the counts with other modules(Hive table row count and solr document count). I use this, as the starting point to identify any errors, during flow at different integration points in application.
We have real time data coming in to our system. We have online queries which we need to serve. In order to serve these online queries we need are doing some pre-processing of the data so that we can serve faster.
Now my query is how do I preprocess the online real time data. There should be a way for me to figure out if the data was already processed or not. In order to find this difference, I have the following approaches:
I can have a flag which says that data is processed or unprocessed, based on which i can further take a decision to process or not
I can have a column family where I can insert the data with a TTL, and a topic in a message bus like kafka which gives me the row identifier in cassandra so that I can process this row in cassandra
I can have a column family per day and a topic in a message bus like kafka which gives me the row identifier of the corresponding column family
I can have a keyspace per day and a topic in a message bus like kafka which gives me the row identifier of the corresponding column family
I read some where that if, the number of deletions increases, then the number of tombstones increases and result in slow query times. Now I am confused with the approach I have to chose among the above four or is there a better way to solve this?
According to the datastax blog third option might be better fit.
Cassandra Anti-patterns