Goal: read a big Cassandra table, process line by line in parallel
Constraints:
not all rows in memory
no Spark, we have to use Camel
One shot, no need polling the table
I did a first working version with CassandraQL but this Component seems to be limited to one query with all in memory, I did not find mechanics like fetSize/fetchMoreResult. I looked CassandraProducer class, PollingConsumerPollingStrategy, ResultSetConversionStrategy... See nothing.
Could it be possible to read a table by chunks of 1000 elements for example, each chunk would generate an exchange lately split in different threads ?
I think that maybe the ProducerTemplate injecting first exchanges in the route could be the answer. But I don't undertand how I could manage the production exchange rate to avoid to have too many rows in memory (to do so, we would need for example to check the size of the next blocking queue, if more than X no consumed elements, wait before producing more messages).
Maybe there are other options to do something like this ?
Maybe I did not see the magic parameter in CassandraQL ?
Maybe we can override some parts of CassandraQL ?
Thank you
This is not going to be answer to be a your question but hope to kick off some discussion. But as someone learning Cassandra and spending bit of time, it made me thinking. And mainly targets fetSize/fetchMoreResult part of the question
First of all, two of your constraints are contradicting
Not all rows in memory
I don't want all them fetched at once
One shot, no need polling the partition
I don't want to go back to db more than once.
Unless what you actually you meant is
Not all rows in memory
I don't want all them fetched at once
You can go back to partition many times, as long as you go back straight to where you left it last time.
As long as the time it takes for the first page is same as time it takes for the second page. And the time it takes for the 19th Page is same as the time it takes for the 20th page.
i.e Not starting from the first row
So I am going to assume that what you meant is Second Scenario and go with it.
Queries for Cassandra are going to satisfy the following two:
They are going to have a restriction on clustering columns
They are already ordered by clustering columns
Now Consider the following table
department(partition key), firstName(clustering_key), personId(clustering_key), lastname, etc as normal cols
First query
select department, firstName, lastname, etc
from person
where department = 'depart1`
order by firstName ASC
limit 25;
Second query (lets say last record in the page had userId=25 and firstName=kavi)
select department, firstName, lastname, etc
from person
where department = 'depart1` and firstName='kavi' and userId > 25
order by firstName ASC
limit 25;
As you can see, we can easily construct a Cassandra query that brings each chunk with certain size in constant time.
Now back to integration framework
I remember this concept called watermark in mule where the endpoint can store and remember so that they can start from there next time. In this case, value of userId and firstName of the last record of the last page is the watermark. So they can issue the second. I am sure we should be able to do the same with camel
I hope I have convinced that polling is not an issue where each chunk is retrieved in constant time
Related
In my application I am making a query to oracle and getting data this way
<select id="getAll" resultType="com.mappers.MyOracleMapper">
SELECT * FROM "OracleTable"
</select>
I get all the data, the problem is that there is a lot of data and it will take too much time to process all the data at once, since the response from the database will come in 3-4 minutes, this is not convenient.
How to make it so that I receive lines in portions without using the id field (since it does not exist, I do not know why). That is, take the first portion of lines, for example, the first 50, process them and take the next portion. It would be desirable to place a variable in properties that will be responsible for the number of lines in portions.
I can't do this in mybatis. This is new to me. Thanks in advance.
there is such a field and it is unique
OFFSET 10 ROWS
FETCH NEXT 10 ROWS ONLY
don't work, because the version is earlier than 12c
If you want to read millions of rows that's going to take time. It's normal to expect a few minutes to read and receive all the data over the wire.
Now, you have two options:
Use a Cursor
In MyBatis you can read the result of the query using the buffering a cursor gives you. The cursor reads a few hundred rows at a time and your app reads them one by one. Your app doesn't notice that behind the scenes there is buffering. Pretty good. For example, you can do:
Cursor<Client> clients = this.sqlSession.selectCursor("getAll");
for (Client c : clients) {
// process one client
}
Consider that cursors remain open until the end of the transaction. If you close the transaction (or exit the method marked as #Transactional) the cursor won't be usable anymore.
Use Manual Pagination
This solution can work well for the first pages of the result set, but it becomes increasingly inefficient and slooooooow the more you advance in the result set. Use it only as a last resort.
The only case where this strategy can be efficient is when you have the chance of implementing "key set pagination". I assume it's not the case here.
You can modify your query to perform explicit pagination. For example, you can do:
<select id="getPage" resultType="com.mappers.MyOracleMapper">
select * from (
SELECT rownum rnum, x.*
FROM OracleTable
WHERE rownum <= #{endingRow}
ORDER BY id
) x
where rnum >= #{startingRow}
</select>
You'll need to provide the extra parameters startingRow and endingRow.
NOTE: It's imperative you include an ORDER BY clause. Otherwise the pagination logic is meaningless. Choose any ordering you want, preferrably something that is backed up by an existing index.
This is a problem I have been thinking about for a long time but I haven't written any code yet because I first want to solve some general problems I am struggling with. This is the main one.
Background
A single page web application makes requests for data to some remote API (which is under our control). It then stores this data in a local cache and serves pages from there. Ideally, the app remains fully functional when offline, including the ability to create new objects.
Constraints
Assume a server side database of products containing +- 50000 products (50Mb)
Assume no db type, we interact with it via REST/GraphQL interface
Assume a single product record is < 1kB
Assume a max payload for a resultset of 256kB
Assume max 5MB storage on the client
Assume search result sets ranging between 0 ... 5000 items per search
Challenge
The challenge is to define a stateless but (network) efficient way fetch pages from a result set so that it is deterministic which results we will get.
Example
In traditional paging, when getting the next 100 results for some query using this url:
https://example.com/products?category=shoes&firstResult=100&pageSize=100
the search result may look like this:
{
"totalResults": 2458,
"firstResult": 100,
"pageSize": 100,
"results": [
{"some": "item"},
{"some": "other item"},
// 98 more ...
]
}
The problem with this is that there is no way, based on this information, to get exactly the objects that are on a certain page. Because by the time we request the next page, the result set may have changed (due to changes in the DB), influencing which items are part of the result set. Even a small change can have a big impact: one item removed from the DB, that happened to be on page 0 of the result set, will change what results we will get when requesting all subsequent pages.
Goal
I am looking for a mechanism to make the definition of the result set independent of future database changes, so if someone was looking for shoes and got a result set of 2458 items, he could actually fetch all pages of that result set reliably even if it got influenced by later changes in the DB (I plan to not really delete items, but set a removed flag on them, for this purpose)
Ideas so far
I have seen a solution where the result set included a "pages" property, which was an array with the first and last id of the items in that page. Assuming your IDs keep going up in number and you don't really delete items from the DB ever, the number of items between two IDs is constant. Meaning the app could get all items between those two IDs and always get the exact same items back. The problem with this solution is that it only works if the list is sorted in ID order... I need custom sorting options.
The only way I have come up with for now is to just send a list of all IDs in the result set... That way pages can be fetched by doing a SELECT * FROM products WHERE id IN (3,4,6,9,...)... but this feels rather inelegant...
Any way I am hoping it is not too broad or theoretical. I have a web-based DB, just no good idea on how to do paging with it. I am looking for answers that help me in a direction to learn, not full solutions.
Versioning DB is the answer for resultsets consistency.
Each record has primary id, modification counter (version number) and timestamp of modification/creation. Instead of modification of record r you add new record with same id, version number+1 and sysdate for modification.
In fetch response you add DB request_time (do not use client timestamp due to possibly difference in time between client/server). First page is served normally, but you return sysdate as request_time. Other pages are served differently: you add condition like modification_time <= request_time for each versioned table.
You can cache the result set of IDs on the server side when a query arrives for the first time and return a unique ID to the frontend. This unique ID corresponds to the result set for that query. So now the frontend can request something like next_page with the unique ID that it got the first time it made the query. You should still go ahead with your approach of changing DELETE operation to a removed operation because it would make sure that none of the entries from the result set it deleted. You can discard the result set of the query from the cache when the frontend reaches the end of the result set or you can set a time limit on the lifetime of the cache entry.
I have a site with millions of users (well, actually it doesn't have any yet, but let's imagine), and I want to calculate some stats like "log-ins in the past hour".
The problem is similar to the one described here: http://highscalability.com/blog/2008/4/19/how-to-build-a-real-time-analytics-system.html
The simplest approach would be to do a select like this:
select count(distinct user_id)
from logs
where date>='20120601 1200' and date <='20120601 1300'
(of course other conditions could apply for the stats, like log-ins per country)
Of course this would be really slow, mainly if it has millions (or even thousands) of rows, and I want to query this every time a page is displayed.
How would you summarize the data? What should go to the (mem)cache?
EDIT: I'm looking for a way to de-normalize the data, or to keep the cache up-to-date. For example I could increment an in-memory variable every time someone logs in, but that would help to know the total amount of logins, not the "logins in the last hour". Hope it's more clear now.
IMO the more correct approach here would be to implement a continuous calculation that holds the relevant counters in memory. Every time a user is added to your system you can fire up an event which can be processed in multiple ways and update last hour, last day or even total users counters. There are some great frameworks out there to do this sort of processing. Twitter Storm is one of them, another one is GigaSpaces XAP (disclaimer - I work for GigaSpaces) and specifically this tutorial, and also Apache S4 and GridGain.
If you don't have a db then never mind. I don't have millions of users but I have table with a years worth of logon that has a million rows and simple stats like that in sub second. A million rows is not that much for a database. You cannot make date a PK as you can have duplicates. For minimal fragmentation and speed of insert make date a clustered non unique index asc and that is how the data comes in. Not sure if you have a DB but in MSSQL you can. Index user_id is something to test. What that would do is slow down insert as that is an index that will fragment. If you a looking for a fairly tight time span a table scan might be OK.
Why distinct user_id rather then a login is a login.
Have a property that only only runs the query every x seconds. Even if every second and reports the cached answer. If or 200 pages hit that property in one second for sure you don't want 200 queries. And if the stat is one second stale for information over the last hour that is still a valid stat.
I've ended up using Esper/NEsper. Also Uri's suggestions where useful.
Esper allows me to compute real-time stats of data as it's being obtained.
If you're just running off of logs, you probably want to look at something like Splunk.
Generally, if you want this in-memory and fast (real time), you would create a distributed cache of login data with an eviction after e.g. 24 hours, and then you could query that cache for e.g. logins within the past hour.
Assuming a login record looks something like:
public class Login implements Serializable {
public Login(String userId, long loginTime) {..}
public String getUserId() {..}
public long getLoginTime() {..}
public long getLastSeenTime() {..}
public void setLastSeenTime(long logoutTime) {..}
public long getLogoutTime() {..}
public void setLogoutTime(long logoutTime) {..}
String userId;
long loginTime;
long lastSeenTime;
long logoutTime;
}
To support the eviction after 24 hours, simply configure an expiry (TTL) on the cache
<expiry-delay>24h</expiry-delay>
To query for all users currently logged in:
long oneHourAgo = System.currentTimeMillis() - 60*60*1000;
Filter query = QueryHelper.createFilter("loginTime > " + oneHourAgo
+ " and logoutTime = 0");
Set idsLoggedIn = cache.keySet(query);
To query for the number of logins and/or active users in the past hour:
long oneHourAgo = System.currentTimeMillis() - 60*60*1000;
Filter query = QueryHelper.createFilter("loginTime > " + oneHourAgo
+ " or lastSeenTime > " + oneHourAgo);
int numActive = cache.keySet(query).size();
(See http://docs.oracle.com/cd/E15357_01/coh.360/e15723/api_cq.htm for more info on queries. All these examples were from Oracle Coherence.)
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.
Basically I have a 'thread line' where new threads are made and a TimeUUID is used as a key. Which obviously provides sorting of a new thread quite easily, espically when say making a query of the latest 20 threads etc.
My problem is that when a new 'post' is made to a thread I want to be able to 'bump' that thread to the front of the 'thread line' which is where the problem comes in, how do I basically make this happen so I can still make queries that can still be selected in the right order without providing any kind of duplicates etc.
The only way I can see this working is if rather than a column family sorting via a TimeUUID I need the column family to sort via the insertion Timestamp, therefore I can use the unique thread IDs for column keys and retrieve these in the order they are inserted or reinserted rather than by TimeUUID? Is this possible or am I missing a simple trick that allows for this? As far as I know you have to set a particular comparitor or otherwise it defaults to bytes?
Columns within a row are always sorted by name with the given comparator. You cannot sort by timestamp or value or anything else, or Cassandra would not be able to merge multiple updates to the same column correctly.
As to your use case, I can think of two options.
The most similar to what you are doing now would be to create a second columnfamily, ThreadMostRecentPosts, with timeuuid columns (you said "keys" but it sounds like you mean "columns"). When a new post arrives, delete the old most-recent column and add a new one.
This has two problems:
The unit of replication is the row, so having this grow indefinitely could be problematic. (Using expiring columns to age out no-longer-relevant thread information might help.)
You need a lock manager so that multiple posts to the same thread don't race and possibly leave multiple entries in this row.
I would suggest instead creating a row per day (for instance), whose columns are the thread IDs and whose values are the most recent post. Adding a new post just updates the value in that column; no delete/re-add is done, so the race is not a problem anymore. You don't get sorting for free anymore but that's okay because you're limiting it to a small enough set that you can do that sort in memory (say, yesterday's threads and today's).
(Finally, I would add that I can say from experience that having a cutoff past which old threads don't get bumped to the front by a new reply is a Good Thing.)
I'm using table with a counter to ensure unique id's on a child element.
I know it is usually better to use a sequence, but I can't use it because I have a lot of counters (a customer can create a couple of buckets and each of them needs to have their own counter, they have to start with 1 (it's a requirement, my customer needs "human readable" keys).
I'm creating records (let's call them items) that have a prikey (bucket_id, num = counter).
I need to guarantee that the bucket_id / num combination is unique (so using a sequence as prikey won't fix my problem).
The creation of rows doesn't happen in pl/sql, so I need to claim the number (btw: it's not against the requirements to have gaps).
My solution was:
UPDATE bucket
SET counter = counter + 1
WHERE id = param_id
RETURNING counter INTO num_forprikey;
PL/SQL returns var_num_forprikey so the item record can be created.
Question:
Will I always get unique num_forprikey even if the user concurrently asks for new items in a bucket?
Will I always get unique num_forprikey
even if the user concurrently asks for
new items in a bucket?
Yes, at least up to a point. The first user to issue that update gets a lock on the row. So no other user can successfully issue that same statement until user numero uno commits (or rolls back). So uniqueness is guaranteed.
Obviously, the cavil is regarding concurrency. Your access to the row is serialized, so there is no way for two users to get a new PRIKEY simultaneously. This is not necessarily a problem. It depends on how many users you have creating new Items, and how often they do it. One user peeling off numbers in the same session won't notice a thing.
I seem to recall this problem from many years back working on of all things an INGRES database. There were no sequences in those days so a lot of effort was put into finding the best scaling solution for this problem by the top INGRES minds of the day. I was fortunate enough to be working along side them so that even though my mind is pitifully smaller than any of theirs, proxmity = residual affect and I retained something. This was one of the things. Let me see if I can remember.
1) for each counter you need row in a work table.
2) each time you need a number
a) lock the row
b) update it
c) get its new value (you use returning for this which I avoid like the plague)
d) commit the update to release your lock on the row
The reason for the commit is for trying to get some kind of scalability. There will always be a limit but you do not serialize on getting a number for any period of time.
In the oracle world we would improve the situation by using a function defined as an AUTONOMOUS_TRANSACTION in order to acquire the next number. IF you think about it, this solution requires that gaps be allowed which you said is OK. By commiting the number update independently of the main transaction, you gain scalability but you introduce gapping.
You will have to accept the fact that your scalability will drop dramatically in this scenario. This is due to at least two reasons:
1) the update/select/commit sequence does its best to reduce the time during which the KEY row is locked, but it is still not zero. Under heavy load, you will serialize and eventually be limited.
2) you are commiting on every key get. A commit is an expensive operation requiring many memory and file management actions on the part of the database. This will limit you also.
In the end you are likely looking at three or more orders of magnitude drop in concurrent transaction load because you are not using sequences. I base this on my experience of the past.
But if you customer requires it, what can you do right?
Good luck. I have not tested the code for syntax errors, I leave that to you.
create or replace function get_next_key (key_name_p in varchar2) return number is
pragma autonomous_transaction;
kev_v number;
begin
update key_table set key = key + 1 where key_name = key_name_p;
select key_name into key_name_v from key_name where key_name = key_name_p;
commit;
return (key_v);
end;
/
show errors
You can still use sequences, just use the row_number() analytic function to please your users. I described it here in more detail: http://rwijk.blogspot.com/2008/01/sequence-within-parent.html
Regards,
Rob.
I'd figure out how to make sequences work. It's the only guarantee, though an exception clause could be coded
http://www.orafaq.com/forum/t/83382/0/ The benefit to sequences (and they could be dynamically created, is you can specify nocache and guarantee order)