Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I have this process where in one table we have a series of item movements that are to be
applied to the other table with items & stocks. Basically, the item movements table is mapped by the following Entity:
public class ItemMovement {
enum ItemMovementStatus { NEW, APPLIED }
private Long movementId;
private String itemName;
private String itemCategory;
private String arbitraryQualifier;
private Date movementDate;
private ItemMovementStatus status;
// getters & setters
}
The inventory item entity is like this:
public class InventoryItem {
private String itemName;
private Double itemStock;
private String arbitraryQualifier;
// getters & setters
}
Movements are generated in the course of a month, at the end of which all the movements must be "applied" to the inventory table. "Applying" a movement means essentially substracting the ITEM_MOVEMENT_QTY for every given movement from the STOCK value of the inventory table where an exact match exists. If an exact match exists with at least the required movement quantity, the job is done. If not, we just take what we can and continue taking now from another item which falls into the same ITEM_CATEGORY. If this last inventory item did not have enough to complete the requested movement quantity we must take from inventory items that share the same ARBITRARY_QUALIFIER that the movement item has.
The problem with this last is that, matches for this ARBITRARY_QUALIFIER can go from hundreds or even thousands of "matches" per movement item, because, as it names implies, this qualifier can relate two "totally" unrelated items. In the worst case (very worst), although very remotely, it should be possible that ALL inventory items are a match for a given item movement.
Initially I wanted to retrieve all matches (in chunks, i.e. the query is "paginated") like this:
select m,i from ItemMovement m, InventoryItem i where
m.itemname=i.itemname
or m.itemCategory=i.itemCategory
or m.arbitraryQualifier=i.arbitraryQualifier
then process each movement with ALL of its matches in a very OO oriented approach, but this is taking too much time when having more than 20K of movements and 20K inventory items. I can actually SEE that the query that retrieves the data (even when paginated) takes too much time (more than a minute per page). Taking too much time is NOT the problem per se, but a constraint that forbids me to hold the transaction for more than 10 min. I know I can increase this value, but I would like to know if the approach I am taking is the right one.
I wish of course to keep a very OO approach in order to fully control the transition of the item movement and keep the arithmetic very clear. I believe the "real" question here is:
¿Is OO the right "way" to do this "kind" of process? If so, is there a design pattern which better addresses my problem in terms of performance, mantainability of the code?
¿Is this a process where I should look, for example, for a PL/SQL stored procedure instead?
I am using EJB, JPA(Hibernate). The database is Oracle.
Thanks,
hibernate was not designed for batch operations.
You will be able to get better performance using PL/SQL procedures than you can get with hibernate.
You must consider the entire system architecture when deciding whether to do it with java or with PL/SQL. In most cases it is possible to get sufficient performance for batch jobs with hibernate.
Regardless of what you choose, there are often ways to improve performence within the given architecture.
One very impotant thing to consider with hibernate is how you access your objects. You want to load the objects from the database using a small number of SELECT operations. Your job will most likely go quicker if you load all objects using 2 SELECT (one for each object type) than if you use 20000 (one for each movement).
Related
Goal: read a big Cassandra table, process line by line in parallel
Constraints:
not all rows in memory
no Spark, we have to use Camel
One shot, no need polling the table
I did a first working version with CassandraQL but this Component seems to be limited to one query with all in memory, I did not find mechanics like fetSize/fetchMoreResult. I looked CassandraProducer class, PollingConsumerPollingStrategy, ResultSetConversionStrategy... See nothing.
Could it be possible to read a table by chunks of 1000 elements for example, each chunk would generate an exchange lately split in different threads ?
I think that maybe the ProducerTemplate injecting first exchanges in the route could be the answer. But I don't undertand how I could manage the production exchange rate to avoid to have too many rows in memory (to do so, we would need for example to check the size of the next blocking queue, if more than X no consumed elements, wait before producing more messages).
Maybe there are other options to do something like this ?
Maybe I did not see the magic parameter in CassandraQL ?
Maybe we can override some parts of CassandraQL ?
Thank you
This is not going to be answer to be a your question but hope to kick off some discussion. But as someone learning Cassandra and spending bit of time, it made me thinking. And mainly targets fetSize/fetchMoreResult part of the question
First of all, two of your constraints are contradicting
Not all rows in memory
I don't want all them fetched at once
One shot, no need polling the partition
I don't want to go back to db more than once.
Unless what you actually you meant is
Not all rows in memory
I don't want all them fetched at once
You can go back to partition many times, as long as you go back straight to where you left it last time.
As long as the time it takes for the first page is same as time it takes for the second page. And the time it takes for the 19th Page is same as the time it takes for the 20th page.
i.e Not starting from the first row
So I am going to assume that what you meant is Second Scenario and go with it.
Queries for Cassandra are going to satisfy the following two:
They are going to have a restriction on clustering columns
They are already ordered by clustering columns
Now Consider the following table
department(partition key), firstName(clustering_key), personId(clustering_key), lastname, etc as normal cols
First query
select department, firstName, lastname, etc
from person
where department = 'depart1`
order by firstName ASC
limit 25;
Second query (lets say last record in the page had userId=25 and firstName=kavi)
select department, firstName, lastname, etc
from person
where department = 'depart1` and firstName='kavi' and userId > 25
order by firstName ASC
limit 25;
As you can see, we can easily construct a Cassandra query that brings each chunk with certain size in constant time.
Now back to integration framework
I remember this concept called watermark in mule where the endpoint can store and remember so that they can start from there next time. In this case, value of userId and firstName of the last record of the last page is the watermark. So they can issue the second. I am sure we should be able to do the same with camel
I hope I have convinced that polling is not an issue where each chunk is retrieved in constant time
Suppose I have a large (300-500k) collection of text documents stored in the relational database. Each document can belong to one or more (up to six) categories. I need users to be able to randomly select documents in a specific category so that a single entity is never repeated, much like how StumbleUpon works.
I don't really see a way I could implement this using slow NOT IN queries with large amount of users and documents, so I figured I might need to implement some custom data structure for this purpose. Perhaps there is already a paper describing some algorithm that might be adapted to my needs?
Currently I'm considering the following approach:
Read all the entries from the database
Create a linked list based index for each category from the IDs of documents belonging to the this category. Shuffle it
Create a Bloom Filter containing all of the entries viewed by a particular user
Traverse the index using the iterator, randomly select items using Bloom Filter to pick not viewed items.
If you track via a table what entries that the user has seen... try this. And I'm going to use mysql because that's the quickest example I can think of but the gist should be clear.
On a link being 'used'...
insert into viewed (userid, url_id) values ("jj", 123)
On looking for a link...
select p.url_id
from pages p left join viewed v on v.url_id = p.url_id
where v.url_id is null
order by rand()
limit 1
This causes the database to go ahead and do a 1 for 1 join, and your limiting your query to return only one entry that the user has not seen yet.
Just a suggestion.
Edit: It is possible to make this one operation but there's no guarantee that the url will be passed successfully to the user.
It depend on how users get it's random entries.
Option 1:
A user is paging some entities and stop after couple of them. for example the user see the current random entity and then moving to the next one, read it and continue it couple of times and that's it.
in the next time this user (or another) get an entity from this category the entities that already viewed is clear and you can return an already viewed entity.
in that option I would recommend save a (hash) set of already viewed entities id and every time user ask for a random entity- randomally choose it from the DB and check if not already in the set.
because the set is so small and your data is so big, the chance that you get an already viewed id is so small, that it will take O(1) most of the time.
Option 2:
A user is paging in the entities and the viewed entities are saving between all users and every time user visit your page.
in that case you probably use all the entities in each category and saving all the viewed entites + check whether a entity is viewed will take some time.
In that option I would get all the ids for this topic- shuffle them and store it in a linked list. when you want to get a random not viewed entity- just get the head of the list and delete it (O(1)).
I assume that for any given <user, category> pair, the number of documents viewed is pretty small relative to the total number of documents available in that category.
So can you just store indexed triples <user, category, document> indicating which documents have been viewed, and then just take an optimistic approach with respect to randomly selected documents? In the vast majority of cases, the randomly selected document will be unread by the user. And you can check quickly because the triples are indexed.
I would opt for a pseudorandom approach:
1.) Determine number of elements in category to be viewed (SELECT COUNT(*) WHERE ...)
2.) Pick a random number in range 1 ... count.
3.) Select a single document (SELECT * FROM ... WHERE [same as when counting] ORDER BY [generate stable order]. Depending on the SQL dialect in use, there are different clauses that can be used to retrieve only the part of the result set you want (MySQL LIMIT clause, SQLServer TOP clause etc.)
If the number of documents is large the chance serving the same user the same document twice is neglibly small. Using the scheme described above you don't have to store any state information at all.
You may want to consider a nosql solution like Apache Cassandra. These seem to be ideally suited to your needs. There are many ways to design the algorithm you need in an environment where you can easily add new columns to a table (column family) on the fly, with excellent support for a very sparsely populated table.
edit: one of many possible solutions below:
create a CF(column family ie table) for each category (creating these on-the-fly is quite easy).
Add a row to each category CF for each document belonging to the category.
Whenever a user hits a document, you add a column with named and set it to true to the row. Obviously this table will be huge with millions of columns and probably quite sparsely populated, but no problem, reading this is still constant time.
Now finding a new document for a user in a category is simply a matter of selecting any result from select * where == null.
You should get constant time writes and reads, amazing scalability, etc if you can accept Cassandra's "eventually consistent" model (ie, it is not mission critical that a user never get a duplicate document)
I've solved similar in the past by indexing the relational database into a document oriented form using Apache Lucene. This was before the recent rise of NoSQL servers and is basically the same thing, but it's still a valid alternative approach.
You would create a Lucene Document for each of your texts with a textId (relational database id) field and multi valued categoryId and userId fields. Populate the categoryId field appropriately. When a user reads a text, add their id to the userId field. A simple query will return the set of documents with a given categoryId and without a given userId - pick one randomly and display it.
Store a users past X selections in a cookie or something.
Return the last selections to the server with the users new criteria
Randomly choose one of the texts satisfying the criteria until it is not a member of the last X selections of the user.
Return this choice of text and update the list of last X selections.
I would experiment to find the best value of X but I have in mind something like an X of say 16?
I have a site with millions of users (well, actually it doesn't have any yet, but let's imagine), and I want to calculate some stats like "log-ins in the past hour".
The problem is similar to the one described here: http://highscalability.com/blog/2008/4/19/how-to-build-a-real-time-analytics-system.html
The simplest approach would be to do a select like this:
select count(distinct user_id)
from logs
where date>='20120601 1200' and date <='20120601 1300'
(of course other conditions could apply for the stats, like log-ins per country)
Of course this would be really slow, mainly if it has millions (or even thousands) of rows, and I want to query this every time a page is displayed.
How would you summarize the data? What should go to the (mem)cache?
EDIT: I'm looking for a way to de-normalize the data, or to keep the cache up-to-date. For example I could increment an in-memory variable every time someone logs in, but that would help to know the total amount of logins, not the "logins in the last hour". Hope it's more clear now.
IMO the more correct approach here would be to implement a continuous calculation that holds the relevant counters in memory. Every time a user is added to your system you can fire up an event which can be processed in multiple ways and update last hour, last day or even total users counters. There are some great frameworks out there to do this sort of processing. Twitter Storm is one of them, another one is GigaSpaces XAP (disclaimer - I work for GigaSpaces) and specifically this tutorial, and also Apache S4 and GridGain.
If you don't have a db then never mind. I don't have millions of users but I have table with a years worth of logon that has a million rows and simple stats like that in sub second. A million rows is not that much for a database. You cannot make date a PK as you can have duplicates. For minimal fragmentation and speed of insert make date a clustered non unique index asc and that is how the data comes in. Not sure if you have a DB but in MSSQL you can. Index user_id is something to test. What that would do is slow down insert as that is an index that will fragment. If you a looking for a fairly tight time span a table scan might be OK.
Why distinct user_id rather then a login is a login.
Have a property that only only runs the query every x seconds. Even if every second and reports the cached answer. If or 200 pages hit that property in one second for sure you don't want 200 queries. And if the stat is one second stale for information over the last hour that is still a valid stat.
I've ended up using Esper/NEsper. Also Uri's suggestions where useful.
Esper allows me to compute real-time stats of data as it's being obtained.
If you're just running off of logs, you probably want to look at something like Splunk.
Generally, if you want this in-memory and fast (real time), you would create a distributed cache of login data with an eviction after e.g. 24 hours, and then you could query that cache for e.g. logins within the past hour.
Assuming a login record looks something like:
public class Login implements Serializable {
public Login(String userId, long loginTime) {..}
public String getUserId() {..}
public long getLoginTime() {..}
public long getLastSeenTime() {..}
public void setLastSeenTime(long logoutTime) {..}
public long getLogoutTime() {..}
public void setLogoutTime(long logoutTime) {..}
String userId;
long loginTime;
long lastSeenTime;
long logoutTime;
}
To support the eviction after 24 hours, simply configure an expiry (TTL) on the cache
<expiry-delay>24h</expiry-delay>
To query for all users currently logged in:
long oneHourAgo = System.currentTimeMillis() - 60*60*1000;
Filter query = QueryHelper.createFilter("loginTime > " + oneHourAgo
+ " and logoutTime = 0");
Set idsLoggedIn = cache.keySet(query);
To query for the number of logins and/or active users in the past hour:
long oneHourAgo = System.currentTimeMillis() - 60*60*1000;
Filter query = QueryHelper.createFilter("loginTime > " + oneHourAgo
+ " or lastSeenTime > " + oneHourAgo);
int numActive = cache.keySet(query).size();
(See http://docs.oracle.com/cd/E15357_01/coh.360/e15723/api_cq.htm for more info on queries. All these examples were from Oracle Coherence.)
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.
I'm fairly new to the more complex parts of Core Data.
My application has a core data store with 15K rows. There is a single entity.
I need to display a subset of those rows in a table view filtered on a calculated search criteria, and for each row displayed add a value that I calculate in real time but don't store in the entity.
The calculation needs to use a couple of values supplied by the user.
A hypothetical example:
Entity: contains fields "id", "first", and "second"
User inputs: 10 and 20
Search / Filter Criteria: only display records where the entity field "id" is a prime number between the two supplied numbers. (I need to build some sort of complex predicate method here I assume?)
Display: all fields of all records that meet the criteria, along with a derived field (not in the the core data entity) that is the sum of the "id" field and a random number, so each row in the tableview would contain 4 fields:
"id", "first", "second", -calculated value-
From my reading / Googling it seems that a transient property might be the way to go, but I can't work out how to do this given that the search criteria and the resultant property need to calculate based on user input.
Could anyone give me any pointers that will help me implement this code? I'm pretty lost right now, and the examples I can find in books etc. don't match my particular needs well enough for me to adapt them as far as I can tell.
Thanks
Darren.
The first thing you need to do is to stop thinking in terms of fields, rows and columns as none of those structures are actually part of Core Data. In this case, it is important because Core Data supports arbitrarily complex fetches but the sqlite store does not. So, if you use a sqlite store your fetches are restricted those supported by SQLite.
In this case, predicates aimed at SQLite can't perform complex operations such as calculating whether an attribute value is prime.
The best solution for your first case would be to add a boolean attribute of isPrime and then modify the setter for your id attribute to calculate whether the set id value is prime or not and then set the isPrime accordingly. That will be store in the SQLite store and can be fetched against e.g. isPrime==YES &&((first<=%#) && (second>=%#))
The second case would simply use a transient property for which you would supply a custom getter to calculate its value when the managed object was in memory.
One often overlooked option is to not use an sqlite store but to use an XML store instead. If the amount of data is relatively small e.g. a few thousand text attributes with a total memory footprint of a few dozen meg, then an XML store will be super fast and can handle more complex operations.
SQLite is sort of the stunted stepchild in Core Data. It's is useful for large data sets and low memory but with memory becoming ever more plentiful, its loosing its edge. I find myself using it less these days. You should consider whether you need sqlite in this particular case.
I'm using table with a counter to ensure unique id's on a child element.
I know it is usually better to use a sequence, but I can't use it because I have a lot of counters (a customer can create a couple of buckets and each of them needs to have their own counter, they have to start with 1 (it's a requirement, my customer needs "human readable" keys).
I'm creating records (let's call them items) that have a prikey (bucket_id, num = counter).
I need to guarantee that the bucket_id / num combination is unique (so using a sequence as prikey won't fix my problem).
The creation of rows doesn't happen in pl/sql, so I need to claim the number (btw: it's not against the requirements to have gaps).
My solution was:
UPDATE bucket
SET counter = counter + 1
WHERE id = param_id
RETURNING counter INTO num_forprikey;
PL/SQL returns var_num_forprikey so the item record can be created.
Question:
Will I always get unique num_forprikey even if the user concurrently asks for new items in a bucket?
Will I always get unique num_forprikey
even if the user concurrently asks for
new items in a bucket?
Yes, at least up to a point. The first user to issue that update gets a lock on the row. So no other user can successfully issue that same statement until user numero uno commits (or rolls back). So uniqueness is guaranteed.
Obviously, the cavil is regarding concurrency. Your access to the row is serialized, so there is no way for two users to get a new PRIKEY simultaneously. This is not necessarily a problem. It depends on how many users you have creating new Items, and how often they do it. One user peeling off numbers in the same session won't notice a thing.
I seem to recall this problem from many years back working on of all things an INGRES database. There were no sequences in those days so a lot of effort was put into finding the best scaling solution for this problem by the top INGRES minds of the day. I was fortunate enough to be working along side them so that even though my mind is pitifully smaller than any of theirs, proxmity = residual affect and I retained something. This was one of the things. Let me see if I can remember.
1) for each counter you need row in a work table.
2) each time you need a number
a) lock the row
b) update it
c) get its new value (you use returning for this which I avoid like the plague)
d) commit the update to release your lock on the row
The reason for the commit is for trying to get some kind of scalability. There will always be a limit but you do not serialize on getting a number for any period of time.
In the oracle world we would improve the situation by using a function defined as an AUTONOMOUS_TRANSACTION in order to acquire the next number. IF you think about it, this solution requires that gaps be allowed which you said is OK. By commiting the number update independently of the main transaction, you gain scalability but you introduce gapping.
You will have to accept the fact that your scalability will drop dramatically in this scenario. This is due to at least two reasons:
1) the update/select/commit sequence does its best to reduce the time during which the KEY row is locked, but it is still not zero. Under heavy load, you will serialize and eventually be limited.
2) you are commiting on every key get. A commit is an expensive operation requiring many memory and file management actions on the part of the database. This will limit you also.
In the end you are likely looking at three or more orders of magnitude drop in concurrent transaction load because you are not using sequences. I base this on my experience of the past.
But if you customer requires it, what can you do right?
Good luck. I have not tested the code for syntax errors, I leave that to you.
create or replace function get_next_key (key_name_p in varchar2) return number is
pragma autonomous_transaction;
kev_v number;
begin
update key_table set key = key + 1 where key_name = key_name_p;
select key_name into key_name_v from key_name where key_name = key_name_p;
commit;
return (key_v);
end;
/
show errors
You can still use sequences, just use the row_number() analytic function to please your users. I described it here in more detail: http://rwijk.blogspot.com/2008/01/sequence-within-parent.html
Regards,
Rob.
I'd figure out how to make sequences work. It's the only guarantee, though an exception clause could be coded
http://www.orafaq.com/forum/t/83382/0/ The benefit to sequences (and they could be dynamically created, is you can specify nocache and guarantee order)