JPA best way to avoid n+1 when I need to make a calculation for each row - performance

My application is used to find place in a city. Each place needs a score to be calculated and this score cannot be predicted in advance (stored somewhere) as it is different for each user and changes over time. Here is was i'm at the moment doing an that is TERRIBLY inneficient (15 times slower than if I mock the database call inside the loop)
SQL(native) query to fetch all the places that matches the search (I select all the column I need specifically)
I loop through the List and for each poi I make a db call to get the info needed to calculate the scores (I need different value residing on different tables)
make the calculation
sort by score desc
cut the list depending on the pagination setting (yes I cannot put LIMIT directly in the query as i don't know the score yet....)
return the List.
Well, this takes 15 seconds in total.
If I remove 2. and simply mock the DB call it only takes 600ms..
my table looks like this:
place_tag_count table:
place_id / tag_id / tag_count
1 100 15
1 200 25
1 300 35
user_tag_score Table:
user_id / tag_id / score
1000 100 0.5
1000 200 0.3
as a simplified example the place score is the sum of the user's tag score multiplied by the tag count found in the place_tag_count
score = 0.5 * 15 + 0.3 * 25 + (i won't complicate the thing but if a tag score is missing i do other calculation that need other db calls....)
the query at 1. returns a distinct place so because the calculation needs all the counts from of the tag's place and the user's tag score I need to make that extra DB call for each poi.
My question is, what would be the BEST way to avoid having n+1 call in my situation? I have thought to some alternative but I prefer having the opinion of a more experience person before going head first.
Instead of returning a distinct place in the query in 1) I actually return the same place grouped by place_id,tag_id for example, and in my Java code I just loop and when I see that the place_id change it means I'm processing an other place
make the query in 1. a bitttt more complicate and aggregate all the numbers i need in a comma separated list )but that requires some kind of sub select which might affect the speed of the query)
other solution ?

Related

Jmeter - How to perform data compare on two JDBC requests for the entire DB

I want to compare data between two databases, where there are 700K records to be compared, and that number is increasing all the time.
My architecture is:
Where based on the MAIN query i run:
JDBC1 & JDBC2 and, then i compare the data between JDBC1 & JDBC2.
If i put static limit in the main query,
select * from transaction_info
order by id asc
limit 1000
everything works fine, and i know how to compare the data.
But, how can i make all this dynamic, so i can break-down the main query like:
1. Query first 1000 rows, then compare/assert.
2. Query next 1000 rows then compare/assert.
And all this to continue till the 700K records?
Any help is appreciated!
As part of your main query SQL, you can include a combination of LIMIT and OFFSET to get incremental fetch of rows and variabilize the OFFSET part like below in order to bring in the dynamism
select * from my_table limit 1000 offset ${offset_value}
For the offset variable, define a Counter config element whose starting value is 1000, increments by 1000 until a max value of 700K. In case you want each user to do this process, then check on "Track counter independently for each user"
This way, everytime you will fetch 1000 rows then the subsequent 1000 in sequence
My suggestion for Test Plan
You can remove the loop controller and place the sub-queries along with assertion as siblings to the Main query and define the Thread Group's loop count to 700. This way the main query along with the sub-queries and assertion are triggered 700 times with each time fetching, processing and asserting 1000 rows at a time.
Sample Test Plan
Thread Group Loop Count Configuration
Hope this helps!

PowerBI filter table based on value of measure_A OR measure_B [duplicate]

We are trying to implement a dashboard that displays various tables, metrics and a map where the dataset is a list of customers. The primary filter condition is the disjunction of two numeric fields. We want to the user to be able to select a threshold for [field 1] and a separate threshold for [field 2] and then impose the condition [field 1] >= <threshold> OR [field 2] >= <threshold>.
After that, we want to also allow various other interactive slicers so the user can restrict the data further, e.g. by country or account manager.
Power BI naturally imposes AND between all filters and doesn't have a neat way to specify OR. Can you suggest a way to define a calculation using the two numeric fields that is then applied as a filter within the same interactive dashboard screen? Alternatively, is there a way to first prompt the user for the two threshold values before the dashboard is displayed -- so when they click Submit on that parameter-setting screen they are then taken to the main dashboard screen with the disjunction already applied?
Added in response to a comment:
The data can be quite simple: no complexity there. The complexity is in getting the user interface to enable a disjunction.
Suppose the data was a list of customers with customer id, country, gender, total value of transactions in the last 12 months, and number of purchases in last 12 months. I want the end-user (with no technical skills) to specify a minimum threshold for total value (e.g. $1,000) and number of purchases (e.g. 10) and then restrict the data set to those where total value of transactions in the last 12 months > $1,000 OR number of purchases in last 12 months > 10.
After doing that, I want to allow the user to see the data set on a dashboard (e.g. with a table and a graph) and from there select other filters (e.g. gender=male, country=Australia).
The key here is to create separate parameter tables and combine conditions using a measure.
Suppose we have the following Sales table:
Customer Value Number
-----------------------
A 568 2
B 2451 12
C 1352 9
D 876 6
E 993 11
F 2208 20
G 1612 4
Then we'll create two new tables to use as parameters. You could do a calculated table like
Number = VALUES(Sales[Number])
Or something more complex like
Value = GENERATESERIES(0, ROUNDUP(MAX(Sales[Value]),-2), ROUNDUP(MAX(Sales[Value]),-2)/10)
Or define the table manually using Enter Data or some other way.
In any case, once you have these tables, name their columns what you want (I used MinNumber and MinValue) and write your filtering measure
Filter = IF(MAX(Sales[Number]) > MIN(Number[MinCount]) ||
MAX(Sales[Value]) > MIN('Value'[MinValue]),
1, 0)
Then put your Filter measure as a visual level filter where Filter is not 0 and use MinCount and MinValues column as slicers.
If you select 10 for MinCount and 1000 for MinValue then your table should look like this:
Notice that E and G only exceed one of the thresholds and tha A and D are excluded.
To my knowledge, there is no such built-in slicer feature in Power BI at the time being. There is however a suggestion in the Power BI forum that requests a functionality like this. If you'd be willing to use the Power Query Editor, it's easy to obtain the values you're looking for, but only for hard-coded values for your limits or thresh-holds.
Let me show you how for a synthetic dataset that should fit the structure of your description:
Dataset:
CustomerID,Country,Gender,TransactionValue12,NPurchases12
51,USA,M,3516,1
58,USA,M,3308,12
57,USA,M,7360,19
54,USA,M,2052,6
51,USA,M,4889,5
57,USA,M,4746,6
50,USA,M,3803,3
58,USA,M,4113,24
57,USA,M,7421,17
58,USA,M,1774,24
50,USA,F,8984,5
52,USA,F,1436,22
52,USA,F,2137,9
58,USA,F,9933,25
50,Canada,F,7050,16
56,Canada,F,7202,5
54,Canada,F,2096,19
59,Canada,F,4639,9
58,Canada,F,5724,25
56,Canada,F,4885,5
57,Canada,F,6212,4
54,Canada,F,5016,16
55,Canada,F,7340,21
60,Canada,F,7883,6
55,Canada,M,5884,12
60,UK,M,2328,12
52,UK,M,7826,1
58,UK,M,2542,11
56,UK,M,9304,3
54,UK,M,3685,16
58,UK,M,6440,16
50,UK,M,2469,13
57,UK,M,7827,6
Desktop table:
Here you see an Input table and a subset table using two Slicers. If the forum suggestion gets implemented, it should hopefully be easy to change a subset like below to an "OR" scenario:
Transaction Value > 1000 OR Number or purchases > 10 using Power Query:
If you use Edit Queries > Advanced filter you can set it up like this:
The last step under Applied Steps will then contain this formula:
= Table.SelectRows(#"Changed Type2", each [NPurchases12] > 10 or [TransactionValue12] > 1000
Now your original Input table will look like this:
Now, if only we were able to replace the hardcoded 10 and 1000 with a dynamic value, for example from a slicer, we would be fine! But no...
I know this is not what you were looking for, but it was the best 'negative answer' I could find. I guess I'm hoping for a better solution just as much as you are!

MAX() SQL Equivalent on Redis

I'm new on Redis, and now I have problem to improve my stat application. The current SQL to generate the statistic is here:
SELECT MIN(created_at), MAX(created_at) FROM table ORDER BY id DESC limit 10000
It will return MIN and MAX value from created_at field.
I have read about RANGE and SCORING on Redis, seem them can be used to solve this problem. But I still confused about SCORING for last 10000 records. Are they can be used to solve this problem, or is there another way to solve this problem using Redis?
Regards
Your target appears to be somewhat unclear - are you looking to store all the records in Redis? If so, what other columns does the table table have and what other queries do you run against it?
I'll take your question at face value, but note that in most NoSQL databases (Redis included) you need to store your data according to how you plan on fetching it. Assuming that you want to get the min/max creation dates of the last 10K records, I suggest that you keep them in a Sorted Set. The Sorted Set's members will be the unique id and their scores will be the creation date (use the epoch value), for example, rows with ids 1, 2 & 3 were created at dates 10, 100 & 1000 respectively:
ZADD table 10 1 100 2 1000 3 ...
Getting the minimal creation date is easy now - just do ZRANGE table 0 0 WITHSCORES - and the max is just a ZRANGE table -1 -1 WITHSCORES away. The only "tricky" part is making sure that the Sorted Set is kept updated, so for every new record you'll need to remove the lowest id from the set and add the new one. In pseudo Python code this would look something like the following:
def updateMinMaxSortedSet(id, date):
count = redis.zcount('table')
if count > 10000:
redis.zrem('table', id-10000)
redis.zadd('table', id, date)

Salesforce SOQL query length and efficiency

I am trying to solve a problem of deleting only rows matching two criteria, each being a list of ids. Now these Ids are in pairs, if the item to be deleted has one, it must have the second one in the pair, so just using two in clauses will not work. I have come up with two solutions.
1) Use the two in clauses but then loop over the items and check that the two ids in question appear in the correct pairing.
I.E.
for(Object__c obj : [SELECT Id FROM Object__c WHERE Relation1__c in :idlist1 AND Relation2__c in:idlist2]){
if(preConstructedPairingsAsString.contains(''+obj.Relation1__c+obj.Relation2__c)){
listToDelete.add(obj);
}
}
2) Loop over the ids and build an admittedly long query.
I like the second choice because I only get the items I need and can just throw the list into delete but I know that salesforce has hangups with SOQL queries. Is there a penalty to the second option? Is it better to build and query off a long string or to get more objects than necessary and filter?
In general you want to put as much logic as you can into soql queries because that won't use any script statements and they execute faster than your code. However, there is a 10k character limit on soql queries (can be raised to 20k) so based on my back of the envelope calculations you'd only be able to put in 250 id pairs or so before hitting that limit.
I would go with option 1 or if you really care about efficiency you can create a formula field on the object that pairs the ids and filter on that.
formula: relation1__c + '-' + relation2__c
for(list<Object__c> objs : [SELECT Id FROM Object__c WHERE formula__c in :idpairs]){
delete objs;
}

Windows Azure Paging Large Datasets Solution

I'm using Windows Azure Table Storage to store millions of entities, however I'm trying to figure out the best solution that easily allows for two things:
1) a search on an entity, will retrieve that entity and at least (pageSize) number of entities either side of that entity
2) if there are more entities beyond (pageSize) number of entities either side of that entity, then page next or page previous links are shown, this will continue until either the start or end is reached.
3) the order is reverse chronological order
I've decided that the PartitionKey will be the Title provided by the user as each container is unique in the system. The RowKey is Steve Marx's lexiographical algorithm:
http://blog.smarx.com/posts/using-numbers-as-keys-in-windows-azure
which when converted to javascript instead of c# looks like this:
pad(new Date(100000000 * 86400000).getTime() - new Date().getTime(), 19) + "_" + uuid()
uuid() is a javascript function that returns a guid and pad adds zeros up to 19 chars in length. So records in the system look something like this:
PK RK
TEST 0008638662595845431_ecf134e4-b10d-47e8-91f2-4de9c4d64388
TEST 0008638662595845432_ae7bb505-8594-43bc-80b7-6bd34bb9541b
TEST 0008638662595845433_d527d215-03a5-4e46-8a54-10027b8e23f8
TEST 0008638662595845434_a2ebc3f4-67fe-43e2-becd-eaa41a4132e2
This pattern allows for every new entity inserted to be at the top of the list which satisfies point number 3 above.
With a nice way of adding new records in the system I thought then I would create a mechanism that looks at the first half of the RowKey i.e. 0008638662595845431_ part and does a greater than or less than comparison depending on which direction of the already found item. In other words to get the row immediately before 0008638662595845431 I would do a query like so:
var tableService = azure.createTableService();
var minPossibleDateTimeNumber = pad(new Date(-100000000*86400000).getTime() - new Date().getTime(), 19);
tableService.getTable('testTable', function (error) {
if (error === null) {
var query = azure.TableQuery
.select()
.from('testTable')
.where('PartitionKey eq ?', 'TEST')
.and('RowKey gt ?', minPossibleDateTimeNumber + '_')
.and('RowKey lt ?', '0008638662595845431_')
.and('Deleted eq ?', 'false');
If the results returned are greater than 1000 and azure gives me a continuation token, then I thought I would remember the last items RowKey i.e. the number part 0008638662595845431. So now the next query will have the remembered value as the starting value etc.
I am using Windows Azure Node.Js SDK and language is javascript.
Can anybody see gotcha's or problems with this approach?
I do not see how this can work effectively and efficiently, especially to get the rows for a previous page.
To be efficient, the prefix of your “key” needs to be a serially incrementing or decrementing value, instead of being based on a timestamp. A timestamp generated value would have duplicates as well as holes, making mapping page size to row count at best inefficient and at worst difficult to determine.
Also, this potential algorithm is dependent on a single partition key, destroying table scalability.
The challenge here would be to have a method of generating a serially incremented key. One solution is to use a SQL database and performing an atomic update on a single row, such that an incrementing or decrementing value is produced in sequence. Something like UPDATE … SET X = X + 1 and return X. Maybe using a stored procedure.
So the key could be a zero left padded serially generated number. Split such that say the first N digits of the number is the partition key and remaining M digits are the row key.
For example
PKey RKey
00001 10321
00001 10322
….
00954 98912
Now, since the rows are in sequence it is possible to write a query with the exact key range for the page size.
Caveat. There is a small risk of a failure occurring between generating a serial key and writing to table storage. In which case, there may be holes in the table. However, your paging algorithm should be able to detect and work around such instances quite easily by specify a page size slightly larger than necessary or by retrying with an adjusted range.

Resources