How could I quickly look-up items in a List of loaded entities - performance

I have built an MVC 5 application, using EF 6 to query the database. One page show a cross table of two dimensions: substances against properties of these substances. It is rendered as an html table.Many cells do not have a value. This is what it looks like:
sub 1 sub 2 sub 3
prop A 1.0
prop B 1.5 X
prop C 0.6 Y
The cell values are actually more complex, including tool tips, footnotes, etc.
I implemented the generation of the html table, by the following steps:
create a list of unique properties;
create a list of unique substances;
loop through the properties;
render a row for each;
loop through the substances;
See if there is a value for the combination of property and substances;
render the cell's value or an empty one.
Using the ANTS performance profiler, I found out that step 6 has a huge performance issue with increasing numbers of substances and properties, the hit count exploding to hundreds of millions, with a few hundred substances and a few tens of properties (the largest selection the user can make). The execution time is many minutes. It seems to scale N(substances)^2 * N(properties)^2.
The code looks like:
Value currentValue =
values.Where(val => val.substance.Id == currentSubstanceId
&& val.property.Id == currentPropertyId).SingleOrDefault();
where values is a List and Value is an entity, which I read from to render the cells. values had been pre-loaded from the database and no queries are shown by the SQL Server Profiler.
Since not all cells have a value, I thought it best to loop through the row and columns and see if there is a value. I cannot just loop through the list of values.
What could I try to improve this? I thought about:
Create some sort of C# object, using the substance.Id and property.Id as a compound key and fill it from the List object. Which would be fastest?
Create some Linq query which returns an object which already contains the empty cells, like (substance cross join properties) left join values. I could do this in SQL easily, but could this be done with Linq? Could the object which stores the result have the Value as a member field, so I can still use it to render the cells?
Stop pre-loading and just run a database query for the value of each combination, possibly benefiting from database indexes.
I am considering restricting the number of substances and properties the user may select, but I would rather not do that.
Addtional info
As requested by C.Zonnenberg, some more info about the query.
The query to fill the list of values is basically as follows:
I create an IQueryable to which I add filters for requested substances and properties. I then include the substances, property and value details, found in related entities. I then execute query.ToList(). The actual SQL query, as seen by the SQL Profiler looks complex, involving SubstanceId IN () and PropertyId IN (), but it takes far less then a second to execute.
It returns a list of proxies, like: {System.Data.Entity.DynamicProxies.SubstancePropertyValue_078F758A4FF9831024D2690C4B546F07240FAC82A1E9D95D3826A834DCD91D1E}

I think your best bet is your first option. But to do that efficiently I would also modify the source data (values) and turn it into a dictionary, so you have a structure that's optimized for indexed lookup:
var dict = values.ToDictionary(e =>
Tuple.Create(e.substance.id, e.propertyid),
e => e.Value);
Then for each cell:
Value currentValue ;
dict.TryGetValue(Tuple.Create(currentSubstanceId, currentPropertyId),
out currentValue );
Further, you may benefit from parallelization by fetching the cell values in a Parallel.ForEach looping through all substances, for instance.

Related

Repeated uses of TOP in query

Finding TOP items in query
I have an Access query with fields (in simplified form) name, type, value. I need to extract the top x records (according to value) FOR EVERY PAIR (name, type) with x depending on the pair. The query already has the column "value" sorted for each pair.
Solution 1. Do separate queries per pair, take the top x in each and build the union of the queries. Wrong! the number of pairs is large, Access can't handle the resulting query.
Solution 2. Add an extra column to the query, call it "Valid" and set it to True in all records. Then use VBA to traverse the recordset items of the query one by one and set Valid to False for the non-top items. Then do an additional query dropping the false records. Wrong again, the recordset is not editable in VBA (even though "Valid" has nothing to do with any tables used in the query). Yes, I opened the recordset in VBA with dbOpenDynaset -- no dice.
Any ideas? Thanks

Windows Azure Paging Large Datasets Solution

I'm using Windows Azure Table Storage to store millions of entities, however I'm trying to figure out the best solution that easily allows for two things:
1) a search on an entity, will retrieve that entity and at least (pageSize) number of entities either side of that entity
2) if there are more entities beyond (pageSize) number of entities either side of that entity, then page next or page previous links are shown, this will continue until either the start or end is reached.
3) the order is reverse chronological order
I've decided that the PartitionKey will be the Title provided by the user as each container is unique in the system. The RowKey is Steve Marx's lexiographical algorithm:
http://blog.smarx.com/posts/using-numbers-as-keys-in-windows-azure
which when converted to javascript instead of c# looks like this:
pad(new Date(100000000 * 86400000).getTime() - new Date().getTime(), 19) + "_" + uuid()
uuid() is a javascript function that returns a guid and pad adds zeros up to 19 chars in length. So records in the system look something like this:
PK RK
TEST 0008638662595845431_ecf134e4-b10d-47e8-91f2-4de9c4d64388
TEST 0008638662595845432_ae7bb505-8594-43bc-80b7-6bd34bb9541b
TEST 0008638662595845433_d527d215-03a5-4e46-8a54-10027b8e23f8
TEST 0008638662595845434_a2ebc3f4-67fe-43e2-becd-eaa41a4132e2
This pattern allows for every new entity inserted to be at the top of the list which satisfies point number 3 above.
With a nice way of adding new records in the system I thought then I would create a mechanism that looks at the first half of the RowKey i.e. 0008638662595845431_ part and does a greater than or less than comparison depending on which direction of the already found item. In other words to get the row immediately before 0008638662595845431 I would do a query like so:
var tableService = azure.createTableService();
var minPossibleDateTimeNumber = pad(new Date(-100000000*86400000).getTime() - new Date().getTime(), 19);
tableService.getTable('testTable', function (error) {
if (error === null) {
var query = azure.TableQuery
.select()
.from('testTable')
.where('PartitionKey eq ?', 'TEST')
.and('RowKey gt ?', minPossibleDateTimeNumber + '_')
.and('RowKey lt ?', '0008638662595845431_')
.and('Deleted eq ?', 'false');
If the results returned are greater than 1000 and azure gives me a continuation token, then I thought I would remember the last items RowKey i.e. the number part 0008638662595845431. So now the next query will have the remembered value as the starting value etc.
I am using Windows Azure Node.Js SDK and language is javascript.
Can anybody see gotcha's or problems with this approach?
I do not see how this can work effectively and efficiently, especially to get the rows for a previous page.
To be efficient, the prefix of your “key” needs to be a serially incrementing or decrementing value, instead of being based on a timestamp. A timestamp generated value would have duplicates as well as holes, making mapping page size to row count at best inefficient and at worst difficult to determine.
Also, this potential algorithm is dependent on a single partition key, destroying table scalability.
The challenge here would be to have a method of generating a serially incremented key. One solution is to use a SQL database and performing an atomic update on a single row, such that an incrementing or decrementing value is produced in sequence. Something like UPDATE … SET X = X + 1 and return X. Maybe using a stored procedure.
So the key could be a zero left padded serially generated number. Split such that say the first N digits of the number is the partition key and remaining M digits are the row key.
For example
PKey RKey
00001 10321
00001 10322
….
00954 98912
Now, since the rows are in sequence it is possible to write a query with the exact key range for the page size.
Caveat. There is a small risk of a failure occurring between generating a serial key and writing to table storage. In which case, there may be holes in the table. However, your paging algorithm should be able to detect and work around such instances quite easily by specify a page size slightly larger than necessary or by retrying with an adjusted range.

Query core data store based on a transient calculated value

I'm fairly new to the more complex parts of Core Data.
My application has a core data store with 15K rows. There is a single entity.
I need to display a subset of those rows in a table view filtered on a calculated search criteria, and for each row displayed add a value that I calculate in real time but don't store in the entity.
The calculation needs to use a couple of values supplied by the user.
A hypothetical example:
Entity: contains fields "id", "first", and "second"
User inputs: 10 and 20
Search / Filter Criteria: only display records where the entity field "id" is a prime number between the two supplied numbers. (I need to build some sort of complex predicate method here I assume?)
Display: all fields of all records that meet the criteria, along with a derived field (not in the the core data entity) that is the sum of the "id" field and a random number, so each row in the tableview would contain 4 fields:
"id", "first", "second", -calculated value-
From my reading / Googling it seems that a transient property might be the way to go, but I can't work out how to do this given that the search criteria and the resultant property need to calculate based on user input.
Could anyone give me any pointers that will help me implement this code? I'm pretty lost right now, and the examples I can find in books etc. don't match my particular needs well enough for me to adapt them as far as I can tell.
Thanks
Darren.
The first thing you need to do is to stop thinking in terms of fields, rows and columns as none of those structures are actually part of Core Data. In this case, it is important because Core Data supports arbitrarily complex fetches but the sqlite store does not. So, if you use a sqlite store your fetches are restricted those supported by SQLite.
In this case, predicates aimed at SQLite can't perform complex operations such as calculating whether an attribute value is prime.
The best solution for your first case would be to add a boolean attribute of isPrime and then modify the setter for your id attribute to calculate whether the set id value is prime or not and then set the isPrime accordingly. That will be store in the SQLite store and can be fetched against e.g. isPrime==YES &&((first<=%#) && (second>=%#))
The second case would simply use a transient property for which you would supply a custom getter to calculate its value when the managed object was in memory.
One often overlooked option is to not use an sqlite store but to use an XML store instead. If the amount of data is relatively small e.g. a few thousand text attributes with a total memory footprint of a few dozen meg, then an XML store will be super fast and can handle more complex operations.
SQLite is sort of the stunted stepchild in Core Data. It's is useful for large data sets and low memory but with memory becoming ever more plentiful, its loosing its edge. I find myself using it less these days. You should consider whether you need sqlite in this particular case.

Ultragrid : how to best add a set of sub rows programatically?

I have an Infragistics Ultragrid that is being used to display a list of attributes. Sometimes the attribute is an array so I am adding a sub row for each element so the user can optionally expand the row showing the array attribute and see all the element values.
So for each element I use:
var addedRow = mGrid.DisplayLayout.Bands[1].AddNew();
which if I have 300 elements gets called 300 times and takes around 9 seconds (I have profiled the application and this call is taking 98% of the elapsed time)
Is there a way to add these sub rows more efficiently?
I know I'm late with an answer, but hopefully someone can use my answer anyway. Whenever I need to set rows and subrows for ultragrid, I simply set the datasource by using linq and anonymous types to generate the propper collection.
say you have a list of persons (id, Name), and a list of cars (id, CarName, and OwnerId (personId))
now you like to show a gridview showing all persons, with an expandabel subrow providing which cars they own. simply do the following.
List<Person> persons = GetAllPersons();
List<Car> cars = GetAllCars();
grid.DataSource = persons.Select(x => new {x.Id, x.Name, Cars = cars.Where(z => z.OwnerId == x.Id).ToList()}).ToList();
Note the anonymous type I make, this will generate a list of objects having an id, Name, and a collection of cars. Also note that I call the ToList method twice in the last line, this is necessary in order to get ultragrid to bind properly.
Note further more that if you need to edit the gridview, the above method migth not be sufficient, as the ultragrid needs an underlaying datasource for modifying, and I dont believe that this will cope. BUT on the internet you'll find some extensions that can copy a Linq collection into a DataTable, doing that and then you should also be able of editing the grid.
I have often used the above method and it performs extremely well, even for huge collections.
Hope this helps somebody
you might want to use ultraGrid1.BeginUpdate(); and ultraGrid1.EndUpdate(true); to stop screen from repainting. made huge performance benefit for my app.
Also in my case I was populating nearly >10,000 rows, so have used UltraDataSource

Aggregate child table values in Birt

I have this Birt report that I inherited from another developer, consisting of a child table inside a master table. For each row in the master table, the child table lists items belonging to the current master row item.
The two tables are fed from different data sets, the child table dataset taking a parameter indicating the master item whose child items to fetch.
Now, what I need to do is add a SUM aggregate to the bottom of the master table, showing the total (for all master items) of a certain field in the child table.
Consider, for example, the following data:
MasterItem1
ChildItem1 SomeValue
ChildItem2 SomeValue
ChildItem3 SomeValue
MasterItem2
ChildItem1 SomeValue
ChildItem2 SomeValue
ChildItem3 SomeValue
--------------------------------
Total
(Why wasn't this done with grouping instead? Short answer: There are in fact two child tables to each master row, containing different numers and types of fields, so the previous developer probably didn't figure out a way to accomplish this with grouping.)
At first I thought I could simply add another child table inside the Total field, with an aggregate summing up the values from the child dataset. That didn't work, however, since the child dataset requires a parameter indicating the master item whose children to fetch, so there is no way to get ALL values from the child dataset at once.
I'm thinking there might be a way to create an expression that references the SomeValue fields in the child table directly, instead of going through the child data set.
Any suggestions are greatly appreciated.
It should be possible to declare a global variable at the start of the report, then add each of the child values to it in one of the child table row events and output it at the end of the reports - if you're comfortable writing Javascript, this is probably the quickest solution.
If you're not comfortable writing Javascript (I'm not) or if the above technique doesn't work out, you could try either:
creating a third dataset, combining the master data items from the main report with the child data items from the subreport and outputting the total in a new data table, or
combining the two existing child data value tables via a union (so that if the master table is A, the main child table is B and the subreport table is C, you have AB union AC), replacing the subreport table and the existing detail rows with new detail rows conditional on child row type, and a total at the end of the report based on the AC values.
Obviously, the latter of these approaches is more complicated - but I think it should be easier to understand and maintain.
The Global Variable is the way to go. For each row in the child table, add the required value to the global variable and then access it for display at the bottom of the table. There is not any hard JavaScript:
var Sum = reportContext.getPersistentGlobalVariable( "RunningAggregate" );
Sum = Sum + row["column Name"];
reportContext.setPersistentGlobalVariable( "RunningAggregate", Sum );
You can then access the Global Variable in the footer of your table via a Dynamic Text item.
Good Luck!
Thanks Mark and Mystik, both your answers led me on the right path!
My final solution is as follows:
1) Declare the sum-variable in the initialize method for the report:
var total = 0;
2) Add each row's value to the sum-variable in the onReder method of the data field containing the values:
total += parseInt(this.getValue());
3) Use the sum-variable as expression in the total-field.
Works like a charm.
Update:
Found a bug in my solution: the last line was left out of the sum. I think the value of the total-cell in the table footer is being defined before the last line has been rendered.
Fix:
Moved summing code from onRender method to onCreate
Added the following code to the total-cell's onRender method:
this.setDisplayValue(total);

Resources