Referencing object's identity before submitting changes in LINQ - linq

is there a way of knowing ID of identity column of record inserted via InsertOnSubmit beforehand, e.g. before calling datasource's SubmitChanges?
Imagine I'm populating some kind of hierarchy in the database, but I wouldn't want to submit changes on each recursive call of each child node (e.g. if I had Directories table and Files table and am recreating my filesystem structure in the database).
I'd like to do it that way, so I create a Directory object, set its name and attributes,
then InsertOnSubmit it into DataContext.Directories collection, then reference Directory.ID in its child Files. Currently I need to call InsertOnSubmit to insert the 'directory' into the database and the database mapping fills its ID column. But this creates a lot of transactions and accesses to database and I imagine that if I did this inserting in a batch, the performance would be better.
What I'd like to do is to somehow use Directory.ID before commiting changes, create all my File and Directory objects in advance and then do a big submit that puts all stuff into database. I'm also open to solving this problem via a stored procedure, I assume the performance would be even better if all operations would be done directly in the database.

One way to get around this is to not use an identity column. Instead build an IdService that you can use in the code to get a new Id each time a Directory object is created.
You can implement the IdService by having a table that stores the last id used. When the service starts up have it grab that number. The service can then increment away while Directory objects are created and then update the table with the new last id used at the end of the run.
Alternatively, and a bit safer, when the service starts up have it grab the last id used and then update the last id used in the table by adding 1000 (for example). Then let it increment away. If it uses 1000 ids then have it grab the next 1000 and update the last id used table. Worst case is you waste some ids, but if you use a bigint you aren't ever going to care.
Since the Directory id is now controlled in code you can use it with child objects like Files prior to writing to the database.
Simply putting a lock around id acquisition makes this safe to use across multiple threads. I've been using this in a situation like yours. We're generating a ton of objects in memory across multiple threads and saving them in batches.
This blog post will give you a good start on saving batches in Linq to SQL.

Not sure off the top if there is a way to run a straight SQL query in LINQ, but this query will return the current identity value of the specified table.
USE [database];
GO
DBCC CHECKIDENT ("schema.table", NORESEED);
GO

Related

Simulating server-side group and sort in Azure table storage

I have a table to which I add records whenever the user views a particular resource. The key fields are
Username
Resource
Date Viewed
On a history page of my app, I want to present a set number (e.g., top 5) of the user's most recently viewed Resources, but I want to group by Resource, so that if some were viewed several times, only the most recent of each one is shown.
To be clear, if the raw data looked like this:
UserA | ResourceA | Jan 1
UserA | ResourceA | Jan 2
UserA | ResourceB | Jan 3
UserA | ResourceA | Jan 4
...
...only the bottom two records would appear in the history page.
I know you can get server-side chronological sorting by using a string derived from the date in the PartitionKey or RowKey fields.
I also see that you could enable a crude grouping mechanism by using Username and Resource as your PartitionKey and RowKey fields, and then using Insert-or-update, to maintain a table in which you kept pointers for the most recent value for each combination. However, those records wouldn't be sorted chronologically.
Is there any way to design a set of tables so that I can get the data I need without retrieving tons of extra entities and sorting on the client? I'm willing to get elaborate with the design if that's what it takes. Thanks in advance!
First, I would strongly recommend that you read this excellent Azure Storage Table Design Guide: Designing Scalable and Performant Tables document from Storage team.
Yes, I would agree that it is somewhat tricky with Azure Table Storage but it is doable :).
What you have to do is keep multiple copies of the same data. Each copy will serve a different purpose.
Considering the scenario where you want to fetch most recent lines for Resource A and B, here's what your entity structure would look like:
PartitionKey: Date/Time (in Ticks) reversed i.e. DateTime.MaxValue.Ticks - LastAccessedDateTime.Ticks. Reverse ticks is required to that most recent entries will show up on the top of the table.
RowKey: Resource name.
AccessDate: Indicates the last access date/time.
User: Name of the user who accessed that resource.
So when you are interested in just finding out most recently used resources, you could start fetching records from the top.
In short, your data storage approach should be primarily governed by how you want to fetch the data. It would even mean you will have to save the same data multiple times.
UPDATE
As discussed in the comments below, Table Service doesn't directly support Server Side Grouping. This is something that you would need to do on your own. What you could do is create a separate table to store the access counts. As and when the resources are accessed, you basically either insert a new record in that table or update the count for that resource in that table.
Assuming you're always interested in finding out resource access count within a date/time range, here's what your entity structure would look like:
PartitionKey: Date/Time (in Ticks). The precision would depend on your reporting requirement. For example, if you want to maintain access counts by day then your precision would be a day.
RowKey: Resource name.
AccessCount: This field will constantly update as and when a resource is accessed.
LastAccessDateTime: This field will denote when a resource was last accessed.
For updating access counts, I would recommend that you make use of a background process. Basically in this approach, as a resource is accessed you add a message in a queue. This message will have resource name and date/time resource was last accessed. Then have a background process poll this queue and fetch messages. As the messages are received, you first get the current count and last access date/time for that resource. If no records are found, you simply insert a record in this table with count as 1. If a record is found then you compare the date/time from the table with the date/time sent in the message. If the date/time from the table is smaller than the date/time sent in the message, you update both count (increase that by 1) and last access date/time. If the date/time from the table is more than the date/time sent in the message, you only update the count.
Now to find most accessed resources in a time span, you simply query this table. Assuming there are limited number of resources (say in 100s), you can get this information from the table with at least 1 request. Since you're dealing with small amount of data, you can simply download this data on the client side and order it anyway you see fit. However to see the access details for a particular resource, you would have to fetch detailed data (1000 entities at a time).
Part of your brain might still be unconsciously trapped in relational-table design paradigms, I'm still getting to grips with that issue myself.
Rather than think of table storage as a database table (with the "query-ability" that goes with it) try visualizing it in more simple (dumb) terms.
A design problem I'm working on now is storing financial transaction data, and I want to know what the total $ amount of these transactions are. Because Azure table storage doesn't (yet?) offer aggregate functions I can't simply go .Sum(). To get around that I'm going to:
Sum the values of the transactions in my app before I pass them to azure.
I'll then pass that the result of the sum into azure as a separate piece of information, called RunningTotal.
Later on I can just return RunningTotal rather than pulling down all the transactions, and I can repeat the process by increment the value of RunningTotal each time i get new transactions.
Of course there are risks to this but the app is a personal one so the risk level is low and manageable, at least as a proof-of-concept.
Perhaps you can use a similar approach for the design of your system: compute useful values in advance. I'll almost be using table storage as a long-term cache rather than a database.

Cassandra Best Practice on edits: delete & re-insert vs. update?

I am new to Cassandra. I am looking at many examples online. Here is one from JHipster Cassandra examples on GitHub:
https://gist.github.com/jdubois/c3d3bedb869466731316
The repository save(user) method does a read (to look for existence) then a delete and re-insert of the existing user across all the denormalized tables whenever the user data changed.
Is this best practice?
Is this only because of how the data model for this sample is designed?
Is this sample's design a result of twisting a POJO framework into a NoSQL database design?
When would I want to just do a update in Cassandra? It supports updates at the field-level, so it seems like that would be preferred.
First of all, the delete operations should be part of the batch for more robust error handling. But it looks like there are also some concurrency issues with the code. It will update the user based on the current user value read before. It's not save to assume this will still be the latest value while save() is actually executed. It will also just overwrite any keys in the lookup table that might be in use for a different user at that point. E.g. the login could already exist for another user while executing insertByLoginStmt.
It is not necessary to delete a row before inserting a new one.
But if you are replacing rows and new columns are different from existing columns then you need to delete all existing columns and insert new columns. Or insert new and delete old, does not matter if happens in batch.

ADOX Rearrange Or Insert Columns Rather than Append them in Access Vb6, VB.Net or CSharp

I need to insert a field in the middle of current fields of a database table. I'm currently doing this in VB6 but may get the green light to do this in .net. Anyway I'm wondering since Access gives you the ability to "insert" fields in the table is there a way to do this in ADOX? If I had to I could step back and use DAO, but not sure how to do it there either.
If yor're wondering why I want to do this applications database has changed over time and I'm being asked to create Upgrade program for some of the installations with older versions.
Any help would be great.
This should not be necessary. Use the correct list of fields in your queries to retrieve them in the required order.
BUT, if you really need to do that, the only way i know is to create a new table with the fields in the required order, read the data from the old table into the new one, delete the old table and rename the new table as the old one.
I hear you: in Access the order of the fields is important.
If you need a comprehensive way to work with ADOX, your go to place is Allen Browne's website. I have used it to from my novice to pro in handling Access database changes. Here it is: www.AllenBrowne.com. Go to Access Tips then scroll down to ADOX Code.
That is also where I normally refer people with doubts about capabilities of Access as a database :)
In your case, you will juggle through creating a new table with the new field in the right position, copying data to the new table, applying properties to the fields, deleting original table, renaming the new table to the required (original) name.
That is the correct order. Do not apply field properties before copying the data. Some indexes and key properties may not be applied when the fields already have data.
Over time, I have automated this so I just run an application to do detect and implement the required changes for me. But that took A LOT of work-weeks.

Does Core Data/SQLite compress redundant information?

I want to use Core Data (probably with SQLite backing) to store a large database. Much of the string data will be the same between numerous rows. Does Core Data/SQLite see such redundancy, and automatically save space in the db files?
Do I need to make sure that the same text in different rows is the same string object before adding it to the db? If so, how do I detect that a new piece of text matches something anywhere in the existing db?
No, Core Data does not attempt to analyze your data to avoid duplication. If you want to save 10 million objects with the same attributes, you'll get 10 million copies.
If you want to avoid creating duplicate instances, you need to do a fetch for matching instances before creating a new one. The general approach is
Fetch objects matching new data-- according to whatever standard indicates a duplicate for your app. Use a predicate with the fetch that contains the attribute(s) that you don't want to duplicate.
If you find anything, either (a) update the instances you find with any new values you have, or (b) if there are no new values, do nothing.
If you don't find anything, create a new instance.
Application-layer logic can help reduce space at the cost of application complexity.
Say your name field can contain either an integer or a string. (SQLite's weak typing makes this easy to do).
If string -- that's the name right there.
If integer -- go look it up on a name table, using the int as key
Of course you have to create that name table, either on the fly as data is inserted, or a once-in-a-while trawl through the data for new names that are worth surrogating in this way.

Entity Framework performance issues

I'm having a problem with performance with the entity framework.
Here's the scenario.
I have an entity called "Segment". Each of these are stored in their own table in the DB.
"Segments" have a custom property called "IsHPMSSegment" which is a calculated field. It is calculated by calling a stored procedure in the DB that takes the "ID" of the "Segment" and compares some of it's value against values in another table.
One of the queries we need to run is stated as follows: Get me all Segments that are HPMS Segments.
Since the "ISHPMSSegment" value of "Segment" is a custom property, I cannot retrieve it's value directly from the DB when the segments are first selected. Instead, as each "Segment" is being created in the result set, entity framework queries the db again to get the value for "IsHPMSSegment". So everytime a "Segment" is being filled, it has to query the DB once again for each Segment returned.
Example: If I get all "Segments" with an ID greater than 5, and the resultset is 1000 segments, then the DB is hit for a total of 1001 times. Once for the initial select query that gets the 1000 records, and then another 1000 times to fill the "IsHPMSSegment" value of each of the "Segments".
The only workaround I can think of it to create a view in the DB ("vSegments") that contains this extra calculated property, and then link my EF object to this view, instead of to the "Segment" table. That way this property would be filled in the first query.
I then have two choices for the remaining functionality (insert, update, delete):
1) wire up my insert, update, and delete functions for the entity to stored procedures
2) make the view updatable
All of this seems like a lot of extra work just to address this performance issue, and I'm left wondering what benefit there is to using EF at all?
Is there a better solution to the "view + stored procedures" idea I stated above (still using EF)?
If not, what benefit does EF provide me? If I was creating my own DAL from scratch, I would still have to create stored procedures and/or views. How much effort am I really saving by using EF and having to program around it's limitations?
On top of all this, EF doesn't seem to handle updating multiple records at once in a satisfactory way. It sends a single update statement call for each record you are updating, even if you are updating them all exactly the same. This also seems to be a detractor (unless there is some workaround for this that I am unaware of).
This is entirely subjective. In my option the separation of duties between your layers is getting mixed up and causing you problems.
My suggestion would be to remove your stored procedure and move the logic into you business layer. Creation of your 'segments' should start in your business layer and have all the appropriate logic done against it. The final state can then be pushed into your data access layer for persistence.

Resources