EF4 Import/Lookup thousands of records - my performance stinks! - performance

I'm trying to setup something for a movie store website (using ASP.NET, EF4, SQL Server 2008), and in my scenario, I want to allow a "Member" store to import their catalog of movies stored in a text file containing ActorName, MovieTitle, and CatalogNumber as follows:
Actor, Movie, CatalogNumber
John Wayne, True Grit, 4577-12 (repeated for each record)
This data will be used to lookup an actor and movie, and create a "MemberMovie" record, and my import speed is terrible if I import more than 100 or so records using these tables:
Actor Table: Fields = {ID, Name, etc.}
Movie Table: Fields = {ID, Title, ActorID, etc.}
MemberMovie Table: Fields = {ID, CatalogNumber, MovieID, etc.}
My methodology to import data into the MemberMovie table from a text file is as follows (after the file has been uploaded successfully):
Create a context.
For each line in the file, lookup the artist in the Actor table.
For each Movie in the Artist table, lookup the matching title.
If a matching Movie is found, add a new MemberMovie record to the context and call ctx.SaveChanges().
The performance of my implementation is terrible. My expectation is that this can be done with thousands of records in a few seconds (after the file has been uploaded), and I've got something that times out the browser.
My question is this: What is the best approach for performing bulk lookups/inserts like this? Should I call SaveChanges only once rather than for each newly created MemberMovie? Would it be better to implement this using something like a stored procedure?
A snippet of my loop is roughly this (edited for brevity):
while ((fline = file.ReadLine()) != null)
{
string [] token = fline.Split(separator);
string Actor = token[0];
string Movie = token[1];
string CatNumber = token[2];
Actor found_actor = ctx.Actors.Where(a => a.Name.Equals(actor)).FirstOrDefault();
if (found_actor == null)
continue;
Movie found_movie = found_actor.Movies.Where( s => s.Title.Equals(title, StringComparison.CurrentCultureIgnoreCase)).FirstOrDefault();
if (found_movie == null)
continue;
ctx.MemberMovies.AddObject(new MemberMovie()
{
MemberProfileID = profile_id,
CatalogNumber = CatNumber,
Movie = found_movie
});
try
{
ctx.SaveChanges();
}
catch
{
}
}
Any help is appreciated!
Thanks, Dennis

First:
Some time ago I wrote an answer about calling SaveChanges after 1, n or all rows:
When should I call SaveChanges() when creating 1000's of Entity Framework objects? (like during an import)
It is actually better to call SaveChanges after more than 1 row, but not after all.
Second:
Make sure you have index on name in Actors table and title in Movies, that should help. Also you shouldn't select whole Actor, if you need only his ID:
Instead of:
Actor found_actor = ctx.Actors.Where(a => a.Name.Equals(actor)).FirstOrDefault();
you can select:
int? found_actor_id = ctx.Actors.Where(a => a.Name.Equals(actor)).Select(a => a.ID).FirstOrDefault();
and then
Something.ActorID = found_actor_id;
This can be faster, because doesn't require whole Actor entity and doesn't require additional lookups, specially when combined with index.
Third:
If you send a very large file, there is still probability of timeout, even with good performance. You should run this import in separate thread and return response immediately. You can give some kind of identifier to every import and allow user to check status by this ID.

Related

Insert a list of objects into my room database at once, and the order of objects is changed

I tried to insert a list of objects from my legacy litepal database into my room database, and when I retrieved them from my room database, I found the order of my objects is no longer the same as that of my old list.
Below is my code:
// litepal is my legacy database
val litepalNoteList = LitePal.findAll(Note::class.java)
if (litepalNoteList.isNotEmpty()) {
litepalNoteList.forEach { note ->
// before insertion, I want to trun my legacy note objects into Traininglog objects
// note and Traninglog are of different types, but their content should be the same
val noteContent = note.note_content
val htmlContent = note.html_note_content
val createdDate = note.created_date
val isMarked = note.isLevelUp
val legacyLog = TrainingLog(
noteContent = noteContent,
htmlLogContent = htmlContent,
createdDate = createdDate,
isMarked = isMarked)
logViewModel.viewModelScope.launch(Dispatchers.IO) {
trainingLogDao.insertNewTrainingLog(legacyLog)
} // the end of forEach
}
The problem is that in my room database, the order of TraningLog objects differs randomly from that of my old list in the Litepal database.
Anyone konw why is this happening?
If the order matters, then you should extract data using the ORDER BY phrase. Otherwise you are leaving the order up to the query optimiser.
So say instead of #Query("SELECT * FROM trainingLog") then you could ORDER the result by using #Query("SELECT * FROM trainingLog ORDER BY createdDate ASC")
The efficiency of extracting the above would be improved by having an index on the createdDate column/field (in room #ColumnInfo(index = true)). However, it should be noted that there are overheads to having an index. Insertions and deletions and updates may incur additional processing to maintain the index. Additionally an index uses more space.
You may wish to have an insert function that can take a list rather than run multiple threaded inserts. Room will then (I believe) do all the inserts in a single transaction (1 disk write instead of many).
e.g.
instead of or as well as
#Insert
fun insert(trainingLog: TrainingLog): Long
you could have
#Insert
fun insert(trainingLogList: List<TrainingLog>): LongArray
Then all you need to do is build the List in your loop and then after the loop invoke the single insert.

Check if data already exists before inserting into BigQuery table (using Python)

I am setting up a daily cron job that appends a row to BigQuery table (using Python), however, duplicate data is being inserted. I have searched online and I know that there is a way to manually remove duplicate data, but I wanted to see if I could avoid this duplication in the first place.
Is there a way to check a BigQuery table to see if a data record already exists first in order to avoid inserting duplicate data? Thanks.
CODE SNIPPET:
import webapp2
import logging
from googleapiclient import discovery
from oath2client.client import GoogleCredentials
PROJECT_ID = 'foo'
DATASET_ID = 'bar'
TABLE_ID = 'foo_bar_table’
class UpdateTableHandler(webapp2.RequestHandler):
def get(self):
credentials = GoogleCredentials.get_application_default()
service = discovery.build('bigquery', 'v2', credentials=credentials)
try:
the_fruits = Stuff.query(Stuff.fruitTotal >= 5).filter(Stuff.fruitColor == 'orange').fetch();
for fruit in the_fruits:
#some code here
basket = dict()
basket['id'] = fruit.fruitId
basket['Total'] = fruit.fruitTotal
basket['PrimaryVitamin'] = fruit.fruitVitamin
basket['SafeRaw'] = fruit.fruitEdibleRaw
basket['Color'] = fruit.fruitColor
basket['Country'] = fruit.fruitCountry
body = {
'rows': [
{
'json': basket,
'insertId': str(uuid.uuid4())
}
]
}
response = bigquery_service.tabledata().insertAll(projectId=PROJECT_ID,
datasetId=DATASET_ID,
tableId=TABLE_ID,
body=body).execute(num_retries=5)
logging.info(response)
except Exception, e:
logging.error(e)
app = webapp2.WSGIApplication([
('/update_table', UpdateTableHandler),
], debug=True)
The only way to test whether the data already exists is to run a query.
If you have lots of data in the table, that query could be expensive, so in most cases we suggest you go ahead and insert the duplicate, and then merge duplicates later on.
As Zig Mandel suggests in a comment, you can query over a date partition if you know the date when you expect to see the record, but that may still be expensive compared to inserting and removing duplicates.

Attribute routing not working with dictionaries

Being new to attribute routing, I'd like to ask for help getting this to work.
This test is a simple dynamic DB table viewer: Given a table name (or stored query name or whatever) and optionally some WHERE parameters, return query results.
Table COMPANIES (one of any number of tables which has an associated SELECT query stored somewhere, keyed by table name):
ID NAME HQ INDUSTRY
1 Apple USA Consumer electronics
2 Bose USA Low-quality, expensive audio equipment
3 Nokia FIN Mobile Phones
Controller:
[Route("view/{table}/{parameters}")]
public object Get(string table, Dictionary<string, string> parameters) {
var sql = GetSql(table);
var dbArgs = new DynamicParameters(parameters);
return Database.Query(sql, dbArgs); // Return stuff/unrelated to problem
}
SQL stored in some resource or table. Obviously the parameters must match exactly:
SELECT * FROM companies
WHERE name = :name
-- OR hq = :hq
-- OR ...etc. Doesn't matter since it never gets this far.
Request (Should look clean, but the exact URL format isn't important):
www.website.com/view/companies?hq=fin --> 404: No matching controller
www.website.com/view/companies/hq=fin --> parameters is null
www.website.com/view/companies/hq=fin&name=nokia --> Exception: A potentially dangerous Request.Path value was detected from the client (&).
When I use: [Route("view/{table}{parameters}")] I get:
A path segment cannot contain two consecutive parameters. They must be separated by a '/' or by a literal string. Parameter name: routeTemplate. Makes sense.
My question is: How do I accept a table name and any number of unknown parameters in the usual key1=val1&key2=val2 form (not some awkward indexed format like the one mentioned here) which will be later bound to SQL parameters, preferably using a vanilla data structure rather than something like FormCollection.
I don't think that binding URL parameters to a Dictionary is built-in to the framework. I'm sure there's a way to extend it if you wanted to.
I think quickest (but still acceptable) option is to get the query string parameters using Request.GetQueryNameValuePairs() like this:
[Route("view/{table}")]
public object Get(string table) {
Dictionary<string, string> parameters = Request.GetQueryNameValuePairs()
.ToDictionary(x => x.Key, x => x.Value);
var sql = GetSql(table);
var dbArgs = new DynamicParameters(parameters);
return Database.Query(sql, dbArgs); // Return stuff/unrelated to problem
}

Entity Framework, Table Per Type and Linq - Getting the "Type"

I have an Abstract type called Product, and five "Types" that inherit from Product in a table per type hierarchy fashion as below:
I want to get all of the information for all of the Products, including a smattering of properties from the different objects that inherit from products to project them into a new class for use in an MVC web page. My linq query is below:
//Return the required products
var model = from p in Product.Products
where p.archive == false && ((Prod_ID == 0) || (p.ID == Prod_ID))
select new SearchViewModel
{
ID = p.ID,
lend_name = p.Lender.lend_name,
pDes_rate = p.pDes_rate,
pDes_details = p.pDes_details,
pDes_totTerm = p.pDes_totTerm,
pDes_APR = p.pDes_APR,
pDes_revDesc = p.pDes_revDesc,
pMax_desc = p.pMax_desc,
dDipNeeded = p.dDipNeeded,
dAppNeeded = p.dAppNeeded,
CalcFields = new DAL.SearchCalcFields
{
pDes_type = p.pDes_type,
pDes_rate = p.pDes_rate,
pTFi_fixedRate = p.pTFi_fixedRate
}
}
The problem I have is accessing the p.pTFi_fixedRate, this is not returned with the Products collection of entities as it is in the super type of Fixed. How do I return the "super" type of Products (Fixed) properties using Linq and the Entity Framework. I actually need to return some fields from all the different supertypes (Disc, Track, etc) for use in calculations. Should I return these as separate Linq queries checking the type of "Product" that is returned?
This is a really good question. I've had a look in the Julie Lerman book and scouted around the internet and I can't see an elegant answer.
If it were me I would create a data transfer object will all the properties of the types and then have a separate query for each type and then union them all up. I would insert blanks into the DTO properies where the properties aren't relevant to that type. Then I would hope that the EF engine makes a reasonable stab at creating decent SQL.
Example
var results = (from p in context.Products.OfType<Disc>
select new ProductDTO {basefield1 = p.val1, discField=p.val2, fixedField=""})
.Union(
from p in context.Products.OfType<Fixed>
select new ProductDTO {basefield1 = p.val1, discField="", fixedField=p.val2});
But that can't be the best answer can it. Is there any others?
So Fixed is inherited from Product? If so, you should probably be querying for Fixed instead, and the Product properties will be pulled into it.
If you are just doing calculations and getting some totals or something, you might want to look at using a stored procedure. It will amount to fewer database calls and allow for much faster execution.
Well it depends on your model, but usually you need to do something like:
var model = from p in Product.Products.Include("SomeNavProperty")
.... (rest of query)
Where SomeNavProperty is the entity type that loads pTFi_fixedRate.

How can i import data to a table from another table

I want to make an aplication of an movie db. so i have a table with movies (title genre,director, etc) and a table with directors. so i want when i create a new entry for movie to have a dropdownlist to enter from the existing directors.
Basically you will need to have 2 queries - 1 for retrieving your movies and one for inserting a movie. Since you are using an ORM (in your case Entity Framework) you will not have to write the queries on hand but you will have to use their API to do the job.
Here is an example:
MovieModel model = new MovieModel();
List<Directors> directors = model.Directors.ToList();
Movie movie = new Movie();
movie.Name = "Sample";
movie.Id = 1;
movie.Director = directors.First(x => x.Id = 1);
model.AddObject(movie);
model.SaveChanges();
In that example the MovieModel is your Entity Framework context and the list of directors is the collection you need to bind to your combo box.
Note that the directors.First(x=>x.Id=1); needs to be replaced with combobox.SelectedItem for your code.
Hope that helps.

Resources