Linq to objects very slow compared to dataview

Linq to objects very slow compared to dataview - performance

I'm a big fan of Linq for typeing, clarity and brevity. But I'm finding it very slow to search for matching records compared to the old dataview by a factor of some 2000 times!
I am writing an app to backup large sets of files - 500,000 files and 500 gb of data. I have created a manifest of files in the backup set and compare the files in the directory with those in the manifest documenting what's been backed up already. This way I know which files have changed and so need to be copied.
The slow step is this one:
var matchingMEs = from m in manifest
where m.FullName == fi.FullName
select m;
where manifest = List<ManifestEntry> and ManifestEntry is a relatively simple POCO.
Overall performance is 17-18 records per second.
When I use a dataview:
DataView vueManifest = new DataView(dt, "", "FullName", DataViewRowState.CurrentRows);
then in the loop find the matching manifest entries with a .FindRows:
matchingMEs = vueManifest.FindRows(fi.FullName);
... then I'm getting some 35,000 files per second throughput!
Is this normal? I can't believe that Linq comes at such a price. Is it the Linq or the objects that slow things down?
(btw, I tried using a Dictionary and a SortedList as well as the List<ManifestEntries> and they all gave about the same result.)

Your DataView is sorting by fullname and hence FindRows can jump straight to the correct record(s), whereas your linq query has to iterate through list until it reaches the correct record(s).
This will definitely be noticeable if you have 500,000 entries.
Assuming fullname is unique, then when you switched to using a dictionary, I would suspect you are still iterating through it using a similar linq query, something like
var matchingME = (from m in manifest where m.Key == fi.FullName select m).Single();
whereas you should be using
var matchingME = manifest[fi.FullName] ;

Related

Need a mechanism for solve "OutOfMemoryException" for handle big Millions of IEnumerable<data> using Linq Group by with order each inner group list

I faced OutOfMemoryException for handle millions Of IEnumerable data collections.
Ex) (not runs because format only)
Var list = new List(…) // (Count=3 million)
Var result = list.groupby(g=> g.key)
.select(s=>
new CustomClass(
s.OrderBy(o1 => o1.date)
.ThenBy(o2 => o2.skey)));
I done various operations using this result.
Ex create to dictionary and added to collectionview then filter finally filtered data’s only display to the data grid with pagination facility(in wpf).
One time screen display occupy more than 1 GB. We runs in x86 architecture.
.net framework 4.6.2
With any second operation ex. Next search with read from db. “OutOfMemoryException” will occur.
Could you please provide me any idea for solve this.

AppleScript Fastest Way to Get Selected Rows From a Large Table?

I have a large table with about 1000 rows. If I try to get selected rows from the table, it takes more than 30 seconds. Is there way to speed up the process? I use JXA.
selected = table.rows.whose({selected:true})()
names = ""
for (r in selected) {
names+=", "+selected[r].uiElements[1].name()
}
console.log(names)
Is there a faster way?
Thanks!

Apparently you can just narrow down using specifier without going through each element to get names for them.
names = table.rows.whose({selected:true}).uiElements[1].name()
This was much faster for my case.

Data Entity Framework and LINQ - Get giant data set and execute a command one at a time on each object

We are pulling in a giant dataset of records (in the 100's of thousands) and then need to update a field on each one, one at a time in an atomic transation. They records are unrelated to each other and we don't want to do a blind update to all couple hundred thousand (there are views and indexes on this table that make that very prohibitive). The ONLY way that I could get this to work without doing a giant transation was as follows (container is a reference to a custom ObjectContext):
var expiredWorkflows = from iw in container.InitiatedWorkflows
where iw.InitiationStatusID != 1 && iw.ExpirationDate < DateTime.Now
select iw.ID;
foreach (int expiredWorkflow in expiredWorkflows)
container.ExecuteStoreCommand("UPDATE dbo.InitiatedWorkflow SET InitiationStatusID = 7 WHERE ID = #ID", new SqlParameter() { ParameterName = "#ID", Value = expiredWorkflow.ToString() } );
We tried looping through each one and just updating the field via the container and then calling SaveChanges(), but that runs everything as one transaction. We tried calling SaveChanges() in the foreach loop, but that threw transaction exceptions. Is there any way to what we are trying to do using the ObjectContext, so it would do something like (the above select would be changed to return the full object, not just the ID):
foreach (var expiredWorkflow in expiredWorkflows)
expiredWorkflow.InitiationStatusID = 7
container.SaveChanges(SaveOptions.OneAtATime);

Speaking generally, if the operation you need to carry out is as simple as the sort of UPDATE your code above suggests, this is the sort of operation that will run far better on the back end database--assuming, of course, there's some clear way to select only the rows that need to be changed. Entity Framework is intended more for manipulating small to medium sets of objects that can easily be loaded into memory and twiddled there, not large bulk-processing operations for which stored procedures are often best. EF can certainly perform those big operations, but it will take a lot longer to execute one SQL statement per row.

Improve SQL Server 2005 Query Performance

I have a course search engine and when I try to do a search, it takes too long to show search results. You can try to do a search here
http://76.12.87.164/cpd/testperformance.cfm
At that page you can also see the database tables and indexes, if any.
I'm not using Stored Procedures - the queries are inline using Coldfusion.
I think I need to create some indexes but I'm not sure what kind (clustered, non-clustered) and on what columns.
Thanks

You need to create indexes on columns that appear in your WHERE clauses. There are a few exceptions to that rule:
If the column only has one or two unique values (the canonical example of this is "gender" - with only "Male" and "Female" the possible values, there is no point to an index here). Generally, you want an index that will be able to restrict the rows that need to be processed by a significant number (for example, an index that only reduces the search space by 50% is not worth it, but one that reduces it by 99% is).
If you are search for x LIKE '%something' then there is no point for an index. If you think of an index as specifying a particular order for rows, then sorting by x if you're searching for "%something" is useless: you're going to have to scan all rows anyway.
So let's take a look at the case where you're searching for "keyword 'accounting'". According to your result page, the SQL that this generates is:
SELECT
*
FROM (
SELECT TOP 10
ROW_NUMBER() OVER (ORDER BY sq.name) AS Row,
sq.*
FROM (
SELECT
c.*,
p.providername,
p.school,
p.website,
p.type
FROM
cpd_COURSES c, cpd_PROVIDERS p
WHERE
c.providerid = p.providerid AND
c.activatedYN = 'Y' AND
(
c.name like '%accounting%' OR
c.title like '%accounting%' OR
c.keywords like '%accounting%'
)
) sq
) AS temp
WHERE
Row >= 1 AND Row <= 10
In this case, I will assume that cpd_COURSES.providerid is a foreign key to cpd_PROVIDERS.providerid in which case you don't need an index, because it'll already have one.
Additionally, the activatedYN column is a T/F column and (according to my rule above about restricting the possible values by only 50%) a T/F column should not be indexed, either.
Finally, because searching with a x LIKE '%accounting%' query, you don't need an index on name, title or keywords either - because it would never be used.
So the main thing you need to do in this case is make sure that cpd_COURSES.providerid actually is a foreign key for cpd_PROVIDERS.providerid.
SQL Server Specific
Because you're using SQL Server, the Management Studio has a number of tools to help you decide where you need to put indexes. If you use the "Index Tuning Wizard" it is actually usually pretty good at tell you what will give you the good performance improvements. You just cut'n'paste your query into it, and it'll come back with recommendations for indexes to add.
You still need to be a little bit careful with the indexes that you add, because the more indexes you have, the slower INSERTs and UPDATEs will be. So sometimes you'll need to consolidate indexes, or just ignore them altogether if they don't give enough of a performance benefit. Some judgement is required.

Is this the real live database data? 52,000 records is a very small table, relatively speaking, for what SQL 2005 can deal with.
I wonder how much RAM is allocated to the SQL server, or what sort of disk the database is on. An IDE or even SATA hard disk can't give the same performance as a 15K RPM SAS disk, and it would be nice if there was sufficient RAM to cache the bulk of the frequently accessed data.
Having said all that, I feel the " (c.name like '%accounting%' OR c.title like '%accounting%' OR c.keywords like '%accounting%') " clause is problematic.
Could you create a separate Course_Keywords table, with two columns "courseid" and "keyword" (varchar(24) should be sufficient for the longest keyword?), with a composite clustered index on courseid+keyword
Then, to make the UI even more friendly, use AJAX to apply keyword validation & auto-completion when people type words into the keywords input field. This gives you the behind-the-scenes benefit of having an exact keyword to search for, removing the need for pattern-matching with the LIKE operator...

Using CF9? Try using Solr full text search instead of %xxx%?

You'll want to create indexes on the fields you search by. An index is a secondary list of your records presorted by the indexed fields.
Think of an old-fashioned printed yellow pages - if you want to look up a person by their last name, the phonebook is already sorted in that way - Last Name is the clustered index field. If you wanted to find phone numbers for people named Jennifer or the person with the phone number 867-5309, you'd have to search through every entry and it would take a long time. If there were an index in the back with all the phone numbers or first names listed in order along with the page in the phonebook that the person is listed, it would be a lot faster. These would be the unclustered indexes.

I would try changing your IN statements to an EXISTS query to see if you get better performance on the Zip code lookup. My experience is that IN statements work great for small lists but the larger they get, you get better performance out of EXISTS as the query engine will stop searching for a specific value the first instance it runs into.
<CFIF zipcodes is not "">
EXISTS (
SELECT zipcode
FROM cpd_CODES_ZIPCODES
WHERE zipcode = p.zipcode
AND 3963 * (ACOS((SIN(#getzipcodeinfo.latitude#/57.2958) * SIN(latitude/57.2958)) +
(COS(#getzipcodeinfo.latitude#/57.2958) * COS(latitude/57.2958) *
COS(longitude/57.2958 - #getzipcodeinfo.longitude#/57.2958)))) <= #radius#
)
</CFIF>

Using LINQ to query flat text files with fixed-length records?

I've got a file filled with records like this:
NCNSCF1124557200811UPPY19871230
The codes are all fixed-length, and some of them link to other flat files (sort of like a relational database). What's the best way of querying this data using LINQ?
This is what I came up with intuitively, but I was wondering if there's a more elegant way:
var records = File.ReadAllLines("data.txt");
var table = from record in records
select new { FirstCode = record.Substring(0, 2),
OtherCode = record.Substring(18, 4) };

For one thing I wouldn't read it all into memory to start with. It's very easy to write a LineReader class which iterates over a file a line at a time. I've got a version in MiscUtil which you can use.
Unless you only want to read the results once, however, you might want to call ToList() at the end to avoid reading the file multiple times. (This is still nicer than reading all the lines and keeping that in memory - you only want to do the splitting once.)
Once you've basically got in-memory collections of all the tables, you can use normal LINQ to Objects to join them together etc. You might want to go to a more sophisticated data model to get indexes though.

I don't think there's a better way out of the box.
One could define a Flat-File Linq Provider which could make the whole thing much simpler, but as far as I know, no one has yet.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio