Finding strings that are not in DB already - performance

I have some bad performance issues in my application. One of the big operations is comparing strings.
I download a list of strings, approximately 1000 - 10000. These are all unique strings.
Then I need to check if these strings already exists in the database.
The linq query that I'm using looks like this:
IEnumerable<string> allNewStrings = DownloadAllStrings();
var selection = from a in allNewStrings
where !(from o in context.Items
select o.TheUniqueString).Contains(a)
select a;
Am I doing something wrong or how could I make this process faster preferably with Linq?
Thanks.

You did query the same unique strings 1000 - 10000 times for every element in allNewStrings, so it's extremely inefficient.
Try to query unique strings separately in order that it is executed once:
IEnumerable<string> allNewStrings = DownloadAllStrings();
var uniqueStrings = from o in context.Items
select o.TheUniqueString;
var selection = from a in allNewStrings
where !uniqueStrings.Contains(a)
select a;
Now you can see that the last query could be written using Except which is more efficient for the case of set operators like your example:
var selection = allNewStrings.Except(uniqueStrings);

An alternative solution would be to use a HashSet:
var set = new HashSet<string>(DownloadAllStrings());
set.ExceptWith(context.Items.Select(s => s.TheUniqueString));
The set will now contain the the strings that are not in the DB.

Related

search an array of string in a large string and check if any exist using linq

I have an array of string
var searchString = new string[] {"1:PS", "2:PS"};
and a large result string eg;
var largeString = "D9876646|10|1:PS^CD9876647100|11|2:PS"
how do I check if any of the options in searchString exist in the largeString?
I know it can be done via loop quite easily but I am looking for an other way around since I need to append the following as search clause in linq query.
You can use LINQ for it with a simple Any() call, like this:
var hasAny = searchString.Any(sub => largeString.Contains(sub));
However, this is as slow as a foreach loop. You can find the answer faster with a regex constructed from searchString:
var regex = string.Join("|", searchString.Select(Regex.Escape));
var hasAny = Regex.IsMatch(largeString, regex);
Depending on the nature of your LINQ provider (assuming it isn't LINQ to Objects), you may want to add individual tests for each member of searchString. The best way to do this is probably using PredicateBuilder
var sq = PredicateBuilder.New<dbType>();
foreach (var s in searchString)
sq = sq.Or(r => r.largeString.Contains(s));
q = q.Where(sq);

How to optimize this linq to objects query?

I'm matching some in memory lists entities with a .contains (subselect) query to filter out old from new users.
Checking for performance problems i saw this:
The oldList mostly has around 1000 users in them, while the new list varies from 100 to 500. Is there a way to optimize this query?
Absolutely - build a set instead of checking a list each time:
// Change string to whatever the type of UserID is.
var oldUserSet = new HashSet<string>(oldList.Select(o => o.UserID));
var newUsers = NewList.Where(n => !oldUserSet.Contains(n.UserID))
.ToList();
The containment check on a HashSet should be O(1) assuming few hash collisions, instead of the O(N) of checking each against the whole sequence (for each new user).
You could make a HashSet<T> of your user IDs in advance. This will cause the Contains to become an O(1) operation:
var oldSet = new HashSet<int>(oldList.Select(o => o.UserID));
var newUsers = NewList.Where(n => !oldSet.Contains(n.UserID)).ToList();
While those HashSet<T> answers are straightforward and simple, some may prefer a linq-centric solution.
LinqToObjects implements join and GroupJoin with a HashSet. Just use one of those - this example uses GroupJoin:
List<User> newUsers =
(
from n in NewList
join o in oldList on n.UserId equals o.UserId into oldGroup
where !oldGroup.Any()
select n
).ToList()

EF - Linq Expression and using a List of Ints to get best performance

So I have a list(table) of about 100k items and I want to retrieve all values that match a given list.
I have something like this.
the Table Sections key is NOT a primary key, so I'm expecting each value in listOfKeys to return a few rows.
List<int> listOfKeys = new List<int>(){1,3,44};
var allSections = Sections.Where(s => listOfKeys.Contains(s.id));
I don't know if it makes a difference but generally listOfKeys will only have between 1 to 3 items.
I'm using the Entity Framework.
So my question is, is this the best / fastest way to include a list in a linq expression?
I'm assuming that it isn't better to use another .NETICollection data object. Should I be using a Union or something?
Thanks
Suppose the listOfKeys will contain only small about of items and it's local list (not from database), like <50, then it's OK. The query generated will be basically WHERE id in (...) or WHERE id = ... OR id = ... ... and that's OK for database engine to handle it.
A Join would probably be more efficient:
var allSections =
from s in Sections
join k in listOfKeys on s.id equals k
select s;
Or, if you prefer the extension method syntax:
var allSections = Sections.Join(listOfKeys, s => s.id, k => k, (s, k) => s);

Longish LINQ query breakes SQLite-parser - simplify?

I'm programming a search for a SQLite-database using C# and LINQ.
The idea of the search is, that you can provide one or more keywords, any of which must be contained in any of several column-entries for that row to be added to the results.
The implementation consists of several linq-queries which are all put together by union. More keywords and columns that have to be considered result in a more complicated query that way. This can lead to SQL-code, which is to long for the SQLite-parser.
Here is some sample code to illustrate:
IQueryable<Reference> query = null;
if (searchAuthor)
foreach (string w in words)
{
string word = w;
var result = from r in _dbConnection.GetTable<Reference>()
where r.ReferenceAuthor.Any(a => a.Person.LastName.Contains(word) || a.Person.FirstName.Contains(word))
orderby r.Title
select r;
query = query == null ? result : query.Union(result);
}
if (searchTitle)
foreach (string word in words)
{
var result = from r in _dbConnection.GetTable<Reference>()
where r.Title.Contains(word)
orderby r.Title
select r;
query = query == null ? result : query.Union(result);
}
//...
Is there a way to structure the query in a way that results in more compact SQL?
I tried to force the creation of smaller SQL-statments by calling GetEnumerator() on the query after every loop. But apparently Union() doesn't operate on data, but on the underlying LINQ/SQL statement, so I was generating to long statements regardless.
The only solution I can think of right now, is to really gather the data after every "sub-query" and doing a union on the actual data and not in the statement. Any ideas?
For something like that, you might want to use a PredicateBuilder, as shown in the chosen answer to this question.

LINQ subquery question

Can anybody tell me how I would get the records in the first statement that are not in the second statement (see below)?
from or in TblOrganisations
where or.OrgType == 2
select or.PkOrgID
Second query:
from o in TblOrganisations
join m in LuMetricSites
on o.PkOrgID equals m.FkSiteID
orderby m.SiteOrder
select o.PkOrgID
If you only need the IDs then Except should do the trick:
var inFirstButNotInSecond = first.Except(second);
Note that Except treats the two sequences as sets. This means that any duplicate elements in first won't be included in the results. I suspect that this won't be a problem since the name PkOrgID suggests a unique ID of some kind.
(See the documentation for Enumerable.Except and Queryable.Except for more info.)
Do you need the whole records, or just the IDs? The IDs are easy...
var ids = firstQuery.Except(secondQuery);
EDIT: Okay, if you can't do that, you'll need something like:
var secondQuery = ...; // As you've already got it
var query = from or in TblOrganisations
where or.OrgType == 2
where !secondQuery.Contains(or.PkOrgID)
select ...;
Check the SQL it produces, but I think it should do the right thing. Note that there's no point in performing any ordering in the second query - or even the join against TblOrganisations. In other words, you could use:
var query = from or in TblOrganisations
where or.OrgType == 2
where !LuMetricSites.Select(m => m.FkSiteID).Contains(or.PkOrgID)
select ...;
Use Except:
var filtered = first.Except(second);

Resources