Linq query increase performance efficient query - linq

int mappedCount = (from product in products
from productMapping in DbContext.ProductCategoryMappings
.Where(x => product.TenantId == x.TenantId.ToString() &&
x.ProductId.ToString().ToUpper() == product.ProductGuid.ToUpper())
join tenantCustMapping in DbContext.TenantCustCategories
on productMapping.Value equals tenantCustMapping.Id
select 1).ToList().Sum();
I need to increase the performance.
When mapping product two tables each item having multiple product

If you want to increase performance, you'd need to know the volumes of data that are getting sent around. How many "products" are in your variable.
It may be quicker to update your products list to contain integers / guids and send that to your database rather than send strings that the database has to run ToUpper() on before comparing them.
something like:
var convertedList = products.Select( new {TenantId = int.parse(product.TenantId), productId = Guid.Parse(x.ProductId)}
Then sending that to your Db, and comparing them directly
I think changing "Select 1).ToList().Sum()" to ".Count()" will improve performance. Even if not, it'll help readability.

Related

How to design querying multiple tags on analytics database

I would like to store user purchase custom tags on each transaction, example if user bought shoes then tags are "SPORTS", "NIKE", SHOES, COLOUR_BLACK, SIZE_12,..
These tags are that seller interested in querying back to understand the sales.
My idea is when ever new tag comes in create new code(something like hashcode but sequential) for that tag, and code starts from "a-z" 26 letters then "aa, ab, ac...zz" goes on. Now keep all the tags given for in one transaction in the one column called tag (varchar) by separating with "|".
Let us assume mapping is (at application level)
"SPORTS" = a
"TENNIS" = b
"CRICKET" = c
...
...
"NIKE" = z //Brands company
"ADIDAS" = aa
"WOODLAND" = ab
...
...
SHOES = ay
...
...
COLOUR_BLACK = bc
COLOUR_RED = bd
COLOUR_BLUE = be
...
SIZE_12 = cq
...
So storing the above purchase transaction, tag will be like tag="|a|z|ay|bc|cq|" And now allowing seller to search number of SHOES sold by adding WHERE condition tag LIKE %|ay|%. Now the problem is i cannot use index (sort key in redshift db) for "LIKE starts with %". So how to solve this issue, since i might have 100 millions of records? dont want full table scan..
any solution to fix this?
Update_1:
I have not followed bridge table concept (cross-reference table) since I want to perform group by on the results after searching the specified tags. My solution will give only one row when two tags matched in a single transaction, but bridge table will give me two rows? then my sum() will be doubled.
I got suggestion like below
EXISTS (SELECT 1 FROM transaction_tag WHERE tag_id = 'zz' and trans_id
= tr.trans_id) in the WHERE clause once for each tag (note: assumes tr is an alias to the transaction table in the surrounding query)
I have not followed this; since i have to perform AND and OR condition on the tags, example ("SPORTS" AND "ADIDAS") ---- "SHOE" AND ("NIKE" OR "ADIDAS")
Update_2:
I have not followed bitfield, since dont know redshift has this support also I assuming if my system will be going to have minimum of 3500 tags, and allocating one bit for each; which results in 437 bytes for each transaction, though there will be only max of 5 tags can be given for a transaction. Any optimisation here?
Solution_1:
I have thought of adding min (SMALL_INT) and max value (SMALL_INT) along with tags column, and apply index on that.
so something like this
"SPORTS" = a = 1
"TENNIS" = b = 2
"CRICKET" = c = 3
...
...
"NIKE" = z = 26
"ADIDAS" = aa = 27
So my column values are
`tag="|a|z|ay|bc|cq|"` //sorted?
`minTag=1`
`maxTag=95` //for cq
And query for searching shoe(ay=51) is
maxTag <= 51 AND tag LIKE %|ay|%
And query for searching shoe(ay=51) AND SIZE_12 (cq=95) is
minTag >= 51 AND maxTag <= 95 AND tag LIKE %|ay|%|cq|%
Will this give any benefit? Kindly suggest any alternatives.
You can implement auto-tagging while the files get loaded to S3. Tagging at the DB level is too-late in the process. Tedious and involves lot of hard-coding
While loading to S3 tag it using the AWS s3API
example below
aws s3api put-object-tagging --bucket --key --tagging "TagSet=[{Key=Addidas,Value=AY}]"
capture tags dynamically by sending and as a parameter
2.load the tags to dynamodb as a metadata store
3.load data to Redshift using S3 COPY command
You can store tags column as varchar bit mask, i.e. a strictly defined bit sequence of 1s or 0s, so that if a purchase is marked by a tag there will be 1 and if not there will be 0, etc. For every row, you will have a sequence of 0s and 1s that has the same length as the number of tags you have. This sequence is sortable, however you would still need lookup into the middle but you will know at which specific position to look so you don't need like, just substring. For further optimization, you can convert this bit mask to integer values (it will be unique for each sequence) and make matching based on that but AFAIK Redshift doesn't support that yet out of box, you will have to define the rules yourself.
UPD: Looks like the best option here is to keep tags in a separate table and create an ETL process that unwraps tags into tabular structure of order_id, tag_id, distributed by order_id and sorted by tag_id. Optionally, you can create a view that joins the this one with the order table. Then lookups for orders with a particular tag and further aggregations of orders should be efficient. There is no silver bullet for optimizing this in a flat table, at least I don't know of such that would not bring a lot of unnecessary complexity versus "relational" solution.

Linq Query Where Contains

I'm attempting to make a linq where contains query quicker.
The data set contains 256,999 clients. The Ids is just a simple list of GUID'S and this would could only contain 3 records.
The below query can take up to a min to return the 3 records. This is because the logic will go through the 256,999 record to see if any of the 256,999 records are within the List of 3 records.
returnItems = context.ExecuteQuery<DataClass.SelectClientsGridView>(sql).Where(x => ids.Contains(x.ClientId)).ToList();
I would like to and get the query to check if the three records are within the pot of 256,999. So in a way this should be much quicker.
I don't want to do a loop as the 3 records could be far more (thousands). The more loops the more hits to the db.
I don't want to grap all the db records (256,999) and then do the query as it would take nearly the same amount of time.
If I grap just the Ids for all the 256,999 from the DB it would take a second. This is where the Ids come from. (A filtered, small and simple list)
Any Ideas?
Thanks
You've said "I don't want to grab all the db records (256,999) and then do the query as it would take nearly the same amount of time," but also "If I grab just the Ids for all the 256,999 from the DB it would take a second." So does this really take "just as long"?
returnItems = context.ExecuteQuery<DataClass.SelectClientsGridView>(sql).Select(x => x.ClientId).ToList().Where(x => ids.Contains(x)).ToList();
Unfortunately, even if this is fast, it's not an answer, as you'll still need effectively the original query to actually extract the full records for the Ids matched :-(
So, adding an index is likely your best option.
The reason the Id query is quicker is due to one field being returned and its only a single table query.
The main query contains sub queries (below). So I get the Ids from a quick and easy query, then use the Ids to get the more details information.
SELECT Clients.Id as ClientId, Clients.ClientRef as ClientRef, Clients.Title + ' ' + Clients.Forename + ' ' + Clients.Surname as FullName,
[Address1] ,[Address2],[Address3],[Town],[County],[Postcode],
Clients.Consent AS Consent,
CONVERT(nvarchar(10), Clients.Dob, 103) as FormatedDOB,
CASE WHEN Clients.IsMale = 1 THEN 'Male' WHEN Clients.IsMale = 0 THEN 'Female' END As Gender,
Convert(nvarchar(10), Max(Assessments.TestDate),103) as LastVisit, ";
CASE WHEN Max(Convert(integer,Assessments.Submitted)) = 1 Then 'true' ELSE 'false' END AS Submitted,
CASE WHEN Max(Convert(integer,Assessments.GPSubmit)) = 1 Then 'true' ELSE 'false' END AS GPSubmit,
CASE WHEN Max(Convert(integer,Assessments.QualForPay)) = 1 Then 'true' ELSE 'false' END AS QualForPay,
Clients.UserIds AS LinkedUsers
FROM Clients
Left JOIN Assessments ON Clients.Id = Assessments.ClientId
Left JOIN Layouts ON Layouts.Id = Assessments.LayoutId
GROUP BY Clients.Id, Clients.ClientRef, Clients.Title, Clients.Forename, Clients.Surname, [Address1] ,[Address2],[Address3],[Town],[County],[Postcode],Clients.Consent, Clients.Dob, Clients.IsMale,Clients.UserIds";//,Layouts.LayoutName, Layouts.SubmissionProcess
ORDER BY ClientRef
I was hoping there was an easier way to do the Contain element. As the pool of Ids would be smaller than the main pool.
A way I've speeded it up for now is. I've done a Stinrg.Join to the list of Ids and added them as a WHERE within the main SQL. This has reduced the time down to a seconds or so now.

Linq query returns duplicate results when .Distinct() isn't used - why?

When I use the following Linq query in LinqPad I get 25 results returned:
var result = (from l in LandlordPreferences
where l.Name == "Wants Student" && l.IsSelected == true
join t in Tenants on l.IsSelected equals t.IsStudent
select new { Tenant = t});
result.Dump();
When I add .Distinct() to the end I only get 5 results returned, so, I'm guessing I'm getting 5 instances of each result when the above is used.
I'm new to Linq, so I'm wondering if this is because of a poorly built query? Or is this the way Linq always behaves? Surely not - if I returned 500 rows with .Distinct(), does that mean without it there's 2,500 returned? Would this compromise performance?
It's a poorly built query.
You are joining LandlordPreferences with Tenants on a boolean value instead of a foreign key.
So, most likely, you have 5 selected land lords and 5 tenants that are students. Each student will be returned for each land lord: 5 x 5 = 25. This is a cartesian product and has nothing to do with LINQ. A similar query in SQL would behave the same.
If you would add the land lord to your result (select new { Tenant = t, Landlord = l }), you would see that no two results are actually the same.
If you can't fix the query somehow, Distinct is your only option.

Using Linq to bring back last 3,4...n orders for every customer

I have a database with customers orders.
I want to use Linq (to EF) to query the db to bring back the last(most recent) 3,4...n orders for every customer.
Note:
Customer 1 may have just made 12 orders in the last hr; but customer 2 may not have made any since last week.
I cant for the life of me work out how to write query in linq (lambda expressions) to get the data set back.
Any good ideas?
Edit:
Customers and orders is a simplification. The table I am querying is actually a record of outbound messages to various web services. It just seemed easer to describe as customers and orders. The relationship is the same.
I am building a task that checks the last n messages for each web service to see if there were any failures. We are wanting a semi real time Health status of the webservices.
#CoreySunwold
My table Looks a bit like this:
MessageID, WebserviceID, SentTime, Status, Message, Error,
Or from a customer/order context if it makes it easer:
OrderID, CustomerID, StatusChangedDate, Status, WidgetName, Comments
Edit 2:
I eventually worked out something
(Hat tip to #StephenChung who basically came up with the exact same, but in classic linq)
var q = myTable.Where(d => d.EndTime > DateTime.Now.AddDays(-1))
.GroupBy(g => g.ConfigID)
.Select(g =>new
{
ConfigID = g.Key,
Data = g.OrderByDescending(d => d.EndTime)
.Take(3).Select(s => new
{
s.Status,
s.SentTime
})
}).ToList();
It does take a while to execute. So I am not sure if this is the most efficient expression.
This should give the last 3 orders of each customer (if having orders at all):
from o in db.Orders
group o by o.CustomerID into g
select new {
CustomerID=g.Key,
LastOrders=g.OrderByDescending(o => o.TimeEntered).Take(3).ToList()
}
However, I suspect this will force the database to return the entire Orders table before picking out the last 3 for each customer. Check the SQL generated.
If you need to optimize, you'll have to manually construct a SQL to only return up to the last 3, then make it into a view.
You can use SelectMany for this purpose:
customers.SelectMany(x=>x.orders.OrderByDescending(y=>y.Date).Take(n)).ToList();
How about this? I know it'll work with regular collections but don't know about EF.
yourCollection.OrderByDescending(item=>item.Date).Take(n);
var ordersByCustomer =
db.Customers.Select(c=>c.Orders.OrderByDescending(o=>o.OrderID).Take(n));
This will return the orders grouped by customer.
var orders = orders.Where(x => x.CustomerID == 1).OrderByDescending(x=>x.Date).Take(4);
This will take last 4 orders. Specific query depends on your table / entity structure.
Btw: You can take x as a order. So you can read it like: Get orders where order.CustomerID is equal to 1, OrderThem by order.Date and take first 4 'rows'.
Somebody might correct me here, but i think doing this is linq with a single query is probably very difficult if not impossible. I would use a store procedure and something like this
select
*
,RANK() OVER (PARTITION BY c.id ORDER BY o.order_time DESC) AS 'RANK'
from
customers c
inner join
order o
on
o.cust_id = c.id
where
RANK < 10 -- this is "n"
I've not used this syntax for a while so it might not be quite right, but if i understand the question then i think this is the best approach.

How can this query be improved?

I am using LINQ to write a query - one query shows all active customers , and another shows all active as well as inactive customers.
if(showall)
{
var prod = Dataclass.Customers.Where(multiple factors ) (all inactive + active)
}
else
{
var prod = Dataclass.Customers.Where(multiple factors & active=true) (only active)
}
Can I do this using only one query? The issue is that, multiple factors are repeated in both the queries
thanks
var customers = Dataclass.Customers.Where(multiple factors);
var activeCust = customers.Where(x => x.active);
I really don't understand the question either. I wouldn't want to make this a one-liner because it would make the code unreadable
I'm assuming you are trying to minimze the number of roundtrips?
If "multiple factors" is the same, you can just filter for active users after your first query:
var onlyActive = prod.Where(p => p.active == true);
Wouldn't you just use your first query to return all customers?? If not you'd be returning the active users twice.
Options I'd consider
Bring all customers once, order by 'status' column so you can easily split them into two sets
Focus on minimizing DB roundtrips. Whatever you do in the front end costs an order of magnitude less than going to the DB.
Revise user requirements. For ex. consider paging on results - it's unlikely that end user will need all customers at once.

Resources