data structure algorithms for database searching - algorithm

I was used to the traditional way of doing database searching with the following
using wildcards for term searches
using where clause for specific data like addresses and names
but at other times, I found these common methods to produce code that is so bloated, especially when it comes to complex searches.
Are there algorithms out there that you use for complex database searching? I tried to look for some but had a hard time doing so. I stumbled accross the binary search but I can't find a use for it :(
EDIT: Here's a pseudocode of a search I was working on. It uses jquery range sliders for maximum and minimum searching
query = 'select * from table'
if set minprice and not set maxprice
if minprice = 'nomin'
query += ' where price < maxprice'
else
query += ' where price < maxprice and price < minprice'
if not set minprice and set maxprice
if maxprice = 'nomax'
query += ' where price > minprice'
else
query += ' where price > minprice and price < maxprice'
if set maxprice and set minprice
if maxprice = 'nomax'
query += ' where price > minprice'
else
query += ' where price > minprice and price < maxprice'
this may not be the codebase by which you base your answers. I'm looking for more elegant ways of doing database searching.
EDIT by elegant I mean ways of rewriting the code to to achieve faster queries at less lines of code

Alright, I'm still not very clear on what you want, but I'll give it a shot...
If you're trying to speed up the query, you don't need to worry about "improved algorithms". Just make sure that any columns that you're searching on (price in your example) have an index on them, and the database will take care of searching efficiently. It's very good at it, I promise.
As for reducing the amount of code, again, I can't speak for every case, but your above pseudocode is bloated because you're handling the exact same case multiple times. My code for something like that would be more like this (pseudocode, no particular language):
if (set(minprice) and minprice != 'nomin')
conditions[] = 'price > minprice'
if (set(maxprice) and maxprice != 'nomax')
conditions[] = 'price < maxprice'
query = 'select * from table'
if (notempty(conditions))
query += ' where '+conditions.join(' and ')

Remeber speed of a query is not just the query itself. Also, greatly depends on how the db is structured. Is this a std relational layout, or a star, or? Are your keys indexed, and do you have secondary indexes? Are you expecting to bring back a lot of data, or just a couple of rows? Are you searching on columns where the db has to do a text search, or on numeric values. And of course, on top of that, how is the db physically layed out? index's and heavy hit tables on seperate drives? and so forth. Like the previous people mentioned, maybe a specific example would be more helpful in trying to solve

When interfacing with a database, you're far better off with a complex and ugly query than with an 'elegant' query which has you duplicating database search functionality inside your application. Each call to the database has a cost associated with it. If you write code to search a database within your application, it's virtually guaranteed to be more expensive.
Unless you are actually writing a database (tall order), let the database do the searching.

try to focus on reorganizing your query building process.
query = select + ' where ' + filter1 + filter2
select = 'select * from table'
filter1 = '';
if set minprice
if minprice = 'nomin'
filter1 = price > minprice'
else
filter1 = 'price < minprice'
and so on ... 'til the building the full query :
query = select;
if any filter on
query += ' where '
first = true
if set filter 1
if not first
query += ' and '
query += filter1
and so on...
you can put your filters in an array. it is more 'scalable' for your code.

The major problem with your code is that it unnecessarily mulls over every possible combination of set(minprice) and set(maxprice), while they can be treated independently:
query = 'select * from table'
conditions = [] #array of strings representing conditions
if set(minprice):
conditions.append("price < minprice")
if set(maxprice):
conditions.append("price > maxprice")
if len(conditions)>0:
query += ' WHERE ' + " and ".join(conditions)
In general it is beneficial to separate generation of conditions (the if set(...) lines above) from building the actual query. This way you don't need a separate if to generate (or skip) an "AND" or "WHERE" before each generated condition but instead you can just process it in one place (the last two lines above) adding the infixes as necessary.

Related

Using a single query for multiple searches in ElasticSearch

I have a dataset with documents that are identifiable by three fields, let's say "name","timestamp" and "country". Now, I use elasticsearch-dsl-py, but I can read native elasticsearch queries, so I can accept those as answers as well.
Here's my code to get a single document by the three fields:
def get(name, timestamp, country):
search = Item.search()
search = search.filter("term", name=name)
search = search.filter("term", timestamp=timestamp)
search = search.filter("term", country=country)
search = search[:1]
return search.execute()[0]
This is all good, but sometimes I'll need to get 200+ items and calling this function means 200 queries to ES.
What I'm looking for is a single query that will take a list of the three field-identifiers and return all the documents matching it, no matter the order.
I've tried using ORs + ANDs but unfortunately the performance is still poor, although at least I'm not making 200 round trips to the server.
def get_batch(list_of_identifiers):
search = Item.search()
batch_query = None
for ref in list_of_identifiers:
sub_query = Q("match", name=ref["name"])
sub_query &= Q("match", timestamp=ref["timestamp"])
sub_query &= Q("match", country=ref["country"])
if not batch_query:
batch_query = sub_query
else:
batch_query |= sub_query
search = search.filter(batch_query)
return search.scan()
Is there a faster/better approach to this problem?
Is using a multi-search going to be the faster option than using should/musts (OR/ANDs) in a single query?
EDIT: I tried multi-search and there was virtually no difference in the time. We're talking about seconds here. For 6 items it takes 60ms to get the result, for 200 items we're talking about 4-5 seconds.

Linq query increase performance efficient query

int mappedCount = (from product in products
from productMapping in DbContext.ProductCategoryMappings
.Where(x => product.TenantId == x.TenantId.ToString() &&
x.ProductId.ToString().ToUpper() == product.ProductGuid.ToUpper())
join tenantCustMapping in DbContext.TenantCustCategories
on productMapping.Value equals tenantCustMapping.Id
select 1).ToList().Sum();
I need to increase the performance.
When mapping product two tables each item having multiple product
If you want to increase performance, you'd need to know the volumes of data that are getting sent around. How many "products" are in your variable.
It may be quicker to update your products list to contain integers / guids and send that to your database rather than send strings that the database has to run ToUpper() on before comparing them.
something like:
var convertedList = products.Select( new {TenantId = int.parse(product.TenantId), productId = Guid.Parse(x.ProductId)}
Then sending that to your Db, and comparing them directly
I think changing "Select 1).ToList().Sum()" to ".Count()" will improve performance. Even if not, it'll help readability.

Linq Query Where Contains

I'm attempting to make a linq where contains query quicker.
The data set contains 256,999 clients. The Ids is just a simple list of GUID'S and this would could only contain 3 records.
The below query can take up to a min to return the 3 records. This is because the logic will go through the 256,999 record to see if any of the 256,999 records are within the List of 3 records.
returnItems = context.ExecuteQuery<DataClass.SelectClientsGridView>(sql).Where(x => ids.Contains(x.ClientId)).ToList();
I would like to and get the query to check if the three records are within the pot of 256,999. So in a way this should be much quicker.
I don't want to do a loop as the 3 records could be far more (thousands). The more loops the more hits to the db.
I don't want to grap all the db records (256,999) and then do the query as it would take nearly the same amount of time.
If I grap just the Ids for all the 256,999 from the DB it would take a second. This is where the Ids come from. (A filtered, small and simple list)
Any Ideas?
Thanks
You've said "I don't want to grab all the db records (256,999) and then do the query as it would take nearly the same amount of time," but also "If I grab just the Ids for all the 256,999 from the DB it would take a second." So does this really take "just as long"?
returnItems = context.ExecuteQuery<DataClass.SelectClientsGridView>(sql).Select(x => x.ClientId).ToList().Where(x => ids.Contains(x)).ToList();
Unfortunately, even if this is fast, it's not an answer, as you'll still need effectively the original query to actually extract the full records for the Ids matched :-(
So, adding an index is likely your best option.
The reason the Id query is quicker is due to one field being returned and its only a single table query.
The main query contains sub queries (below). So I get the Ids from a quick and easy query, then use the Ids to get the more details information.
SELECT Clients.Id as ClientId, Clients.ClientRef as ClientRef, Clients.Title + ' ' + Clients.Forename + ' ' + Clients.Surname as FullName,
[Address1] ,[Address2],[Address3],[Town],[County],[Postcode],
Clients.Consent AS Consent,
CONVERT(nvarchar(10), Clients.Dob, 103) as FormatedDOB,
CASE WHEN Clients.IsMale = 1 THEN 'Male' WHEN Clients.IsMale = 0 THEN 'Female' END As Gender,
Convert(nvarchar(10), Max(Assessments.TestDate),103) as LastVisit, ";
CASE WHEN Max(Convert(integer,Assessments.Submitted)) = 1 Then 'true' ELSE 'false' END AS Submitted,
CASE WHEN Max(Convert(integer,Assessments.GPSubmit)) = 1 Then 'true' ELSE 'false' END AS GPSubmit,
CASE WHEN Max(Convert(integer,Assessments.QualForPay)) = 1 Then 'true' ELSE 'false' END AS QualForPay,
Clients.UserIds AS LinkedUsers
FROM Clients
Left JOIN Assessments ON Clients.Id = Assessments.ClientId
Left JOIN Layouts ON Layouts.Id = Assessments.LayoutId
GROUP BY Clients.Id, Clients.ClientRef, Clients.Title, Clients.Forename, Clients.Surname, [Address1] ,[Address2],[Address3],[Town],[County],[Postcode],Clients.Consent, Clients.Dob, Clients.IsMale,Clients.UserIds";//,Layouts.LayoutName, Layouts.SubmissionProcess
ORDER BY ClientRef
I was hoping there was an easier way to do the Contain element. As the pool of Ids would be smaller than the main pool.
A way I've speeded it up for now is. I've done a Stinrg.Join to the list of Ids and added them as a WHERE within the main SQL. This has reduced the time down to a seconds or so now.

lucene.net, document boost not working

i am a beginner & developing my very first project with lucene.net i.e. an address search utility, lucene.net 3.0.3
using standard analyzer, query parser, (suppose i have a single field, Stored & Analyzed as well)
- sample data : (every row is a document with a single field)
(Postcode and street column concatenated)
UB6 9AH Greenford Road something
UB6 9AP Greenford Road something
UB1 3EB Greenford Road something
PR8 3JT Greenford Road something
HA1 3QD something Greenford Road
SM1 1JY something Greenford Road something
Searching
StringBuilder customQuery = new StringBuilder();
customQuery.Append(_searchFieldName + ":\"" + searchTerm + "\"^" + (wordsCount));
// this is for phrase matching
foreach (var word in words.Where(word => !string.IsNullOrEmpty(word)))
{
customQuery.Append(" +" + _searchFieldName + ":" + word + "*");
}
// this is prefix match for each word
Query query = _parser.Parse(customQuery.ToString());
_searcher.Search(query, collector);
all above (searching) working fine
Question
if i search for "Greenford road" ,
i may want that row that has 'SM1' should come up (means i want to priorities result as per postcode)
i have tested Query-Time-Boost and it works fine
but i may have a long list of priority postcodes sometimes (so i don't want to loop over each postcode and set its priority at query time
I WANT DOCUMENT TIME BOOSTING
but whatever document boost i set (at the time of indexing), it doesn't effect my search results
doc.Add(new Field(SearchFieldName, SearchField, Field.Store.YES, Field.Index.ANALYZED));
if (condition == true)
{
doc.Boost = 2; // or 5 or 200 etc (nothing works)
}
please HELP
i tried to understand similarity and scoring, but its too much mathematics there...
please help....
I recently had this problem myself and I think it might be due to wildcard queries (It was in my case at least). There is another post here that explains the issue better, and provides a possible solution:
Lucene .net Boost not working when using * wildcard

Is there an OR clause in LINQ?

I am trying to query an XML document for the specific records that I need. I know that the line containing the "or where" case below is incorrect, but I'm hoping it will illustrate what I am trying to accomplish. Can you do a conditional where clause on two seperate properties?
XDocument xd = XDocument.Load("CardData.xml");
SearchList.ItemsSource = from x in xd.Descendants("card")
where x.Element("title").Value.ToUpper().Contains(searchterm.ToUpper())
or where x.Element("id").Value.Contains(searchterm)
select new Card
{
Title = x.Element("title").Value
};
Yes - simply use the boolean or || and combine your conditions into one Where clause:
where x.Element("title").Value.ToUpper().Contains(searchterm.ToUpper()) ||
x.Element("id").Value.Contains(searchterm)
Also note just as a minor optimization, I would pre-compute some of the operations you currently have in your Where clause so they are not performed on every item in the list - probably doesn't matter but it might when you have a lot of elements (and is just a good habit to get into in my opinion):
string searchTermUpperCase = searchterm.ToUpper();
SearchList.ItemsSource = from x in xd.Descendants("card")
where x.Element("title").Value.ToUpper().Contains(searchTermUpperCase)
or where x.Element("id").Value.Contains(searchterm)
..

Resources