Optimizing algorithms for multiple queries of the same kind - algorithm

There is a particular class of algorithm coding problems which require us to evaluate multiple queries which can be of two kind :
Perform search over a range of data
Update the data over a given range
One example which I've been recently working on is this(though not the only one) : Quadrant Queries
Now, to optimize my algorithm, I have had one idea :
I can use dynamic programming to keep the search results for a particular range, and generate data for other ranges as required.
For example, if I have to calculate sum of numbers in an array from index 4 to 7, I can already keep sum of elements upto 4 and sum of elements upto 7 which is easy and then I'll just need the difference of the two + 4th element which is O(1). But this raises another problem : During the update operation, I'll have to update my stored search data for all the elements following the updated element. This seems to be inefficient, though I did not try it practically.
Someone suggested me that I can combine subsequent update operations using some special data structure.(Actually read it on some forum).
Question: Is there a known way to optimize these kind of problems? Is there a special data structure that does it? The idea I mentioned;Is it possible that it might be more efficient than direct approach? Should I try it out?

It might help:
Segment Trees (Range-Range part)

Related

Data structure to Filter Data Quickly

I am doing a bit of research into making an efficient filtering algorithm when it comes to many properties of specific data. This is kind of a fun project for me to learn new data structures.
for example, say I wanted All RPG's on Playstation Which had English releases.
Now I want to allow for much more complex queries.
Is there a good data structure to handle filtering attributes like this, without the need to give all of the attributes. Instead I can give only a few and still find the correct games?
I currently plan to have "buckets" which will describe an attribute, for example all Genre's game ID's will be in one bucket, and so forth. Then I will use a hash algorithm to add 1 to that game, and only use games which have the correct value after the search.
But I want to try to find a faster or easier method, any suggestions when it comes to filtering many attributes to find sets of items?
Thanks,
What do you mean by "without the need to give all of the attributes"? Are you saying you have N attributes and you want to find the items that match l < N of the attributes, or are you saying that you don't want to compute an index for each attribute?
Hashing each attribute into buckets will give you O(1) time at the expense of O(n) space to store each index.
You could sort your list by one or two attributes to make some lookups O(logn) at the expense of having to do the sorting up front for O(nlogn) time
You could get kinda clever with bloom filters for your attributes and let some attributes overlap. This would lead to some false-positives, but you could filter those out after the fact. This gives you constant-space with constant-time lookup in the average case (but O(n) time in the worse-case).

Fast algorithm for approximate lookup on multiple keys

I have formulated a solution to a problem where I am storing parameters in a set of tables, and I want to be able to look up the parameters based on multiple criteria.
For example, if criteria 1 and criteria 2 can each be either A or B, then I'd have four potential parameters - one for each combination A&A, A&B, B&A and B&B. For these sort of criteria I could concatenate the fields or something similar and create a unique key to look up each value quickly.
Unfortunately not all of my criteria are like this. Some of the criteria are numerical and I only care about whether or not a result sits above or below a boundary. That also wouldn't be a problem on its own - I could maybe use a binary search or something relatively quick to find the nearest key above or below my value.
My problem is I need to include a number of each in the same table. In other words, I could have three criteria - two with A/B entries, and one with less-than-x/greater-than-x type entries, where x is in no way fixed. So in this example I would have a table with 8 entries. I can't just do a binary search for the boundary because the closest boundary won't necessarily be applicable due to the other criteria. For example, if the first two criteria are A&B, then the closest boundary might be 100, but if the if first two criteria are A&A, the closest boundary might be 50. If I want to look up A, A, 101, then I want it to recognise that 50 is the closest boundary that applies - not 100.
I have a procedure to do the lookup but it gets very slow as the tables get bigger - it basically goes through each criteria, checks if a match is still possible, and if so it looks at more criteria - if not, it moves on to check the next entry in the table. So in other words, my procedure requires cycling through the table entries one by one and checking for a match. I have tried to optimise that by ensuring the tables that are input to the procedure are as small as possible and by making sure it looks at the criteria that are least likely to match first (so that it checks each entry as quickly as possible) but it is still very slow.
The biggest tables are maybe 200 rows with about 10 criteria to check, but many are much smaller (maybe 10x5). The issue is that I need to call the procedure many times during my application, so algorithms with some initial overhead don't necessarily make things better. I do have some scope to change the format of the tables before runtime but I would like to keep away from that as much as possible (while recognising it may be the only way forward).
I've done quite a bit of research but I haven't had any luck. Does anyone know of any algorithms that have been designed to tackle this kind of problem? I was really hoping that there would be some clever hash function or something that means I won't have to cycle through the tables, but from my limited knowledge something like that would struggle here. I feel confident that I understand the problem well enough to gradually optimise the solution I have at the moment, but I want to be sure I've not missed a much better solution.
Apologies for the very long and abstract description of the problem - hopefully it's clear what I'm trying to do. I'll amend my question if it's unclear.
Thanks for any help.
this is basically what a query optimizer does in SQL land. There are fast, free, in memory databases for exactly this purpose. Checkout sqlite https://www.sqlite.org/inmemorydb.html.
It sounds like you are doing what is called a 'full table scan' for each query, which is like the last resort for a query optimizer.
As I've understood, you want to select entries by criteria like
A& not B & x1 >= lower_x1 & x1 < upper_x1 & x2 >= lower_x2 & x2 < lower_x2 & ...
The easiest way is to have them sorted by all possible xi, where i=1,2.. in separate sets, and have separated 'words' for various combination of A,B,..
The search will works as follows:
Select a proper world by Boolean criteria combination
For each i, find the population of lower_xi..upper_xi range in corresponding set (this operation is O(log(N))
Select i where the population is the lowest
While iterating instances through lower_xi..upper_xi range, filter the results by checking other upper/lower bound criteria (for all xj where j!=i)
Note that this s a general solution. Of course if you know some relation between your bound(s), you may use a list sorted by respective combination(s) of item values.

Find all the rows given the number of matching attributes and the query

Below is my problem definition:
Given a database D, each row has m categorical attributes. Given a query which is a vector of m categorical attributes and the number of matching, k. How to find all the row ids such that the number of matching attributes to the query is greater than or equal to k efficiently?
The easier version (I think) is that given a vector of <=m-categorical attributes, how to find ids of all the rows that match those <=m-categorical attributes.
In some of the question (e.g. this), they need to scan the whole database every time the query comes in. I think this is not fast enough. I am not sure about the complexity on this actually.
If it is possible, I want to avoid scanning all the rows in the database. Therefore, I am thinking of building some kinds of index but I am wondering if there is any existing work for these?
In addition, is there a problem similar to this and what is it called? I want to take a look.
Thank you very much for your help.
(Regarding the coding, I mainly code in Python 2.7 for this.)

Indexing by float or double field algorithm

I have a task to perform fast search in huge in-memory array of objects by some object's fields. I need to select the subset of objects satisfying some criteria.
The criteria may be specified as a floating point value or range of such values (eg. 2.5..10).
The problem is that the float property to be searched on is not quite uniformly distributed; it could contain few objects with value range 10-20 (for example) and another million objects with values 0-1, and another million with values 100-150.
So, how possible is it to build index for effective searching those objects? Code samples are welcome.
If the in memory array is ordered then binary search would be my first attempt. Wikipedia entry has example code as well.
http://en.wikipedia.org/wiki/Binary_search_algorithm
If you're doing lookups only, a single sort followed by multiple binary searches is good.
You could also try a perfect hash algorithm, if you want the ultimate in lookup speed and little more.
If you need more than just lookups, check out treaps and red-black trees. The former are fast on average, while the latter are decent performers with a low operation duration variability.
You could try a range tree, for the range requirement.
I fail to see what the distribution of values has to do with building an index (with the possible exception of exact duplicates). Since the data fits in memory, just extract all the fields with their original position, sort them, and use a binary search as suggested by #MattiLyra.
Are we missing something?

Do I need to implement a b-tree search for this?

I have an array of integers, which could run into the hundreds of thousands (or more), sorted numerically ascending since that's how they were originally stacked.
I need to be able to query the array to get the index of its first occurrence of a number >= some input, as efficiently as possible. The only way I would know how to do this without even thinking about it would be to iterate through the array testing the condition until it returns true, at which point I'd stop iterating. However, this is the most expensive solution to this problem and I'm looking for the best algorithm to solve it.
I'm coding in Objective-C, but I'll give an example in JavaScript to broaden the audience of people who are able to respond.
// Sample set
var numbers = [1, 7, 23, 23, 23, 89, 1002, 1003];
var indexAfter100 = getIndexOfValueGreaterThan(100);
var indexAfter7 = getIndexOfValueGreaterThan(7);
// (indexAfter100 == 6) == true
// (indexAfter7 == 2) == true
Putting this data into a DB in order to perform this search will only be a last-resort solution since I'm keen to see some sort of algorithm to tackle this quickly in memory.
I do have the ability to change the data structure, or to store an additional data structure as I'm building the array, since my program has already pushed each number one by one onto this stack, so I'd just modify the code that's adding them to the stack. Searching for the index as they're being added to the stack isn't possible since the search operation will be repeated frequently with different values after the fact.
Right now I'm thinking "B-Tree" but to be honest, I would have no idea how to implement one and before I go off and start figuring that out, I wonder if there's a nice algorithm that fits this single use-case better?
You should use binary search. Objective C could even have a built-in method for that (many languages I know do). B-tree won't probably help much, unless you want to store the data on disk.
I don't know about Objective-C, but C (plain 'ol C) comes with a function called bsearch (besides, AFAIK, Obj-C can call C functions just fine):
http://www.cplusplus.com/reference/clibrary/cstdlib/bsearch/
That basically does a binary search which sounds like it's what you need.
A fast search algorithm should be able to handle an array of ints of that size without taking too long, I should think (and the array is sorted, so a binary search would probably be the way to go).
I think a btree is probably overkill...
Since they are sorted in a particular ASCending order and you only need the bigger ones, I would serialize that array, explode it by the INT and keep the part of the serialized string that holds the bigger INTs, then unserialize it and voilá.
Linear search also referred to as sequential search looks at each element in sequence from the start to see if the desired element is present in the data structure. When the amount of data is small, this search is fast.Its easy but work needed is in proportion to the amount of data to be searched.Doubling the number of elements will double the time to search if the desired element is not present.
Binary search is efficient for larger array. In this we check the middle element.If the value is bigger that what we are looking for, then look in the first half;otherwise,look in the second half. Repeat this until the desired item is found. The table must be sorted for binary search. It eliminates half the data at each iteration.Its logarithmic

Resources