Is range tree widely used in spacial search problems? - data-structures

I am looking for some data structures for range searching. I think range trees offer a good time complexity (but with some storage requirements).
However, it seems to me that other data structures, like KD-trees, are more discussed and recommended than range trees. Is this true? If so, why?

I would expect that it is because kd-trees can straightforwardly be extended to contains objects other than points. This gives it many applications in e.g. virtual worlds, where we want quick querying of triangles. Similar extensions of range trees are not straightforward, and in fact I've never seen any.
To give a quick recap: a kd-tree can preprocess a set of n points in d-space in O(n log n) time into a structure using O(n) space, such that any d-dimensional range query can be answered in O(n1-1/d + k) time, where k is the number of answers. A range trees takes O(n logd-1n) time to preprocess, takes O(n logd-1n) space, and can answer range queries in O(logd-1n + k) time.
The query time for a range tree is obviously a lot better than that of a kd-tree, if we're talking about 2- or 3-dimensional space. However, the kd-tree has several advantages. First of all, it always requires only linear storage. Secondly, it is always constructed in O(n log n) time. Third, if the dimensionality is very high, it will outperform a range tree unless your points sets are very large (although arguably, at this point a linear search will be almost as fast as a kd-tree).
I think another important point is that kd-trees are more well known by people than range trees. I'd never heard of a range tree before taking a course in computational geometry, but I'd heard of and worked with kd-trees before (albeit in a computer graphics setting).
EDIT: You ask what is a better data structure for 2D or 3D fixed radius search when you have millions of points. I really can't tell you! I'd be inclined to say a range tree will be faster if you perform many queries, but for 3D the construction will be slower by a factor of O(log n), and memory use may become an issue before speed does. I'd recommend integrating good implementations of both structures and simply testing what does a better job for your particular requirements.

For your needs (particle simulation in 2D or 3D), you want a data structure capable of all nearest neighbor queries. The cover tree is a data structure that is best suited for this task. I came across it while computing the nearest neighbors for kernel density estimation. This Wikipedia page explains the basic definition of the tree, and John Langford's page has a link to a C++ implementation.
The running time of a single query is O(c^12*logn), where c is the expansion constant of the dataset. This is an upper bound - in practice, the data structure performs faster than others. This paper shows that the running time of batch processing of all nearest neighbors (for all the data points), as needed for a particle simulation, is O(c^16*n), and this theoretical linear bound is also practical for your need. Construction time is O(nlogn) and storage is O(n).

Related

How to optimize the algorithm used to calculate K-nearest neighbor algorithm?

KNN is such a straightforward algorithm that's easy to implement:
# for each test datapoint in X_test:
# calculate its distance from every points in X_train
# find the top k most closest points
# take majority vote of the k neighbors and use that as prediction for this test data point
Yet I think the time complexity is not good enough. How is the algorithm optimized when it is implemented in reality? (like what trick or data structure it's using?)
The k-nearest neighbor algorithm differs from other learning methods because no
model is induced from the training examples. The data remains as they are; they
are simply stored in memory.
A genetic algorithm is combined with k-NN to improve performance. Another successful technique known as instance
selection is also proposed to face simultaneously, the efficient storage and noise of
k-NN. you can try this: when a new instance should be classified; instead of
involving all learning instances to retrieve the k-neighbors which will increase the
computing time, a selection of a smaller subset of instances is first performed.
you can also try:
Improving k-NN speed by reducing the number of training
documents
Improving k-NN by neighborhood size and similarity
function
Improving k-NN by advanced storage structures
What you describe is the brute force kNN calculation with O(size(X_test)*size(X_train)*d), where d is the number of dimensions in the feature vectors.
More efficient solution use spatial indexing to put an index on the X_train data. This typically reduces the individual lookups to O( log(size(X_train)) * d) or even O( log(size(X_train)) + d).
Common spatial indexes are:
kD-Trees (they are often used, but scale badly with 'd')
R-Trees, such as the RStarTree
Quadtrees (Usually not efficient for large 'd', but for example the PH-Tree works well with d=1000 and has excellent remove/insertion times (disclaimer, this is my own work))
BallTrees (I don't really know much about them)
CoverTrees (Very fast lookup for high 'd', but long build-up times
There are also the class of 'approximate' NN searches/queries. These trade correctness with speed, they may skip a few of the closest neighbors. You can find a performance comparison and numerous implementations in python here.
If you are looking for Java implementations of some of the spatial indexes above, have a look at my implementations.

Most efficient implementation to get the closest k items

In the K-Nearest-Neighbor algorithm, we find the top k neighbors closest to a new point out of N observations and use those neighbors to classify the point. From my knowledge of data structures, I can think of two implementations of this process:
Approach 1
Calculate the distances to the new point from each of N observations
Sort the distances using quicksort and take the top k points
Which would take O(N + NlogN) = O(NlogN) time.
Approach 2
Create a max-heap of size k
Calculate the distance from the new point for the first k points
For each following observation, if the distance is less than the max in the heap, pop that point from the heap and replace it with the current observation
Re-heapify (logk operations for N points)
Continue until there are no more observations at which point we should only have the top 5 closest distances in the heap.
This approach would take O(N + NlogK) = O(NlogK) operations.
Are my analyses correct? How would this process be optimized in a standard package like sklearn? Thank you!
Here's a good overview of the common methods used: https://en.wikipedia.org/wiki/Nearest_neighbor_search
What you describe is linear search (since you need to compute the distance to every point in the dataset).
The good thing is that this always works. The bad thing about it is that is slow, especially if you query it a lot.
If you know a bit more about your data you can get better performance. If the data has low dimensionality (2D, 3D) and is uniformly distributed (this doesn't mean perfectly, just not in very dense and very tight clusters) then space partitioning works great because it cuts down quickly on the points that are too far anyway (complexity O(logN) ). They work also for higher dimensionallity or if there are some clusters, but the performance suffers a bit (still better overall than linear search).
Usually space partitioning or locality sensitive hashing are enough for common datasets.
The trade-off is that you use more memory and some set-up time to speed up future queries. If you have a lot of queries then it's worth it. If you only have a few, not so much.

What are some fast approximations of Nearest Neighbor?

Say I have a huge (a few million) list of n vectors, given a new vector, I need to find a pretty close one from the set but it doesn't need to be the closest. (Nearest Neighbor finds the closest and runs in n time)
What algorithms are there that can approximate nearest neighbor very quickly at the cost of accuracy?
EDIT: Since it will probably help, I should mention the data are pretty smooth most of the time, with a small chance of spikiness in a random dimension.
There are exist faster algoritms then O(n) to search closest element by arbitary distance. Check http://en.wikipedia.org/wiki/Kd-tree for details.
If you are using high-dimension vector, like SIFT or SURF or any descriptor used in multi-media sector, I suggest your consider LSH.
A PhD dissertation from Wei Dong (http://www.cs.princeton.edu/cass/papers/cikm08.pdf) might help you find the updated algorithm of KNN search, i.e, LSH. Different from more traditional LSH, like E2LSH (http://www.mit.edu/~andoni/LSH/) published earlier by MIT researchers, his algorithm uses multi-probing to better balance the trade-off between recall rate and cost.
A web search on "nearest neighbor" lsh library finds
http://www.mit.edu/~andoni/LSH/
http://www.cs.umd.edu/~mount/ANN/
http://msl.cs.uiuc.edu/~yershova/MPNN/MPNN.htm
For approximate nearest neighbour, the fastest way is to use locality sensitive hashing (LSH). There are many variants of LSHs. You should choose one depending on the distance metric of your data. The big-O of the query time for LSH is independent of the dataset size (not considering time for output result). So it is really fast. This LSH library implements various LSH for L2 (Euclidian) space.
Now, if the dimension of your data is less than 10, kd tree is preferred if you want exact result.

How to efficiently find k-nearest neighbours in high-dimensional data?

So I have about 16,000 75-dimensional data points, and for each point I want to find its k nearest neighbours (using euclidean distance, currently k=2 if this makes it easiser)
My first thought was to use a kd-tree for this, but as it turns out they become rather inefficient as the number of dimension grows. In my sample implementation, its only slightly faster than exhaustive search.
My next idea would be using PCA (Principal Component Analysis) to reduce the number of dimensions, but I was wondering: Is there some clever algorithm or data structure to solve this exactly in reasonable time?
The Wikipedia article for kd-trees has a link to the ANN library:
ANN is a library written in C++, which
supports data structures and
algorithms for both exact and
approximate nearest neighbor searching
in arbitrarily high dimensions.
Based on our own experience, ANN
performs quite efficiently for point
sets ranging in size from thousands to
hundreds of thousands, and in
dimensions as high as 20. (For applications in significantly higher
dimensions, the results are rather
spotty, but you might try it anyway.)
As far as algorithm/data structures are concerned:
The library implements a number of
different data structures, based on
kd-trees and box-decomposition trees,
and employs a couple of different
search strategies.
I'd try it first directly and if that doesn't produce satisfactory results I'd use it with the data set after applying PCA/ICA (since it's quite unlikely your going to end up with few enough dimensions for a kd-tree to handle).
use a kd-tree
Unfortunately, in high dimensions this data structure suffers severely from the curse of dimensionality, which causes its search time to be comparable to the brute force search.
reduce the number of dimensions
Dimensionality reduction is a good approach, which offers a fair trade-off between accuracy and speed. You lose some information when you reduce your dimensions, but gain some speed.
By accuracy I mean finding the exact Nearest Neighbor (NN).
Principal Component Analysis(PCA) is a good idea when you want to reduce the dimensional space your data live on.
Is there some clever algorithm or data structure to solve this exactly in reasonable time?
Approximate nearest neighbor search (ANNS), where you are satisfied with finding a point that might not be the exact Nearest Neighbor, but rather a good approximation of it (that is the 4th for example NN to your query, while you are looking for the 1st NN).
That approach cost you accuracy, but increases performance significantly. Moreover, the probability of finding a good NN (close enough to the query) is relatively high.
You could read more about ANNS in the introduction our kd-GeRaF paper.
A good idea is to combine ANNS with dimensionality reduction.
Locality Sensitive Hashing (LSH) is a modern approach to solve the Nearest Neighbor problem in high dimensions. The key idea is that points that lie close to each other are hashed to the same bucket. So when a query arrives, it will be hashed to a bucket, where that bucket (and usually its neighboring ones) contain good NN candidates).
FALCONN is a good C++ implementation, which focuses in cosine similarity. Another good implementation is our DOLPHINN, which is a more general library.
You could conceivably use Morton Codes, but with 75 dimensions they're going to be huge. And if all you have is 16,000 data points, exhaustive search shouldn't take too long.
No reason to believe this is NP-complete. You're not really optimizing anything and I'd have a hard time figure out how to convert this to another NP-complete problem (I have Garey and Johnson on my shelf and can't find anything similar). Really, I'd just pursue more efficient methods of searching and sorting. If you have n observations, you have to calculate n x n distances right up front. Then for every observation, you need to pick out the top k nearest neighbors. That's n squared for the distance calculation, n log (n) for the sort, but you have to do the sort n times (different for EVERY value of n). Messy, but still polynomial time to get your answers.
BK-Tree isn't such a bad thought. Take a look at Nick's Blog on Levenshtein Automata. While his focus is strings it should give you a spring board for other approaches. The other thing I can think of are R-Trees, however I don't know if they've been generalized for large dimensions. I can't say more than that since I neither have used them directly nor implemented them myself.
One very common implementation would be to sort the Nearest Neighbours array that you have computed for each data point.
As sorting the entire array can be very expensive, you can use methods like indirect sorting, example Numpy.argpartition in Python Numpy library to sort only the closest K values you are interested in. No need to sort the entire array.
#Grembo's answer above should be reduced significantly. as you only need K nearest Values. and there is no need to sort the entire distances from each point.
If you just need K neighbours this method will work very well reducing your computational cost, and time complexity.
if you need sorted K neighbours, sort the output again
see
Documentation for argpartition

When does Big-O notation fail?

What are some examples where Big-O notation[1] fails in practice?
That is to say: when will the Big-O running time of algorithms predict algorithm A to be faster than algorithm B, yet in practice algorithm B is faster when you run it?
Slightly broader: when do theoretical predictions about algorithm performance mismatch observed running times? A non-Big-O prediction might be based on the average/expected number of rotations in a search tree, or the number of comparisons in a sorting algorithm, expressed as a factor times the number of elements.
Clarification:
Despite what some of the answers say, the Big-O notation is meant to predict algorithm performance. That said, it's a flawed tool: it only speaks about asymptotic performance, and it blurs out the constant factors. It does this for a reason: it's meant to predict algorithmic performance independent of which computer you execute the algorithm on.
What I want to know is this: when do the flaws of this tool show themselves? I've found Big-O notation to be reasonably useful, but far from perfect. What are the pitfalls, the edge cases, the gotchas?
An example of what I'm looking for: running Dijkstra's shortest path algorithm with a Fibonacci heap instead of a binary heap, you get O(m + n log n) time versus O((m+n) log n), for n vertices and m edges. You'd expect a speed increase from the Fibonacci heap sooner or later, yet said speed increase never materialized in my experiments.
(Experimental evidence, without proof, suggests that binary heaps operating on uniformly random edge weights spend O(1) time rather than O(log n) time; that's one big gotcha for the experiments. Another one that's a bitch to count is the expected number of calls to DecreaseKey).
[1] Really it isn't the notation that fails, but the concepts the notation stands for, and the theoretical approach to predicting algorithm performance. </anti-pedantry>
On the accepted answer:
I've accepted an answer to highlight the kind of answers I was hoping for. Many different answers which are just as good exist :) What I like about the answer is that it suggests a general rule for when Big-O notation "fails" (when cache misses dominate execution time) which might also increase understanding (in some sense I'm not sure how to best express ATM).
It fails in exactly one case: When people try to use it for something it's not meant for.
It tells you how an algorithm scales. It does not tell you how fast it is.
Big-O notation doesn't tell you which algorithm will be faster in any specific case. It only tells you that for sufficiently large input, one will be faster than the other.
When N is small, the constant factor dominates. Looking up an item in an array of five items is probably faster than looking it up in a hash table.
Short answer: When n is small. The Traveling Salesman Problem is quickly solved when you only have three destinations (however, finding the smallest number in a list of a trillion elements can last a while, although this is O(n). )
the canonical example is Quicksort, which has a worst time of O(n^2), while Heapsort's is O(n logn). in practice however, Quicksort is usually faster then Heapsort. why? two reasons:
each iteration in Quicksort is a lot simpler than Heapsort. Even more, it's easily optimized by simple cache strategies.
the worst case is very hard to hit.
But IMHO, this doesn't mean 'big O fails' in any way. the first factor (iteration time) is easy to incorporate into your estimates. after all, big O numbers should be multiplied by this almost-constant facto.
the second factor melts away if you get the amortized figures instead of average. They can be harder to estimate, but tell a more complete story
One area where Big O fails is memory access patterns. Big O only counts operations that need to be performed - it can't keep track if an algorithm results in more cache misses or data that needs to be paged in from disk. For small N, these effects will typically dominate. For instance, a linear search through an array of 100 integers will probably beat out a search through a binary tree of 100 integers due to memory accesses, even though the binary tree will most likely require fewer operations. Each tree node would result in a cache miss, whereas the linear search would mostly hit the cache for each lookup.
Big-O describes the efficiency/complexity of the algorithm and not necessarily the running time of the implementation of a given block of code. This doesn't mean Big-O fails. It just means that it's not meant to predict running time.
Check out the answer to this question for a great definition of Big-O.
For most algorithms there is an "average case" and a "worst case". If your data routinely falls into the "worst case" scenario, it is possible that another algorithm, while theoretically less efficient in the average case, might prove more efficient for your data.
Some algorithms also have best cases that your data can take advantage of. For example, some sorting algorithms have a terrible theoretical efficiency, but are actually very fast if the data is already sorted (or nearly so). Another algorithm, while theoretically faster in the general case, may not take advantage of the fact that the data is already sorted and in practice perform worse.
For very small data sets sometimes an algorithm that has a better theoretical efficiency may actually be less efficient because of a large "k" value.
One example (that I'm not an expert on) is that simplex algorithms for linear programming have exponential worst-case complexity on arbitrary inputs, even though they perform well in practice. An interesting solution to this is considering "smoothed complexity", which blends worst-case and average-case performance by looking at small random perturbations of arbitrary inputs.
Spielman and Teng (2004) were able to show that the shadow-vertex simplex algorithm has polynomial smoothed complexity.
Big O does not say e.g. that algorithm A runs faster than algorithm B. It can say that the time or space used by algorithm A grows at a different rate than algorithm B, when the input grows. However, for any specific input size, big O notation does not say anything about the performance of one algorithm relative to another.
For example, A may be slower per operation, but have a better big-O than B. B is more performant for smaller input, but if the data size increases, there will be some cut-off point where A becomes faster. Big-O in itself does not say anything about where that cut-off point is.
The general answer is that Big-O allows you to be really sloppy by hiding the constant factors. As mentioned in the question, the use of Fibonacci Heaps is one example. Fibonacci Heaps do have great asymptotic runtimes, but in practice the constants factors are way too large to be useful for the sizes of data sets encountered in real life.
Fibonacci Heaps are often used in proving a good lower bound for asymptotic complexity of graph-related algorithms.
Another similar example is the Coppersmith-Winograd algorithm for matrix multiplication. It is currently the algorithm with the fastest known asymptotic running time for matrix multiplication, O(n2.376). However, its constant factor is far too large to be useful in practice. Like Fibonacci Heaps, it's frequently used as a building block in other algorithms to prove theoretical time bounds.
This somewhat depends on what the Big-O is measuring - when it's worst case scenarios, it will usually "fail" in that the runtime performance will be much better than the Big-O suggests. If it's average case, then it may be much worse.
Big-O notation typically "fails" if the input data to the algorithm has some prior information. Often, the Big-O notation refers to the worst case complexity - which will often happen if the data is either completely random or completely non-random.
As an example, if you feed data to an algorithm that's profiled and the big-o is based on randomized data, but your data has a very well-defined structure, your result times may be much faster than expected. On the same token, if you're measuring average complexity, and you feed data that is horribly randomized, the algorithm may perform much worse than expected.
Small N - And for todays computers, 100 is likely too small to worry.
Hidden Multipliers - IE merge vs quick sort.
Pathological Cases - Again, merge vs quick
One broad area where Big-Oh notation fails is when the amount of data exceeds the available amount of RAM.
Using sorting as an example, the amount of time it takes to sort is not dominated by the number of comparisons or swaps (of which there are O(n log n) and O(n), respectively, in the optimal case). The amount of time is dominated by the number of disk operations: block writes and block reads.
To better analyze algorithms which handle data in excess of available RAM, the I/O-model was born, where you count the number of disk reads. In that, you consider three parameters:
The number of elements, N;
The amount of memory (RAM), M (the number of elements that can be in memory); and
The size of a disk block, B (the number of elements per block).
Notably absent is the amount of disk space; this is treated as if it were infinite. A typical extra assumption is that M > B2.
Continuing the sorting example, you typically favor merge sort in the I/O case: divide the elements into chunks of size θ(M) and sort them in memory (with, say, quicksort). Then, merge θ(M/B) of them by reading the first block from each chunk into memory, stuff all the elements into a heap, and repeatedly pick the smallest element until you have picked B of them. Write this new merge block out and continue. If you ever deplete one of the blocks you read into memory, read a new block from the same chunk and put it into the heap.
(All expressions should be read as being big θ). You form N/M sorted chunks which you then merge. You merge log (base M/B) of N/M times; each time you read and write all the N/B blocks, so it takes you N/B * (log base M/B of N/M) time.
You can analyze in-memory sorting algorithms (suitably modified to include block reads and block writes) and see that they're much less efficient than the merge sort I've presented.
This knowledge is courtesy of my I/O-algorithms course, by Arge and Brodal (http://daimi.au.dk/~large/ioS08/); I also conducted experiments which validate the theory: heap sort takes "almost infinite" time once you exceed memory. Quick sort becomes unbearably slow, merge sort barely bearably slow, I/O-efficient merge sort performs well (the best of the bunch).
I've seen a few cases where, as the data set grew, the algorithmic complexity became less important than the memory access pattern. Navigating a large data structure with a smart algorithm can, in some cases, cause far more page faults or cache misses, than an algorithm with a worse big-O.
For small n, two algorithms may be comparable. As n increases, the smarter algorithm outperforms. But, at some point, n grows big enough that the system succumbs to memory pressure, in which case the "worse" algorithm may actually perform better because the constants are essentially reset.
This isn't particularly interesting, though. By the time you reach this inversion point, the performance of both algorithms is usually unacceptable, and you have to find a new algorithm that has a friendlier memory access pattern AND a better big-O complexity.
This question is like asking, "When does a person's IQ fail in practice?" It's clear that having a high IQ does not mean you'll be successful in life and having a low IQ does not mean you'll perish. Yet, we measure IQ as a means of assessing potential, even if its not an absolute.
In algorithms, the Big-Oh notation gives you the algorithm's IQ. It doesn't necessarily mean that the algorithm will perform best for your particular situation, but there's some mathematical basis that says this algorithm has some good potential. If Big-Oh notation were enough to measure performance you would see a lot more of it and less runtime testing.
Think of Big-Oh as a range instead of a specific measure of better-or-worse. There's best case scenarios and worst case scenarios and a huge set of scenarios in between. Choose your algorithms by how well they fit within the Big-Oh range, but don't rely on the notation as an absolute for measuring performance.
When your data doesn't fit the model, big-o notation will still work, but you're going to see an overlap from best and worst case scenarios.
Also, some operations are tuned for linear data access vs. random data access, so one algorithm while superior in terms of cycles, might be doggedly slow if the method of calling it changes from design. Similarly, if an algorithm causes page/cache misses due to the way it access memory, Big-O isn't going to going to give an accurate estimate of the cost of running a process.
Apparently, as I've forgotten, also when N is small :)
The short answer: always on modern hardware when you start using a lot of memory. The textbooks assume memory access is uniform, and it is no longer. You can of course do Big O analysis for a non-uniform access model, but that is somewhat more complex.
The small n cases are obvious but not interesting: fast enough is fast enough.
In practice I've had problems using the standard collections in Delphi, Java, C# and Smalltalk with a few million objects. And with smaller ones where the dominant factor proved to be the hash function or the compare
Robert Sedgewick talks about shortcomings of the big-O notation in his Coursera course on Analysis of Algorithms. He calls particularly egregious examples galactic algorithms because while they may have a better complexity class than their predecessors, it would take inputs of astronomical sizes for it to show in practice.
https://www.cs.princeton.edu/~rs/talks/AlgsMasses.pdf
Big O and its brothers are used to compare asymptotic mathematical function growth. I would like to emphasize on the mathematical part. Its entirely about being able reduce your problem to a function where the input grows a.k.a scales. It gives you a nice plot where your input (x axis) related to the number of operations performed(y-axis). This is purely based on the mathematical function and as such requires us to accurately model the algorithm used into a polynomial of sorts. Then the assumption of scaling.
Big O immediately loses its relevance when the data is finite, fixed and constant size. Which is why nearly all embedded programmers don't even bother with big O. Mathematically this will always come out to O(1) but we know that we need to optimize our code for space and Mhz timing budget at a level that big O simply doesn't work. This is optimization is on the same order where the individual components matter due to their direct performance dependence on the system.
Big O's other failure is in its assumption that hardware differences do not matter. A CPU that has a MAC, MMU and/or a bit shift low latency math operations will outperform some tasks which may be falsely identified as higher order in the asymptotic notation. This is simply because of the limitation of the model itself.
Another common case where big O becomes absolutely irrelevant is where we falsely identify the nature of the problem to be solved and end up with a binary tree when in reality the solution is actually a state machine. The entire algorithm regimen often overlooks finite state machine problems. This is because a state machine complexity grows based on the number of states and not the number of inputs or data which in most cases are constant.
The other aspect here is the memory access itself which is an extension of the problem of being disconnected from hardware and execution environment. Many times the memory optimization gives performance optimization and vice-versa. They are not necessarily mutually exclusive. These relations cannot be easily modeled into simple polynomials. A theoretically bad algorithm running on heap (region of memory not algorithm heap) data will usually outperform a theoretically good algorithm running on data in stack. This is because there is a time and space complexity to memory access and storage efficiency that is not part of the mathematical model in most cases and even if attempted to model often get ignored as lower order terms that can have high impact. This is because these will show up as a long series of lower order terms which can have a much larger impact when there are sufficiently large number of lower order terms which are ignored by the model.
Imagine n3+86n2+5*106n2+109n
It's clear that the lower order terms that have high multiples will likely together have larger significance than the highest order term which the big O model tends to ignore. It would have us ignore everything other than n3. The term "sufficiently large n' is completely abused to imagine unrealistic scenarios to justify the algorithm. For this case, n has to be so large that you will run out of physical memory long before you have to worry about the algorithm itself. The algorithm doesn't matter if you can't even store the data. When memory access is modeled in; the lower order terms may end up looking like the above polynomial with over a 100 highly scaled lower order terms. However for all practical purposes these terms are never even part of the equation that the algorithm is trying to define.
Most scientific notations are generally the description of mathematical functions and used to model something. They are tools. As such the utility of the tool is constrained and only as good as the model itself. If the model cannot describe or is an ill fit to the problem at hand, then the model simply doesn't serve the purpose. This is when a different model needs to be used and when that doesn't work, a direct approach may serve your purpose well.
In addition many of the original algorithms were models of Turing machine that has a completely different working mechanism and all computing today are RASP models. Before you go into big O or any other model, ask yourself this question first "Am I choosing the right model for the task at hand and do I have the most practically accurate mathematical function ?". If the answer is 'No', then go with your experience, intuition and ignore the fancy stuff.

Resources