I recently completed a course in Data Structures and gained new knowledge of storing data such as through linked chains, trees, arrays, and all their various sub-types. However each kind has variations on efficiency to do things with the data such as sorting, finding, or even simple adding and removing of entries from each particular data structure. For simplicity of asking this question, I will keep the method of focus to be traversing a data structure to find a particular entry, which in most average cases is O(n) efficiency*(see notes).
Now in some languages, there exists a technique by which user passed input can be instantiated and used as a new variable within the program. For instance, if I had a method that would do the following
myNewVariable("number1 = 1;"), number1 would be created with value of 1. I will not focus on any particular language as this is a question by matter of efficiency, not "if it can be done". For sake of this question, say the method that is used to create the new variables has a well defined structure which takes in integers to define new variables. For instance, If a user passed in 1 as input, it would be stored as number1=1 or if user input is passed as 3, variable would be number3=3 and so on for however many inputs the user passes. For the sake of this question, we will say the user inputs 100 integers from 1 to 100.
What I seek to know is if storing data like this has any efficiency bonus than storing it inside a data structure. From my current software perspective, there appears to be an efficiency bonus in that sorting the data would no longer have purpose as an entry could be found by trying to find if the number exists as a variable already, which would serve as O(1) efficiency when finding a particular entry (all you do is try to find number# where # is the actual number). However, from a hardware perspective, it doesn't appear that anything has changed. We have just offloaded more work for the program in question to remember for itself. Data is data, and while it can be compressed and decompressed, a data entry is a data entry, and efficiency would still be in place no matter how large the data actually is.
I want to confirm that I am wrong (or not) In my thinking in that storing data outside of a data structure like an array, a linked chain, or a tree has an efficiency bonus whatsoever. If no response can be achieved here in a timely manner, I will ask my professor about it to see if she knows and share it here.
Notes:
(For those unaware, O(n) is big O notation which denotes efficiency regarding how effective an algorithm is. The O(x), x being the problem size of how many steps it takes to complete a given function. That said, O(1) is the best possible algorithm that can be achieved as the algorithm completes in one run. As such, O(n^2) is a worse algorithm as that means the problem size is handled more than once.)
Related
I have learned about dynamic array (non-fixed size array) as dynamic array as vector in C++ and Arraylist in Java
And how can we implement it.
Basically when the array is full we create another array of doubled size and copy the old items to the new array
So can we implement an array of non-fixed size with random access as a vector and Arraylist without spending time copying the old elements?
In other word, Is there data structure like that (dynamic size and random access and no need for copy elements)??
Depending on what you mean by "like", this is trivially impossible to already exists.
First the trivially impossible. When we create an array, we mark a section of memory as being only for that array. If you have 3 such arrays that can grow without bound, one of them will eventually run into another. Given that we can actually create arrays that are bigger than available memory (it just pages to disk), we have to manage this risk, not avoid it.
But how big an issue is it? Copying data is O(1) per element, no matter how big it gets. And the overhead is low. The cost of this dynamicism is that you need to always check where the array starts. But that's a pretty fast check.
Alternately we can move to paged memory. Now an array access looks like, "Check what page it is on, then look at where it is in the page." Now your array can grow, but you never change where anything is. But if you want it to grow without bound, you have to add levels. We can implement it, and it does avoid copying, but this form of indirection has generally NOT proven worth it for general purpose programming. However paging is used in databases. And it is also used by operating systems to manage turning what the program thinks is the address of the data, to the actual address in memory. If you want to dive down that rabbit hole, TLB is worth looking at.
But there are other options that exist as well. Instead of fixed sized pages, we can have variable sized ones. This approach gets very complicated, very quickly. But the result is very useful. Look up ropes for more.
The browser that I wrote this on stores the text of what I wrote using a rope. This is how it can easily offer features like multi-level undo and editing in the middle of the document. However the raw performance of such schemes is significant. It is clearly worthwhile if you need the features, but otherwise we don't do it.
In short, every set of choices we make has tradeoffs. The one you'd like to optimize has what has proven to be the best tradeoff for offering dynamic size and raw performance. That's why it appears everywhere from Python lists to C++ vectors.
I have a question about fundamentals in data structures.
I understand that array's access time is faster than a linked list. O(1)- array vs O(N) -linked list
But a linked list beats an array in removing an element since there is no shifting needing O(N)- array vs O(1) -linked list
So my understanding is that if the majority of operations on the data is delete then using a linked list is preferable.
But if the use case is:
delete elements but not too frequently
access ALL elements
Is there a clear winner? In a general case I understand that the downside of using the list is that I access each node which could be on a separate page while an array has better locality.
But is this a theoretical or an actual concern that I should have?
And is the mixed-type i.e. create a linked list from an array (using extra fields) good idea?
Also does my question depend on the language? I assume that shifting elements in array has the same cost in all languages (at least asymptotically)
Singly-linked lists are very useful and can be better performance-wise relative to arrays if you are doing a lot of insertions/deletions, as opposed to pure referencing.
I haven't seen a good use for doubly-linked lists for decades.
I suppose there are some.
In terms of performance, never make decisions without understanding relative performance of your particular situation.
It's fairly common to see people asking about things that, comparatively speaking, are like getting a haircut to lose weight.
Before writing an app, I first ask if it should be compute-bound or IO-bound.
If IO-bound I try to make sure it actually is, by avoiding inefficiencies in IO, and keeping the processing straightforward.
If it should be compute-bound then I look at what its inner loop is likely to be, and try to make that swift.
Regardless, no matter how much I try, there will be (sometimes big) opportunities to make it go faster, and to find them I use this technique.
Whatever you do, don't just try to think it out or go back to your class notes.
Your problem is different from anyone else's, and so is the solution.
The problem with a list is not just the fragmentation, but mostly the data dependency. If you access every Nth element in array you don't have locality, but the accesses may still go to memory in parallel since you know the address. In a list it depends on the data being retrieved, and therefore traversing a list effectively serializes your memory accesses, causing it to be much slower in practice. This of course is orthogonal to asymptotic complexities, and would harm you regardless of the size.
Over the past few days I have been preparing for my very first phone interview for a software development job. In researching questions I have come up with this article.
Every thing was great until I got to this passage,
"When would you use a linked list vs. a vector? "
Now from experience and research these are two very different data structures, a linked list being a dynamic array and a vector being a 2d point in space. The only correlation I can see between the two is if you use a vector as a linked list, say myVector(my value, pointer to neighbor)
Thoughts?
Vector is another name for dynamic arrays. It is the name used for the dynamic array data structure in C++. If you have experience in Java you may know them with the name ArrayList. (Java also has an old collection class called Vector that is not used nowadays because of problems in how it was designed.)
Vectors are good for random read access and insertion and deletion in the back (takes amortized constant time), but bad for insertions and deletions in the front or any other position (linear time, as items have to be moved). Vectors are usually laid out contiguously in memory, so traversing one is efficient because the CPU memory cache gets used effectively.
Linked lists on the other hand are good for inserting and deleting items in the front or back (constant time), but not particularly good for much else: For example deleting an item at an arbitrary index in the middle of the list takes linear time because you must first find the node. On the other hand, once you have found a particular node you can delete it or insert a new item after it in constant time, something you cannot do with a vector. Linked lists are also very simple to implement, which makes them a popular data structure.
I know it's a bit late for this questioner but this is a very insightful video from Bjarne Stroustrup (the inventor of C++) about why you should avoid linked lists with modern hardware.
https://www.youtube.com/watch?v=YQs6IC-vgmo
With the fast memory allocation on computers today, it is much quicker to create a copy of the vector with the items updated.
I don't like the number one answer here so I figured I'd share some actual research into this conducted by Herb Sutter from Microsoft. The results of the test was with up to 100k items in a container, but also claimed that it would continue to out perform a linked list at even half a million entities. Unless you plan on your container having millions of entities, your default container for a dynamic container should be the vector. I summarized more or less what he says, but will also link the reference at the bottom:
"[Even if] you preallocate the nodes within a linked list, that gives you half the performance back, but it's still worse [than a vector]. Why? First of all it's more space -- The per element overhead (is part of the reason) -- the forward and back pointers involved within a linked list -- but also (and more importantly) the access order. The linked list has to traverse to find an insertion point, doing all this pointer chasing, which is the same thing the vector was doing, but what actually is occurring is that prefetchers are that fast. Performing linear traversals with data that is mapped efficiently within memory (allocating and using say, a vector of pointers that is defined and laid out), it will outperform linked lists in nearly every scenario."
https://youtu.be/TJHgp1ugKGM?t=2948
Use vector unless "data size is big" or "strong safety guarantee is essential".
data size is big
:- vector inserting in middle take linear time(because of the need to shuffle things around),but other are constant time operation (like traversing to nth node).So there no much overhead if data size is small.
As per "C++ coding standards Book by Andrei Alexandrescu and Herb Sutter"
"Using a vector for small lists is almost always superior to using list. Even though insertion in the middle of the sequence is a linear-time operation for vector and a constant-time operation for list, vector usually outperforms list when containers are relatively small because of its better constant factor, and list's Big-Oh advantage doesn't kick in until data sizes get larger."
strong safety guarantee
List provide strong safety guaranty.
http://www.cplusplus.com/reference/list/list/insert/
As a correction on the Big O time of insertion and deletion within a linked list, if you have a pointer that holds the position of the current element, and methods used to move it around the list, (like .moveToStart(), .moveToEnd(), .next() etc), you can remove and insert in constant time.
I've been coding for quite sometime now. And my work pertains to solving real-world business scenarios. However, I have not really come across any practical usage of some of the data structures like the Linked List, Queues and Stacks etc.
Not even at the business framework level. Of course, there is the ubiquitous HashTable, ArrayList and of late the List...but is there any practical usage of some of the other basic data structures?
It would be great if someone gave a real-world solution where a Doubly Linked List "performs" better than the obvious easily usable counterpart.
Of course it’s possible to get by with only a Map (aka HashTable) and a List. A Queue is only a glorified List but if you use a Queue everywhere you really need a queue then your code gets a lot more readable because nobody has to guess what you are using that List for.
And then there are algorithms that work a lot better when the underlying data structure is not a plain List but a DoublyLinkedList due to the way they have to navigate the list. The same is valid for all other data structures: there’s always a use for them. :)
Stacks can be used for pairing (parseing) such as matching open brackets to closing brackets.
Queues can be used for messaging, or activity processing.
Linked list, or double linked lists can be used for circular navigation.
Most of these algorithms are usually at a lower level than your usual "business" application. For example indices on the database is a variation of a multiply linked list. Implementation of function calling mechanism(or a parse tree) is a stack. Queues and FIFOs are used for servicing network request etc.
These are just examples of collection structures that are optimized for speed in various scenarios.
LIFO-Stack and FIFO-Queue are reasonably abstract (behavioral spec-level) data structures, so of course there are plenty of practical uses for them. For example, LIFO-Stack is a great way to help remove recursion (stack up the current state and loop, instead of making a recursive call); FIFO-Queue helps "buffer up" and "peel away" work nuggets in a coroutine arrangement; etc, etc.
Doubly-linked-List is more of an implementation issue than a behavioral spec-level one, mostly... can be a good way to implement a FIFO-Queue, for example. If you need a sequence with fast splicing and removal give a pointer to one sequence iten, you'll find plenty of other real-world uses, too.
I use queues, linked lists etc. in business solutions all the time.
Except they are implemented by Oracle, IBM, JMS etc.
These constructs are generally at a much lower level of abstaction than you would want while implementing a business solution. Where a business problem would benifit from
such low level constructs (e.g. delivery route planning, production line scheduling etc.) there is usually a package available to do it or you.
I don't use them very often, but they do come up. For example, I'm using a queue in a current project to process asynchronous character equipment changes that must happen in the order the user makes them.
A linked list is useful if you have a subset of "selected" items out of a larger set of items, where you must perform one type of operation on a "selected" item and a default operation or no operation at all on a normal item and the set of "selected" items can change at will (possibly due to user input). Because linked list removal can be done nearly instantaneously (vs. the traversal time it would take for an array search), if the subsets are large enough then it's faster to maintain a linked list than to either maintain an array or regenerate the whole subset by scanning through the whole larger set every time you need the subset.
With a hash table or binary tree, you could search for a single "selected" item, but you couldn't search for all "selected" items without checking every item (or having a separate dictionary for every permutation of selected items, which is obviously impractical).
A queue can be useful if you are in a scenario where you have a lot of requests coming in and you want to make sure to handle them fairly, in order.
I use stacks whenever I have a recursive algorithm, which usually means it's operating on some hierarchical data structure, and I want to print an error message if I run out of memory instead of simply letting the software crash if the program stack runs out of space. Instead of calling the function recursively, I store its local variables in an object, run a loop, and maintain a stack of those objects.
I'm trying to think of a naming convention that accurately conveys what's going on within a class I'm designing. On a secondary note, I'm trying to decide between two almost-equivalent user APIs.
Here's the situation:
I'm building a scientific application, where one of the central data structures has three phases: 1) accumulation, 2) analysis, and 3) query execution.
In my case, it's a spatial modeling structure, internally using a KDTree to partition a collection of points in 3-dimensional space. Each point describes one or more attributes of the surrounding environment, with a certain level of confidence about the measurement itself.
After adding (a potentially large number of) measurements to the collection, the owner of the object will query it to obtain an interpolated measurement at a new data point somewhere within the applicable field.
The API will look something like this (the code is in Java, but that's not really important; the code is divided into three sections, for clarity):
// SECTION 1:
// Create the aggregation object, and get the zillion objects to insert...
ContinuousScalarField field = new ContinuousScalarField();
Collection<Measurement> measurements = getMeasurementsFromSomewhere();
// SECTION 2:
// Add all of the zillion objects to the aggregation object...
// Each measurement contains its xyz location, the quantity being measured,
// and a numeric value for the measurement. For example, something like
// "68 degrees F, plus or minus 0.5, at point 1.23, 2.34, 3.45"
foreach (Measurement m : measurements) {
field.add(m);
}
// SECTION 3:
// Now the user wants to ask the model questions about the interpolated
// state of the model. For example, "what's the interpolated temperature
// at point (3, 4, 5)
Point3d p = new Point3d(3, 4, 5);
Measurement result = field.interpolateAt(p);
For my particular problem domain, it will be possible to perform a small amount of incremental work (partitioning the points into a balanced KDTree) during SECTION 2.
And there will be a small amount of work (performing some linear interpolations) that can occur during SECTION 3.
But there's a huge amount of work (constructing a kernel density estimator and performing a Fast Gauss Transform, using Taylor series and Hermite functions, but that's totally beside the point) that must be performed between sections 2 and 3.
Sometimes in the past, I've just used lazy-evaluation to construct the data structures (in this case, it'd be on the first invocation of the "interpolateAt" method), but then if the user calls the "field.add()" method again, I have to completely discard those data structures and start over from scratch.
In other projects, I've required the user to explicitly call an "object.flip()" method, to switch from "append mode" into "query mode". The nice this about a design like this is that the user has better control over the exact moment when the hard-core computation starts. But it can be a nuisance for the API consumer to keep track of the object's current mode. And besides, in the standard use case, the caller never adds another value to the collection after starting to issue queries; data-aggregation almost always fully precedes query preparation.
How have you guys handled designing a data structure like this?
Do you prefer to let an object lazily perform its heavy-duty analysis, throwing away the intermediate data structures when new data comes into the collection? Or do you require the programmer to explicitly flip the data structure from from append-mode into query-mode?
And do you know of any naming convention for objects like this? Is there a pattern I'm not thinking of?
ON EDIT:
There seems to be some confusion and curiosity about the class I used in my example, named "ContinuousScalarField".
You can get a pretty good idea for what I'm talking about by reading these wikipedia pages:
http://en.wikipedia.org/wiki/Scalar_field
http://en.wikipedia.org/wiki/Vector_field
Let's say you wanted to create a topographical map (this is not my exact problem, but it's conceptually very similar). So you take a thousand altitude measurements over an area of one square mile, but your survey equipment has a margin of error of plus-or-minus 10 meters in elevation.
Once you've gathered all the data points, you feed them into a model which not only interpolates the values, but also takes into account the error of each measurement.
To draw your topo map, you query the model for the elevation of each point where you want to draw a pixel.
As for the question of whether a single class should be responsible for both appending and handling queries, I'm not 100% sure, but I think so.
Here's a similar example: HashMap and TreeMap classes allow objects to be both added and queried. There aren't separate interfaces for adding and querying.
Both classes are also similar to my example, because the internal data structures have to be maintained on an ongoing basis in order to support the query mechanism. The HashMap class has to periodically allocate new memory, re-hash all objects, and move objects from the old memory to the new memory. A TreeMap has to continually maintain tree balance, using the red-black-tree data structure.
The only difference is that my class will perform optimally if it can perform all of its calculations once it knows the data set is closed.
If an object has two modes like this, I would suggest exposing two interfaces to the client. If the object is in append mode, then you make sure that the client can only ever use the IAppendable implementation. To flip to query mode, you add a method to IAppendable such as AsQueryable. To flip back, call IQueryable.AsAppendable.
You can implement IAppendable and IQueryable on the same object, and keep track of the state in the same way internally, but having two interfaces makes it clear to the client what state the object is in, and forces the client to deliberately make the (expensive) switch.
I generally prefer to have an explicit change, rather than lazily recomputing the result. This approach makes the performance of the utility more predictable, and it reduces the amount of work I have to do to provide a good user experience. For example, if this occurs in a UI, where do I have to worry about popping up an hourglass, etc.? Which operations are going to block for a variable amount of time, and need to be performed in a background thread?
That said, rather than explicitly changing the state of one instance, I would recommend the Builder Pattern to produce a new object. For example, you might have an aggregator object that does a small amount of work as you add each sample. Then instead of your proposed void flip() method, I'd have a Interpolator interpolator() method that gets a copy of the current aggregation and performs all your heavy-duty math. Your interpolateAt method would be on this new Interpolator object.
If your usage patterns warrant, you could do simple caching by keeping a reference to the interpolator you create, and return it to multiple callers, only clearing it when the aggregator is modified.
This separation of responsibilities can help yield more maintainable and reusable object-oriented programs. An object that can return a Measurement at a requested Point is very abstract, and perhaps a lot of clients could use your Interpolator as one strategy implementing a more general interface.
I think that the analogy you added is misleading. Consider an alternative analogy:
Key[] data = new Key[...];
data[idx++] = new Key(...); /* Fast! */
...
Arrays.sort(data); /* Slow! */
...
boolean contains = Arrays.binarySearch(data, datum) >= 0; /* Fast! */
This can work like a set, and actually, it gives better performance than Set implementations (which are implemented with hash tables or balanced trees).
A balanced tree can be seen as an efficient implementation of insertion sort. After every insertion, the tree is in a sorted state. The predictable time requirements of a balanced tree are due to the fact the cost of sorting is spread over each insertion, rather than happening on some queries and not others.
The rehashing of hash tables does result in less consistent performance, and because of that, aren't appropriate for certain applications (perhaps a real-time microcontroller). But even the rehashing operation depends only on the load factor of the table, not the pattern of insertion and query operations.
For your analogy to hold strictly, you would have to "sort" (do the hairy math) your aggregator with each point you add. But it sounds like that would be cost prohibitive, and that leads to the builder or factory method patterns. This makes it clear to your clients when they need to be prepared for the lengthy "sort" operation.
Your objects should have one role and responsibility. In your case should the ContinuousScalarField be responsible for interpolating?
Perhaps you might be better off doing something like:
IInterpolator interpolator = field.GetInterpolator();
Measurement measurement = Interpolator.InterpolateAt(...);
I hope this makes sense, but without fully understanding your problem domain it's hard to give you a more coherent answer.
"I've just used lazy-evaluation to construct the data structures" -- Good
"if the user calls the "field.add()" method again, I have to completely discard those data structures and start over from scratch." -- Interesting
"in the standard use case, the caller never adds another value to the collection after starting to issue queries" -- Whoops, false alarm, actually not interesting.
Since lazy eval fits your use case, stick with it. That's a very heavily used model because it is so delightfully reliable and fits most use cases very well.
The only reason for rethinking this is (a) the use case change (mixed adding and interpolation), or (b) performance optimization.
Since use case changes are unlikely, you might consider the performance implications of breaking up interpolation. For example, during idle time, can you precompute some values? Or with each add is there a summary you can update?
Also, a highly stateful (and not very meaningful) flip method isn't so useful to clients of your class. However, breaking interpolation into two parts might still be helpful to them -- and help you with optimization and state management.
You could, for example, break interpolation into two methods.
public void interpolateAt( Point3d p );
public Measurement interpolatedMasurement();
This borrows the relational database Open and Fetch paradigm. Opening a cursor can do a lot of preliminary work, and may start executing the query, you don't know. Fetching the first row may do all the work, or execute the prepared query, or simply fetch the first buffered row. You don't really know. You only know that it's a two part operation. The RDBMS developers are free to optimize as they see fit.
Do you prefer to let an object lazily perform its heavy-duty analysis,
throwing away the intermediate data structures when new data comes
into the collection? Or do you require the programmer to explicitly
flip the data structure from from append-mode into query-mode?
I prefer using data structures that allow me to incrementally add to it with "a little more work" per addition, and to incrementally pull the data I need with "a little more work" per extraction.
Perhaps if you do some "interpolate_at()" call in the upper-right corner of your region, you only need to do calculations involving the points in that upper-right corner,
and it doesn't hurt anything to leave the other 3 quadrants "open" to new additions.
(And so on down the recursive KDTree).
Alas, that's not always possible -- sometimes the only way to add more data is to throw away all the previous intermediate and final results, and re-calculate everything again from scratch.
The people who use the interfaces I design -- in particular, me -- are human and fallible.
So I don't like using objects where those people must remember to do things in a certain way, or else things go wrong -- because I'm always forgetting those things.
If an object must be in the "post-calculation state" before getting data out of it,
i.e. some "do_calculations()" function must be run before the interpolateAt() function gets valid data,
I much prefer letting the interpolateAt() function check if it's already in that state,
running "do_calculations()" and updating the state of the object if necessary,
and then returning the results I expected.
Sometimes I hear people describe such a data structure as "freeze" the data or "crystallize" the data or "compile" or "put the data into an immutable data structure".
One example is converting a (mutable) StringBuilder or StringBuffer into an (immutable) String.
I can imagine that for some kinds of analysis, you expect to have all the data ahead of time,
and pulling out some interpolated value before all the data has put in would give wrong results.
In that case,
I'd prefer to set things up such that the "add_data()" function fails or throws an exception
if it (incorrectly) gets called after any interpolateAt() call.
I would consider defining a lazily-evaluated "interpolated_point" object that doesn't really evaluate the data right away, but only tells that program that sometime in the future that data at that point will be required.
The collection isn't actually frozen, so it's OK to continue adding more data to it,
up until the point something actually extract the first real value from some "interpolated_point" object,
which internally triggers the "do_calculations()" function and freezes the object.
It might speed things up if you know not only all the data, but also all the points that need to be interpolated, all ahead of time.
Then you can throw away data that is "far away" from the interpolated points,
and only do the heavy-duty calculations in regions "near" the interpolated points.
For other kinds of analysis, you do the best you can with the data you have, but when more data comes in later, you want to use that new data in your later analysis.
If the only way to do that is to throw away all the intermediate results and recalculate everything from scratch, then that's what you have to do.
(And it's best if the object automatically handled this, rather than requiring people to remember to call some "clear_cache()" and "do_calculations()" function every time).
You could have a state variable. Have a method for starting the high level processing, which will only work if the STATE is in SECTION-1. It will set the state to SECTION-2, and then to SECTION-3 when it is done computing. If there's a request to the program to interpolate a given point, it will check if the state is SECTION-3. If not, it will request the computations to begin, and then interpolate the given data.
This way, you accomplish both - the program will perform its computations at the first request to interpolate a point, but can also be requested to do so earlier. This would be convenient if you wanted to run the computations overnight, for example, without needing to request an interpolation.