In the OCW Advanced Data Structures course, Prof. E. Demaine mentions a Data Structure that is able to find all the points dominated by a query point (b2, b3) using O(n) space and O(k) time, provided that a search for point b3 has already been completed, where k is the size of the output.
The solution works by transforming the above problem into a ray stabbing problem, and using a technique similar to fractional cascading, as shown in the following image from the lecture notes:
While the concept itself is intuitive, implementing the actual data structure is not straightforward at all.
Chazelle describes this in a paper as Filtering Search (pp712).
I would like to find additional literature or answers that describe and explain of this data structure and algorithm (perhaps with pseudo code and more images, with focus on implementation).
Additionally, I would also like to know more about whether this structure can be implemented in a way that is not "static". That is, I would like to be able to insert and delete points from the structure as efficiently as possible.
The book "Computational Geometry: Algorithms and Applications" covers data structures for questions like these. Each chapter has a nice section describing where to learn more, including more complex structures for answering the same problems that are not covered in the book. There are enough diagrams, but not much pseudocode.
Many structures like this can be dynamized using techniques discussed in the book "The design of dynamic data structures". Jeff Erickson has some nice notes on the topic. Using fractional cascading with it is discussed is Cache-Oblivious Streaming B-trees" - see the section about "cache-oblivious lookahead arrays.
Related
As stated in the title, I'm simply looking for algorithms or solutions one might use to take in the twitter firehose (or a portion of it) and
a) identify questions in general
b) for a question, identify questions that could be the same, with some degree of confidence
Thanks!
(A)
I would try to identify questions using machine learning and the Bag of Words model.
Create a labeled set of twits, and label each of them with a binary
flag: question or not question.
Extract the features from the training set. The features are traditionally words, but at least for any time I tried it - using bi-grams significantly improved the results. (3-grams were not helpful for my cases).
Build a classifier from the data. I usually found out SVM gives better performance then other classifiers, but you can use others as well - such as Naive Bayes or KNN (But you will probably need feature selection algorithm for these).
Now you can use your classifier to classify a tweet.1
(B)
This issue is referred in the world of Information-Retrieval as "duplicate detection" or "near-duplicate detection".
You can at least find questions which are very similar to each other using Semantic Interpretation, as described by Markovitch and Gabrilovich in their wonderful article Wikipedia-based Semantic Interpretation for Natural Language Processing. At the very least, it will help you identify if two questions are discussing the same issues (even though not identical).
The idea goes like this:
Use wikipedia to build a vector that represents its semantics, for a term t, the entry vector_t[i] is the tf-idf score of the term i as it co-appeared with the term t. The idea is described in details in the article. Reading the 3-4 first pages are enough to understand it. No need to read it all.2
For each tweet, construct a vector which is a function of the vectors of its terms. Compare between two vectors - and you can identify if two questions are discussing the same issues.
EDIT:
On 2nd thought, the BoW model is not a good fit here, since it ignores the position of terms. However, I believe if you add NLP processing for extracting feature (for examples, for each term, also denote if it is pre-subject or post-subject, and this was determined using NLP procssing), combining with Machine Learning will yield pretty good results.
(1) For evaluation of your classifier, you can use cross-validation, and check the expected accuracy.
(2) I know Evgeny Gabrilovich published the implemented algorithm they created as an open source project, just need to look for it.
I have been reading parts of Introduction to Algorithms by Cormen et al, and have implemented some of the algorithms.
In order to test my implementations I wrote some glue code to do file io, then made some sample input by hand and some more sample input by writing programs that generate sample input.
However I am doubtful as to the quality of my own sample inputs -- corner cases; I may have missed the more interesting possibilities; I may have miscalculated the proper output; etc.
Is there a set of test inputs and outputs for various algorithms collected somewhere on the Internet so that I might be able to test my code? I am looking for test data reasonably specific to particular algorithms, rather than contest problems that often involve a problem solving component as well.
I understand that I might have to adjust my code depending on the format the input is collected in (e.g. The various constraints of the inputs; for graph algorithms, the representation of the graph; etc.) although, I am hoping that the change I would have to make would be reasonably trivial.
Edit:
Some particular datasets I am currently looking for are:
Lists of numbers
Skewed so that Quick sort performs badly.
Skewed so that Fibonacci Heap performs particularly well or poorly for specific operations.
Graphs (for which High Performance Mark has offered a number of interesting references)
Sparse graphs (with specific bounds on number of edges),
Dense graphs,
Since, I am still working through the book, if you are in a similar situation as I am, or you just feel the list could be improved, please feel free to edit the list -- some time soon, I may come to need datasets similar to what you are looking for. I am not entirely sure how editing privileges work, but if I have any say over it, I will try to approve it.
I don't know of any one resource which will provide you with sample inputs for all the types of algorithm that Cormen et al cover but for graph datasets here are a couple of references:
Knuth's Stanford Graphbase
and
the Stanford Large Network Dataset Collection
which I stumbled across while looking for the link to the former. You might be interested in this one too:
the Matrix Market
Why not edit your question and let SO know what other types of input you are looking for.
I am going to stick my head on the line and say that I do not know of any such source, and I very much doubt that such a source exists.
As you seem to be aware, algorithms can be applied to almost any sort of data, and so it would be fruitless to attempt to provide sample data.
There are various types of trees I know. For example, binary trees can be classified as binary search trees, two trees, etc.
Can anyone give me a complete classification of all the trees in computer science?
Please provide me with reliable references or web links.
It's virtually impossible to answer this question since there are essentially arbitrarily many different ways of using trees. The issue is that a tree is a structure - it's a way of showing how various pieces of data are linked to one another - and what you're asking for is every possible way of interpreting the meaning of that structure. This would be similar, for example, to asking for all uses of calculus in engineering; calculus is a tool with which you can solve an enormous class of problems, but there's no concise way to explain all possible uses of the integral because in each application it is used a different way.
In the case of trees, I've found that there are thousands of research papers describing different tree structures and ways of using trees to solve problems. They arise in string processing, genomics, computational geometry, theory of computation, artificial intelligence, optimization, operating systems, networking, compilers, and a whole host of other areas. In each of these domains they're used to encode specific structures that are domain-specific and difficult to understand without specialized knowledge of the field. No one reference can cover all these ares in any reasonable depth.
In short, you seem to already know the structure of a tree, and this general notion is transferrable to any of the above domains. But to try to learn every possible way of using this structure or all its applications would be a Herculean undertaking that no one, not even the legendary Don Knuth, could ever hope to achieve in a lifetime.
Wikipedia has a nice compilation of the various trees at the bottom of the page
Dictionary of Algorithms and Data Structures has more information
What specifics are you looking for?
I'm a self-taught developer and, quite frankly, am not all that great at figuring out which search or sort algorithm to use in any particular situation. I was just wondering if there was a Design Patterns-esque listing of the common algorithms available out there in the ether for me to bookmark. Something like:
Name of algorithm (with aliases, if any)
Problem it addresses
Big-O cost
Algorithm itself
Examples
Other algorithms it may be used with/substituted for
I'm just looking for a simple, concise listing of the algorithms I probably should know in one location. Is there anything like this available?
The web site http://www.sorting-algorithms.com/ shows many popular sorting algorithms, and describes their complexity and implementation. It goes the extra step to show, via animations, how those algorithms perform on different types of data (i.e pre-sorted, sparse, reverse-sorted, etc...).
This site has some examples of sorting algorithms, included visual aids to help you get the hang of it. I personally like the various best/worst/average/few unique cases they show.
Wikipedia has a nice table that lists most of the common sorting algorithms along with classification of them and basic analysis of their complexity characteristics.
The more common sorting algorithms have pseudocode and more in-depth analysis. For less common sorting algorithms, you'll probably have better luck finding details in academic papers or real implementations.
your should read CLRS.
In terms of problems variety, there are millions. and it all comes from puzzles and math.
Skienna has nice problems with different varieties.
You have a great article on the wikipedia.
http://en.wikipedia.org/wiki/Sorting_algorithm#Comparison_of_algorithms
But I would suggest reading some book. Almost every book has one chapter about sorting.
I want to read a book on data structures and algorithms, but I would like to know if there is any specific topic in discrete mathematics considered very important as a prerequisite to understanding the materials presented in data structure book.
P.S I am self-taught programmer; I didn't take any computer science courses.
"Discrete math" is more a buzzword that contains the basics from a dozen different topics (logic, algorithms, theory of computation, number theory, digital design, etc.) all marginally related to programming. Reading a discrete mathematics book would be about the same as reading the first chapter or two of books on all these topics.
The most essential thing to understand is boolean logic, which you're probably already pretty good at if you're self-taught; algorithms are also very important. The theory of computation stuff is fairly interesting, but not really useful unless you're really into algorithms, or want to write your own parser. Number theory is good to learn if you want to get into cryptography.
You don't really need to know any of these things to read about data structures.
Mathematical induction is probably the single most important concept nobody has mentioned yet. It is essential for understanding and proving the properties of algorithms on trees and other inductively defined data structures.
BTW, the classic textbook on this topic is Concrete Mathematics: A Foundation for Computer Science, by Ronald Graham, Donald Knuth, and Oren Patashnik.
But life is too short to read a textbook just so you can read a textbook. Dive in. If you find yourself lost, go find the background you need.
Some topics usually found in introductory discrete math books that come in handy in an algorithms/data structure course are:
Some basic probability/statistics: Useful in understanding hashing and randomized algorithms
Most discrete math books have a chapter on graphs and related concepts, things like topological sorting, relations, partial and total orders.
Set theory and formal logic: Essential tools in reasoning about the correctness and complexity of algorithms.
There are probably a few others that escape me at this moment. It's been a while since I left college.
Having said this, a good data-structure/algorithm book often has one or two introductory chapters and sections in most other chapters that are aimed at bringing the reader up to speed on some of the relevant discrete math topics. But IMO, it is better to know this stuff just to have a more thorough understanding, if you have the time and inclination. Otherwise, I don't think you will find yourself stuck if you have a good book.
PS:
The topics I mention are from these two books:
"Discrete and Combinatorial Mathematics: An Applied Introduction" by Grimaldi
"Discrete Mathematics and its Applications" by Rosen
("Concrete Math" is way too heavy to read just for data structures)
Go ahead and read the data structures book, you'll be fine.
For data structures and algorithms I think you will mostly want to know the area of Calculus related to the computing of series limits. This, in turn, involves some knowledge of Algebra.
You need to know how to compute series limits in order to be able to compute algorithm complexity.
if you are interested not only in data structure but in all computer science fields, discrete math include Boolean algebra and it's application that is the basis of computer architecture and assembly language, but i don't think it's related to data structures and algorithms