Data Structures and data representation in a given context - data-structures

I started dedicating time for learning algorithms and data structures. So my first and basic question is, how do we represent the data depending on the context.
I have given it time and thought and came up with this conclusion.
Groups of same data -> List/Arrays
Classification of data [Like population on gender, then age etc.] -> Trees
Relations [Like relations between a product brought and others] -> Graphs
I am posting this question to know our stack overflow community thought about my interpretation of datastructures. Since it is a generic topic I could not get a justification for my thought online. Please help me if I am wrong.

This looks like oversimplifying things.
The data structure we want to use depends on what we are going to do with the data.
For example, when we store records about people and need fast access by index, we can use an array.
When we store the same records about people but need to find by name fast, we can use a search tree.
Graphs are a theoretical concept, not a data structure.
They can be stored as an adjacency matrix (two-dimensional array, suitable for small or dense graphs), or as lists of adjacent edges (array/list of dynamic arrays/lists, suitable for large or sparse graphs), or implicitly (generated on the fly), or otherwise.

Related

Building a Decision Tree based on a sparse, multi-value matrix

Decision tree learning algos have a general advantage over eg NNs / RNNs, that their internal structure are humanly comprehensible, and can be sanity checked; and can improve even human decision making. Algorithms like ID3 are excellent for building toy models from even non-toy level data.
For a recent project of mine, there is a large existing database, which contains the outcome column for all rows; however, there are 2 extra complexities:
Not all column values are filled for every training set - the matrix is sparse, and
Values aren't binary -there are 2-5 (up to ~7) values for many columns.
I'd like to automatically train & learn the closest-matching decision tree given the data above, and so I turn to the nice community of stackoverflow:
What specific algorithm can I use to extract a decision tree, given the constrains above?
Edit note: insta-accept for: naming algo, giving a short description on how it works, and linking to an OSS implementation (in any language) thereof. Thanks!

what is data structure? simple straight forward explaination required

i have to explain what data structure is to someone, so what would be the easiest way to explain it? would it be right if i say
"Data structure is used to organize data(arrange data in some fashion) so that we can perform certain operation fastly with as little resource usage as possible"
How values are placed in locations together and their location addresses and indices are stored as values too.
And that as very abstract "structures" so one has linked lists, arrays, pointers, graphs, binary trees. And can do things with them (the algorithms). The capabilities like being sorted, needing sortedness, fast access and so on.
This is fundamental, not too complicated, and a good grasp of data
structures, the correct usage of data structures can solve problems
elegantly. For learning data structures a language like Pascal is more
beneficial than C.
In computer science, a data structure is a particular way of organizing data in a computer so that it can be used efficiently.
Source: wikipedia (https://en.wikipedia.org/wiki/Data_structure)
I would say what you wrote is pretty close. :)

Is there a formalism for this data structure?

I'm looking for a mathematical formalism for a data structure I'm working with, so that I can track down relevant theorems and algorithms.
Suppose you have the following:
A directed acyclic graph of topics.
At each topic, there are one or more relations between the topic, items in a set of documents and items in a set of groups.
The groups may be a simple set or they may end up as a DAG. They are used to manage the visibility of the association of a document with a topic.
Only recently have I come across hypergraphs, which seem relevant but too general. Is there a formalism for this data structure? If not, can it be described more succinctly in mathematical terms?
This looks like http://en.wikipedia.org/wiki/Formal_concept_analysis , especially Galois lattice.
A lattice has more constraints than what you describe, but maybe you can adopt this formalism in your application, or start from here to see if there are related works closer to your needs.
I guess you already know http://en.wikipedia.org/wiki/Ontology_%28information_science%29 , which is also the starting point of a lot of resources.

Multi-dimensional data structure

Which of the following data structure
R-tree,
R*-tree,
X-tree,
SS-tree,
SR-tree,
VP-tree,
metric-trees
provide reasonably good performance in insert, update and searching of multidimensional data stored in its corresponding form?
Is there a better data structure out there for handling multidimensional data?
What kind of multi-dimensional data are you talkign about? In the R-tree wiki, it states that itis used for indexing multi-dimensional data, but it seems clear that it will be primarily useful for data which is multi-dimensional in the same kind of feature -- i.e. vertical location and horizontal location, longitude and latitude, etc.
If the data is multi-dimensional simply because there are a lot of attributes for the data and it needs to be analyzed along many of these dimensions, then a relational representation is probably best.
The real issue is how do you optimize the relations and indices for the type of queries you need to answer? For this, you need to do some domain analysis beforehand, and some performance analysis after the first iteration, to determine if there are better ways to structure and index your tables.

Help Understanding Cross Validation and Decision Trees

I've been reading up on Decision Trees and Cross Validation, and I understand both concepts. However, I'm having trouble understanding Cross Validation as it pertains to Decision Trees. Essentially Cross Validation allows you to alternate between training and testing when your dataset is relatively small to maximize your error estimation. A very simple algorithm goes something like this:
Decide on the number of folds you want (k)
Subdivide your dataset into k folds
Use k-1 folds for a training set to build a tree.
Use the testing set to estimate statistics about the error in your tree.
Save your results for later
Repeat steps 3-6 for k times leaving out a different fold for your test set.
Average the errors across your iterations to predict the overall error
The problem I can't figure out is at the end you'll have k Decision trees that could all be slightly different because they might not split the same way, etc. Which tree do you pick? One idea I had was pick the one with minimal errors (although that doesn't make it optimal just that it performed best on the fold it was given - maybe using stratification will help but everything I've read say it only helps a little bit).
As I understand cross validation the point is to compute in node statistics that can later be used for pruning. So really each node in the tree will have statistics calculated for it based on the test set given to it. What's important are these in node stats, but if your averaging your error. How do you merge these stats within each node across k trees when each tree could vary in what they choose to split on, etc.
What's the point of calculating the overall error across each iteration? That's not something that could be used during pruning.
Any help with this little wrinkle would be much appreciated.
The problem I can't figure out is at the end you'll have k Decision trees that could all be slightly different because they might not split the same way, etc. Which tree do you pick?
The purpose of cross validation is not to help select a particular instance of the classifier (or decision tree, or whatever automatic learning application) but rather to qualify the model, i.e. to provide metrics such as the average error ratio, the deviation relative to this average etc. which can be useful in asserting the level of precision one can expect from the application. One of the things cross validation can help assert is whether the training data is big enough.
With regards to selecting a particular tree, you should instead run yet another training on 100% of the training data available, as this typically will produce a better tree. (The downside of the Cross Validation approach is that we need to divide the [typically little] amount of training data into "folds" and as you hint in the question this can lead to trees which are either overfit or underfit for particular data instances).
In the case of decision tree, I'm not sure what your reference to statistics gathered in the node and used to prune the tree pertains to. Maybe a particular use of cross-validation related techniques?...
For the first part, and like the others have pointed out, we usually use the entire dataset for building the final model, but we use cross-validation (CV) to get a better estimate of the generalization error on new unseen data.
For the second part, I think you are confusing CV with the validation set, used to avoid overfitting the tree by pruning a node when some function value computed on the validation set does not increase before/after the split.
Cross validation isn't used for buliding/pruning the decision tree. It's used to estimate how good the tree (built on all of the data) will perform by simulating arrival of new data (by building the tree without some elements just as you wrote). I doesn't really make sense to pick one of the trees generated by it because the model is constrained by the data you have (and not using it all might actually be worse when you use the tree for new data).
The tree is built over the data that you choose (usualy all of it). Pruning is usually done by using some heuristic (i.e. 90% of the elements in the node belongs to class A so we don't go any further or the information gain is too small).
The main point of using cross-validation is that it gives you better estimate of the performance of your trained model when used on different data.
Which tree do you pick? One option would be that you bulid a new tree using all your data for training set.
It has been mentioned already that the purpose of the cross-validation is to qualify the model. In other words cross-validation provide us with an error/accuracy estimation of model generated with the selected "parameters" regardless of the used data.
The corss-validation process can be repeated using deferent parameters untill we are satisfied with the performance. Then we can train the model with the best parameters on the whole data.
I am currently facing the same problem, and I think there is no “correct” answer, since the concepts are contradictory and it’s a trade-off between model robustness and model interpretation.
I basically chose the decision tree algorithm for the sake of easy interpretability, visualization and straight forward hands-on application.
On the other hand, I want to proof the robustness of the model using cross-validation.
I think I will apply a two step approach:
1. Apply k-fold cross-validation to show robustness of the algorithm with this dataset
2. Use the whole dataset for the final decision tree for interpretable results.
You could also randomly choose a tree set of the cross-validation or the best performing tree, but then you would loose information of the hold-out set.

Resources