Related
I am using neo4j-core gem (Neo4j::Node API). It is the only MRI-compatible Ruby binding of neo4j that I could find, and hence is valuable, but its documentation is a crap (it has missing links, lots of typographical errors, and is difficult to comprehend). In the Label and Index Support section of the first link, it says:
Create a node with an [sic] label person and one property
Neo4j::Node.create({name: 'kalle'}, :person)
Add index on a label
person = Label.create(:person)
person.create_index(:name)
drop index
person.drop_index(:name)
(whose second code line I believe is a typographical error of the following)
person = Node4j::Label.create(:person)
What is a label, is it the name of a database table, or is it an attribute peculiar to a node?
If it is the name of a node, I don't under the fact that (according to the API in the second link) the method Neo4j::Node.create and Neo4j::Node#add_label can take multiple arguments for the label. What does it mean to have multiple labels on a node?
Furthermore, If I repeat the create command with the same label argument, it creates a different node object each time. What does it mean to have multiple nodes with the same name? Isn't a label something to identify a node?
What is index? How are labels and indices different?
Labels are a way of grouping nodes. You can give the label to many nodes or just one node. Think of it as a collection of nodes that are grouped together. They allow you to assign indexes and other constraints.
An index allows quick lookup of nodes or edges without having to traverse the entire graph to find them. Think of it as a table of direct pointers to the particular nodes/edges indexed.
As I read what you pasted from the docs (and without, admittedly, knowing the slightest thing about neo4j):
It's a graph database, where every piece of data is a node with a certain amount of properties.
Each node can have a label (or more, presumably?). Think of it as a type -- or perhaps more appropriately, in Ruby parlance, a Module.
It's a database, so nodes can be part of an index for quicker access. So can subsets of nodes, and therefor nodes with a certain label.
Put another way: Think of the label as the table in a DB. Nodes as DB rows, which can belong to one or more labels/tables, or no label/table at all for that matter. And indexes as DB indexes on sets of rows.
I am in the process of merging two data sets together in Stata and came up with a potential concern.
I am planning on sorting each data set in exactly the same manner on several categorical variables that are common to both sets of data. HOWEVER, several of the categorical variables have more categories present in one data set over the other. I have been careful enough to ensure that the coding matches up in both data sets (e.g. Red is coded as 1 in both data set A and B, but data set A has only Red, Green and Blue whereas data set B has Red, Green, Blue, and Yellow).
If I were to sort each data set the same way and generate an id variable (gen id = _n) and merge on that, would I run into any problems?
There is no statistical question here, as this is purely about data management in Stata, so I too shall shortly vote for this to be migrated to Stack Overflow, where I would be one of those who might try to answer it, so I will do that now.
What you describe to generate identifiers is not how to think of merging data sets, regardless of any of the other details in your question.
Imagine any two data sets, and then in each data set, generate an identifier that is based on the observation numbers, as you propose. Generating such similar identifiers does not create a genuine merge key. You might as well say that four values "Alan" "Bill" "Christopher" "David" in one data set can be merged with "William" "Xavier" "Yulia" "Zach" in another data set because both can be labelled with observation numbers 1 to 4.
My advice is threefold:
Try what you are proposing with your data and try to understand the results.
Consider whether you have something else altogether, namely an append problem. It is quite common to confuse the two.
If both of those fail, come back with a real problem and real code and real results for a small sample, rather than abstract worries.
I think I may have solved my problem - I figured I would post an answer specifically relating to the problem in case anybody has the same issue.
~~
I have two data sets: One containing information about the amount of time IT help spent at a customer and another data set with how much product a customer purchased. Both data sets contain unique ID numbers for each company and the fiscal quarter and year that link the sets together (e.g. ID# 1001 corresponds to the same company in both data sets). Additionally, the IT data set contains unique ID numbers for each IT person and the customer purchases data set contains a unique ID number for each purchase made. I am not interested in analysis at the individual employee level, so I collapsed the IT time data set to the total sum of time spent at a given company regardless of who was there.
I was interested in merging both data sets so that I could perform analysis to estimate some sort of "responsiveness" (or elasticity) function linking together IT time spent and products purchased.
I am certain this is a case of "merging" data because I want to add more VARIABLES not OBSERVATIONS - that is, I wish to horizontally elongate not vertically elongate my final data set.
Stata 12 has many options for merging - one to one, many to one, and one to many. Supposing that I treat my IT time data set as my master and my purchases data set as my merging set, I would perform a "m:1" or many to one merge. This is because I have MANY purchases corresponding to one observation per quarter per company.
I'm looking for a best way to represent a set of structural data.
I'm designing a product picker. It will ask user some questions to narrow down to the set of products.
i.e.
1st question: "What's the product Group?"
Answer: Group1
In Group1, available Product Categories are (pick one):
Category1
Category2
Category4
Answer: Category4
In Category4 for Group1, available Types are:
Type3
Type5
Answer: Type5
For Type5, in Category4, in Group1 available Product Chacteristics are... etc.
So each new question shows list based not only on the previous answer, but on all the answers before. (i.e. some Types available in category4 would be different if that Category4 was in Group2). It's like a tree, where each child could be under multiple parents.
There may be up to 10 such levels.
What's the most efficient structure to store this hierarchy?
Without any extra knowledge of the problem and the different distributions, here is what you should do:
Each node will have an n-dimensional array of bits stored in it, where n is its level (Groups are level 0). Then, when you reach level i, you will look over all nodes in that level, and for each one see if the bit that fits the current history is on or off. (There are no pointers or such between the nodes, nodes are just a convenient name I'm using).
The dimensions of the arrays in each level would be the total size of the previous levels, e.g. in Types level (level 2), you would have 2-dimensional arrays, with the dimensions (# Groups)*(# Categories).
Example:
To know whether or not Type5 should appear in Category4, Group1, you would go to its array in the cell [1][4], and if it is on (1) then it should appear, otherwise (0) it shouldn't.
If you are using a language that allows pointer arithmetics (like c/c++), you can slightly optimize the matrix access by maintaining the offset you need to go to, since it always starts the same: [1], [1][4], [1][4][5], ..., but this should come at a much later time, when everything already works properly.
If later on you get to know more details about your problem, such as that most of these connections do or don't exist, then you could think about using sparse matrices, for example, instead of regular ones.
I have a list of tuples.
[
"Bob": 3,
"Alice": 2,
"Jane": 1,
]
When incrementing the counts
"Alice" += 2
the order should be maintained:
[
"Alice": 4,
"Bob": 3,
"Jane": 1,
]
When all is in memory there rather simple ways (some more or some less) to efficiently implement this. (using an index, insert-sort etc) The question though is: What's the most promising approach when the list does not fit into memory.
Bonus question: What if not even the index fits into memory?
How would you approach this?
B+ trees order a number of items using a key. In this case, the key is the count, and the item is the person's name. The entire B+tree doesn't need to fit into memory - just the current node being searched. You can set the maximum size of the nodes (and indirectly the depth of the tree) so that a node will fit into memory. (In practice nodes are usually far smaller than memory capacity.)
The data items are stored at the leaves of the tree, in so-called blocks. You can either store the items inline in the index, or store pointers to external storage. If the data is regularly sized, this can make for efficient retrieval from files. In the question example, the data items could be single names, but it would be more efficient to store blocks of names, all names in a block having the same count. The names within each block could also be sorted. (The names in the blocks themselves could be organized as a B-tree.)
If the number of names becomes large enough that the B+tree blocks are becoming ecessively large, the key can be made into a composite key, e.g. (count, first-letter). When searching the tree, only the count needs to be compared to find all names with that count. When inserting, or searcing for a specific name with a given count, then the full key can be compared to include filtering by name prefix.
Alternatively, instead of a composite key, the data items can point to offsets/blocks in an external file that contains the blocks of names, which will keep the B+tree itself small.
If the blocks of the btree are linked together, range queries can be efficiently implemented by searching for the start of the range, and then following block pointers to the next block until the end of the range is reached. This would allow you to efficiently implement "find all names with a count between 10 and 20".
As the other answers have noted, an RDBMS is the pre-packaged way of storing lists that don't fit into memory, but I hope this gives an insight into the structures used to solve the problem.
A relational database such as MySQL is specifically designed for storing large amounts of data the sum of which does not fit into memory, querying against this large amount of data, and even updating it in place.
For example:
CREATE TABLE `people` (
`name` VARCHAR(255),
`count` INT
);
INSERT INTO `people` VALUES
('Bob', 3),
('Alice', 2),
('Jane', 1);
UPDATE `people` SET `count` = `count` + 2;
After the UPDATE statement, the query SELECT * FROM people; will show:
+-------+-------+
| name | count |
+-------+-------+
| Bob | 5 |
| Alice | 4 |
| Jane | 3 |
+-------+-------+
You can save the order of people in your table by adding an autoincrementing primary key:
CREATE TABLE `people` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`name` VARCHAR(255),
`count` INT,
PRIMARY KEY(`id`)
);
INSERT INTO `people` VALUES
(DEFAULT, 'Bob', 3),
(DEFAULT, 'Alice', 2),
(DEFAULT, 'Jane', 1);
RDMS? Even flat file versions like SQLite. Otherwise a combination utilizing lazy loading. Only keep X records in memory the top Y records & the Z most recent ones that had counts updated. Otherwise a table of Key, Count columns where you run UPDATEs changing the values. The ordered list can be retrieved with a simple SELECT ORDER BY.
Read about B-trees and B+-trees. With these, the index can always be made small enough to fit into memory.
An interesting approach quite unlike BTrees is the Judy Tree
What you seem to be looking for are out of core algorithms for container classes, specifically an out of core list container class. Check out the stxxl library for some great examples of out of core alogorithms and processing.
You may also want to look at this related question
As far as "implementation details tackling this 'by hand'", you could read about how database systems do this by searching for the original papers on database design or finding graduate course notes on database architecture.
I did some searching and found a survey article by G. Graefe titled "Query evaluation techniques for large databases". It somewhat exhaustively covers every aspect of querying large databases, but the entire section 4 addresses how "query evaluation systems ... access base data stored in the database". Also, Graefe's survey was linked to by the course page for CPS 216: Advanced Databases Systems at Duke, Fall 2001. Week 5 was on Physical Data Organization which says that most commerical DBMS's organize data on-disk using blocks in the N-ary Storage Model (NSM): records are stored from the beginning of each block and a "directory" exists at the end.
See also:
Spring 2004 CPS 216 lecture notes
MIT OCW 6.830 Database Systems
Of course I know I could use a database. This question was more about the implementation details tackling this "by hand"
So basically, you are asking "How does a database do this?" To which the answer is, it uses a tree (for both the data and the index), and only stores part of the tree in memory at any one time.
As has already been mentioned, B-Trees are especially useful for this: since hard-drives always read a fixed amount at a time (the "sector size"), you can make each node the size of a sector to maximize efficiency.
You do not specify that you need to add or remove any elements from the list, just keep it sorted.
If so, a straightforward flat-file approach - typically using mmap for convenience - will work and be faster than a more generic database.
You can use a bsearch to locate the item, or maintain a set of the counts of slots with each value.
As you access an item, so the part of the file it is in (think in terms of memory 'pages') gets read into RAM automatically by the OS, and the slot and it's adjacent slots even gets copied into the L1 cache-line.
You can do an immediate comparison on it's adjacent slots to see if the increment or decrement causes the item to be out-of-order; if it is, you can use a linear iteration (perhaps augmented with a bsearch) to locate the first/last item with the appropriate count, and then swap them.
Managing files is what the OS is built to do.
I'm currently working on implementing a list-type structure at work, and I need it to be crazy effective. In my search for effective data structures I stumbled across a patent for a quad liked list, and this sparked my interest enough to make me forget about my current task and start investigating the quad list instead. Unfortunately, internet was very secretive about the whole thing, and google didn't produce much in terms of usable results. The only explanation I got was the patent description that stated:
A quad linked data structure that provides bidirectional search capability for multiple related fields within a single record. The data base is searched by providing sets of pointers at intervals of N data entries to accommodate a binary search of the pointers followed by a linear search of the resultant range to locate an item of interest and its related field.
This, unfortunately, just makes me more puzzled, as I cannot wrap my head around the non-layman explanation. So therefore I turn to you all in hope that you can explain to me what this quad linked history really is, as I know not knowing will drive me up and over the walls pretty quickly.
Do you know what a quad linked list is?
I can't be sure, but it sounds a bit like a skip list.
Even if that's not what it is, you might find skip lists handy. (To the best of my knowledge they are unidirectional, however.)
I've not come across the term formally before, but from the patent description, I can make an educated guess.
A linked list is one where each node has a link to the next...
a -->-- b -->-- c -->-- d -->-- null
A doubly linked list means each node holds a link to its predecessor as well.
--<-- --<-- --<--
| | | |
a -->-- b -->-- c -->-- d -->-- null
Let's assume the list is sorted. If I want to perform binary search, I'd normally go half way down the list to find the middle node, then go into the appropriate interval and repeat. However, linked list traversal is always O(n) - I have to follow all the links. From the description, I think they're just adding additional links from a node to "skip" a fixed number of nodes ahead in the list. Something like...
--<-- --<-- --<--
| | | |
a -->-- b -->-- c -->-- d -->-- null
| |
|----------->-----------|
-----------<-----------
Now I can traverse the list more rapidly, especially if I chose the extra link targets carefully (i.e. ensure they always go back/forward half of the offset of the item they point from in the list length). I then find the rough interval I want with these links, and use the normal links to find the item.
This is a good example of why I hate software patents. It's eminently obvious stuff, wrapped in florid prose to confuse people.
I don't know if this is exactly a "quad-linked list", but it sounds like something like this:
struct Person {
// Normal doubly-linked list.
Customer *nextCustomer;
Customer *prevCustomer;
std::string firstName;
Customer *nextByFirstName;
Customer *prevByFirstName;
std::string lastName;
Customer *nextByLastName;
Customer *prevByLastName;
};
That is: you maintain several orderings through your collection. You can easily navigate in firstName order, or in lastName order. It's expensive to keep the links up to date, but it makes navigation quite quick.
Of course, this could be something completely different.
My reading of it is that a quad linked list is one which can be traversed (backwards or forwards) in O(n) in two different ways, ie sorted according to FieldX or FieldY:
(a) generating first and second sets
of link pointers, wherein the first
set of link pointers points to
successor elements of the set of
related records when the records are
ordered with respect to the fixed ID
field, and the second set of link
pointers points to predecessor
elements of the set of related records
when the records are ordered with
respect to the fixed ID field;
(b) generating third and fourth sets
of link pointers, wherein the third
set of link pointers points to
successor elements of the set of
related records when the records are
ordered with respect to the variable
ID field, and the fourth set of link
pointers points to predecessor
elements of the set of related records
when the records are ordered with
respect to the variable ID field;
So if you had a quad linked list of employees you could store it sorted by name AND sorted by age, and enumerate either in O(n).
One source of the patent is this. There are, it appears, two claims, the second of which is more nearly relevant:
A computer implemented method for organizing and searching a set of related records, wherein each record includes:
i) a fixed ID field; and
ii) a variable ID field; the method comprising the steps of:
(a) generating first and second sets of link pointers, wherein the first set of link pointers points to successor elements of the set of related records when the records are ordered with respect to the fixed ID field, and the second set of link pointers points to predecessor elements of the set of related records when the records are ordered with respect to the fixed ID field;
(b) generating third and fourth sets of link pointers, wherein the third set of link pointers points to successor elements of the set of related records when the records are ordered with respect to the variable ID field, and the fourth set of link pointers points to predecessor elements of the set of related records when the records are ordered with respect to the variable ID field;
(c) generating first and second sets of field pointers, wherein the first set of field pointers includes an ordered set of pointers that point to every Nth fixed ID field when the records are ordered with respect to the fixed ID field, and the second set of pointers includes an ordered set of pointers that point to every Nth variable ID field when the records are ordered with respect to the variable ID field;
(d) when searching for a particular record by reference to its fixed ID field, conducting a binary search of the first set of field pointers to determine an initial pointer and a final pointer defining a range within which the particular record is located;
(e) examining by linear scarch, the fixed ID fields within the range determined in step (d) to locate the particular record;
(f) when searching for a particular record by reference to its variable ID field, conducting a binary search of the second set of field pointers to determine an initial pointer and a final pointer defining a range within which the particular record is located;
(g) examining, by linear search, the variable ID fields within the range determined in step (f) to locate the particular record.
When you work through the patent gobbledegook, I think it means approximately the same as having two skip lists (one for forward search, one for backwards search) on each of two keys (hence 4 lists in total, and the name 'quad-list'). I don't think it is a very good patent - it looks to be an obvious application of skip lists to a data set where you have two keys to search on.
The description isn't particularly good, but as best I can gather, it sounds like a less-efficient skip list.