What is the data structure behind Clojure's sets? - data-structures

I recently listened to Rich Hickey's interview on Software Engineering Radio. During the interview Rich mentioned that Clojure's collections are implemented as trees. I'm hoping to implement persistent data structures in another language, and would like to understand how sets and Clojure's other persistent data structures are implemented.
What would the tree look like at each point in the following scenario?
Create the set {1 2 3}
Create the union of {1 2 3} and {4}
Create the difference of {1 2 3 4} and {1}
I'd like to understand how the three sets generated ({1 2 3}, {1 2 3 4}, and {2 3 4}) share structure, and how "deletions" are handled.
I'd also like to know the maximum number of branches that a node may have. Rich mentioned in the interview that the trees are shallow, so presumably the branching factor is greater than two.

You probably need to read the work of Phil Bagwell. His research into data structures is the base of Clojure, Haskell and Scala persistent data structures.
There is this talk by Phil at Clojure/Conj: http://www.youtube.com/watch?v=K2NYwP90bNs
There are also some papers:
Fast Functional Lists, Hash-Lists, Deques and Variable Length Arrays
Ideal Hash Trees
More from Phil here
You can also read Purely Functional Data Structures by Chris Okasaki. This blog post talks about the book: http://okasaki.blogspot.com.br/2008/02/ten-years-of-purely-functional-data.html

You should really read Clojure Programming, it covers this in great detail, including pictures. Briefly though, collections are depth first searches through trees. We can show your examples like this:
(def x #{1 2 3})
x
|
| \
|\ 3
1 \
2
(def y (conj x 4))
x y
| / \
| \ 4
|\ 3
1 \
2
(def z (difference y #{1}))
x y
| / \
| \ 4
|\ 3
1/\
z- 2
Note that these are just indicative, I'm not saying that this is exactly the layout Clojure uses internally. It's the gist though.

I like SCdF's drawings and explanations, but if you're looking for more depth you should read the excellent series of articles on Clojure's data structures at Higher-Order. It explains in detail how Clojure's maps work, and Clojure's sets are just a thin layer on top of its maps: #{:a :b} is implemented as a wrapping around {:a :a, :b :b}.

Here's a starting point: https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/PersistentHashSet.java
You can see it's implemented in terms of PersistentHashMap.

Related

Better Model/Algorithm to predict multiple target

I wanna use multiple variables to predict multiple target. Note multiple target here doesn't mean multi-label.
Let's go for an example like this:
# In this example: x1,x2,x3 are used to predict y1,y2
pd.DataFrame({'x1':[1,2,3],'x2':[2,3,4],'x3':[1,1,1],'y1':[1,2,1],'y2':[2,3,3]})
x1 x2 x3 y1 y2
0 1 2 1 1 2
1 2 3 1 2 3
2 3 4 1 1 3
In my limited experience with data mining, I found two solutions might help:
Build two xgboost models respectively to predict y1,y2
Using a full-connected layer to embed [x1,x2,x3] into [y1,y2], which seems it's a promising solution.
Wanted to know if it's good practice to do that and what would be the better way to predict multiple target?
Regardless of your approach, two outputs means you need two functions. I hope it's clear that a layer producing two outputs is equivalent to two layers producing an output each.
The only thing worth taking into account here (only relevant for deeper models) is whether you want to build intermediate representations of your input that are shared for predicting both outputs, i.e. x → h1 → h2 → .. → hN, hN → y1, hN → y2. Doing so would enforce your hN representation to act as a task-indifferent, multi-purpose encoder while simultaneously reducing the complexity of having two models learn the same thing.
For shallow architectures, such as the single-layer one you described, this is meaningless.

Relational Algebra Division

I'm currently dealing with a relational algebra division issue. I have the following two relations:
A | B | C B
--|---|-- ---
1 | 2 | 3 2
Relation R = 1 | 2 | 6 Relation T =
4 | 2 | 2
4 | 5 | 6
Now I'm doing the following operation: R ÷ T
When I calculate this, my result is as follows:
A | C
--|--
1 | 3
R ÷ T = 1 | 6
4 | 2
For me it is because for the division I look at those tuples in R which are present in combination with all tuples in T. But when I use a relational algebra calculator, such as RelaX it returns
A | C
--|--
R ÷ T = 4 | 2
Where did I make a mistake? Thanks in advance for any help.
Is there anybody who can help?
Performing division on these schema is not good to fully understand how the operator works. Definitions of this operator are not very clear, and the operation is usually replaced by a combination of other operators.
A clear way to see how this works in your case would consist on create a new instance of R with columns ordered properly defining a new schema: (A,C,B). This is, making the attribute of T appear as the last attribute in A. In this case where you are performing a really simple division, it's pretty straightforward to see the result, but imagine if you had schema R(A,D,B,C) and T(B,D). Here the attributes of T appear with a different order in R, and given some instances with not many tuples would already make it difficult to check just by looking them up.
This might be the most difficult operator defined in relational algebra as a query usually involves concepts from selection, projection and join. Also it's complicated to put it out only on words.
A good way of thinking about this operator, is to think about the GROUP BY on SQL. In your example this means using GROUP BY attributes A,C - which would create groups with every combination of different values for these attributes that appear on the schema instance. Each of these groups will have a set of all the values of B associated with the combinations of values of A, C. If you think of the values of the attribute B in the instance of T as a set, you can quickly verify: for each group obtained by grouping by A,C, if the set of values of B in T is included in the set of values of B in R, then the values of A,C are a tuple of the resulting relation.
I know I'm a bit late to this question but I've seen many people confused about this. If it's not clear enough, I leave reference to a document I wrote explaining it much more in detail and with a really good example, HERE.

Type of pseudo code

First of all, sorry for this stupid question. But I really need to know about the languages which are used to show execution flow of program in computer science books.
Example:
1 A = 4
2 t1 = A * B
3 L1: t2 = t1 / C
4 if t2 < W goto L2
5 M = t1 * k
6 t3 = M + I
7 L2: H = I
8 M = t3 - H
9 if t3 ≥ 0 goto L3
10 goto L1
11 L3: halt
Does this language have some specific standers? Is this a pseudo code or an intermediate form of code?
There are no technical rules for Pseudocode, unless you are attempting to conform to standards/syntax for a particular language.
Pseudocode is meant to be human readable and still convey the flow and meaning of the code.
Books that use Pseudocode typically conform to a Java, C, or Pascal-type (among others) structure to make the code easy to read for those familiar with the languages.
The naming conventions that I have seen in the past usually lean toward C or Java-esque naming conventions.
You can find more information here: http://en.wikipedia.org/wiki/Pseudocode
The purpose of pseudocode is to describe an algorithm in a manner which is readable and unambiguous. (Different authors place different amount of emphasis on those two goals, which are frequently in opposition.)
Pseudocode does not need to look like english (or another spoken/written language), nor does it need to look like a real programming language. Ideally its constructs should be familiar to programmers of many different languages.
That pseudocode fills that requirement fairly well... I don't see anything in it which I can't readily understand the effect of.

Haskell Range Map library

Is there a Haskell library that allows me to have a Map from ranges to values? (Preferable somewhat efficient.)
let myRangeMap = RangeMap [(range 1 3, "foo"),(range 2 7, "bar"),(range 9 12, "baz")]
in rangeValues 2
==> ["foo","bar"]
I've written a library to search in overlapping intervals because the existing ones did not fit my needs. I think it may have a more approachable interface than for example SegmentTree:
https://www.chr-breitkopf.de/comp/IntervalMap/index.html
It's also available on Hackage: https://hackage.haskell.org/package/IntervalMap
This task is called a stabbing query on a set of intervals. An efficient data structure for it is called (one-dimensional) segment tree.
The SegmentTree package provides an implementation of this data structure, but unfortunately I cannot figure out how to use it. (I feel that the interface of this package does not provide the right level of abstraction.)
Perhaps the rangemin library does what you want?
Good old Data.Map (and its more efficient Data.IntMap cousin) has a function
splitLookup :: Ord k => k -> Map k a -> (Map k a, Maybe a, Map k a)
which splits a map into submaps of keys less than / greater than a given key. This can be used for certain kinds of range searching.

Aggregating automatically-generated feature vectors

I've got a classification system, which I will unfortunately need to be vague about for work reasons. Say we have 5 features to consider, it is basically a set of rules:
A B C D E Result
1 2 b 5 3 X
1 2 c 5 4 X
1 2 e 5 2 X
We take a subject and get its values for A-E, then try matching the rules in sequence. If one matches we return the first result.
C is a discrete value, which could be any of a-e. The rest are just integers.
The ruleset has been automatically generated from our old system and has an extremely large number of rules (~25 million). The old rules were if statements, e.g.
result("X") if $A >= 1 && $A <= 10 && $C eq 'A';
As you can see, the old rules often do not even use some features, or accept ranges. Some are more annoying:
result("Y") if ($A == 1 && $B == 2) || ($A == 2 && $B == 4);
The ruleset needs to be much smaller as it has to be human maintained, so I'd like to shrink rule sets so that the first example would become:
A B C D E Result
1 2 bce 5 2-4 X
The upshot is that we can split the ruleset by the Result column and shrink each independently. However, I cannot think of an easy way to identify and shrink down the ruleset. I've tried clustering algorithms but they choke because some of the data is discrete, and treating it as continuous is imperfect. Another example:
A B C Result
1 2 a X
1 2 b X
(repeat a few hundred times)
2 4 a X
2 4 b X
(ditto)
In an ideal world, this would be two rules:
A B C Result
1 2 * X
2 4 * X
That is: not only would the algorithm identify the relationship between A and B, but would also deduce that C is noise (not important for the rule)
Does anyone have an idea of how to go about this problem? Any language or library is fair game, as I expect this to be a mostly one-off process. Thanks in advance.
Check out the Weka machine learning lib for Java. The API is a little bit crufty but it's very useful. Overall, what you seem to want is an off-the-shelf machine learning algorithm, which is exactly what Weka contains. You're apparently looking for something relatively easy to interpret (you mention that you want it to deduce the relationship between A and B and to tell you that C is just noise.) You could try a decision tree, such as J48, as these are usually easy to visualize/interpret.
Twenty-five million rules? How many features? How many values per feature? Is it possible to iterate through all combinations in practical time? If you can, you could begin by separating the rules into groups by result.
Then, for each result, do the following. Considering each feature as a dimension, and the allowed values for a feature as the metric along that dimension, construct a huge Karnaugh map representing the entire rule set.
The map has two uses. One: research automated methods for the Quine-McCluskey algorithm. A lot of work has been done in this area. There are even a few programs available, although probably none of them will deal with a Karnaugh map of the size you're going to make.
Two: when you have created your final reduced rule set, iterate over all combinations of all values for all features again, and construct another Karnaugh map using the reduced rule set. If the maps match, your rule sets are equivalent.
-Al.
You could try a neural network approach, trained via backpropagation, assuming you have or can randomly generate (based on the old ruleset) a large set of data that hit all your classes. Using a hidden layer of appropriate size will allow you to approximate arbitrary discriminant functions in your feature space. This is more or less the same idea as clustering, but due to the training paradigm should have no issue with your discrete inputs.
This may, however, be a little too "black box" for your case, particularly if you have zero tolerance for false positives and negatives (although, it being a one-off process, you get an arbitrary degree of confidence by checking a gargantuan validation set).

Resources