Sorting vectors in Clojure map of vectors - sorting

I have a map of vectors, like this:
{2 ["a" "c" "b"], 1 ["z" "y" "x"]}
I want to get a map that is sorted by keys, and then each corresponding vector is also sorted, like this:
{1 ["x" "y" "z"], 2 ["a" "b" "c"]}
I know I can sort by keys by doing (into (sorted-map) themap), and I know that I can supply a transducer to into, but I'm coming up short as to exactly how the transducer should look. Here's a transducer I've tried:
(defn xform [entry]
(vector (first entry) (vec (sort (second entry)))))
However, when I try to apply it to my map, I get this exception:
java.lang.IllegalArgumentException: Don't know how to create ISeq from: clojure.core$conj__4345
How can I get this to work? Is there a better way then using into with a transducer?

Like this:
(into (sorted-map)
(map (fn [[k v]] [k (vec (sort v))]))
{2 ["a" "c" "b"], 1 ["z" "y" "x"]})

Related

I'm trying to understand the syntax of this cartesian-product function in Clojure

Here's some code for a cartesian product, it can be two lists, two vectors, or any number of combinations of the two. I'd really appreciate help with the second, fourth, and final lines, explaining what each line is doing
(defn cartesian-product ;function name definition
([] '(())) ;need help understanding this
([xs & more] ; at least two variables, xs is one of them
(mapcat #(map (partial cons %) ;mapcat means a create a concatenated map of the following
;still trying to figure out partial, but cons takes a
;variable and puts it in front of a sequence
(apply cartesian-product more)) ; this is the sequence that is mapped
; using (partial cons %)
xs))) ;not sure what this is here for
Here is a reworked version that illustrates what is going on (and how):
(ns tst.demo.core
(:use demo.core tupelo.core tupelo.test))
;----------------------------------------------------------------------------
; Lesson: how map & mapcat work
(defn dup [x]
"Return 2 of the arg in a vector"
[x x])
(dotest
(let [nums [0 1 2]]
(is= (mapv inc nums) [1 2 3])
(is= (mapv dup nums) [[0 0] ; like a matrix, 2-D
[1 1]
[2 2]])
; mapcat glues together the inner "row" vectors. So the result is 1-D instead of 2-D
(is= (mapcat dup nums) [0 0 1 1 2 2])))
then the reworked code
;----------------------------------------------------------------------------
(def empty-matrix [[]]) ; 0 rows, 0 cols
(defn cartesian-product ;function name definition
"When called with 1 or more sequences, returns a list of all possible combinations
of one item from each collection"
([] ; if called with no args
empty-matrix) ; return an empty matrix
; if called with 1 or more args,
([xs ; first arg is named `xs` (i.e. plural for x values)
& more] ; all other args are wrapped in a list named `more`
(let [recursion-result (apply cartesian-product more) ; get cartesian prod of sequences 2..N
inner-fn (fn [arg] (map ; for each recursion-result
(partial cons arg) ; glue arg to the front of it
recursion-result))
; for each item in the first sequence (xs), glue to front of
; each recursion result and then convert 2D->1D
output (mapcat inner-fn xs)]
output)))
and some unit tests to show it in action
(dotest
(is= (cartesian-product [1 2 3]) [[1] [2] [3]])
(is= (cartesian-product [1 2 3] [:a :b])
[[1 :a]
[1 :b]
[2 :a]
[2 :b]
[3 :a]
[3 :b]])
(is= (cartesian-product [1 2 3] [:a :b] ["apple" "pea"])
[[1 :a "apple"]
[1 :a "pea"]
[1 :b "apple"]
[1 :b "pea"]
[2 :a "apple"]
[2 :a "pea"]
[2 :b "apple"]
[2 :b "pea"]
[3 :a "apple"]
[3 :a "pea"]
[3 :b "apple"]
[3 :b "pea"]]))

Fast way to estimate item counts above a given threshold? Probabilistic data structure?

I have a large list of values, drawn from the range 0 to 100,000 (represented here as letters for clarity). There might be a few thousand items in each input.
[a a a a b b b b c f d b c f ... ]
I want to find the count of numbers with counts over certain threshold. For example, if the threshold is 3, the answer is {a: 4, b: 5}.
The obvious way to do this is to group by identity, count each grouping and then filter.
This is a language agnostic question, but in Clojure (don't be put off if you don't know Clojure!):
(filter (fn [[k cnt]] (> cnt threshold)) (frequencies input))
This function runs over a very large number of inputs, each input is very large, so the grouping and filtering is an expensive operation. I want to find some kind of guard function that will return early if the input can never produce any outputs over the given threshold or otherwise partition the problem space. For example, the most simplistic is if the size of the input is less than the size of the threshold return nil.
I'm looking for a better guard function that will skip the computation if the input can't produce any outputs. Or a quicker way to produce the output.
Obviously it has to be less expensive than the grouping itself. One great solution involved the count of the input by the distinct set of inputs but that ended up being as expensive as grouping...
I have an idea that probabilistic data structures might hold the key. Any ideas?
(I tagged hyerloglog, although I don't think it applies because it doesn't provide counts)
You might like to look at Narrator. It's designed for 'analyzing and aggregating streams of data'.
A simple query-seq to do what you're initially after is:
(require '[narrator.query :refer [query-seq query-stream]])
(require '[narrator.operators :as n])
(def my-seq [:a :a :b :b :b :b :c :a :b :c])
(query-seq (n/group-by identity n/rate) my-seq)
==> {:a 3, :b 5, :c 2}
Which you can filter as you suggested.
You can use quasi-cardinality to quickly determine the number of unique items in your sample (and thus your partition question). It uses HyperLogLog cardinality estimation algorithm for this, e.g.
(query-seq (n/quasi-cardinality) my-seq)
==> 3
quasi-frequency-by demonstrated here:
(defn freq-in-seq
"returns a function that, when given a value, returns the frequency of that value in the sequence s
e.g. ((freq-in-seq [:a :a :b :c]) :a) ==> 2"
[s]
(query-seq (n/quasi-frequency-by identity) s))
((freq-in-seq my-seq) :a) ==> 3
quasi-distinct-by:
(query-seq (n/quasi-distinct-by identity) my-seq)
==> [:a :b :c]
There's also real-time stream analysis with query-stream.
Here's something showing you how you can sample the stream to get count of changes over 'period' values read:
(s/stream->seq
(->> my-seq
(map #(hash-map :timestamp %1 :value %2) (range))
(query-stream (n/group-by identity n/rate)
{:value :value :timestamp :timestamp :period 3})))
==> ({:timestamp 3, :value {:a 2, :b 1}} {:timestamp 6, :value {:b 3}} {:timestamp 9, :value {:a 1, :b 1, :c 1}} {:timestamp 12, :value {:c 1}})
The result is a sequence of changes every 3 items (period 3), with the appropriate timestamp.
You can also write custom stream aggregators, which would probably be how you go about accumulating the values in the stream above. I had a quick go with these, and failed abysmally to get it working (only on my lunch break at moment), but this works in its place:
(defn lazy-value-accum
([s] (lazy-value-accum s {}))
([s m]
(when-not (empty? s)
(lazy-seq
(let [new-map (merge-with + m (:value (first s)))]
(cons new-map
(lazy-value-accum (rest s) new-map))))))
(lazy-value-accum
(s/stream->seq
(->> my-seq
(map #(hash-map :timestamp %1 :value %2) (range))
(query-stream (n/group-by identity n/rate)
{:value :value :timestamp :timestamp :period 3}))))
==> ({:a 2, :b 1} {:a 2, :b 4} {:a 3, :b 5, :c 1} {:a 3, :b 5, :c 2})
which shows a gradually accumulating count of each value after every period samples, which can be read lazily.
what about using partition-all to produce a lazy list of partitions of maximum size n, apply frequencies on each partition, merge them and then filter final map?
(defn lazy-count-and-filter
[coll n threshold]
(filter #(< threshold (val %))
(apply (partial merge-with +)
(map frequencies
(partition-all n coll)))))
ex:
(lazy-count-and-filter [:a :c :b :c :a :d :a] 2 1)
==> ([:a 3] [:c 2])
If you're looking to speed up the work on a single node, consider reducers or core.async, as this blog post illustrates.
If this is a very large dataset, and this operation is needed frequently, and you have resources to have a multi-node cluster, you could consider setting up either Storm or Onyx.
Realistically, it sounds like reducers will give you the most benefit for the least amount of work. With all the options that I've listed, the solutions that are more powerful/flexible/faster require more time upfront to understand. In order of simplest to most powerful, they are reducers, core.async, Storm, Onyx.

Clojure thread-first with filter function

I'm having a problem stringing some forms together to do some ETL on a result set from a korma function.
I get back from korma sql:
({:id 1 :some_field "asd" :children [{:a 1 :b 2 :c 3} {:a 1 :b 3 :c 4} {:a 2 :b 2 :c 3}] :another_field "qwe"})
I'm looking to filter this result set by getting the "children" where the :a keyword is 1.
My attempt:
;mock of korma result
(def data '({:id 1 :some_field "asd" :children [{:a 1 :b 2 :c 3} {:a 1 :b 3 :c 4} {:a 2 :b 2 :c 3}] :another_field "qwe"}))
(-> data
first
:children
(filter #(= (% :a) 1)))
What I'm expecting here is a vector of hashmaps that :a is set to 1, i.e :
[{:a 1 :b 2 :c 3} {:a 1 :b 3 :c 4}]
However, I'm getting the following error:
IllegalArgumentException Don't know how to create ISeq from: xxx.core$eval3145$fn__3146 clojure.lang.RT.seqFrom (RT.java:505)
From the error I gather it's trying to create a sequence from a function...though just not able to connect the dots as to why.
Further, if I separate the filter function entirely by doing the following:
(let [children (-> data first :children)]
(filter #(= (% :a) 1) children))
it works. I'm not sure why the first-thread is not applying the filter function, passing in the :children vector as the coll argument.
Any and all help much appreciated.
Thanks
You want the thread-last macro:
(->> data first :children (filter #(= (% :a) 1)))
yields
({:a 1, :b 2, :c 3} {:a 1, :b 3, :c 4})
The thread-first macro in your original code is equivalent to writing:
(filter (:children (first data)) #(= (% :a) 1))
Which results in an error, because your anonymous function is not a sequence.
The thread-first (->) and thread-last (->>) macros are always problematical in that it is easy to make a mistake in choosing one over the other (or in mixing them up as you have done here). Break down the steps like so:
(ns tstclj.core
(:use cooljure.core) ; see https://github.com/cloojure/tupelo/
(:gen-class))
(def data [ {:id 1 :some_field "asd"
:children [ {:a 1 :b 2 :c 3}
{:a 1 :b 3 :c 4}
{:a 2 :b 2 :c 3} ]
:another_field "qwe"} ] )
(def v1 (first data))
(def v2 (:children v1))
(def v3 (filter #(= (% :a) 1) v2))
(spyx v1) ; from tupelo.core/spyx
(spyx v2)
(spyx v3)
You will get results like:
v1 => {:children [{:c 3, :b 2, :a 1} {:c 4, :b 3, :a 1} {:c 3, :b 2, :a 2}], :another_field "qwe", :id 1, :some_field "asd"}
v2 => [{:c 3, :b 2, :a 1} {:c 4, :b 3, :a 1} {:c 3, :b 2, :a 2}]
v3 => ({:c 3, :b 2, :a 1} {:c 4, :b 3, :a 1})
which is what you desired. The problem is that you really needed to use thread-last for the filter form. The most reliable way of avoiding this problem is to always be explicit and use the Clojure as-> threading form, or, even better, it-> from the Tupelo library:
(def result (it-> data
(first it)
(:children it)
(filter #(= (% :a) 1) it)))
By using thread-first, you accidentally wrote the equivalent of this:
(def result (it-> data
(first it)
(:children it)
(filter it #(= (% :a) 1))))
and the error reflects the fact that the function #(= (% :a) 1) can't be cast into a seq. Sometimes, it pays to use a let form and give names to the intermediate results:
(let [result-map (first data)
children-vec (:children result-map)
a1-maps (filter #(= (% :a) 1) children-vec) ]
(spyx a1-maps))
;;-> a1-maps => ({:c 3, :b 2, :a 1} {:c 4, :b 3, :a 1})
We could also look at either of the two previous solutions and notice that the output of each stage is used as the last argument to the next function in the pipeline. Thus, we could also solve it with thread-last:
(def result3 (->> data
first
:children
(filter #(= (% :a) 1))))
(spyx result3)
;;-> result3 => ({:c 3, :b 2, :a 1} {:c 4, :b 3, :a 1})
Unless your processing chain is very simple, I find it is just about always clearer to use the it-> form to be explicit about how the intermediate value should be used by each stage of the pipeline.
I'm not sure why the first-thread is not applying the filter function, passing in the :children vector as the coll argument.
This is precisely what the thread-first macro does.
From the clojuredocs.org:
Threads the expr through the forms. Inserts x as the
second item in the first form, making a list of it if it is not a
list already.
So, in your case the application of filter ends up being:
(filter [...] #(= (% :a) 1))
If you must use thread-first (instead of thread-last), then you can get around this by partially applying filter and its predicate:
(->
data
first
:children
((partial filter #(= (:a %) 1)))
vec)
; [{:a 1, :b 2, :c 3} {:a 1, :b 3, :c 4}]

Is there a standard way to compare Clojure vectors in the 'conventional' way

Clojure vectors have the uncommon property that when you compare them, the length of the vector is considered before any other property. In e.g. Haskell
Prelude> [1, 3] > [1, 2, 3]
True
and Ruby
1.9.3p392 :003 > [1, 3] <=> [1, 2, 3]
=> 1
But in Clojure:
user=> (compare [1, 3] [1, 2, 3])
-1
Now you can implement the 'conventional' comparison yourself:
(defn vector-compare [[value1 & rest1] [value2 & rest2]]
(let [result (compare value1 value2)]
(cond
(not (= result 0)) result
(nil? value1) 0 ; value2 will be nil as well
:else (recur rest1 rest2))))
but I expect this way of comparing vectors is so common that there is a standard way to achieve this. Is there?
The compare function compares 2 things if they implement interface java.lang.Comparable. Vector in Clojure implement this interface as shown at this link, basically it check the length first. There is no core function which does what you want so you will have to roll your own function.
Other thing I would like to mention is that the haskell version is basically comparing lists (not vector) and it is not efficient to calculate list length which make sense to avoid length while comparing lists where as vector length calculation is O(1) operation and hence it make sense to check the length first.
Something like this?
(first (filter (complement zero?) (map compare [1 3] [1 2 3])))

sorting according to a custom comparator

I've got a map looking like this:
user> (frequencies "aaabccddddee")
{\a 3, \b 1, \c 2, \d 4, \e 2}
And I'd like to have a function that would sort the key/value pairs according to the order each character is appearing in a string I'd pass as an argument.
Something like this:
user> (somesort "defgcab" (frequencies "aaabccddddee"))
[[\d 4] [\e 2] [\c 2] [\a 3] [\b 1]]
(in the example above 'f' and 'g' do not appear in the map and they're hence ignored. It is guaranteed that the string -- "defgcab" in this example -- shall contain every character/key in the map)
The resulting collection doesn't matter much as long as it is sorted.
I've tried several things but cannot find a way to make this work.
I kinda prefer using sort-by to do the sorting logic, and just create a custom comparator for your collection:
(defn sorter [coll] (zipmap coll (range)))
(sort-by (comp (sorter "defgcab") key)
(frequencies "aaabccddddee"))
;=> ([\d 4] [\e 2] [\c 2] [\a 3] [\b 1])
Edit: this has the further advantage that you can keep your collection a map if you want, although you have to do a little more work:
(defn map-sorter [coll]
(let [order (zipmap coll (range))]
(fn [a b]
(compare (order a) (order b)))))
(into (sorted-map-by (map-sorter "defgcab"))
(frequencies "aaabccddddee"))
;=> {\d 4, \e 2, \c 2, \a 3, \b 1}
(defn somesort [str st]
(filter (fn [[k v]] v ) (map (fn [c] [c (get st c)]) str)) )
How this works:
Using map on the "sorting string" for each character in that string get the corresponding key value as a vector from the set
Using filter, filter out the elements where the value is nil
Ankur's solution expressed with for , which is maybe a bit easier to read:
(defn somesort [str st]
(for [c str :let [v (get st c)] :when v] [c v]))
This works assuming the characters in the custom sorting string are unique.
The sorting string is iterated over once, each of its characters is looked up in a map. If you make sure the map passed to the function is a hash map, then this is linear.
You can also use find which returns [k v] with map :
(let [fs (frequencies "abbccc")]
(map #(find fs %) "defgcab")

Resources