Clojure thread-first with filter function - filter

I'm having a problem stringing some forms together to do some ETL on a result set from a korma function.
I get back from korma sql:
({:id 1 :some_field "asd" :children [{:a 1 :b 2 :c 3} {:a 1 :b 3 :c 4} {:a 2 :b 2 :c 3}] :another_field "qwe"})
I'm looking to filter this result set by getting the "children" where the :a keyword is 1.
My attempt:
;mock of korma result
(def data '({:id 1 :some_field "asd" :children [{:a 1 :b 2 :c 3} {:a 1 :b 3 :c 4} {:a 2 :b 2 :c 3}] :another_field "qwe"}))
(-> data
first
:children
(filter #(= (% :a) 1)))
What I'm expecting here is a vector of hashmaps that :a is set to 1, i.e :
[{:a 1 :b 2 :c 3} {:a 1 :b 3 :c 4}]
However, I'm getting the following error:
IllegalArgumentException Don't know how to create ISeq from: xxx.core$eval3145$fn__3146 clojure.lang.RT.seqFrom (RT.java:505)
From the error I gather it's trying to create a sequence from a function...though just not able to connect the dots as to why.
Further, if I separate the filter function entirely by doing the following:
(let [children (-> data first :children)]
(filter #(= (% :a) 1) children))
it works. I'm not sure why the first-thread is not applying the filter function, passing in the :children vector as the coll argument.
Any and all help much appreciated.
Thanks

You want the thread-last macro:
(->> data first :children (filter #(= (% :a) 1)))
yields
({:a 1, :b 2, :c 3} {:a 1, :b 3, :c 4})
The thread-first macro in your original code is equivalent to writing:
(filter (:children (first data)) #(= (% :a) 1))
Which results in an error, because your anonymous function is not a sequence.

The thread-first (->) and thread-last (->>) macros are always problematical in that it is easy to make a mistake in choosing one over the other (or in mixing them up as you have done here). Break down the steps like so:
(ns tstclj.core
(:use cooljure.core) ; see https://github.com/cloojure/tupelo/
(:gen-class))
(def data [ {:id 1 :some_field "asd"
:children [ {:a 1 :b 2 :c 3}
{:a 1 :b 3 :c 4}
{:a 2 :b 2 :c 3} ]
:another_field "qwe"} ] )
(def v1 (first data))
(def v2 (:children v1))
(def v3 (filter #(= (% :a) 1) v2))
(spyx v1) ; from tupelo.core/spyx
(spyx v2)
(spyx v3)
You will get results like:
v1 => {:children [{:c 3, :b 2, :a 1} {:c 4, :b 3, :a 1} {:c 3, :b 2, :a 2}], :another_field "qwe", :id 1, :some_field "asd"}
v2 => [{:c 3, :b 2, :a 1} {:c 4, :b 3, :a 1} {:c 3, :b 2, :a 2}]
v3 => ({:c 3, :b 2, :a 1} {:c 4, :b 3, :a 1})
which is what you desired. The problem is that you really needed to use thread-last for the filter form. The most reliable way of avoiding this problem is to always be explicit and use the Clojure as-> threading form, or, even better, it-> from the Tupelo library:
(def result (it-> data
(first it)
(:children it)
(filter #(= (% :a) 1) it)))
By using thread-first, you accidentally wrote the equivalent of this:
(def result (it-> data
(first it)
(:children it)
(filter it #(= (% :a) 1))))
and the error reflects the fact that the function #(= (% :a) 1) can't be cast into a seq. Sometimes, it pays to use a let form and give names to the intermediate results:
(let [result-map (first data)
children-vec (:children result-map)
a1-maps (filter #(= (% :a) 1) children-vec) ]
(spyx a1-maps))
;;-> a1-maps => ({:c 3, :b 2, :a 1} {:c 4, :b 3, :a 1})
We could also look at either of the two previous solutions and notice that the output of each stage is used as the last argument to the next function in the pipeline. Thus, we could also solve it with thread-last:
(def result3 (->> data
first
:children
(filter #(= (% :a) 1))))
(spyx result3)
;;-> result3 => ({:c 3, :b 2, :a 1} {:c 4, :b 3, :a 1})
Unless your processing chain is very simple, I find it is just about always clearer to use the it-> form to be explicit about how the intermediate value should be used by each stage of the pipeline.

I'm not sure why the first-thread is not applying the filter function, passing in the :children vector as the coll argument.
This is precisely what the thread-first macro does.
From the clojuredocs.org:
Threads the expr through the forms. Inserts x as the
second item in the first form, making a list of it if it is not a
list already.
So, in your case the application of filter ends up being:
(filter [...] #(= (% :a) 1))
If you must use thread-first (instead of thread-last), then you can get around this by partially applying filter and its predicate:
(->
data
first
:children
((partial filter #(= (:a %) 1)))
vec)
; [{:a 1, :b 2, :c 3} {:a 1, :b 3, :c 4}]

Related

Convert vector of strings into hash-map in Clojure

I have the following data structure:
["a 1" "b 2" "c 3"]
How can I transform that into a hash-map?
I want the following data structure:
{:a 1 :b 2 :c 3}
Use clojure.string/split and then use keyword and Integer/parseInt:
(->> ["a 1" "b 2" "c 3"]
(map #(clojure.string/split % #" "))
(map (fn [[k v]] [(keyword k) (Integer/parseInt v)]))
(into {}))
=> {:a 1, :b 2, :c 3}
and one more :)
(->> ["a 1" "b 2" "c 3"]
(clojure.pprint/cl-format nil "{~{:~a ~}}")
clojure.edn/read-string)
;;=> {:a 1, :b 2, :c 3}
(into {}
(map #(clojure.edn/read-string (str "[:" % "]")))
["a 1" "b 2" "c 3"])
;; => {:a 1, :b 2, :c 3}
(def x ["a 1" "b 2" "c 3"])
(clojure.edn/read-string (str "{:" (clojure.string/join " :" x) "}"))
;;=> {:a 1, :b 2, :c 3}

Calling up rand-int into a variable

I need to create random numbers and store them into a variable for multiple calculations and for different calls.
(defn position
(def x (rand-int 2147483647))
(def y (rand-int 2147483647))
(def z (rand-int 2147483647))
)
What I want to do is calling this function in a loop, do calculations with it and store this away.
Anyone could help, please? There is probably a better way.
If you want an arbitrary number of things, you need to use an arbitrarily sized data structure. In this case you can probably use the function repeatedly:
(repeatedly 5 #(rand-int 2147483647))
In this example we take 5 elements (you can change to as many as you need) from repeatedly running the anonymous no-arguments function #(rand-int 2147483647) which is what you seem to need.
To generate an infinite lazy sequence of random ints you can use:
(repeatedly #(rand-int 2147483647))
To generate many positions you can also use repeatedly:
(defn rand-position []
{:x (rand-int 2147483647)
:y (rand-int 2147483647)
:z (rand-int 2147483647)})
(def positions (repeatedly rand-position))
(take 5 positions) ;; will generate 5 random positions, each represented as a map
With functional programming, I won't suggest you go down the path of defining global variables (using def). It is better to design your functions to operate on some data structure and returns the same or yet other data structure. In your case, the data-structure is called position. Given it is not a primitive (like integer), you have to decide how to model it. With Clojure, you can either pick a vector or a map for it then all your functions have to follow. Here is how I would go through this development process:
Warning: long reply ahead...
Prologue
(require '[clojure.spec.alpha :as s])
(require '[clojure.spec.gen.alpha :as gen])
(def gen-samples
"A function to create a generator, and to generate
some samples from it in one step"
(comp gen/sample s/gen))
Spec out your data structure
;; dimension is an integer between 0 to 2147483647
(s/def ::dim (s/int-in 0 2147483647))
(gen-samples ::dim)
;; => (1 0 2 1 0 0 2 0 0 27)
;; Option 1: position as a collection of 3 dimensions
(s/def ::position (s/coll-of ::dim :count 3))
(gen-samples ::position)
;; => ([0 0 0] [0 0 0] [1 0 0] [0 1 1] [1 0 1] [3 2 1] [26 1 0] [7 1 1] [6 24 1] [2 0 21])
;; Option 2: position as a map - with x,y,z as keys and with dimension as values
(s/def ::x ::dim)
(s/def ::y ::dim)
(s/def ::z ::dim)
(s/def ::position (s/keys :req-un [::x ::y ::z]))
(gen-samples ::position)
;; => ({:x 1, :y 1, :z 0} {:x 0, :y 0, :z 0} {:x 1, :y 2, :z 1} {:x 1, :y 2, :z 0} {:x 2, :y 2, :z 5} {:x 4, :y 1, :z 13} {:x 4, :y 8, :z 7} {:x 2, :y 5, :z 10} {:x 22, :y 3, :z 4} {:x 124, :y 1, :z 8})
Assuming you take option 2, now spec out your function
;; in this case, move-east is a function which takes a position
;; and returns another position - with x-dimension of the
;; new position always greater than the old one
(s/fdef move-east
:args (s/cat :pos ::position)
:ret ::position
:fn #(> (-> % :ret :x) (-> % :args :pos :x)))
Implementation - the easy part
(defn move-east [pos]
(update pos :x inc))
Some manual test
(-> ::position gen-samples first)
;; => {:x 1, :y 1, :z 0}
(move-east *1)
;; => {:x 2, :y 1, :z 0}
Auto test based on the spec
(require '[clojure.spec.test.alpha :as stest])
(-> `move-east
stest/check
stest/summarize-results)
;; => {:total 1, :check-passed 1}
;; what if my function is wrong?
(defn move-east [pos]
(update pos :x dec))
(-> `move-east
stest/check
stest/summarize-results)
;; => {:total 1, :check-failed 1}
;; what is wrong?
(-> `move-east
stest/check
first
stest/abbrev-result)
;; which basically returns a result like below...
;; showing return x is -1 and hence fails the ::dim spec
{:clojure.spec.alpha/problems
({:path [:ret :x],
:pred
(clojure.core/fn
[%]
(clojure.spec.alpha/int-in-range? 0 2147483647 %)),
:val -1,
:via [:play/dim],
:in [:x]}),
:clojure.spec.alpha/failure :check-failed}

Fast way to estimate item counts above a given threshold? Probabilistic data structure?

I have a large list of values, drawn from the range 0 to 100,000 (represented here as letters for clarity). There might be a few thousand items in each input.
[a a a a b b b b c f d b c f ... ]
I want to find the count of numbers with counts over certain threshold. For example, if the threshold is 3, the answer is {a: 4, b: 5}.
The obvious way to do this is to group by identity, count each grouping and then filter.
This is a language agnostic question, but in Clojure (don't be put off if you don't know Clojure!):
(filter (fn [[k cnt]] (> cnt threshold)) (frequencies input))
This function runs over a very large number of inputs, each input is very large, so the grouping and filtering is an expensive operation. I want to find some kind of guard function that will return early if the input can never produce any outputs over the given threshold or otherwise partition the problem space. For example, the most simplistic is if the size of the input is less than the size of the threshold return nil.
I'm looking for a better guard function that will skip the computation if the input can't produce any outputs. Or a quicker way to produce the output.
Obviously it has to be less expensive than the grouping itself. One great solution involved the count of the input by the distinct set of inputs but that ended up being as expensive as grouping...
I have an idea that probabilistic data structures might hold the key. Any ideas?
(I tagged hyerloglog, although I don't think it applies because it doesn't provide counts)
You might like to look at Narrator. It's designed for 'analyzing and aggregating streams of data'.
A simple query-seq to do what you're initially after is:
(require '[narrator.query :refer [query-seq query-stream]])
(require '[narrator.operators :as n])
(def my-seq [:a :a :b :b :b :b :c :a :b :c])
(query-seq (n/group-by identity n/rate) my-seq)
==> {:a 3, :b 5, :c 2}
Which you can filter as you suggested.
You can use quasi-cardinality to quickly determine the number of unique items in your sample (and thus your partition question). It uses HyperLogLog cardinality estimation algorithm for this, e.g.
(query-seq (n/quasi-cardinality) my-seq)
==> 3
quasi-frequency-by demonstrated here:
(defn freq-in-seq
"returns a function that, when given a value, returns the frequency of that value in the sequence s
e.g. ((freq-in-seq [:a :a :b :c]) :a) ==> 2"
[s]
(query-seq (n/quasi-frequency-by identity) s))
((freq-in-seq my-seq) :a) ==> 3
quasi-distinct-by:
(query-seq (n/quasi-distinct-by identity) my-seq)
==> [:a :b :c]
There's also real-time stream analysis with query-stream.
Here's something showing you how you can sample the stream to get count of changes over 'period' values read:
(s/stream->seq
(->> my-seq
(map #(hash-map :timestamp %1 :value %2) (range))
(query-stream (n/group-by identity n/rate)
{:value :value :timestamp :timestamp :period 3})))
==> ({:timestamp 3, :value {:a 2, :b 1}} {:timestamp 6, :value {:b 3}} {:timestamp 9, :value {:a 1, :b 1, :c 1}} {:timestamp 12, :value {:c 1}})
The result is a sequence of changes every 3 items (period 3), with the appropriate timestamp.
You can also write custom stream aggregators, which would probably be how you go about accumulating the values in the stream above. I had a quick go with these, and failed abysmally to get it working (only on my lunch break at moment), but this works in its place:
(defn lazy-value-accum
([s] (lazy-value-accum s {}))
([s m]
(when-not (empty? s)
(lazy-seq
(let [new-map (merge-with + m (:value (first s)))]
(cons new-map
(lazy-value-accum (rest s) new-map))))))
(lazy-value-accum
(s/stream->seq
(->> my-seq
(map #(hash-map :timestamp %1 :value %2) (range))
(query-stream (n/group-by identity n/rate)
{:value :value :timestamp :timestamp :period 3}))))
==> ({:a 2, :b 1} {:a 2, :b 4} {:a 3, :b 5, :c 1} {:a 3, :b 5, :c 2})
which shows a gradually accumulating count of each value after every period samples, which can be read lazily.
what about using partition-all to produce a lazy list of partitions of maximum size n, apply frequencies on each partition, merge them and then filter final map?
(defn lazy-count-and-filter
[coll n threshold]
(filter #(< threshold (val %))
(apply (partial merge-with +)
(map frequencies
(partition-all n coll)))))
ex:
(lazy-count-and-filter [:a :c :b :c :a :d :a] 2 1)
==> ([:a 3] [:c 2])
If you're looking to speed up the work on a single node, consider reducers or core.async, as this blog post illustrates.
If this is a very large dataset, and this operation is needed frequently, and you have resources to have a multi-node cluster, you could consider setting up either Storm or Onyx.
Realistically, it sounds like reducers will give you the most benefit for the least amount of work. With all the options that I've listed, the solutions that are more powerful/flexible/faster require more time upfront to understand. In order of simplest to most powerful, they are reducers, core.async, Storm, Onyx.

Clojure sort map over value

I'm trying to sort a map over the values.
The input-map looks like:
{:Blabla 1, :foo 1, :bla-bla 1, :Bla 2, :bla/bla 1, :bla 4, :blub 2, :hello 1, :Foo 2}
The output should look like:
{:bla 4 :Bla 2 :blub 2 :Foo 2 :Blabla 1 :bla-bla 1 :bla/bla 1 :foo 1 :hello 1}
I used sorted-map-by like in the documentation here:
http://clojuredocs.org/clojure.core/sorted-map-by
(defn sort-keyword-list [texts]
(let [results (word-counts texts)]
;results is now {:Blabla 1, :foo 1, :bla-bla 1, :Bla 2, :bla/bla 1, :bla 4, :blub 2, :hello 1, :Foo 2}
(into (sorted-map-by (fn [key1 key2]
(compare [(get results key2) key2]
[(get results key1) key1])))
results))
)
Well I found out that this solution works only if the keywords have no special characters like "/" or "-" inside. Is this a known bug?
So how can I sort a map by values quickly without writing a own and slowly sort-algorithm?
In my Clojure 1.6.0 REPL, the code in the question already sorts by value:
user=> (into (sorted-map-by (fn [key1 key2]
(compare [(get x key2) key2]
[(get x key1) key1])))
x)
{:bla 4, :blub 2, :Foo 2, :Bla 2, :bla/bla 1, :hello 1, :foo 1, :bla-bla 1, :Blabla 1}
If you want entries with the same value to be sorted by key, you need to stringify the keys. Here's why:
user=> x
{:bla-bla 1, :Blabla 1, :bla/bla 1, :hello 1, :bla 4, :foo 1, :Bla 2, :Foo 2, :blub 2}
user=> (sort (keys x))
(:Bla :Blabla :Foo :bla :bla-bla :blub :foo :hello :bla/bla)
user=> (sort (map str (keys x)))
(":Bla" ":Blabla" ":Foo" ":bla" ":bla-bla" ":bla/bla" ":blub" ":foo" ":hello")
Here is a solution based on the suggestion by #user100464 with explicit considerations of comparison of keys, when the values are the same.
Note: I choose to sort decreasingly by reversing the order of the arguments to the comparisons: (compare (x k2) (x k1)) and (compare k2 k1).
(defn sort-by-value-then-key [x]
(into (sorted-map-by (fn [k1, k2]
(let [v_c (compare (x k2) (x k1))]
(if (= 0 v_c)
(compare k2 k1)))))
x))
One may customize at (compare k2 k1) to implement more elaborate key comparison.

Clojure: I have many sorted maps and want to reduce in order all there values a super maps of keys -> vector

I have seen this but can't work out how to apply it (no pun intended) to my situation.
I have a sorted list of maps like this: (note there can be more than two keys in the map)
({name1 3, name2 7}, {name1 35, name2 7}, {name1 0, name2 3})
What I am after is this data structure afterwards:
({:name1 [3,35,0]}, {:name2 [7,7,3]})
Ive been struggling with this for a while and cant seem to get anywhere near.
Caveats: The data must stay sorted and I have N keywords not just two.
I'd go for merge-with with some preprocessing added:
(def maps [{:a :b :e :f} {:a :c} {:a :d :e :g}])
(apply merge-with concat (for [m maps [k v] m] {k [v]}))
>>> {:e (:f :g), :a (:b :c :d)}
I think the function you want is merge-with:
user=> (def x {:a 1 :b 2})
user=> (def y {:a 3 :b 4})
user=> (merge-with vector x y)
{:a [1 3], :b [2 4]}
user=>
user=> (def z {:a 5 :b 6 :c 7})
user=> (merge-with vector x y z)
{:a [[1 3] 5], :c 7, :b [[2 4] 6]} ; oops
user=> (merge-with #(vec (flatten (vector %1 %2))) x y z)
{:a [1 3 5] :b [2 4 6] :c 7}
user=>
This is my attempt at the problem. It is not as elegant as Rafal's solution.
(def maps [{:a :b :e :f} {:a :c} {:a :d :e :g}])
(apply merge-with #(if (list? %1) (conj %1 %2) (list %1 %2)) maps)
>>> {:a (:d :b :c), :e (:f :g)}

Resources