clojure sqlkorma library: out of memory error - jdbc

I'm doing what I thought was a fairly straightforward task: run a sql query (over about 65K rows of data) using sqlkorma library (http://sqlkorma.com), and for each row transforming it in some way, and then writing to CSV file. I don't really think that 65K rows is all that large given that I have a 8GB laptop, but I also assumed that a sql result set would be lazily fetched and so the whole thing would never get held in memory at the same time. So I was really really surprised when I ended up with this stack trace:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at clojure.lang.PersistentHashMap$BitmapIndexedNode.assoc(PersistentHashMap.java:777)
at clojure.lang.PersistentHashMap.createNode(PersistentHashMap.java:1101)
at clojure.lang.PersistentHashMap.access$600(PersistentHashMap.java:28)
at clojure.lang.PersistentHashMap$BitmapIndexedNode.assoc(PersistentHashMap.java:749)
at clojure.lang.PersistentHashMap$TransientHashMap.doAssoc(PersistentHashMap.java:269)
at clojure.lang.ATransientMap.assoc(ATransientMap.java:64)
at clojure.lang.PersistentHashMap.create(PersistentHashMap.java:56)
at clojure.lang.PersistentHashMap.create(PersistentHashMap.java:100)
at clojure.lang.PersistentArrayMap.createHT(PersistentArrayMap.java:61)
at clojure.lang.PersistentArrayMap.assoc(PersistentArrayMap.java:201)
at clojure.lang.PersistentArrayMap.assoc(PersistentArrayMap.java:29)
at clojure.lang.RT.assoc(RT.java:702)
at clojure.core$assoc.invoke(core.clj:187)
at clojure.core$zipmap.invoke(core.clj:2715)
at clojure.java.jdbc$resultset_seq$thisfn__204.invoke(jdbc.clj:243)
at clojure.java.jdbc$resultset_seq$thisfn__204$fn__205.invoke(jdbc.clj:243)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
at clojure.lang.LazySeq.seq(LazySeq.java:60)
at clojure.lang.Cons.next(Cons.java:39)
at clojure.lang.PersistentVector.create(PersistentVector.java:51)
at clojure.lang.LazilyPersistentVector.create(LazilyPersistentVector.java:31)
at clojure.core$vec.invoke(core.clj:354)
at korma.db$exec_sql$fn__343.invoke(db.clj:203)
at clojure.java.jdbc$with_query_results_STAR_.invoke(jdbc.clj:669)
at korma.db$exec_sql.invoke(db.clj:202)
at korma.db$do_query$fn__351.invoke(db.clj:225)
at clojure.java.jdbc$with_connection_STAR_.invoke(jdbc.clj:309)
at korma.db$do_query.invoke(db.clj:224)
at korma.core$exec.invoke(core.clj:474)
at db$query_db.invoke(db.clj:23)
at main$_main.doInvoke(main.clj:32)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
As far as I can tell from the stack, it has not made it outside the query code (meaning it hasn't reached my transformation/write to CSV code at all). If it matters, my sql is fairly straightforward, basically SELECT * FROM my_table WHERE SOME_ID IS NOT NULL AND ROWNUM < 65000 ORDER BY some_id ASC. This is oracle (to explain rownum above), but I don' think that matters.
EDIT:
Code sample:
(defmacro query-and-print [q] `(do (dry-run ~q) ~q))
(defn query-db []
(query-and-print
(select my_table
(where (and (not= :MY_ID "BAD DATA")
(not= :MY_ID nil)
(raw (str "rownum < " rows))))
(order :MY_ID :asc))))
; args contains rows 65000, and configure-app sets up the jdbc
; connection string, and sets a var with rows value
(defn -main [& args]
(when (configure-app args)
(let [results (query-db)
dedup (dedup-with-merge results)]
(println "Result size: " (count results))
(println "Dedup size: " (count dedup))
(to-csv "target/out.csv" (transform-data dedup)))))

clojure.java.sql creates lazy sequences:
(defn resultset-seq
"Creates and returns a lazy sequence of maps corresponding to
the rows in the java.sql.ResultSet rs. Based on clojure.core/resultset-seq
but it respects the current naming strategy. Duplicate column names are
made unique by appending _N before applying the naming strategy (where
N is a unique integer)."
[^ResultSet rs]
(let [rsmeta (.getMetaData rs)
idxs (range 1 (inc (.getColumnCount rsmeta)))
keys (->> idxs
(map (fn [^Integer i] (.getColumnLabel rsmeta i)))
make-cols-unique
(map (comp keyword *as-key*)))
row-values (fn [] (map (fn [^Integer i] (.getObject rs i)) idxs))
rows (fn thisfn []
(when (.next rs)
(cons (zipmap keys (row-values)) (lazy-seq (thisfn)))))]
(rows)))
Korma fully realizes the sequence by dropping each row to a vector:
(defn- exec-sql [{:keys [results sql-str params]}]
(try
(case results
:results (jdbc/with-query-results rs (apply vector sql-str params)
(vec rs))
:keys (jdbc/do-prepared-return-keys sql-str params)
(jdbc/do-prepared sql-str params))
(catch Exception e
(handle-exception e sql-str params))))

Besides the with-lazy-results route in https://github.com/korma/Korma/pull/66, as a completely different way to resovle the problem, you can simply increase the heap size available to your JVM by setting the appropriate flag. JVMs are not allowed to use all the free memory on your machine; they are strictly limited to the amount you tell them they're allowed to use.0
One way to do this is to set :jvm-opts ["-Xmx4g"] in your project.clj file. (Adjust the exact heap size as necessary.) Another way is to do something like:
export JAVA_OPTS=-Xmx:4g
lein repl # or whatever lanuches your Clojure process
The with-lazy-results route is better in the sense that you can operate on any sized result set, but it's not merged into mainline Korma and requires some updating to work with recent versions. It's good to know how to adjust the JVM's allowed heap size anyway.

Related

Randomly sampling an item always returns the same value

I am attempting to write a script that samples one of my colleagues' names randomly:
#!/usr/bin/env gosh -b
(define-module utils
(use data.random :only (samples$ integers-between$))
(use gauche.generator :only (generator->list))
(export generator->first
sample1)
(define (generator->first gen)
(car (generator->list gen 1)))
(define (sample1 items)
(generator->first (samples$ items))))
(import (utils :only (sample1)))
(define team-members
(list "billy"
"nilly"
"silly"
"willy"))
(define (main args)
(print (sample1 team-members))
0)
However it always emits the same value:
for i in {1..25}; do
gosh ./random-team-member
done
Why is this happening?
Do I need to somehow initialize the random number generator? Am I misusing generator->list?
Observe the documentation for data.random.
The random seed is initialized by a fixed value when the module is loaded. You can get and set the random seed [with random-data-seed].
I don't see any obvious way to initialize the random seed from some other source of randomness, such as the system clock. So if you want unpredictable randomness, you'll need to find entropy yourself - perhaps read from /dev/urandom?

Completion for frame local variables from backtrace

I'm trying to add completion at point for frame-local variables from backtrace-frames during invocations of read--expression by debugger-eval-expression or edebug-eval-expression.
I constructed the following completion table to add frame-local variables to the already available table for local elisp variables,
;; completion table for locals in current frame
(defvar my-backtrace-locals-completion-table
(completion-table-in-turn
(completion-table-dynamic
(lambda (_string)
(when-let* ((idx (backtrace-get-index)) ;backtrace.el
(frame (nth idx backtrace-frames)))
(backtrace-frame-locals frame)))
'do-switch-buffer)
elisp--local-variables-completion-table)) ;elisp-mode.el
which seems to work fine, eg. to reproduce
(1) evaluate
;; debug-on-error = t
(let ((my-local-var '(1 2))) (mapcan #'car this-local-var))
(2) from debugger's second frame, evaluate with eval-expression
(funcall my-backtrace-locals-completion-table "my-" nil t)
returns expected ("my-local-var").
The problem is following the above steps, but calling instead calling debugger-eval-expression doesn't work. The environment where the table is evaluated isn't finding a backtrace-frame (with or without do-switch-buffer).
How can I define the table to be evaluated in the proper buffer?
emacs v27.0.50
The completion table above doesn't quite return the expected candidates for debugger-eval-expression. The environment where the expression is evaluated has locals from frames higher than, but not including, the one at point in the Backtrace buffer.
So, the available locals should be only those from higher frames, eg.
(eval-when-compile (require 'dash))
(defvar my-backtrace-locals-completion-table
(completion-table-dynamic
(lambda (_string)
(when backtrace-frames
(--mapcat
(-some->> (backtrace-frame-locals it) (--map (car it)))
(nthcdr (1+ (backtrace-get-index)) backtrace-frames))))
'do-switch-buffer))
Then, redefining debugger-eval-expression's interactive spec to use the new locals table in place of the normal elisp table provides the correct completions (passing the 'do-switch-buffer arg completion-table-dynamic to find completions in the original buffer).
(defun my-backtrace#eval (orig-fn exp &optional nframe)
(interactive
(let ((elisp--local-variables-completion-table
my-backtrace-locals-completion-table))
(list (read--expression "[my] Eval in stack frame: "))))
(apply orig-fn (list exp nframe)))
(advice-add 'debugger-eval-expression :around #'my-backtrace#eval)

Clojure jdbc - query single column flattened result

I'm trying to read data (cca 760k rows) from a single column into one (flattened) vector. Result of clojure.java.jdbc/query is seq of maps, e.g. ({:key "a"} {:key "b"} ...). With option :as-arrays? true provided, [[:key] ["a"] ["b"] ...] is returned. To flatten the result, I've also used option :row-fn first and got [:key "a" "b" ...]. Finally, I've applied rest to get rid of the :key.
Wrapping and unwrapping of rows with vectors seems like a lot of unnecessary work. I'm also not happy with performance. Is there a faster / more idiomatic way? I've tried...
(jdbc/with-db-connection [con -db-spec-]
(with-open [^Statement stmt (.createStatement (:connection con))
^ResultSet res (.executeQuery stmt query)]
(let [ret (ArrayList.)]
(while (.next res)
(.add ret (.getString res 1)))
(into [] ret))))
... but it's not much faster, and it's ugly.
EDIT
Nicer way to do it is via transducers (see here):
(into []
(map :key)
(jdbc/reducible-query
connection
["SELECT key FROM tbl"]
{:raw? true}))
You can just use :row-fn :key. Not sure what performance you are expecting but on my i5 PC, retrieving 760K records took ~3 seconds (H2 file based database)
(time
(count
(jdbc/query db ["select top 760000 key from table1"] {:row-fn :key})))
;; => 760000
"Elapsed time: 3003.456295 msecs"

Clojurescript, how to access a map within a indexed sequence

(println (get-in #app-state ["my-seq"]))
Returns the following sequence with type cljs.core/IndexedSeq
([-Jg95JpA3_N3ejztiBBM {create_date 1421803605375,
website "www.example.com", first_name "one"}]
[-Jg95LjI7YWiWE233eK1 {create_date 1421803613191,
website "www.example.com", first_name "two"}]
[-Jg95MwOBXxOuHyMJDlI {create_date 1421803618124,
website "www.example.com", first_name "three"}])
How can I access the maps in the sequence by uid? For example, the map belonging to
-Jg95LjI7YWiWE233eK1
If you need the order of the data, you have the following possibilities:
Store the data once in order and once as a map. So, when adding a new entry, do something like:
(defn add-entry
[uid data]
(swap! app-state update-in ["my-seq"]
#(-> %
(update-in [:sorted] conj data)
(update-in [:by-uid] assoc uid data))))
With lookup functions being:
(defn sorted-entries
[]
(get-in #app-state ["my-seq" :sorted]))
(defn entry-by-uid
[uid]
(get-in #app-state ["my-seq" :by-uid uid]))
This has the best lookup performance but will use more memory and make the code a little bit more complex.
Search the entry in the seq:
(defn entry-by-uid
[uid]
(->> (get #app-state "my-seq")
(filter (comp #{uid} first))
(first)))
In the worst case, this has to traverse the whole seq to find your entry.
If order does not matter, I recommend storing the data as a map in the first place:
(defn add-entry
[uid data]
(swap! app-state assoc-in ["my-seq" uid] data))
(defn entry-by-uid
[uid]
(get-in #app-state ["my-seq" uid]))

Dynamic values passing in Prepared Statement

I am using Clojure.java.jdbc for database access in clojure.
I wanted to use prepared statements with select.
From my previous question I got the answer like this,
(jdbc/query (:conn dbinfo)
["select * from users where username = ? and password = ?"
"harikk09"
"amma123"])
It is working also.
Now,
this parameter list I want to make dynamic. so I write a function like,
(defn values-builder (fn[param] (:value #(:value (param 1)))))
which actually works correctly and return a collection of values using a println.
(println (map values-builder params))
gives
(harikk09 amma123)
But when I tried to execute it like this, where sql-query is the previously mentioned query
(jdbc/query (:conn dbinfo) sql-query (map values-builder params))
, it throws an exception:
Caused by: java.lang.IllegalArgumentException: No value supplied for key:
Clojure.lang.LazySeq#ab5111fa
Can anyone help me to rectify this error?
I think clojure expects a list of parameters without () or [].
The JDBC query and prepared values together need to be a collection. So you need to make a collection out of a string and a collection of parametrized values. To prepend a single item onto the front of a collection, use cons
(jdbc/query (:conn dbinfo) (cons sql-query (map values-builder params)))
Use apply to splice in the arguments
(apply jdbc/query (:conn dbinfo) sql-query (map values-builder params))
Update: as noted below apply won't work since sql needs to be in a vector with the params
in that case you need to cons the sql query onto the generated params list
(jdbc/query (:conn dbinfo) (cons sql-query (map values-builder params)))

Resources