I'm trying to have a function write a database sql dump to text file from a select statement. The volume returned can be very large, and I'm interested in doing this as fast as possible.
With a large result set I also need to log every x-interval the total number of rows written and how many rows per second have been written since last x-interval. I have a (map ) that is actually doing the write during a (with-open ) so i believe the side-effect of logging rows completed should happen there. (See comments in code).
My questions are:
How do i write "rows-per-second" during the interval and "total rows so far"?
Is there anything additional I want to keep in mind while writing large jdbc result sets to a file (or named-pipe, bulk loader, etc.) ?
Does the (doall ) around the (map ) function fetch all results... making it non-lazy and potentially memory intensive?
Would fixed width be possible as an option? I believe that would be faster for a named pipe to bulk loader. The trade-off would be on disk i/o in place of CPU utilization for downstream parsing. However this might require introspection on the result set returned (with .getMetaData?)
(ns metadata.db.table-dump
[:use
[clojure.pprint]
[metadata.db.connections]
[metadata.db.metadata]
[clojure.string :only (join)]
[taoensso.timbre :only (debug info warn error set-config!)]
]
[:require
[clojure.java.io :as io ]
[clojure.java.jdbc :as j ]
[clojure.java.jdbc.sql :as sql]
]
)
(set-config! [:appenders :spit :enabled?] true)
(set-config! [:shared-appender-config :spit-filename] "log.log")
(let [
field-delim "\t"
row-delim "\n"
report-seconds 10
sql "select * from comcast_lineup "
joiner (fn [v] (str (join field-delim v ) row-delim ) )
results (rest (j/query local-postgres [sql ] :as-arrays? true :row-fn joiner ))
]
(with-open [wrtr (io/writer "test.txt")]
(doall
(map #(.write wrtr %)
; Somehow in here i want to log with (info ) rows written so
; far, and "rows per second" every 10 seconds.
results ))
) (info "Completed write") )
Couple general tips:
At the JDBC level you may need to use setFetchSize to avoid loading the entire resultset into RAM before it even gets to Clojure. See What does Statement.setFetchSize(nSize) method really do in SQL Server JDBC driver?
Make sure clojure.java.jdbc is actually returning a lazy seq (it probably is?)-- if not, consider resultset-seq
doall will indeed force the whole thing to be in RAM; try doseq instead
Consider using an atom to keep count of rows written as you go; you can use this to write rows-so-far, etc.
Sketch:
(let [ .. your stuff ..
start (System/currentTimeMillis)
row-count (atom 0)]
(with-open [^java.io.Writer wrtr (io/writer "test.txt")]
(doseq [row results]
(.write wrtr row)
(swap! row-count inc)
(when (zero? (mod #row-count 10000))
(println (format "written %d rows" #row-count))
(println (format "rows/s %.2f" (rate-calc-here)))))))
You may get some use out of my answer to Idiomatic clojure for progress reporting?
To your situation specifically
1) You could add an index to your map as the second argument to the anonymous function, then in the function you are mapping look at the index to see what row you are writing. which can be used to update an atom.
user> (def stats (atom {}))
#'user/stats
user> (let [start-time (. (java.util.Date.) getTime)]
(dorun (map (fn [line index]
(println line) ; write to log file here
(reset! stats [{:lines index
:start start-time
:end (. (java.util.Date.) getTime)}]))
["line1" "line2" "line3"]
(rest (range)))))
line1
line2
line3
nil
user> #stats
[{:lines 3, :start 1383183600216, :end 1383183600217}]
user>
The contents of stats can then be printed/logged every few seconds to update the UI
3) you most certainly want to use dorun instead of doall because as you suspect this will run out of memory on a large enough data set. dorun drops the results as they are written so you can run it on infinitely large data if you want to wait long enough.
Related
The text files have a list of paths with a different prefix.
Lets say before.txt looks like this:
before/pictures/img1.jpeg
before/pictures/img2.jpeg
before/pictures/img3.jpeg
and after.txt looks like this:
after/pictures/img1.jpeg
after/pictures/img3.jpeg
The function deleted-files should remove the different prefix (before, after) and compare the two files to print the missing list of after.txt.
Code so far:
(ns dirdiff.core
(:gen-class))
(defn deleted-files [prefix-file1 prefix-file2 file1 file2]
(let [before (slurp "resources/davor.txt")
(let [after (slurp "resources/danach.txt")
)
Expected output: which is the one who was deleted
/pictures/img2.jpeg
How can I filter the lists in clojure.clj to show only the missing ones?
You probably want to compute a set difference between the two sets of filenames after prefices have been removed:
(defn deprefixing [prefix]
(comp (filter #(clojure.string/starts-with? % prefix))
(map #(subs % (count prefix)))))
(defn load-string-set [xf filename]
(->> filename
slurp
clojure.string/split-lines
(into #{} xf)))
(defn deleted-files [prefix-file1 prefix-file2 file1 file2]
(clojure.set/difference (load-string-set (deprefixing prefix-file1) file1)
(load-string-set (deprefixing prefix-file2) file2)))
(deleted-files "before" "after"
"/tmp/before.txt" "/tmp/after.txt")
;; => #{"/pictures/img2.jpeg"}
Here is how I would approach it, starting from this template project:
(ns tst.demo.core
(:use tupelo.core tupelo.test)
(:require
[clojure.set :as set]
[tupelo.string :as str]
))
(defn file-dump->names
[file-dump-str prefix ]
(it-> file-dump-str
(str/whitespace-collapse it)
(str/split it #" ")
(mapv #(str/replace % prefix "") it)))
(defn delta-files
[before-files-in after-files-in
before-prefix after-prefix]
(let-spy [before-files (file-dump->names before-files-in before-prefix)
after-files (file-dump->names after-files-in after-prefix)
before-files-set (set before-files)
after-files-set (set after-files)
delta-sorted (vec (sort (set/difference before-files-set after-files-set)))]
delta-sorted))
and a unit test to show it in action:
(dotest
(let [before-files "before/pictures/img1.jpeg
before/pictures/img2.jpeg
before/pictures/img3.jpeg "
after-files "after/pictures/img1.jpeg
after/pictures/img3.jpeg "
before-prefix "before"
after-prefix "after"]
(is= (delta-files before-files after-files before-prefix after-prefix)
["/pictures/img2.jpeg"])
))
Be sure to study the these documentation sources, including books like Getting Clojure and the Clojure CheatSheet.
Notes:
I like to use let-spy and let-spy-pretty to illustrate the progression of code. It produces output like so:
-------------------------------
Clojure 1.10.2 Java 15
-------------------------------
Testing tst.demo.core
before-files => ["/pictures/img1.jpeg" "/pictures/img2.jpeg" "/pictures/img3.jpeg"]
after-files => ["/pictures/img1.jpeg" "/pictures/img3.jpeg"]
before-files-set => #{"/pictures/img3.jpeg" "/pictures/img2.jpeg" "/pictures/img1.jpeg"}
after-files-set => #{"/pictures/img3.jpeg" "/pictures/img1.jpeg"}
delta-sorted => ["/pictures/img2.jpeg"]
Ran 2 tests containing 1 assertions.
0 failures, 0 errors.
The spyx macro is also very useful for debugging. See the README and the API docs.
Preamble
I was looking throught the source code in clojure.core for no particular reason.
I started reading defmacro ns, here is the abridged source:
(defmacro ns
"...docstring..."
{:arglists '([name docstring? attr-map? references*])
:added "1.0"}
[name & references]
(let [...
; Argument processing here.
name-metadata (meta name)]
`(do
(clojure.core/in-ns '~name)
~#(when name-metadata
`((.resetMeta (clojure.lang.Namespace/find '~name) ~name-metadata)))
(with-loading-context
~#(when gen-class-call (list gen-class-call))
~#(when (and (not= name 'clojure.core) (not-any? #(= :refer-clojure (first %)) references))
`((clojure.core/refer '~'clojure.core)))
~#(map process-reference references))
(if (.equals '~name 'clojure.core)
nil
(do (dosync (commute ##'*loaded-libs* conj '~name)) nil)))))
Looking Closer
And then trying to read it I saw some strange macro patterns, in particular we can look at:
~#(when name-metadata
`((.resetMeta (clojure.lang.Namespace/find '~name) ~name-metadata)))
The clojure.core version
Here is a standalone working extraction from the macro:
(let [name-metadata 'name-metadata
name 'name]
`(do
~#(when name-metadata
`((.resetMeta (clojure.lang.Namespace/find '~name) ~name-metadata)))))
=> (do (.resetMeta (clojure.lang.Namespace/find (quote name)) name-metadata))
When I ran this could I couldn't help but wonder why there is a double list at the point `((.resetMeta.
My version
I found that by just removing the unquote-splicing (~#) the double list was unnecessary. Here is a working standalone example:
(let [name-metadata 'name-metadata
name 'name]
`(do
~(when name-metadata
`(.resetMeta (clojure.lang.Namespace/find '~name) ~name-metadata))))
=> (do (.resetMeta (clojure.lang.Namespace/find (quote name)) name-metadata))
My Question
Thus, why does clojure.core choose this seemingly extraneous way of doing things?
My Own Thoughts
Is this an artifact of convention?
Are there other similar instances where this is used in more complex ways?
~ always emits a form; ~# can potentially emit nothing at all. Thus sometimes one uses ~# to splice in a single expression conditionally:
;; always yields the form (foo x)
;; (with x replaced with its macro-expansion-time value):
`(foo ~x)`
;; results in (foo) is x is nil, (foo x) otherwise:
`(foo ~#(if x [x]))
That's what's going on here: the (.resetMeta …) call is emitted within the do form that ns expands to only if name-metadata is truthy (non-false, non-nil).
In this instance, it doesn't really matter – one could use ~, drop the extra brackets and accept that the macroexpansion of an ns form with no name metadata would have an extra nil in the do form. For the sake of a prettier expansion, though, it makes sense to use ~# and only emit a form to handle name metadata when it is actually useful.
This code works, printing the rows in the given table:
(defn count-extreferences-subset [config]
(let [emr-dbs (:emr-databases config)]
(println "Counting external references: " emr-dbs)
(jdbc/with-db-connection [dbconn (:db-spec (first emr-dbs))]
(let [q "SELECT * FROM LOCREG"
rs (jdbc/query dbconn [q])]
(dorun (map println rs))))))
According to the documentation in clojure.jdbc, this should also work, but should print the rows as the result set is realized (preventing memory overflow for large result sets):
(defn count-extreferences-subset [config]
(let [emr-dbs (:emr-databases config)]
(println "Counting external references: " emr-dbs)
(jdbc/with-db-connection [dbconn (:db-spec (first emr-dbs))]
(let [q "SELECT * FROM LOCREG"
_ (jdbc/query dbconn [q] {:row-fn println})]))))
However this fails at run-time with the following exception:
java.lang.IllegalArgumentException: No value supplied for key: {:row-fn #object[clojure.core$println 0x46ed7a0e "clojure.core$println#46ed7a0e"]}
Any idea why the use of the :row-fn option is failing?
I believe the curly braces are the problem. Your code should follow the following pattern:
(jdbc/query db-spec
["select name, cost from fruit where cost = 12"]
:row-fn add-tax)
You can fine more information in The Clojure Cookbook. I highly recommend buying a copy!
Yes, I am aware of the emacs profiler feature. I'm looking for something similar to the time keyword in bash, something like:
(time (myfunc))
which would return or print the time taken by the myfunc call. Is there such a thing?
benchmark.el provides benchmark-run and benchmark-run-compiled functions as well as a benchmark version to run interactively. The linked example:
C-u 512 M-x benchmark (sort (number-sequence 1 512) '<)
Elapsed time: 0.260000s (0.046000s in 6 GCs)
The timer used by all those functions is the benchmark-elapse macro, which you can also use directly if desired:
ELISP> (require 'benchmark)
ELISP> (benchmark-elapse
(sit-for 2))
2.00707889
I found exactly what I was looking for at http://nullprogram.com/blog/2009/05/28/
(defmacro measure-time (&rest body)
"Measure and return the running time of the code block."
(declare (indent defun))
(let ((start (make-symbol "start")))
`(let ((,start (float-time)))
,#body
(- (float-time) ,start))))
I'm trying to read millions of rows from a database and write to a text file.
This is a continuation of my question database dump to text file with side effects
My problem now seems to be that the logging doesn't happen until the program completes. Another indicator that i'm not processing lazily is that the text file isn't written at all until the program finishes.
Based on an IRC tip it seems my issue is likely having to do with :result-set-fnand defaulting to doall in the clojure.java.jdbc/query area of the code.
I have tried to replace this with a for function but still discover that memory consumption is high as it pulls the entire result set into memory.
How can i have a :result-set-fn that doesn't pull everything in like doall? How can I progressively write the log file as the program is running, rather then dump everything once the -main execution is finished?
(let [
db-spec local-postgres
sql "select * from public.f_5500_sf "
log-report-interval 1000
fetch-size 100
field-delim "\t"
row-delim "\n"
db-connection (doto ( j/get-connection db-spec) (.setAutoCommit false))
statement (j/prepare-statement db-connection sql :fetch-size fetch-size )
joiner (fn [v] (str (join field-delim v ) row-delim ) )
start (System/currentTimeMillis)
rate-calc (fn [r] (float (/ r (/ ( - (System/currentTimeMillis) start) 100))))
row-count (atom 0)
result-set-fn (fn [rs] (lazy-seq rs))
lazy-results (rest (j/query db-connection [statement] :as-arrays? true :row-fn joiner :result-set-fn result-set-fn))
]; }}}
(.setAutoCommit db-connection false)
(info "Started dbdump session...")
(with-open [^java.io.Writer wrtr (io/writer "output.txt")]
(info "Running query...")
(doseq [row lazy-results]
(.write wrtr row)
))
(info (format "Completed write with %d rows" #row-count))
)
I took the recent fixes for clojure.java.jdbc by putting [org.clojure/java.jdbc "0.3.0-beta1"] in my project.clj dependencies listing. This one enhances/corrects the :as-arrays? true functionality of clojure.java.jdbc/query described here.
I think this helped somewhat however I may still have been able to override the :result-set-fn to vec.
The core issue was resolved by tucking all row logic into :row-fn. The initial OutOfMemory problems had to do with iterating through j/query result sets rather than defining the specific :row-fn.
New (working) code is below:
(defn -main []
(let [; {{{
db-spec local-postgres
source-sql "select * from public.f_5500 "
log-report-interval 1000
fetch-size 1000
row-count (atom 0)
field-delim "\u0001" ; unlikely to be in source feed,
; although i should still check in
; replace-newline below (for when "\t"
; is used especially)
row-delim "\n" ; unless fixed-width, target doesn't
; support non-printable chars for recDelim like
db-connection (doto ( j/get-connection db-spec) (.setAutoCommit false))
statement (j/prepare-statement db-connection source-sql :fetch-size fetch-size :concurrency :read-only)
start (System/currentTimeMillis)
rate-calc (fn [r] (float (/ r (/ ( - (System/currentTimeMillis) start) 100))))
replace-newline (fn [s] (if (string? s) (clojure.string/replace s #"\n" " ") s))
row-fn (fn [v]
(swap! row-count inc)
(when (zero? (mod #row-count log-report-interval))
(info (format "wrote %d rows" #row-count))
(info (format "\trows/s %.2f" (rate-calc #row-count)))
(info (format "\tPercent Mem used %s " (memory-percent-used))))
(str (join field-delim (doall (map #(replace-newline %) v))) row-delim ))
]; }}}
(info "Started database table dump session...")
(with-open [^java.io.Writer wrtr (io/writer "./sql/output.txt")]
(j/query db-connection [statement] :as-arrays? true :row-fn
#(.write wrtr (row-fn %))))
(info (format "\t\t\tCompleted with %d rows" #row-count))
(info (format "\t\t\tCompleted in %s seconds" (float (/ (- (System/currentTimeMillis) start) 1000))))
(info (format "\t\t\tAverage rows/s %.2f" (rate-calc #row-count)))
nil)
)
Other things i experimented (with limited success) involved the timbre logging and turning off stardard out; i wondered if with using a REPL it might cache the results before displaying back to my editor (vim fireplace) and i wasn't sure if that was utilizing a lot of the memory.
Also, I added the logging parts around memory free with (.freeMemory (java.lang.Runtime/getRuntime)). I wasn't as familiar with VisualVM and pinpointing exactly where my issue was.
I am happy with how it works now, thanks everyone for your help.
You can use prepare-statement with the :fetch-size option. Otherwise, the query itself is eager despite the results being delivered in a lazy sequence.
prepare-statement requires a connection object, so you'll need to explicitly create one. Here's an example of how your usage might look:
(let [db-spec local-postgres
sql "select * from big_table limit 500000 "
fetch-size 10000 ;; or whatever's appropriate
cnxn (doto (j/get-connection db-spec)
(.setAutoCommit false))
stmt (j/prepare-statement cnxn sql :fetch-size fetch-size)
results (rest (j/query cnxn [stmt]))]
;; ...
)
Another option
Since the problem seems to be with query, try with-query-results. It's considered deprecated but is still there and works. Here's an example usage:
(let [db-spec local-postgres
sql "select * from big_table limit 500000 "
fetch-size 100 ;; or whatever's appropriate
cnxn (doto (j/get-connection db-spec)
(.setAutoCommit false))
stmt (j/prepare-statement cnxn sql :fetch-size fetch-size)]
(j/with-query-results results [stmt] ;; binds the results to `results`
(doseq [row results]
;;
)))
I've have found a better solution: you need to declare a cursor and fetch chunks of data from it in a transaction. Example:
(db/with-tx
(db/execute! "declare cur cursor for select * from huge_table")
(loop []
(when-let [rows (-> "fetch 10 from cur" db/query not-empty)]
(doseq [row rows]
(process-a-row row))
(recur))))
Here, db/with-tx, db/execute! and db/query are my own shortcuts declared in db namespace:
(def ^:dynamic
*db* {:dbtype "postgresql"
:connection-uri <some db url>)})
(defn query [& args]
(apply jdbc/query *db* args))
(defn execute! [& args]
(apply jdbc/execute! *db* args))
(defmacro with-tx
"Runs a series of queries into transaction."
[& body]
`(jdbc/with-db-transaction [tx# *db*]
(binding [*db* tx#]
~#body)))