I have the following directories:
mod01
mod02
mod03
...
mod100
When I use
(list (directory-files dir t "\\(mod\\)\\([0-9]\\)" nil))
the output is:
mod01
mod02
mod03
...
mod10
mod100
...
mod99
As you can see, mod100 is not in the correct position.The desired output is:
mod01
mod02
...
mod10
mod11
...
mod100
Thank you for your advice
Supply a custom predicate function extracting the numeric part:
(sort
(directory-files dir t "\\(mod\\)\\([0-9]\\)" nil)
(lambda (x y)
(<
(string-to-number (replace-regexp-in-string ".*mod\\([[:digit:]]+\\).*" "\\1" x))
(string-to-number (replace-regexp-in-string ".*mod\\([[:digit:]]+\\).*" "\\1" y)))))
As described in the help doc for directory-files sorting uses the predicate string-lessp, for which (string-lessp "100" "9") returns t. You could write your own predicate and set nosort to true and use cl-sort to sort the contents by extracting the numeric part of the strings. If you are on a machine with access to sort -V, you could just wrap a shell command,
(defun my-sort (&optional dir)
(interactive "D")
(with-temp-buffer
(shell-command
(concat "ls " (shell-quote-argument (or dir default-directory)) "| sort -V")
(current-buffer))
(split-string (buffer-string) "\n")))
Using sort's version sorting should result in the desired ordering.
Related
The text files have a list of paths with a different prefix.
Lets say before.txt looks like this:
before/pictures/img1.jpeg
before/pictures/img2.jpeg
before/pictures/img3.jpeg
and after.txt looks like this:
after/pictures/img1.jpeg
after/pictures/img3.jpeg
The function deleted-files should remove the different prefix (before, after) and compare the two files to print the missing list of after.txt.
Code so far:
(ns dirdiff.core
(:gen-class))
(defn deleted-files [prefix-file1 prefix-file2 file1 file2]
(let [before (slurp "resources/davor.txt")
(let [after (slurp "resources/danach.txt")
)
Expected output: which is the one who was deleted
/pictures/img2.jpeg
How can I filter the lists in clojure.clj to show only the missing ones?
You probably want to compute a set difference between the two sets of filenames after prefices have been removed:
(defn deprefixing [prefix]
(comp (filter #(clojure.string/starts-with? % prefix))
(map #(subs % (count prefix)))))
(defn load-string-set [xf filename]
(->> filename
slurp
clojure.string/split-lines
(into #{} xf)))
(defn deleted-files [prefix-file1 prefix-file2 file1 file2]
(clojure.set/difference (load-string-set (deprefixing prefix-file1) file1)
(load-string-set (deprefixing prefix-file2) file2)))
(deleted-files "before" "after"
"/tmp/before.txt" "/tmp/after.txt")
;; => #{"/pictures/img2.jpeg"}
Here is how I would approach it, starting from this template project:
(ns tst.demo.core
(:use tupelo.core tupelo.test)
(:require
[clojure.set :as set]
[tupelo.string :as str]
))
(defn file-dump->names
[file-dump-str prefix ]
(it-> file-dump-str
(str/whitespace-collapse it)
(str/split it #" ")
(mapv #(str/replace % prefix "") it)))
(defn delta-files
[before-files-in after-files-in
before-prefix after-prefix]
(let-spy [before-files (file-dump->names before-files-in before-prefix)
after-files (file-dump->names after-files-in after-prefix)
before-files-set (set before-files)
after-files-set (set after-files)
delta-sorted (vec (sort (set/difference before-files-set after-files-set)))]
delta-sorted))
and a unit test to show it in action:
(dotest
(let [before-files "before/pictures/img1.jpeg
before/pictures/img2.jpeg
before/pictures/img3.jpeg "
after-files "after/pictures/img1.jpeg
after/pictures/img3.jpeg "
before-prefix "before"
after-prefix "after"]
(is= (delta-files before-files after-files before-prefix after-prefix)
["/pictures/img2.jpeg"])
))
Be sure to study the these documentation sources, including books like Getting Clojure and the Clojure CheatSheet.
Notes:
I like to use let-spy and let-spy-pretty to illustrate the progression of code. It produces output like so:
-------------------------------
Clojure 1.10.2 Java 15
-------------------------------
Testing tst.demo.core
before-files => ["/pictures/img1.jpeg" "/pictures/img2.jpeg" "/pictures/img3.jpeg"]
after-files => ["/pictures/img1.jpeg" "/pictures/img3.jpeg"]
before-files-set => #{"/pictures/img3.jpeg" "/pictures/img2.jpeg" "/pictures/img1.jpeg"}
after-files-set => #{"/pictures/img3.jpeg" "/pictures/img1.jpeg"}
delta-sorted => ["/pictures/img2.jpeg"]
Ran 2 tests containing 1 assertions.
0 failures, 0 errors.
The spyx macro is also very useful for debugging. See the README and the API docs.
Preamble
I was looking throught the source code in clojure.core for no particular reason.
I started reading defmacro ns, here is the abridged source:
(defmacro ns
"...docstring..."
{:arglists '([name docstring? attr-map? references*])
:added "1.0"}
[name & references]
(let [...
; Argument processing here.
name-metadata (meta name)]
`(do
(clojure.core/in-ns '~name)
~#(when name-metadata
`((.resetMeta (clojure.lang.Namespace/find '~name) ~name-metadata)))
(with-loading-context
~#(when gen-class-call (list gen-class-call))
~#(when (and (not= name 'clojure.core) (not-any? #(= :refer-clojure (first %)) references))
`((clojure.core/refer '~'clojure.core)))
~#(map process-reference references))
(if (.equals '~name 'clojure.core)
nil
(do (dosync (commute ##'*loaded-libs* conj '~name)) nil)))))
Looking Closer
And then trying to read it I saw some strange macro patterns, in particular we can look at:
~#(when name-metadata
`((.resetMeta (clojure.lang.Namespace/find '~name) ~name-metadata)))
The clojure.core version
Here is a standalone working extraction from the macro:
(let [name-metadata 'name-metadata
name 'name]
`(do
~#(when name-metadata
`((.resetMeta (clojure.lang.Namespace/find '~name) ~name-metadata)))))
=> (do (.resetMeta (clojure.lang.Namespace/find (quote name)) name-metadata))
When I ran this could I couldn't help but wonder why there is a double list at the point `((.resetMeta.
My version
I found that by just removing the unquote-splicing (~#) the double list was unnecessary. Here is a working standalone example:
(let [name-metadata 'name-metadata
name 'name]
`(do
~(when name-metadata
`(.resetMeta (clojure.lang.Namespace/find '~name) ~name-metadata))))
=> (do (.resetMeta (clojure.lang.Namespace/find (quote name)) name-metadata))
My Question
Thus, why does clojure.core choose this seemingly extraneous way of doing things?
My Own Thoughts
Is this an artifact of convention?
Are there other similar instances where this is used in more complex ways?
~ always emits a form; ~# can potentially emit nothing at all. Thus sometimes one uses ~# to splice in a single expression conditionally:
;; always yields the form (foo x)
;; (with x replaced with its macro-expansion-time value):
`(foo ~x)`
;; results in (foo) is x is nil, (foo x) otherwise:
`(foo ~#(if x [x]))
That's what's going on here: the (.resetMeta …) call is emitted within the do form that ns expands to only if name-metadata is truthy (non-false, non-nil).
In this instance, it doesn't really matter – one could use ~, drop the extra brackets and accept that the macroexpansion of an ns form with no name metadata would have an extra nil in the do form. For the sake of a prettier expansion, though, it makes sense to use ~# and only emit a form to handle name metadata when it is actually useful.
I try to write a program that adds in the source code the string "hello world". The name of the file in source.rkt. It gives me this error:
source.rkt:6:31: #%datum: keyword used as an expression in: #:mode
#(118 6)
This is the code:
#lang racket
(provide (all-defined-out))
(define out (open-output-file "source.rkt"
[#:mode 'text
#:exists 'can-update]))
(write "hello world" out)
(close-output-port out)
The brackets are not literals. They mean optional. Therefore, the correct syntax is:
(define out (open-output-file "source.rkt"
#:mode 'text
#:exists 'can-update))
I want to extract the urls in reddit, my code is
#lang racket
(require net/url)
(require html)
(define reddit (string->url "http://www.reddit.com/r/programming/search?q=racket&sort=relevance&restrict_sr=on&t=all"))
(define in (get-pure-port reddit #:redirections 5))
(define response-html (read-html-as-xml in))
(define content-0 (list-ref response-html 0))
(close-input-port in)
The content-0 above is
(element
(location 0 0 15)
(location 0 0 82)
...
I'm wondering how to extract specific content from it.
Usually it's more convenient to deal with HTML as x-expressions instead of the html module's structs.
Also you should probably use call/input-url to handle closing the port automatically.
You can combine both of these ideas by defining a read-html-as-xexpr function and using it like this:
#lang racket/base
(require html
net/url
xml)
(define (read-html-as-xexpr in) ;; input-port? -> xexpr?
(caddr
(xml->xexpr
(element #f #f 'root '()
(read-html-as-xml in)))))
(define reddit (string->url "http://www.reddit.com/r/programming/search?q=racket&sort=relevance&restrict_sr=on&t=all"))
(call/input-url reddit
get-pure-port
read-html-as-xexpr)
That will return a big x-expression like:
'(html
((lang "en") (xml:lang "en") (xmlns "http://www.w3.org/1999/xhtml"))
(head
()
(title () "programming: search results")
(meta
((content " reddit, reddit.com, vote, comment, submit ")
(name "keywords")))
(meta
((content "reddit: the front page of the internet") (name "description")))
(meta ((content "origin") (name "referrer")))
(meta ((content "text/html; charset=UTF-8") (http-equiv "Content-Type")))
... snip ...
How to extract specific pieces of that?
For simple HTML where I don't expect the overall structure to change, I will often just use match.
However a more correct and robust way to go about it is to use the xml/path module.
UPDATE: I noticed your question started by asking about extracting URLs. Here's the example updated to use se-path*/list to get all the href attributes of all the <a> elements:
#lang racket/base
(require html
net/url
xml
xml/path)
(define (read-html-as-xexprs in) ;; (-> input-port? xexpr?)
(caddr
(xml->xexpr
(element #f #f 'root '()
(read-html-as-xml in)))))
(define reddit (string->url "http://www.reddit.com/r/programming/search?q=racket&sort=relevance&restrict_sr=on&t=all"))
(define xe (call/input-url reddit
get-pure-port
read-html-as-xexprs))
(se-path*/list '(a #:href) xe)
Result:
'("#content"
"http://www.reddit.com/r/announcements/"
"http://www.reddit.com/r/Art/"
"http://www.reddit.com/r/AskReddit/"
"http://www.reddit.com/r/askscience/"
"http://www.reddit.com/r/aww/"
"http://www.reddit.com/r/blog/"
"http://www.reddit.com/r/books/"
"http://www.reddit.com/r/creepy/"
"http://www.reddit.com/r/dataisbeautiful/"
"http://www.reddit.com/r/DIY/"
"http://www.reddit.com/r/Documentaries/"
"http://www.reddit.com/r/EarthPorn/"
"http://www.reddit.com/r/explainlikeimfive/"
"http://www.reddit.com/r/Fitness/"
"http://www.reddit.com/r/food/"
... snip ...
How can I write an Emacs Lisp function to find all hrefs in an HTML file and extract all of the links?
Input:
<html>
<a href="http://www.stackoverflow.com" _target="_blank">StackOverFlow</a>
<h1>Emacs Lisp</h1>
<a href="http://news.ycombinator.com" _target="_blank">Hacker News</a>
</html>
Output:
http://www.stackoverflow.com|StackOverFlow
http://news.ycombinator.com|Hacker News
I've seen the re-search-forward function mentioned several times during my search. Here's what I think that I need to do based on what I've read so far.
(defun extra-urls (file)
...
(setq buffer (...
(while
(re-search-forward "http://" nil t)
(when (match-string 0)
...
))
I took Heinzi's solution and came up with the final solution that I needed. I can now take a list of files, extract all URL's and titles, and place the results in one output buffer.
(defun extract-urls (fname)
"Extract HTML href url's,titles to buffer 'new-urls.csv' in | separated format."
(setq in-buf (set-buffer (find-file fname))); Save for clean up
(beginning-of-buffer); Need to do this in case the buffer is already open
(setq u1 '())
(while
(re-search-forward "^.*<a href=\"\\([^\"]+\\)\"[^>]+>\\([^<]+\\)</a>" nil t)
(when (match-string 0) ; Got a match
(setq url (match-string 1) ) ; URL
(setq title (match-string 2) ) ; Title
(setq u1 (cons (concat url "|" title "\n") u1)) ; Build the list of URLs
)
)
(kill-buffer in-buf) ; Don't leave a mess of buffers
(progn
(with-current-buffer (get-buffer-create "new-urls.csv"); Send results to new buffer
(mapcar 'insert u1))
(switch-to-buffer "new-urls.csv"); Finally, show the new buffer
)
)
;; Create a list of files to process
;;
(mapcar 'extract-urls '(
"/tmp/foo.html"
"/tmp/bar.html"
))
If there is at most one link per line and you don't mind some very ugly regular expression hacking, run the following code on your buffer:
(defun getlinks ()
(beginning-of-buffer)
(replace-regexp "^.*<a href=\"\\([^\"]+\\)\"[^>]+>\\([^<]+\\)</a>.*$" "LINK:\\1|\\2")
(beginning-of-buffer)
(replace-regexp "^\\([^L]\\|\\(L[^I]\\)\\|\\(LI[^N]\\)\\|\\(LIN[^K]\\)\\).*$" "")
(beginning-of-buffer)
(replace-regexp "
+" "
")
(beginning-of-buffer)
(replace-regexp "^LINK:\\(.*\\)$" "\\1")
)
It replaces all links with LINK:url|description, deletes all lines containing anything else, deletes empty lines, and finally removes the "LINK:".
Detailed HOWTO: (1) Correct the bug in your example html file by replacing <href with <a href, (2) copy the above function into Emacs scratch, (3) hit C-x C-e after the final ")" to load the function, (4) load your example HTML file, (5) execute the function with M-: (getlinks).
Note that the linebreaks in the third replace-regexp are important. Don't indent those two lines.
You can use the 'xml library, examples of using the parser are found here. To parse your particular file, the following does what you want:
(defun my-grab-html (file)
(interactive "fHtml file: ")
(let ((res (car (xml-parse-file file)))) ; 'car because xml-parse-file returns a list of nodes
(mapc (lambda (n)
(when (consp n) ; don't operate on the whitespace, xml preserves whitespace
(let ((link (cdr (assq 'href (xml-node-attributes n)))))
(when link
(insert link)
(insert "|")
(insert (car (xml-node-children n))) ;# grab the text for the link
(insert "\n")))))
(xml-node-children res))))
This does not recursively parse the HTML to find all the links, but it should get you started in the direction of the general solution.