Extracting URLs from an Emacs buffer? - elisp

How can I write an Emacs Lisp function to find all hrefs in an HTML file and extract all of the links?
Input:
<html>
<a href="http://www.stackoverflow.com" _target="_blank">StackOverFlow&lt/a>
<h1>Emacs Lisp</h1>
<a href="http://news.ycombinator.com" _target="_blank">Hacker News&lt/a>
</html>
Output:
http://www.stackoverflow.com|StackOverFlow
http://news.ycombinator.com|Hacker News
I've seen the re-search-forward function mentioned several times during my search. Here's what I think that I need to do based on what I've read so far.
(defun extra-urls (file)
...
(setq buffer (...
(while
(re-search-forward "http://" nil t)
(when (match-string 0)
...
))

I took Heinzi's solution and came up with the final solution that I needed. I can now take a list of files, extract all URL's and titles, and place the results in one output buffer.
(defun extract-urls (fname)
"Extract HTML href url's,titles to buffer 'new-urls.csv' in | separated format."
(setq in-buf (set-buffer (find-file fname))); Save for clean up
(beginning-of-buffer); Need to do this in case the buffer is already open
(setq u1 '())
(while
(re-search-forward "^.*<a href=\"\\([^\"]+\\)\"[^>]+>\\([^<]+\\)</a>" nil t)
(when (match-string 0) ; Got a match
(setq url (match-string 1) ) ; URL
(setq title (match-string 2) ) ; Title
(setq u1 (cons (concat url "|" title "\n") u1)) ; Build the list of URLs
)
)
(kill-buffer in-buf) ; Don't leave a mess of buffers
(progn
(with-current-buffer (get-buffer-create "new-urls.csv"); Send results to new buffer
(mapcar 'insert u1))
(switch-to-buffer "new-urls.csv"); Finally, show the new buffer
)
)
;; Create a list of files to process
;;
(mapcar 'extract-urls '(
"/tmp/foo.html"
"/tmp/bar.html"
))

If there is at most one link per line and you don't mind some very ugly regular expression hacking, run the following code on your buffer:
(defun getlinks ()
(beginning-of-buffer)
(replace-regexp "^.*<a href=\"\\([^\"]+\\)\"[^>]+>\\([^<]+\\)</a>.*$" "LINK:\\1|\\2")
(beginning-of-buffer)
(replace-regexp "^\\([^L]\\|\\(L[^I]\\)\\|\\(LI[^N]\\)\\|\\(LIN[^K]\\)\\).*$" "")
(beginning-of-buffer)
(replace-regexp "
+" "
")
(beginning-of-buffer)
(replace-regexp "^LINK:\\(.*\\)$" "\\1")
)
It replaces all links with LINK:url|description, deletes all lines containing anything else, deletes empty lines, and finally removes the "LINK:".
Detailed HOWTO: (1) Correct the bug in your example html file by replacing <href with <a href, (2) copy the above function into Emacs scratch, (3) hit C-x C-e after the final ")" to load the function, (4) load your example HTML file, (5) execute the function with M-: (getlinks).
Note that the linebreaks in the third replace-regexp are important. Don't indent those two lines.

You can use the 'xml library, examples of using the parser are found here. To parse your particular file, the following does what you want:
(defun my-grab-html (file)
(interactive "fHtml file: ")
(let ((res (car (xml-parse-file file)))) ; 'car because xml-parse-file returns a list of nodes
(mapc (lambda (n)
(when (consp n) ; don't operate on the whitespace, xml preserves whitespace
(let ((link (cdr (assq 'href (xml-node-attributes n)))))
(when link
(insert link)
(insert "|")
(insert (car (xml-node-children n))) ;# grab the text for the link
(insert "\n")))))
(xml-node-children res))))
This does not recursively parse the HTML to find all the links, but it should get you started in the direction of the general solution.

Related

Trailing backslash error web-mode content type

I get this error when trying to set content type in web-mode: File mode specification error: (invalid-regexp Trailing backslash)
I have had a hard time debugging this. I'm very new to emacs so I need some help setting web-mode. I have been following the documentation in web-mode.org but it has been difficult to decypher. Thanks.
(use-package
web-mode
:defer 2
:ensure t
:mode ("\\.html?\\"
"\\.hbs$\\"
"\\.vue$\\"
"\\.css?\\"
"components/.*\\.js[x]?\\'"
"containers/.*\\.js[x]?\\'")
:config (progn
(setq web-mode-enable-auto-closing t
web-mode-enable-auto-opening t
web-mode-enable-auto-pairing t
web-mode-enable-auto-indentation t
web-mode-enable-auto-quoting t
;; right now paired with AutoComplete
web-mode-ac-sources-alist
'(("css" . (ac-source-css-property))
("vue" . (ac-source-words-in-buffer ac-source-abbrev))
("html" . (ac-source-words-in-buffer ac-source-abbrev)))
web-mode-content-types-alist
'(("jsx" . "components/.*\\.js[x]?\\'")
("jsx" . "containers/.*\\.js[x]?\\'")))))
;; usually I set them in containers/ or components/ directorie
;; and to keep seperate from plain JS
;; adjust indents for web-mode to 2 spaces
(defun my-web-mode-hook ()
"Hooks for Web mode. Adjust indents"
;;; http://web-mode.org/
(setq web-mode-markup-indent-offset 2)
(setq web-mode-css-indent-offset 2)
(setq web-mode-code-indent-offset 2))
(add-hook 'web-mode-hook 'my-web-mode-hook)
In the list of regexps after :mode, make sure that they all end with \\'. Currently two of them do, but four of them lost the final ' character.
:mode ("\\.html?\\'"
"\\.hbs$\\'"
"\\.vue$\\'"
"\\.css?\\'"
"components/.*\\.js[x]?\\'"
"containers/.*\\.js[x]?\\'")
\' is a special regexp construct that "matches the empty string, but only at the end of the buffer or string being matched against".

Sort directory-files

I have the following directories:
mod01
mod02
mod03
...
mod100
When I use
(list (directory-files dir t "\\(mod\\)\\([0-9]\\)" nil))
the output is:
mod01
mod02
mod03
...
mod10
mod100
...
mod99
As you can see, mod100 is not in the correct position.The desired output is:
mod01
mod02
...
mod10
mod11
...
mod100
Thank you for your advice
Supply a custom predicate function extracting the numeric part:
(sort
(directory-files dir t "\\(mod\\)\\([0-9]\\)" nil)
(lambda (x y)
(<
(string-to-number (replace-regexp-in-string ".*mod\\([[:digit:]]+\\).*" "\\1" x))
(string-to-number (replace-regexp-in-string ".*mod\\([[:digit:]]+\\).*" "\\1" y)))))
As described in the help doc for directory-files sorting uses the predicate string-lessp, for which (string-lessp "100" "9") returns t. You could write your own predicate and set nosort to true and use cl-sort to sort the contents by extracting the numeric part of the strings. If you are on a machine with access to sort -V, you could just wrap a shell command,
(defun my-sort (&optional dir)
(interactive "D")
(with-temp-buffer
(shell-command
(concat "ls " (shell-quote-argument (or dir default-directory)) "| sort -V")
(current-buffer))
(split-string (buffer-string) "\n")))
Using sort's version sorting should result in the desired ordering.

Clojure.core source: Why ~# (unquote-splicing operator) with a quoted double list inside, instead of ~ (unquote operator)

Preamble
I was looking throught the source code in clojure.core for no particular reason.
I started reading defmacro ns, here is the abridged source:
(defmacro ns
"...docstring..."
{:arglists '([name docstring? attr-map? references*])
:added "1.0"}
[name & references]
(let [...
; Argument processing here.
name-metadata (meta name)]
`(do
(clojure.core/in-ns '~name)
~#(when name-metadata
`((.resetMeta (clojure.lang.Namespace/find '~name) ~name-metadata)))
(with-loading-context
~#(when gen-class-call (list gen-class-call))
~#(when (and (not= name 'clojure.core) (not-any? #(= :refer-clojure (first %)) references))
`((clojure.core/refer '~'clojure.core)))
~#(map process-reference references))
(if (.equals '~name 'clojure.core)
nil
(do (dosync (commute ##'*loaded-libs* conj '~name)) nil)))))
Looking Closer
And then trying to read it I saw some strange macro patterns, in particular we can look at:
~#(when name-metadata
`((.resetMeta (clojure.lang.Namespace/find '~name) ~name-metadata)))
The clojure.core version
Here is a standalone working extraction from the macro:
(let [name-metadata 'name-metadata
name 'name]
`(do
~#(when name-metadata
`((.resetMeta (clojure.lang.Namespace/find '~name) ~name-metadata)))))
=> (do (.resetMeta (clojure.lang.Namespace/find (quote name)) name-metadata))
When I ran this could I couldn't help but wonder why there is a double list at the point `((.resetMeta.
My version
I found that by just removing the unquote-splicing (~#) the double list was unnecessary. Here is a working standalone example:
(let [name-metadata 'name-metadata
name 'name]
`(do
~(when name-metadata
`(.resetMeta (clojure.lang.Namespace/find '~name) ~name-metadata))))
=> (do (.resetMeta (clojure.lang.Namespace/find (quote name)) name-metadata))
My Question
Thus, why does clojure.core choose this seemingly extraneous way of doing things?
My Own Thoughts
Is this an artifact of convention?
Are there other similar instances where this is used in more complex ways?
~ always emits a form; ~# can potentially emit nothing at all. Thus sometimes one uses ~# to splice in a single expression conditionally:
;; always yields the form (foo x)
;; (with x replaced with its macro-expansion-time value):
`(foo ~x)`
;; results in (foo) is x is nil, (foo x) otherwise:
`(foo ~#(if x [x]))
That's what's going on here: the (.resetMeta …) call is emitted within the do form that ns expands to only if name-metadata is truthy (non-false, non-nil).
In this instance, it doesn't really matter – one could use ~, drop the extra brackets and accept that the macroexpansion of an ns form with no name metadata would have an extra nil in the do form. For the sake of a prettier expansion, though, it makes sense to use ~# and only emit a form to handle name metadata when it is actually useful.

How to extract element from html in Racket?

I want to extract the urls in reddit, my code is
#lang racket
(require net/url)
(require html)
(define reddit (string->url "http://www.reddit.com/r/programming/search?q=racket&sort=relevance&restrict_sr=on&t=all"))
(define in (get-pure-port reddit #:redirections 5))
(define response-html (read-html-as-xml in))
(define content-0 (list-ref response-html 0))
(close-input-port in)
The content-0 above is
(element
(location 0 0 15)
(location 0 0 82)
...
I'm wondering how to extract specific content from it.
Usually it's more convenient to deal with HTML as x-expressions instead of the html module's structs.
Also you should probably use call/input-url to handle closing the port automatically.
You can combine both of these ideas by defining a read-html-as-xexpr function and using it like this:
#lang racket/base
(require html
net/url
xml)
(define (read-html-as-xexpr in) ;; input-port? -> xexpr?
(caddr
(xml->xexpr
(element #f #f 'root '()
(read-html-as-xml in)))))
(define reddit (string->url "http://www.reddit.com/r/programming/search?q=racket&sort=relevance&restrict_sr=on&t=all"))
(call/input-url reddit
get-pure-port
read-html-as-xexpr)
That will return a big x-expression like:
'(html
((lang "en") (xml:lang "en") (xmlns "http://www.w3.org/1999/xhtml"))
(head
()
(title () "programming: search results")
(meta
((content " reddit, reddit.com, vote, comment, submit ")
(name "keywords")))
(meta
((content "reddit: the front page of the internet") (name "description")))
(meta ((content "origin") (name "referrer")))
(meta ((content "text/html; charset=UTF-8") (http-equiv "Content-Type")))
... snip ...
How to extract specific pieces of that?
For simple HTML where I don't expect the overall structure to change, I will often just use match.
However a more correct and robust way to go about it is to use the xml/path module.
UPDATE: I noticed your question started by asking about extracting URLs. Here's the example updated to use se-path*/list to get all the href attributes of all the <a> elements:
#lang racket/base
(require html
net/url
xml
xml/path)
(define (read-html-as-xexprs in) ;; (-> input-port? xexpr?)
(caddr
(xml->xexpr
(element #f #f 'root '()
(read-html-as-xml in)))))
(define reddit (string->url "http://www.reddit.com/r/programming/search?q=racket&sort=relevance&restrict_sr=on&t=all"))
(define xe (call/input-url reddit
get-pure-port
read-html-as-xexprs))
(se-path*/list '(a #:href) xe)
Result:
'("#content"
"http://www.reddit.com/r/announcements/"
"http://www.reddit.com/r/Art/"
"http://www.reddit.com/r/AskReddit/"
"http://www.reddit.com/r/askscience/"
"http://www.reddit.com/r/aww/"
"http://www.reddit.com/r/blog/"
"http://www.reddit.com/r/books/"
"http://www.reddit.com/r/creepy/"
"http://www.reddit.com/r/dataisbeautiful/"
"http://www.reddit.com/r/DIY/"
"http://www.reddit.com/r/Documentaries/"
"http://www.reddit.com/r/EarthPorn/"
"http://www.reddit.com/r/explainlikeimfive/"
"http://www.reddit.com/r/Fitness/"
"http://www.reddit.com/r/food/"
... snip ...

exception: current-continuation-marks: no corresponding prompt in the continuation: #<continuation-prompt-tag:web>

why do i get exception on (redirect/get) in this program
#lang web-server
(require web-server/formlets web-server/page) (struct app (nm) #:mutable)
(define (start req) (render-main-page req))
this function is to be used by most pages and generates comlete page xexpr by calling each given piece of page generator functions, each of which may embed their urls
(define (render-page embed/url a-title blocks)
(response/xexpr `(html (head (title ,a-title)
(body ,#(map (lambda (block) (block embed/url)) blocks))))))
this is piece of first page generator function
(define (render-action-panel embed/url action)
`(a ([href ,(embed/url action)]) "New"))
this is first page
(define/page (render-main-page)
(local [(define (new-handler req) (render-app-page req (app "new value")))
(define (panel-block embed/url) (render-action-panel embed/url new-handler))]
(render-page embed/url "Title" (list panel-block))))
this is piece of second page generator function (represents form)
(define (add-app-formlet an-app) (formlet (#%# ,{=> input-string nm}) nm))
(define (render-app-form embed/url an-app save-handler)
`(div (form ([action ,(embed/url save-handler)][method "POST"])
(span ,#(formlet-display (add-app-formlet an-app) ))
(span (input ([type "submit"][value "Save"]))))));)
the second form,
save handler throws exception when trying do post-redirect-get
(define/page (render-app-page an-app)
(local [(define (save-handler req)
(render-app-page
(redirect/get)
(set-app-nm! an-app (formlet-process (add-app-formlet an-app) req))))
(define (form-block embed/url)
(render-app-form embed/url an-app save-handler ))
]
(render-page embed/url "Title - form: " (list form-block))))
(require web-server/servlet-env)
(serve/servlet start)
Which redirect/get are you using?
The one from web-server/lang/servlet (which should be used with #lang web-server) is different than the one from web-server/servlet (which should be used with #lang racket (and friends))
This error message means that you are using the one from web-server/servlet.
FWIW, web-server/page cannot be used with #lang web-server because it is just a simple macro over the send/suspend/dispatch from web-server/servlet.

Resources