I want to extract the urls in reddit, my code is
#lang racket
(require net/url)
(require html)
(define reddit (string->url "http://www.reddit.com/r/programming/search?q=racket&sort=relevance&restrict_sr=on&t=all"))
(define in (get-pure-port reddit #:redirections 5))
(define response-html (read-html-as-xml in))
(define content-0 (list-ref response-html 0))
(close-input-port in)
The content-0 above is
(element
(location 0 0 15)
(location 0 0 82)
...
I'm wondering how to extract specific content from it.
Usually it's more convenient to deal with HTML as x-expressions instead of the html module's structs.
Also you should probably use call/input-url to handle closing the port automatically.
You can combine both of these ideas by defining a read-html-as-xexpr function and using it like this:
#lang racket/base
(require html
net/url
xml)
(define (read-html-as-xexpr in) ;; input-port? -> xexpr?
(caddr
(xml->xexpr
(element #f #f 'root '()
(read-html-as-xml in)))))
(define reddit (string->url "http://www.reddit.com/r/programming/search?q=racket&sort=relevance&restrict_sr=on&t=all"))
(call/input-url reddit
get-pure-port
read-html-as-xexpr)
That will return a big x-expression like:
'(html
((lang "en") (xml:lang "en") (xmlns "http://www.w3.org/1999/xhtml"))
(head
()
(title () "programming: search results")
(meta
((content " reddit, reddit.com, vote, comment, submit ")
(name "keywords")))
(meta
((content "reddit: the front page of the internet") (name "description")))
(meta ((content "origin") (name "referrer")))
(meta ((content "text/html; charset=UTF-8") (http-equiv "Content-Type")))
... snip ...
How to extract specific pieces of that?
For simple HTML where I don't expect the overall structure to change, I will often just use match.
However a more correct and robust way to go about it is to use the xml/path module.
UPDATE: I noticed your question started by asking about extracting URLs. Here's the example updated to use se-path*/list to get all the href attributes of all the <a> elements:
#lang racket/base
(require html
net/url
xml
xml/path)
(define (read-html-as-xexprs in) ;; (-> input-port? xexpr?)
(caddr
(xml->xexpr
(element #f #f 'root '()
(read-html-as-xml in)))))
(define reddit (string->url "http://www.reddit.com/r/programming/search?q=racket&sort=relevance&restrict_sr=on&t=all"))
(define xe (call/input-url reddit
get-pure-port
read-html-as-xexprs))
(se-path*/list '(a #:href) xe)
Result:
'("#content"
"http://www.reddit.com/r/announcements/"
"http://www.reddit.com/r/Art/"
"http://www.reddit.com/r/AskReddit/"
"http://www.reddit.com/r/askscience/"
"http://www.reddit.com/r/aww/"
"http://www.reddit.com/r/blog/"
"http://www.reddit.com/r/books/"
"http://www.reddit.com/r/creepy/"
"http://www.reddit.com/r/dataisbeautiful/"
"http://www.reddit.com/r/DIY/"
"http://www.reddit.com/r/Documentaries/"
"http://www.reddit.com/r/EarthPorn/"
"http://www.reddit.com/r/explainlikeimfive/"
"http://www.reddit.com/r/Fitness/"
"http://www.reddit.com/r/food/"
... snip ...
Related
Good afternoon. The defun macro places the procedure code in the functional cell of the symbol and, if no errors are found, prints the name of the symbol in the mini-buffer.
(defun InsertSsulku ()
"Вставка текста ссылки на якорь-ноду в точку"
(interactive)
(let* ((karta '(keymap "Вставка ссылок"
("Решения" menu-item "Решения"
(keymap "Решения"
("#ref{сбрнкФйлвРквдств, текст-ссылки}" menu-item "#anchor{сбрнкФйлвРквдств}" fint-function :key-sequence nil)))
("Скрипты" menu-item "Скрипты"
(keymap "Скрипты"
("#ref{Родительский скрипт, текст ссылки}" menu-item "#anchor{Родительский скрипт}" fint-function :key-sequence nil)
("#ref{Дочерний скрипт, текст ссылки}" menu-item "#anchor{Дочерний скрипт}" fint-function :key-sequence nil)))))
(выбзнч (x-popup-menu (list '(0 0) (selected-frame))
karta)))
(insert (nth (- (length выбзнч) 1) выбзнч))))
The next question is, why does the procedure begin to run when defined?
The next question is, why does the procedure begin to run when defined?
It doesn't. What do you mean by that?
Preamble
I was looking throught the source code in clojure.core for no particular reason.
I started reading defmacro ns, here is the abridged source:
(defmacro ns
"...docstring..."
{:arglists '([name docstring? attr-map? references*])
:added "1.0"}
[name & references]
(let [...
; Argument processing here.
name-metadata (meta name)]
`(do
(clojure.core/in-ns '~name)
~#(when name-metadata
`((.resetMeta (clojure.lang.Namespace/find '~name) ~name-metadata)))
(with-loading-context
~#(when gen-class-call (list gen-class-call))
~#(when (and (not= name 'clojure.core) (not-any? #(= :refer-clojure (first %)) references))
`((clojure.core/refer '~'clojure.core)))
~#(map process-reference references))
(if (.equals '~name 'clojure.core)
nil
(do (dosync (commute ##'*loaded-libs* conj '~name)) nil)))))
Looking Closer
And then trying to read it I saw some strange macro patterns, in particular we can look at:
~#(when name-metadata
`((.resetMeta (clojure.lang.Namespace/find '~name) ~name-metadata)))
The clojure.core version
Here is a standalone working extraction from the macro:
(let [name-metadata 'name-metadata
name 'name]
`(do
~#(when name-metadata
`((.resetMeta (clojure.lang.Namespace/find '~name) ~name-metadata)))))
=> (do (.resetMeta (clojure.lang.Namespace/find (quote name)) name-metadata))
When I ran this could I couldn't help but wonder why there is a double list at the point `((.resetMeta.
My version
I found that by just removing the unquote-splicing (~#) the double list was unnecessary. Here is a working standalone example:
(let [name-metadata 'name-metadata
name 'name]
`(do
~(when name-metadata
`(.resetMeta (clojure.lang.Namespace/find '~name) ~name-metadata))))
=> (do (.resetMeta (clojure.lang.Namespace/find (quote name)) name-metadata))
My Question
Thus, why does clojure.core choose this seemingly extraneous way of doing things?
My Own Thoughts
Is this an artifact of convention?
Are there other similar instances where this is used in more complex ways?
~ always emits a form; ~# can potentially emit nothing at all. Thus sometimes one uses ~# to splice in a single expression conditionally:
;; always yields the form (foo x)
;; (with x replaced with its macro-expansion-time value):
`(foo ~x)`
;; results in (foo) is x is nil, (foo x) otherwise:
`(foo ~#(if x [x]))
That's what's going on here: the (.resetMeta …) call is emitted within the do form that ns expands to only if name-metadata is truthy (non-false, non-nil).
In this instance, it doesn't really matter – one could use ~, drop the extra brackets and accept that the macroexpansion of an ns form with no name metadata would have an extra nil in the do form. For the sake of a prettier expansion, though, it makes sense to use ~# and only emit a form to handle name metadata when it is actually useful.
I try to write a program that adds in the source code the string "hello world". The name of the file in source.rkt. It gives me this error:
source.rkt:6:31: #%datum: keyword used as an expression in: #:mode
#(118 6)
This is the code:
#lang racket
(provide (all-defined-out))
(define out (open-output-file "source.rkt"
[#:mode 'text
#:exists 'can-update]))
(write "hello world" out)
(close-output-port out)
The brackets are not literals. They mean optional. Therefore, the correct syntax is:
(define out (open-output-file "source.rkt"
#:mode 'text
#:exists 'can-update))
why do i get exception on (redirect/get) in this program
#lang web-server
(require web-server/formlets web-server/page) (struct app (nm) #:mutable)
(define (start req) (render-main-page req))
this function is to be used by most pages and generates comlete page xexpr by calling each given piece of page generator functions, each of which may embed their urls
(define (render-page embed/url a-title blocks)
(response/xexpr `(html (head (title ,a-title)
(body ,#(map (lambda (block) (block embed/url)) blocks))))))
this is piece of first page generator function
(define (render-action-panel embed/url action)
`(a ([href ,(embed/url action)]) "New"))
this is first page
(define/page (render-main-page)
(local [(define (new-handler req) (render-app-page req (app "new value")))
(define (panel-block embed/url) (render-action-panel embed/url new-handler))]
(render-page embed/url "Title" (list panel-block))))
this is piece of second page generator function (represents form)
(define (add-app-formlet an-app) (formlet (#%# ,{=> input-string nm}) nm))
(define (render-app-form embed/url an-app save-handler)
`(div (form ([action ,(embed/url save-handler)][method "POST"])
(span ,#(formlet-display (add-app-formlet an-app) ))
(span (input ([type "submit"][value "Save"]))))));)
the second form,
save handler throws exception when trying do post-redirect-get
(define/page (render-app-page an-app)
(local [(define (save-handler req)
(render-app-page
(redirect/get)
(set-app-nm! an-app (formlet-process (add-app-formlet an-app) req))))
(define (form-block embed/url)
(render-app-form embed/url an-app save-handler ))
]
(render-page embed/url "Title - form: " (list form-block))))
(require web-server/servlet-env)
(serve/servlet start)
Which redirect/get are you using?
The one from web-server/lang/servlet (which should be used with #lang web-server) is different than the one from web-server/servlet (which should be used with #lang racket (and friends))
This error message means that you are using the one from web-server/servlet.
FWIW, web-server/page cannot be used with #lang web-server because it is just a simple macro over the send/suspend/dispatch from web-server/servlet.
How can I write an Emacs Lisp function to find all hrefs in an HTML file and extract all of the links?
Input:
<html>
<a href="http://www.stackoverflow.com" _target="_blank">StackOverFlow</a>
<h1>Emacs Lisp</h1>
<a href="http://news.ycombinator.com" _target="_blank">Hacker News</a>
</html>
Output:
http://www.stackoverflow.com|StackOverFlow
http://news.ycombinator.com|Hacker News
I've seen the re-search-forward function mentioned several times during my search. Here's what I think that I need to do based on what I've read so far.
(defun extra-urls (file)
...
(setq buffer (...
(while
(re-search-forward "http://" nil t)
(when (match-string 0)
...
))
I took Heinzi's solution and came up with the final solution that I needed. I can now take a list of files, extract all URL's and titles, and place the results in one output buffer.
(defun extract-urls (fname)
"Extract HTML href url's,titles to buffer 'new-urls.csv' in | separated format."
(setq in-buf (set-buffer (find-file fname))); Save for clean up
(beginning-of-buffer); Need to do this in case the buffer is already open
(setq u1 '())
(while
(re-search-forward "^.*<a href=\"\\([^\"]+\\)\"[^>]+>\\([^<]+\\)</a>" nil t)
(when (match-string 0) ; Got a match
(setq url (match-string 1) ) ; URL
(setq title (match-string 2) ) ; Title
(setq u1 (cons (concat url "|" title "\n") u1)) ; Build the list of URLs
)
)
(kill-buffer in-buf) ; Don't leave a mess of buffers
(progn
(with-current-buffer (get-buffer-create "new-urls.csv"); Send results to new buffer
(mapcar 'insert u1))
(switch-to-buffer "new-urls.csv"); Finally, show the new buffer
)
)
;; Create a list of files to process
;;
(mapcar 'extract-urls '(
"/tmp/foo.html"
"/tmp/bar.html"
))
If there is at most one link per line and you don't mind some very ugly regular expression hacking, run the following code on your buffer:
(defun getlinks ()
(beginning-of-buffer)
(replace-regexp "^.*<a href=\"\\([^\"]+\\)\"[^>]+>\\([^<]+\\)</a>.*$" "LINK:\\1|\\2")
(beginning-of-buffer)
(replace-regexp "^\\([^L]\\|\\(L[^I]\\)\\|\\(LI[^N]\\)\\|\\(LIN[^K]\\)\\).*$" "")
(beginning-of-buffer)
(replace-regexp "
+" "
")
(beginning-of-buffer)
(replace-regexp "^LINK:\\(.*\\)$" "\\1")
)
It replaces all links with LINK:url|description, deletes all lines containing anything else, deletes empty lines, and finally removes the "LINK:".
Detailed HOWTO: (1) Correct the bug in your example html file by replacing <href with <a href, (2) copy the above function into Emacs scratch, (3) hit C-x C-e after the final ")" to load the function, (4) load your example HTML file, (5) execute the function with M-: (getlinks).
Note that the linebreaks in the third replace-regexp are important. Don't indent those two lines.
You can use the 'xml library, examples of using the parser are found here. To parse your particular file, the following does what you want:
(defun my-grab-html (file)
(interactive "fHtml file: ")
(let ((res (car (xml-parse-file file)))) ; 'car because xml-parse-file returns a list of nodes
(mapc (lambda (n)
(when (consp n) ; don't operate on the whitespace, xml preserves whitespace
(let ((link (cdr (assq 'href (xml-node-attributes n)))))
(when link
(insert link)
(insert "|")
(insert (car (xml-node-children n))) ;# grab the text for the link
(insert "\n")))))
(xml-node-children res))))
This does not recursively parse the HTML to find all the links, but it should get you started in the direction of the general solution.