How to apply the string() function to each attribute? - xpath

I need to grab some hyper references in a Bash script.
The following command uses curl and xmllint to read all href attributes of a HTML page:
curl --silent -L google.com | xmllint --html --xpath '//a/#href' -
But I need only the values of the attributes. The value of an attribute can be selected with the string() function. But if I use it, I get only the first element of the list of attributes:
curl --silent -L google.com | xmllint --html --xpath 'string(//a/#href)' -
How can I apply the the string() function to each attribute?

You could do (notice the difference in the XPath expression):
curl --silent -L google.com | xmllint --html --xpath '//a/#*'
and then add another pipe to send the output to sed, filtering out the attribute names to get the values you want. But this is a sort of odd way to extract stuff from a document.

Related

How to write a script that will use regex to output only the heading and paragraph text from the http://example.com website

I am a beginner in scripting and i am working on the bash scripting for my work.
for this task i tried the sed command which didn't work
for your problem, following would work:
#!/bin.bash
curl -s http://example.com/ | grep -P "\s*\<h1\>.*\<\/h1\>" |sed -n 's:.*<h1>\(.*\)</h1>.*:\1:p'
curl -s http://example.com/ | grep -P "\s*\<p\>.*\<\/p\>" |sed -n 's:.*<p>\(.*\)</p>.*:\1:p'
The first line scrapes via curl and grep the <h1>..</h1> part(assuming theres only one as we are considering your example) and using sed extract the first capturing group( (.*) ) by :\1:
The second line does the same but for <p1> tag.
I could cram these 2 lines in one grep but these'll work fine!
Edit:
If <p> tag end on different lines, above wouldn't, you may have to use pcregrep
curl -s http://example.com/ | pcregrep -M "\s*\<p\>(\n|.)*\<\/p\>"
You can use the following one liner :
curl -s http://example.com/ | sed -n '2,$p' > /tmp/tempfile && cat /tmp/tempfile | xmllint --xpath '/html/head/title/text()' - && echo ; cat /tmp/tempfile | xmllint --xpath '/html/body/div/p/text()' -
This uses xmllint's xpath command to extract the text within <title> and <p> tags.

Variable as tag for xpath - bash script (html)

I am currently trying to print set of data from the web page using xpath. I am having issues with looping over an array of tags as example below produces empty $element variables:
declare -a Elements=('//*[#id="page-wrapper"]/div[1]' '//*[#id="page-wrapper"]/div[2]');
COUNTER=1
for tag in "${Elements[#]}";
do
element="$(curl -s http://mypage | xmllint --html --xpath '$tag' - 2>text.txt | tr -d 'a-z<>=""/')" \
echo ELEMENT $COUNTER : $element
let COUNTER=COUNTER+1
done
If I manually replace the '$tag' with xpath (.e.g having the following:
element="$(curl -s http://mypage | xmllint --html --xpath '//*[#id="page-wrapper"]/div[1]' - 2>text.txt | tr -d 'a-z<>=""/')" \
everything works perfectly. Any ideas what am I doing wrong? I believe there is something wrong with syntax surrounding $tag, but I cannot see what exactly I'm doing incorrectly. Any nudge in right direction would be much appreciated!
Change the line
xmllint --html --xpath '$tag'
to
xmllint --html --xpath "$tag"
Its the rule-of-thumb, single-quotes do NOT expand variables in bash, you need to double-quote them. Using a single quote around variable names, deprives $ from doing variable interpolation.
Also a good read, Expressions don't expand in single quotes, use double quotes for that.

how to pass xml to xmllint in a variable instead of a file?

i want to do this
xmllint --xpath "//filestodelete[filename = somename]/text()" #filestodelete#
and filestodelete is a BPEL variable of type XML
but it does not work
how to do this>??
Assuming you've put your query text in a shell variable named query (to make my examples terser) --
With bash, you can use a herestring:
xmllint --xpath "$query" - <<<"$filestodelete"
With POSIX sh, you need to use a heredoc:
xmllint --xpath "$query" - <<EOF
$filestodelete
EOF
By the way -- since not all versions of xmllint support --xpath, you'd have better compatibility across releases if you used XMLStarlet instead, which has supported the following from its initial creation:
xmlstarlet sel -t -m "$query" -v . <<<"$filestodelete"

Extract value via OSX Terminal from .html for "curl" submission within a single script

How do I extract the variable value of the following line of an html page via Terminal to submit it afterwards via "curl -d" in the same script?
<input type="hidden" name="au_pxytimetag" value="1234567890">
Edit: how do I transfer the extracted value to the "curl -d" command within a single script? might be a silly question, but I'm total noob. =0)
EDITED:
I cannot tell from your question what you are actually trying to do. I originally thought you were trying to extract a variable from a file, but it seems you actually want to firstly, get that file, secondly extract a variable, and thirdly, use variable for something else... so let's address each of those steps:
Firstly you want to grab a page using curl, so you will do
curl www.some.where.com
and the page will be output on your terminal. But actually you want to search for something on that page, so you need to do
curl www.some.where.com | awk something
or
curl www.some.where.com | grep something
But you want to put that into a variable, so you need to do
var=$(curl www.some.where.com | awk something)
or
var=$(curl www.some.where.com | grep something)
The actual command I think you want is
var=$(curl www.some.where.com | awk -F\" '/au_pxytimetag/{print $(NF-1)}')
Then you want to use the variable var for another curl operation, so you will need to do
curl -d "param1=$var" http://some.url.com/somewhere
Original answer
I'd use awk like this:
var=$(awk -F\" '/au_pxytimetag/{print $(NF-1)}' yourfile)
to take second to last field on line containing au_pxytimetag using " as field separator.
Then you can use it like this
curl -d "param1=$var&param2=SomethingElse" http://some.url.com/somewhere
You can use xmllint:
value=$(xmllint --html --xpath "string(//input[#name='au_pxytimetag']/#value)" index.html)
You can do it with my Xidel:
xidel http://webpage -e "//input[#name='au_pxytimetag']/#value"
But you do not need to.
With
xidel http://webpage -f "(//form)[1]" -e "//what-you-need-from-the-next-page"
you can send all values from the first form on the webpage to the form action and then you can query something from the next page
You can try:
grep au_pxytimetag input.html | sed "s/.* value=\"\(.*\)\".*/\1/"
EDIT:
If you need this on a script:
#!/bin/bash
DATA=$(grep au_pxytimetag input.html | sed "s/.* value=\"\(.*\)\".*/\1/")
curl http://example.com -d $DATA

Extract just the link from "a href"

xmllint sample.xml --xpath "//a[text()='some value']/#href"
outputs:
href="http://some.address"
How can I output just the link without attribute name?
xmllint sample.xml --xpath "string(//a[text()='some value']/#href)"

Resources