Extract just the link from "a href" - xpath

xmllint sample.xml --xpath "//a[text()='some value']/#href"
outputs:
href="http://some.address"
How can I output just the link without attribute name?

xmllint sample.xml --xpath "string(//a[text()='some value']/#href)"

Related

How to write a script that will use regex to output only the heading and paragraph text from the http://example.com website

I am a beginner in scripting and i am working on the bash scripting for my work.
for this task i tried the sed command which didn't work
for your problem, following would work:
#!/bin.bash
curl -s http://example.com/ | grep -P "\s*\<h1\>.*\<\/h1\>" |sed -n 's:.*<h1>\(.*\)</h1>.*:\1:p'
curl -s http://example.com/ | grep -P "\s*\<p\>.*\<\/p\>" |sed -n 's:.*<p>\(.*\)</p>.*:\1:p'
The first line scrapes via curl and grep the <h1>..</h1> part(assuming theres only one as we are considering your example) and using sed extract the first capturing group( (.*) ) by :\1:
The second line does the same but for <p1> tag.
I could cram these 2 lines in one grep but these'll work fine!
Edit:
If <p> tag end on different lines, above wouldn't, you may have to use pcregrep
curl -s http://example.com/ | pcregrep -M "\s*\<p\>(\n|.)*\<\/p\>"
You can use the following one liner :
curl -s http://example.com/ | sed -n '2,$p' > /tmp/tempfile && cat /tmp/tempfile | xmllint --xpath '/html/head/title/text()' - && echo ; cat /tmp/tempfile | xmllint --xpath '/html/body/div/p/text()' -
This uses xmllint's xpath command to extract the text within <title> and <p> tags.

Variable as tag for xpath - bash script (html)

I am currently trying to print set of data from the web page using xpath. I am having issues with looping over an array of tags as example below produces empty $element variables:
declare -a Elements=('//*[#id="page-wrapper"]/div[1]' '//*[#id="page-wrapper"]/div[2]');
COUNTER=1
for tag in "${Elements[#]}";
do
element="$(curl -s http://mypage | xmllint --html --xpath '$tag' - 2>text.txt | tr -d 'a-z<>=""/')" \
echo ELEMENT $COUNTER : $element
let COUNTER=COUNTER+1
done
If I manually replace the '$tag' with xpath (.e.g having the following:
element="$(curl -s http://mypage | xmllint --html --xpath '//*[#id="page-wrapper"]/div[1]' - 2>text.txt | tr -d 'a-z<>=""/')" \
everything works perfectly. Any ideas what am I doing wrong? I believe there is something wrong with syntax surrounding $tag, but I cannot see what exactly I'm doing incorrectly. Any nudge in right direction would be much appreciated!
Change the line
xmllint --html --xpath '$tag'
to
xmllint --html --xpath "$tag"
Its the rule-of-thumb, single-quotes do NOT expand variables in bash, you need to double-quote them. Using a single quote around variable names, deprives $ from doing variable interpolation.
Also a good read, Expressions don't expand in single quotes, use double quotes for that.

xmllint: Formating without adding header

Is there a way to use $xmllint --format file without the header section?
<?xml version="1.0"?>
<Tag>
<Sub>A</Sub>
</Tag>
I know you can use --c14n but that does not seem to mix well with --format.
As $xmllint --format --c14n file will just produce:
<Tag><Sub>A</Sub></Tag>
Desired Result
<Tag>
<Sub>A</Sub>
</Tag>
You can use sed to remove the first line. Not saying it's the best but it would get you going:
xmllint --format <file> | sed 1d
You would preferrable try to avoid one million calls to xmllint. And sed or tail.
I'm not sure if xmllint supports inplace edit. But if it does then something like this might be possible:
xargs < list_of_files_to_change.txt xmllint --inplace --format
xargs < list_of_files_to_change.txt sed -i 1d

how to pass xml to xmllint in a variable instead of a file?

i want to do this
xmllint --xpath "//filestodelete[filename = somename]/text()" #filestodelete#
and filestodelete is a BPEL variable of type XML
but it does not work
how to do this>??
Assuming you've put your query text in a shell variable named query (to make my examples terser) --
With bash, you can use a herestring:
xmllint --xpath "$query" - <<<"$filestodelete"
With POSIX sh, you need to use a heredoc:
xmllint --xpath "$query" - <<EOF
$filestodelete
EOF
By the way -- since not all versions of xmllint support --xpath, you'd have better compatibility across releases if you used XMLStarlet instead, which has supported the following from its initial creation:
xmlstarlet sel -t -m "$query" -v . <<<"$filestodelete"

How to apply the string() function to each attribute?

I need to grab some hyper references in a Bash script.
The following command uses curl and xmllint to read all href attributes of a HTML page:
curl --silent -L google.com | xmllint --html --xpath '//a/#href' -
But I need only the values of the attributes. The value of an attribute can be selected with the string() function. But if I use it, I get only the first element of the list of attributes:
curl --silent -L google.com | xmllint --html --xpath 'string(//a/#href)' -
How can I apply the the string() function to each attribute?
You could do (notice the difference in the XPath expression):
curl --silent -L google.com | xmllint --html --xpath '//a/#*'
and then add another pipe to send the output to sed, filtering out the attribute names to get the values you want. But this is a sort of odd way to extract stuff from a document.

Resources