Simple dom document iteration - xpath

I have an HTML as so:
<html>
<body>
<div class="somethingunneccessary"></div>
<div class="container">
<div>
<p>text1</p>
<p>text2</p>
<p>text3</p>
</div>
<div>
<p>text4/p>
<p>text5</p>
<p>text6</p>
</div>
<div>
<p>text7</p>
<p>text8</p>
<p>text9</p>
</div>
<div>
<p>text10</p>
<p>text11</p>
<p>text12</p>
</div>
<div>
<p>text13</p>
<p>text14</p>
<p>text15</p>
</div>
</div>
</body>
</html>
What I'm trying to accomplish is the following:
1./ Loop over the div elements within the div having a class container.
2./ During the iteration I want to grab the text from the 3rd p tag.
The looping part is essential instead of just slicing out the p tags by themselves
I've got some code done but it doesn't do looping:
$doc=new DOMDocument();
$doc->loadHTML($htmlsource);
$xpath = new DOMXpath($doc);
$commentxpath = $xpath->query("/html/body/div[2]/div[5]/p[3]");
$commentdata = $commentxpath->item(0)->nodeValue;
How do I loop through each inner div element and extract the 3rd p tag.
Like I said, the looping is essential.

During the iteration I want to grab the text from the 3rd p tag
Try:
"//div[#class='container']/div/p[3]"
This should return all third p in all div inside of div with class container.

You may have to query over attributes: php xpath get attribute value
$xpath->query("/html/body/div[#class='container']");

Just try
/html/body/div/div//p
That should return only the p elements XD

Related

Auto generate XPath for known element in HTML tree using python

Is there any way (libs, not manually) for generating relative XPath for a known element in HTML?
Let say the second P element inside class="content"
<html>
<body>
<div class"title">
<h1>***</h1>
</div>
<p> *** </p>
<h3>***</h3>
<div class"content">
<p>****</p>
<p>****</p>
</div>
</body>
</html>
Use case:
The idea is to guess where are the elements that I might be interested in. For example title, content or author. After I've found the element I want to generate xpath for it and later use Python3.
Try something like this:
from lxml import etree
datum = """
<html>
<body>
<div class="title">
<h1>***</h1>
</div>
<p> *** </p>
<h3>***</h3>
<div class="content">
<p>something</p>
<p>target</p>
</div>
</body>
</html>
"""
root = etree.fromstring(datum)
tree = etree.ElementTree(root)
find_text = etree.XPath("//p[text()='target']")
for target in find_text(root):
print(tree.getpath(target))
Output:
/html/body/div[2]/p[2]

Select element based on cousin value

Lets say I have this html (ignore tags names):
<div>
<card>
<h2>1</h2>
</card>
<footer>
<p>text 1</p>
</footer>
</div>
<div>
<card>
<h2>2</h2>
</card>
<footer>
<p>text 2</p>
</footer>
</div>
<div>
<card>
<h2>3</h2>
</card>
<footer>
<p>text 2</p>
</footer>
</div>
and I want to select p tag that have an h2 value of 2 (I will select p with text 2)
if I use this expression //h2[text()="2"]/../following::footer/p I will get 2 p tags.
How do I select only the p tag with cousin h2 value of 2 ?
EDIT: Robbie Averill answer was the first to work, but you should check other answers they are very good too.
You can navigate from the h2 matched up to the div that contains the element you want, then target footer/p elements from there:
//h2[text()="2"]/../../footer/p
Try to use below XPath to select required element:
//card[h2="2"]/following-sibling::footer/p
This XPath,
//div[card/h2="2"]/footer/p
will select footer/p cousins of card/h2 elements with string values of 2.

How to insert tags in html code with bash

I have the following html code:
<!DOCTYPE html>
<html>
...
<main>
<div class=postLink>
Write your title here
<div id="description"> Write the description of the post. It will show in the index.</div>
<div id="datePost">19-06-2017</div>
<div id="tags">
</div>
</div>
<div class=postLink>
Write your title here
<div id="description"> Write the description of the post. It will show in the index.</div>
<div id="datePost">19-06-2017</div>
<div id="tags">
</div>
</div>
...
</main>
<footer>
</footer>
</body>
</html>
I want to insert the tags in div called tag but only one time.
i tried to do that:
for tag in $tags
do
sed -i "/<div id=\"tags\">/a $tag" $publicDir/index.html
done
But that produces that all postLink have all tags.
I only want that a postlink have their tags not the others postLink. Like that:
<div class=postLink>
Write your title here
<div id="description"> Write the description of the post. It will show in the index.</div>
<div id="datePost">19-06-2017</div>
<div id="tags">
Hellotag
Othertag
</div>
</div>
<div class=postLink>
Write your title here
<div id="description"> Write the description of the post. It will show in the index.</div>
<div id="datePost">19-06-2017</div>
<div id="tags">
Byetag
Xtag
</div>
</div>
Sorry my bad english.
I solved the problem from oter point of view.
What I do is created a variable called string with the html code "postlink" before where the tags have to be. And then a for loop append tags code and finally add divs ends.
So i have only one "postlink" with the tags, and tags dont affect the previous postlink that could be there before on html page.
string="<div class=postLink>\n\t $title\n <div id=\"description\">$description</div>\n<div id=\"datePost\">$date</div>\n<div id=\"tags\">\n"
for tag in $tags
do
string+="$tag\n"
done
string+="</div>\n</div>"
sed -i "/<main>/a $string" "$publicDir/index.html"
Thanks John Goofy and 123 for try to help me. :)
By using awk and passing the desired text to be printed within the tags div as a variable should achieve what you need. You could use a for loop to construct this variable with different files within the tags directory where required just ensure that you add new lines and tabs where required.
awk -v var="\n\t Xtag\n" '{ if ( $0 ~ "tags" && done != 1 ) { print $0var;done=1 } else { printf $0"\n" } }' html
Where html is the file in question, we pass the text to be added as a variable called var and then where we find a line that has tags within it (you can "sure up" the reg expression where necessary here) we print the line and the additional text. The variable done is then set to one to ensure that the if statement fails on the next div tag and the text isn't printed.

Check <div> at the same level

I have at the same level, and I want the check the value of one div, and retrieve the value of another using local-name() where possible.
<div class="x-extension-property">
<div class="x-extension-property-id">I own a house</div>
<div class="x-extension-key"></div>
<div class="x-extension-value">This is the value I want </div>
<div class="x-extension-data-type"></div>
</div>
In a single Xpath statement I would like to detect that x-extension-property-id = "I own a house" and when that matches retrieve the value of x-extension-value which is "This is the value I want"
I have not tested it, but something like this should work:
/div/div[#class='x-extension-property-id' and text() = 'I own a house']/../div[#class='x-extension-value']

nokogiri + mechanize css selector by text

I am new to nokogiri and so far most familiar with CSS selectors, I am trying to parse information from a table, below is a sample of the table and the code I'm using, I'm stuck on the appropriate if statement, as it seems to return the whole contents of the table.
Table:
<div class="holder">
<div class ="row">
<div class="c1">
<!-- Content I Don't need -->
</div>
<div class="c2">
<span class="data">
<!-- Content I Don't Need -->
<span class="data">
</div>
</div>
...
<div class="row">
<div class="c1">
SPECIFIC TEXT
</div>
<div class="c2">
<span class="data">
What I want
</span>
</div>
</div>
</div>
My Script: (if SPECIFIC TEXT is found in the table it returns every "div.c2 span.data" variable - so I've either screwed up my knowledge of do loops or if statements)
data = []
page.agent.get(url)
page.search('div.row').each do |row_data|
if (row_data.search('div.c1:contains("/SPECIFIC TEXT/")').text.strip
temp = row_data.search('div.c2 span.data').text.strip
data << temp
end
end
There's no need to stop and insert ruby logic when you can extract what you need in a single CSS selector.
data = page.search('div.row > div.c1:contains("SPECIFIC TEXT") + div.c2 span.data')
This will include only those that match the selector (e.g. follow the SPECIFIC TEXT).
Here's where your logic may have gone wrong:
This code
if (row_data.search('div.c1:contains("SPECIFIC TEXT")'...
temp = row_data.search('div.c2 span.data')...
first searches the row for the specific text, then if it matches, returns ALL rows matching the second query, which has the same starting point. The key is the + in the CSS selector above which will return elements immediately following (e.g. the next sibling element). I'm making an assumption, of course, that the next element is always what you want.
I'd do
require 'nokogiri'
html = <<_
<div class="holder">
<div class ="row">
<div class="c1">
<!-- Content I Don't need -->
</div>
<div class="c2">
<span class="data">
<!-- Content I Don't Need -->
<span class="data">
</div>
</div>
<div class="row">
<div class="c1">
SPECIFIC TEXT
</div>
<div class="c2">
<span class="data">
What I want
</span>
</div>
</div>
</div>
_
doc = Nokogiri::HTML(html)
css_string = 'div.row > div.c1[text()*="SPECIFIC TEXT"] + div.c2 span.data'
doc.at(css_string).text.strip
# => "What I want"
How those selectors would work here -
[name*="value"] - Selects elements that have the specified attribute with a value containing the a given substring.
Child Selector (“parent > child”) - Selects all direct child elements specified by "child" of elements specified by "parent".
Next Adjacent Selector (“prev + next”) - Selects all next elements matching "next" that are immediately preceded by a sibling "prev".
Class Selector (“.class”) - Selects all elements with the given class.
Descendant Selector (“ancestor descendant”) - Selects all elements that are descendants of a given ancestor.

Resources