I am trying to use xPath to traverse through the code of a newspaper (for the sake of practice) right now I'd like to get the main article, it's picture and the small description I get of it. But I'm not that skilled in xPath so far and I can't get to the small description.
withing this code:
<div class="margenesPortlet">
<div class="fondoprincipal">
<div class="margenesPortlet">
<a href='notas/n1092329.htm' ><img id="LinkNotaA1_Foto" src="http://i.oem.com.mx/5cfaf266-bb93-436c-82bc-b60a78d21fb6.jpg" height="250" width="300" border="0" /></a>
<div class="piefoto_esto">Un tubo de 12 pulgadas al lado de la Vialidad Sacramento que provocó el corte del servicio durante toda la mañana y hasta alrededor de las cuatro de la tarde. Foto: El Heraldo de Chihuahua</div>
<div class="cabezaprincesto"><a href='notas/n1092329.htm' class='cabezaprincesto' >Sin agua 8 mil usuarios</a></div>
<div class="resumenesto"><a href='notas/n1092329.htm' class='resumenesto' >La ruptura de una línea en el tanque de rebombeo de agua Sacramento dejó sin servicio a ocho mil usuarios, en once colonias del sur de la ciudad. </a></div>
</div>
</div>
</div>
I've want to get the picture (with or without caption) and then the title of the article. These 3 things I can get by using:
//div[#class='fondoprincipal'] <-- gives me the main image and caption
//a[#class='cabezaprincesto']/text() <-- gives me the article's title
but I can't get ahold of the small description which is the div with class="resumenesto", I haven't tried getting anything by that id because the same id is used over and over through the rest of the HTML so it returns lots of extra items.
How can I get this particular one? and then would any of you recommend me a good way of parsing it to another webpage? I was thinking maybe php writing some html using those values but I'm not sure really...
Edit
What I mean by "this particular one" is how do I get this div class="resumenesto", the one residing within div class="fondoprincipal"...
Edit 2
Thank you, now xPath Traversing is a little bit more clear. But then about my second question, would any of you recommend me a good way of parsing it to another webpage? I was thinking maybe php writing some html using those values but I'm not sure really..
You say "id" of resumenesto, but in your code example the div you're talking about has a class of resumenesto.
Further, when you use an xpath of something like this:
//div[#class='resumenesto']
What you're getting is a list of nodes matching that xpath.
So if you want to specifically refer only to a single item in that list, you need to specify which item in the list:
//div[#class='resumenesto'][1]
Further, what do you mean by "this particular one"? The only way to tell xpath specificity is to give it context, for instance "the div with class resumenesto that resides within some other div", or "the first of the divs with class resumenesto".
Read W3Schools' overview of XPath syntax for some more info.
Edit:
To get the div residing within "fondoprincipal":
//div[#class='fondoprincipal']//div[#class='resumenesto']
This tells xpath to find any descendant div with class fondoprincipal within the document, and within that div, find any descendant div with class resumenesto.
And to narrow your search you can add the div too:
//div[#class='resumenesto']/a[#class='resumenesto']/text()
To get it to the test you need to:
//div[#class='fondoprincipal']//a[#class='resumenesto']
Note that you want to get the a (isntead of the div as Raul suggested), since its in that element that you get the text.
Regarding putting it on a page, you can do it in asp.net. Use the XElement to load the values and then the XPathSelectElement to get the values (http://msdn.microsoft.com/en-us/library/bb156083.aspx).
Related
[enter image description here][1]
<mat-chip role="option" cdkdrag="" container="body" class="cdk-drag mat-chip mat-focus-indicator mat-primary mat-standard-chip mat-chip-with-trailing-icon ng-star-inserted" tabindex="-1" aria-disabled="false" style="" xpath="1"><div class="mat-chip-ripple"></div> Table data <mat-icon role="img" aria-hidden="false" aria-label="icon-drag-doubledot" class="mat-icon notranslate icon-drag material-icons mat-icon-no-color" data-mat-icon-type="font"></mat-icon><mat-icon role="img" matchipremove="" aria-hidden="false" aria-label="icon-close" class="mat-icon notranslate mat-chip-remove mat-chip-trailing-icon icon-close material-icons mat-icon-no-color" data-mat-icon-type="font"></mat-icon></mat-chip>
From the above HTML tags, i want to extract those Table Data text. I am unable to get the text using below xpath
//mat-chip[#role='option']//div
//mat-chip[#role='option']//div/following-sibling::text()[1]
while executing it's throwing an error 'Failed: invalid selector: The result of the xpath expression "//mat-chip[#role='option']//div/following-sibling::text()[2]" is: [object Text]. It should be an element.'
Can anyone please help
[enter image description here][2]
That's right, the text belongs to this element //mat-chip[#role='option']
I checked it locally, all good. If doesn't work on your end then the problem somewhere else
P.S. proof
Selenium expects an element not a text()-node. So use this:
You need this:
//mat-chip[#role='option']/div
Although I doubt that [#role='option'] has any benefits.
If you are certain that there will be only one div inside that mat-chip, you could speed up the XPath by adding [1], like this:
//mat-chip[#role='option']/div[1]
This tells the XPath engine to stop searching after it has found this first div.
is there a way to get nodes containing a specific string which is split over 2 tags. I tried this but it doesn't work. I can't manage to ignore foreign tag.
$crawler->filterXPath('//p/text()[contains(., "caractère a priori")]');
<p>leur caractère <foreign xml:lang="lat">a priori</foreign>, soit..</p>
Thanks a lot !
The below XPath should work for you, it will return only <p> nodes which contain the text specified in the contains statement. I've expanded the example a bit, for me to test, and included a fiddle here.
XPath:
div/p[contains(., 'caractère a priori')]
Input
<div>
<p>leur caractère <foreign xml:lang="lat">a priori</foreign>, soit..</p>
<p>leur poisson <foreign xml:lang="lat">a priori</foreign>, soit..</p>
</div>
Output
<p>leur caractère <foreign xml:lang="lat">a priori</foreign>, soit..</p>
Hopefully that give you enough to go on!
The xpath I have defined below is working properly if tested individually. However, when I call
it from storage object and make that structure look like as underneath, trouble comes up and generates
disorganized results. Ignore my linguistic mistakes, if any.
Storage=xpath('//div[#class="info"]')
for item in Storage:
Name=item.xpath('//span[#itemprop="name"]/text()')
Address=item.xpath('//span[#itemprop="streetAddress" and #class="street-address"]/text()')
Phone=item.xpath('//div[#itemprop="telephone" and #class="phones phone primary"]/text()')
My question is: How to build an xpath expression If it is taken from "storage" and built "Name", "Address", and "Phone"
as I tried to do above. Thanks.
Here is the html element for that expression, if needed.
<div class="info"><h2 class="n">36. <span itemprop="name">The Coffee Table Eagle Rock</span></h2><div data-tripadvisor="{"rating":"4.0","count":"11"}" data-israteable="true" class="info-section info-primary"><div class="result-rating three half "><span class="count">(5)</span></div><div class="ta-rating extra-rating ta-4-0"></div><span class="ta-count">(11)</span><p itemscope="" itemtype="http://schema.org/PostalAddress" itemprop="address" class="adr"><span itemprop="streetAddress" class="street-address">1958 Colorado Blvd</span><span itemprop="addressLocality" class="locality">Los Angeles, </span><span itemprop="addressRegion">CA</span> <span itemprop="postalCode">90041</span></p><div itemprop="telephone" class="phones phone primary">(323) 255-2200</div></div><div class="info-section info-secondary"><div class="categories">Coffee & Espresso RestaurantsBars</div><div class="links">WebsiteMenu</div><a data-analytics="{"adclick":true,"events":"event7,event6","category":"8004238","impression_id":"fbd98612-6b8a-43c2-b31e-fd579de20126","listing_id":"11287432","item_id":-1,"listing_type":"free","ypid":"11287432","content_provider":"MDM","srid":"L-webyp-1c6db222-cc63-48d8-90d1-2d5dc8754cca-11287432","item_type":"PUP","lhc":"8004238","ldir":"LA","rate":3.5,"hasTripAdvisor":true,"mip_claimed_staus":"mip_unclaimed","mip_ypid":"11287432","click_id":523,"listing_features":"orderonline"}" href="https://yellowpages.pingup.com/Bkm3xG?ypid=11287432&uvid=t3pfPllxtLYkH2dlkSbiCC1marvZprsz1YhqhycO80NYrDv0OMX3uTJ3ryFG464RywmpWCrB&source=web-prod" rel="nofollow" target="_blank" class="action order-online" data-impressed="1">Order Online</a></div><div class="preferred-listing-features"></div><div class="snippet"><figure class="avatar-1 color-1"></figure><p class="body with-avatar">I went here recently with my 2 year old for breakfast. I got the Silverlake omelet and the breakfast sandwich for my son. The food was great (especi…</p></div></div>
If you want to get child/descendant elements of already defined item, you need to use .// to point on current ("item") element, but not // that points on root element. Try below:
Storage=xpath('//div[#class="info"]')
for item in Storage:
Name=item.xpath('.//span[#itemprop="name"]/text()')
Address=item.xpath('.//span[#itemprop="streetAddress" and #class="street-address"]/text()')
Phone=item.xpath('.//div[#itemprop="telephone" and #class="phones phone primary"]/text()')
I am newbie here. Please advise. How to select checkbox in my case?
<ul class="phrases-list" style="">
<li>
<input type="checkbox" class="select-phrase">
<span class="prase-title"> Dog - Wikipedia, the free encyclopedia </span>
(en.wikipedia.org)
<div class="prase-desc hidden">The domestic dog (Canis lupus familiaris or Canis familiaris) is a domesticated...</div>
</li>
The following doesn't work for me:
When /I check box "([^\"]+)"$/ do |label|
page.check(label)
end
step: And I check box "Dog - Wikipedia, the free encyclopedia"
If you can change the html, wrap the input and span in a label element
<ul class="phrases-list" style="">
<li>
<label>
<input type="checkbox" class="select-phrase">
<span class="prase-title"> Dog - Wikipedia, the free encyclopedia </span>
</label>
(en.wikipedia.org)
<div class="prase-desc hidden">The domestic dog (Canis lupus familiaris or Canis familiaris) is a domesticated...</div>
</li>
which has the added benefit of clicks on the "Dog - Wikipedia ..." text triggering the checkbox too. With that change your step should work as written. If you can't modify the html then things get more difficult.
Something like
find('span', text: label).find(:xpath, './preceding-sibling::input').set(true)
should work, although I'm curious how you're using these checkboxes from JS with nothing tying them to any specific value
Let's assume that you are prevented from changing the HTML. In this case, it would probably be easiest to query for the element via XPath. For example:
# Here's the XPath query
q = "//span[contains(text(), 'Dog - Wikipedia')]/preceding-sibling::input"
# Use the query to find the checkbox. Then, check the checkbox.
page.find(:xpath, q).set(true)
Okay - it's not as bad as it looks! Let's analyze this XPath so we can understand what it's doing:
//span
This first part says "Search the entire HTML document and discover all "span" elements. Of course, there are probably a LOT of "span" elements in the HTML document, so we'll need to restrict this:
//span[contains(text(), 'Dog - Wikipedia')]
Now we're only searching for the "span" elements that contain the text "Dog - Wikipedia". Presumably, this text will uniquely identify the desired "span" element on the page (if not, then just search for more of the text).
At this point, we have the "span" element that is adjacent to the desired "input" element. So, we can query for the "input" element using the "preceding-sibling::" XPath Axis:
//span[contains(text(), 'Dog - Wikipedia')]/preceding-sibling::input
Lets say I have a simple page that has less IDs than I'd like for testing
<div class="__panel_body">
<div class="__panel_header">Real Estate Rating</div>
<div class="__panel_body">
<div class="__panel_header">Property Rating Info</div>
<a class="icon.edit"></a>
<a class="icon.edit"></a>
</div>
<div class="__panel_body">
<div class="__panel_header">General Risks</div>
<a class="icon.edit"></a>
<a class="icon.edit"></a>
</div>
<div class="__panel_body">
<div class="__panel_header">Amenities</div>
<a class="icon.edit"></a>
<a class="icon.edit"></a>
</div>
</div>
I'm using Jeff Morgan's Page Object gem and I want to make accessors for the edit links in any given section.
The challenge is that the panel headers differentiate what body I want to choose. Then I need to access the parent and get all links with class "icon.edit". Assume I can't change the HTML to solve this.
Here's a start
module RealEstateRatingPageFields
div(:general_risks_section, ....)
def general_risks_edit_links
general_risks_section_element.links(class: "icon.edit")
end
end
How do I get the general_risks_section accessor to work, though?
I want that to represent the parent div to the panel header with text 'General Risks'...
There are a number of ways to get the general risk section.
Using a Block
The accessors can take a block where you can more programatically describe how to locate the element. This allows you to locate a distinguishing element and then traverse the DOM to the element you actually want. In this case, you can locate the header with the matching text and navigate to its parent.
div(:general_risks_section) { div_element(class: '__panel_header', text: 'General Risks').parent }
Using XPath
While harder to read and write, you could also use an XPath locator. The concept and thought process is the same as using the block. The only benefit is that it reduces the number of element calls, which slightly improves performance.
div(:general_risks_section, xpath: './/div[#class="__panel_body"][./div[#class="__panel_header" and text() = "General Risks"]]')
The XPath is saying:
.//div # Find a div element that
[#class="__panel_body"] # Has the class "__panel_body" and
[./div[ # Contains a div element that
#class="__panel_header" and # Has the class "__panel_header" and
text() = "General Risks" # Has the text "General Risks"
]]
Using the Body Text
Given the HTML, you could also just locate the section directly based on its text.
div(:general_risks_section, class: '__panel_body', text: 'General Risks')
Note that this assumes that the HTML given was not simplified. If there are actually other text nodes, this probably would not be the best option.