I'm trying to get the value of the data-content_published_date attribute which is published date of an article with xPath but for some reason I cant!
Here is how I tried:
//div[#id='page-segment-values']/#data-content_published_date
And here is the div tag:
<div id="page-segment-values" class="">
<div class="keyvals" data-content_author_name="Eli Meixler" data-content_cms_id="5461956" data-content_headline="British Academic" data-content_modified_date="" data-content_published_date="2018-11-22T08:15:43.000Z" data-content_shown_on_platform="own" data-content_type="article" data-path="/5461956/british-academic-sentenced-spying-united-arab-emirates/" data-referrer="" data-search="" data-content_is_post="post" data-title="A British Academic Has Been Sentenced to Life in Prison on Espionage Charges in the United Arab Emirates" data-affiliate_link_count="0" data-content_cms_category="World" data-content_cms_tags="onetime|overnight|United Arab Emirates" data-content_cms_terms="World,onetime,overnight,United Arab Emirates" data-time_inc_brand="time.com" data-time_inc_application="front end" data-content_syndicated="false" data-content_syndicated_brand="" data-content_syndicated_url="" data-content_nlp_sentiment_label="negative" data-content_nlp_sentiment_score="-0.2" data-content_nlp_sentiment_magnitude="5.9" data-content_nlp_entities="Matthew Hedges" data-content_nlp_payload="{"entities":[{"type":"PERSON","text":"Matthew Hedges","relevance":0.57477295,"disambiguation":{}},{"type":"WORK_OF_ART","text":"British Academic Sentenced to Life on Spying Charges","relevance":0.02937225,"disambiguation":{}},{"type":"LOCATION","text":"British","relevance":0.029201617,"disambiguation":{"mid":"\/m\/07ssc","wikipedia_url":"https:\/\/en.wikipedia.org\/wiki\/United_Kingdom"}},{"type":"PERSON","text":"academic","relevance":0.020707963,"disambiguation":{}},{"type":"OTHER","text":"life","relevance":0.018888554,"disambiguation":{}},{"type":"OTHER","text":"spying charges","relevance":0.013851213,"disambiguation":{}},{"type":"LOCATION","text":"prison","relevance":0.0133453375,"disambiguation":{}},{"type":"LOCATION","text":"United Arab Emirates","relevance":0.011457251,"disambiguation":{"wikipedia_url":"https:\/\/en.wikipedia.org\/wiki\/United_Arab_Emirates","mid":"\/m\/0j1z8"}},{"type":"OTHER","text":"threats","relevance":0.009783207,"disambiguation":{}},{"type":"EVENT","text":"blowback","relevance":0.009783207,"disambiguation":{}}],"categories":[{"label":"\/Law & Government\/Public Safety\/Crime & Justice","score":0.64}],"docSentiment":{"magnitude":5.9,"score":-0.2,"label":"negative"},"language":"en"}" data-content_nlp_categories="/Law & Government/Public Safety/Crime & Justice"></div>
</div>
Anyone has any idea why can't I access the value of it? And how to get it?
Use the inner <div> element and the attribute you are looking after.
//div[#class='keyvals']/#data-content_published_date
You've missed nested div element, thus you're trying to get value of data-content_published_date of outer div.
Correct XPath query should be
//div[#id='page-segment-values']/div/#data-content_published_date
Or even more precisely
//div[#id='page-segment-values']/div[#class='keyvals']/#data-content_published_date
Related
<div class='postbodytop">
<a class="xxxxxxxxxxxxxxxx" href="xxxxxxxxxxxxxx">tonyd</a>
"posted this 4 minutes ago "
<span class="hidden-xs"> </span>
</div>
Hello, I want to extract the "posted this 4 minutes ago" or just "4 minutes" using xpath. Can anybody help me? Thank you
The div whose class equals postbodytop contains three child nodes: a span, a text node, and another span. Your path should start at the div and then select the child text node, for which the appropriate test is text().
div/text()
Of course this is just a fragment of a bigger page, and your XPath may need to have something at the start e.g. /html/body/ etc. and if there are other div elements at the same level as the <div class=postbodytop>, then you should be more specific about the div, e.g. div[#class="postbodytop"] instead of just div in that XPath expression.
I'm looking to get the output:
50ml milk
From the following code:
<ul class="ingredients-list__group">
<li>50ml <a href="/glossary/milk" class="tooltip-processed">milk
<div class="tooltip">
<h2
class="node-title">Milk</h2> <span class="fonetic">mill-k</span>
<p>One of the most widely used ingredients, milk is often referred to as a complete food. While cow…</p>
</div>
</a>
</li>
</ul>
Currently I'm using the XPATH:
//ul[#class="ingredients-list__group"]/li
But getting:
50ml milk Milk mill-kOne of the most widely used ingredients, milk is often referred to as a complete food. While cow…
How do I exclude the stuff within the div/tooltip?
With xpath 2.0:
//ul[#class="ingredients-list__group"]/li/concat(./text()[1], ./a/text()[1])
With xpath 1.0:
concat(//ul[#class="ingredients-list__group"]/li/text()[1], //ul[#class="ingredients-list__group"]/li/a/text()[1])'
You can select the relevant text nodes using
//ul[#class="ingredients-list__group"]//
text()[not(ancestor::div[#class='tooltip'])]
If you're in XPath 2.0 you can then put this in a call of string-join() to join these into a single string. If you're stuck with 1.0, you'll have to return multiple text nodes to the calling application and concatenate them together in the host language code.
The xpath I have defined below is working properly if tested individually. However, when I call
it from storage object and make that structure look like as underneath, trouble comes up and generates
disorganized results. Ignore my linguistic mistakes, if any.
Storage=xpath('//div[#class="info"]')
for item in Storage:
Name=item.xpath('//span[#itemprop="name"]/text()')
Address=item.xpath('//span[#itemprop="streetAddress" and #class="street-address"]/text()')
Phone=item.xpath('//div[#itemprop="telephone" and #class="phones phone primary"]/text()')
My question is: How to build an xpath expression If it is taken from "storage" and built "Name", "Address", and "Phone"
as I tried to do above. Thanks.
Here is the html element for that expression, if needed.
<div class="info"><h2 class="n">36. <span itemprop="name">The Coffee Table Eagle Rock</span></h2><div data-tripadvisor="{"rating":"4.0","count":"11"}" data-israteable="true" class="info-section info-primary"><div class="result-rating three half "><span class="count">(5)</span></div><div class="ta-rating extra-rating ta-4-0"></div><span class="ta-count">(11)</span><p itemscope="" itemtype="http://schema.org/PostalAddress" itemprop="address" class="adr"><span itemprop="streetAddress" class="street-address">1958 Colorado Blvd</span><span itemprop="addressLocality" class="locality">Los Angeles, </span><span itemprop="addressRegion">CA</span> <span itemprop="postalCode">90041</span></p><div itemprop="telephone" class="phones phone primary">(323) 255-2200</div></div><div class="info-section info-secondary"><div class="categories">Coffee & Espresso RestaurantsBars</div><div class="links">WebsiteMenu</div><a data-analytics="{"adclick":true,"events":"event7,event6","category":"8004238","impression_id":"fbd98612-6b8a-43c2-b31e-fd579de20126","listing_id":"11287432","item_id":-1,"listing_type":"free","ypid":"11287432","content_provider":"MDM","srid":"L-webyp-1c6db222-cc63-48d8-90d1-2d5dc8754cca-11287432","item_type":"PUP","lhc":"8004238","ldir":"LA","rate":3.5,"hasTripAdvisor":true,"mip_claimed_staus":"mip_unclaimed","mip_ypid":"11287432","click_id":523,"listing_features":"orderonline"}" href="https://yellowpages.pingup.com/Bkm3xG?ypid=11287432&uvid=t3pfPllxtLYkH2dlkSbiCC1marvZprsz1YhqhycO80NYrDv0OMX3uTJ3ryFG464RywmpWCrB&source=web-prod" rel="nofollow" target="_blank" class="action order-online" data-impressed="1">Order Online</a></div><div class="preferred-listing-features"></div><div class="snippet"><figure class="avatar-1 color-1"></figure><p class="body with-avatar">I went here recently with my 2 year old for breakfast. I got the Silverlake omelet and the breakfast sandwich for my son. The food was great (especi…</p></div></div>
If you want to get child/descendant elements of already defined item, you need to use .// to point on current ("item") element, but not // that points on root element. Try below:
Storage=xpath('//div[#class="info"]')
for item in Storage:
Name=item.xpath('.//span[#itemprop="name"]/text()')
Address=item.xpath('.//span[#itemprop="streetAddress" and #class="street-address"]/text()')
Phone=item.xpath('.//div[#itemprop="telephone" and #class="phones phone primary"]/text()')
I am newbie here. Please advise. How to select checkbox in my case?
<ul class="phrases-list" style="">
<li>
<input type="checkbox" class="select-phrase">
<span class="prase-title"> Dog - Wikipedia, the free encyclopedia </span>
(en.wikipedia.org)
<div class="prase-desc hidden">The domestic dog (Canis lupus familiaris or Canis familiaris) is a domesticated...</div>
</li>
The following doesn't work for me:
When /I check box "([^\"]+)"$/ do |label|
page.check(label)
end
step: And I check box "Dog - Wikipedia, the free encyclopedia"
If you can change the html, wrap the input and span in a label element
<ul class="phrases-list" style="">
<li>
<label>
<input type="checkbox" class="select-phrase">
<span class="prase-title"> Dog - Wikipedia, the free encyclopedia </span>
</label>
(en.wikipedia.org)
<div class="prase-desc hidden">The domestic dog (Canis lupus familiaris or Canis familiaris) is a domesticated...</div>
</li>
which has the added benefit of clicks on the "Dog - Wikipedia ..." text triggering the checkbox too. With that change your step should work as written. If you can't modify the html then things get more difficult.
Something like
find('span', text: label).find(:xpath, './preceding-sibling::input').set(true)
should work, although I'm curious how you're using these checkboxes from JS with nothing tying them to any specific value
Let's assume that you are prevented from changing the HTML. In this case, it would probably be easiest to query for the element via XPath. For example:
# Here's the XPath query
q = "//span[contains(text(), 'Dog - Wikipedia')]/preceding-sibling::input"
# Use the query to find the checkbox. Then, check the checkbox.
page.find(:xpath, q).set(true)
Okay - it's not as bad as it looks! Let's analyze this XPath so we can understand what it's doing:
//span
This first part says "Search the entire HTML document and discover all "span" elements. Of course, there are probably a LOT of "span" elements in the HTML document, so we'll need to restrict this:
//span[contains(text(), 'Dog - Wikipedia')]
Now we're only searching for the "span" elements that contain the text "Dog - Wikipedia". Presumably, this text will uniquely identify the desired "span" element on the page (if not, then just search for more of the text).
At this point, we have the "span" element that is adjacent to the desired "input" element. So, we can query for the "input" element using the "preceding-sibling::" XPath Axis:
//span[contains(text(), 'Dog - Wikipedia')]/preceding-sibling::input
I am trying to use xPath to traverse through the code of a newspaper (for the sake of practice) right now I'd like to get the main article, it's picture and the small description I get of it. But I'm not that skilled in xPath so far and I can't get to the small description.
withing this code:
<div class="margenesPortlet">
<div class="fondoprincipal">
<div class="margenesPortlet">
<a href='notas/n1092329.htm' ><img id="LinkNotaA1_Foto" src="http://i.oem.com.mx/5cfaf266-bb93-436c-82bc-b60a78d21fb6.jpg" height="250" width="300" border="0" /></a>
<div class="piefoto_esto">Un tubo de 12 pulgadas al lado de la Vialidad Sacramento que provocó el corte del servicio durante toda la mañana y hasta alrededor de las cuatro de la tarde. Foto: El Heraldo de Chihuahua</div>
<div class="cabezaprincesto"><a href='notas/n1092329.htm' class='cabezaprincesto' >Sin agua 8 mil usuarios</a></div>
<div class="resumenesto"><a href='notas/n1092329.htm' class='resumenesto' >La ruptura de una línea en el tanque de rebombeo de agua Sacramento dejó sin servicio a ocho mil usuarios, en once colonias del sur de la ciudad. </a></div>
</div>
</div>
</div>
I've want to get the picture (with or without caption) and then the title of the article. These 3 things I can get by using:
//div[#class='fondoprincipal'] <-- gives me the main image and caption
//a[#class='cabezaprincesto']/text() <-- gives me the article's title
but I can't get ahold of the small description which is the div with class="resumenesto", I haven't tried getting anything by that id because the same id is used over and over through the rest of the HTML so it returns lots of extra items.
How can I get this particular one? and then would any of you recommend me a good way of parsing it to another webpage? I was thinking maybe php writing some html using those values but I'm not sure really...
Edit
What I mean by "this particular one" is how do I get this div class="resumenesto", the one residing within div class="fondoprincipal"...
Edit 2
Thank you, now xPath Traversing is a little bit more clear. But then about my second question, would any of you recommend me a good way of parsing it to another webpage? I was thinking maybe php writing some html using those values but I'm not sure really..
You say "id" of resumenesto, but in your code example the div you're talking about has a class of resumenesto.
Further, when you use an xpath of something like this:
//div[#class='resumenesto']
What you're getting is a list of nodes matching that xpath.
So if you want to specifically refer only to a single item in that list, you need to specify which item in the list:
//div[#class='resumenesto'][1]
Further, what do you mean by "this particular one"? The only way to tell xpath specificity is to give it context, for instance "the div with class resumenesto that resides within some other div", or "the first of the divs with class resumenesto".
Read W3Schools' overview of XPath syntax for some more info.
Edit:
To get the div residing within "fondoprincipal":
//div[#class='fondoprincipal']//div[#class='resumenesto']
This tells xpath to find any descendant div with class fondoprincipal within the document, and within that div, find any descendant div with class resumenesto.
And to narrow your search you can add the div too:
//div[#class='resumenesto']/a[#class='resumenesto']/text()
To get it to the test you need to:
//div[#class='fondoprincipal']//a[#class='resumenesto']
Note that you want to get the a (isntead of the div as Raul suggested), since its in that element that you get the text.
Regarding putting it on a page, you can do it in asp.net. Use the XElement to load the values and then the XPathSelectElement to get the values (http://msdn.microsoft.com/en-us/library/bb156083.aspx).