Extract, create and append using xpath and telegram instant view - xpath

<iframe src="https://expmle.com/subdirectory/sample_title" />
How can I create and append below <a> tag from above code using XPath and Telegram Instant view functions?
sample_title
I want to extract whole src and last part of that and use of them to create <a> tag.
Any suggestions would be very appreciated :)

This should work correctly.
<a>: //iframe # Find iframe and convert it to <a>
#set_attr(href, ./#src) # Set href attribute from src
$anchor # Create variable for current <a>
#set_attr(text, ./#href) # We set new attribute for link which will processed by #match function, then #text attribute will be replaced by result of the #match
#match("\\w+_\\w+"): $#/#text # Now we find our future name of the link "sample_name" (this function will replace all in #text by our new name
#prepend(#text): $anchor # And then put this name to his $anchor

# append <a> tag below
#after(<a>, href, ./#src, content, ./#src): //iframe
# take everything after the last slash
#match("[^\/]+$", "1"): $#/#content
# move the attribute inside the tag
#append(#content): $#
If the last $# won't work, just define a variable for <a>.

Related

How to scarpe the href using Nokogiri

I have a variable e which stores a Nokogiri::XML::Element object.
when I execute puts e I get on the screen the following:
<h3 class="fixed-recipe-card__h3">
<a href="https://www.allrecipes.com/recipe/21712/chocolate-covered-strawberries/" data-content-provider-id="" data-internal-referrer-link="hub recipe" class="fixed-recipe-card__title-link">
<span class="fixed-recipe-card__title-link">Chocolate Covered Strawberries</span>
</a>
</h3>
I would like to scrape this part https://www.allrecipes.com/recipe/21712/chocolate-covered-strawberries/
How can I do this using Nokogiri
If you want to extract the link, you can use:
e.at_css("a").attributes["href"].value
.at_css returns the first element matching the CSS selector (another Nokogiri::XML::Element). To get a list of all matching elements, use .css instead.
.attributes gives you a hash mapping attribute name to Nokogiri::XML::Attr. Once you look up the desired attribute in this hash (href), you can call .value to get the actual text value.

How to search through and update some text only, leaving out text inside some element name

I have this HTML fragment:
<p>Yes. No. Both. Maybe a plane?</p><h2 id="2-is-it-a-plane">2. Is it a plane?</h2><p>Yes. No. Both.</p><h2 id="3-what-is-superman-anyway">3. What is Superman, anyway?</h2><p>Is it a bird? Is it a plane? No, it’s Superman.</p>
I need to replace the word plane with
plane
but only when it's outside of an <a></a> anchor tag, and outside a heading, <h1-h6></h> tag.
This is what I've tried:
require 'Nokogiri'
h = '<p>Yes. No. Both. Maybe a plane?</p><h2 id="2-is-it-a-plane">2. Is it a plane?</h2><p>Yes. No. Both.</p><h2 id="3-what-is-superman-anyway">3. What is Superman, anyway?</h2><p>Is it a bird? Is it a plane? No, it’s Superman.</p>'
doc = Nokogiri::HTML::DocumentFragment(h).parse
# Try 1: This outputs all content, but I need to avoid <a>/<h#>
doc.content
# Try 2: The below line removes headings permanently - I need them to remain
# doc.search(".//h2").remove
# Try 3: This just comes out empty - why?
# doc.xpath('text()')
# doc.xpath('//text()')
# then,
# code to replace `plane` is here ...
# this part is not needed
# then,
doc.to_html
I tried various other variations of xpath to no avail. What am I doing wrong?
After some playing around, it appears you needed to use the XPath selector p/text(). Things then got more complicated because you're trying to replace normal text with a link element.
When I just tried using gsub, Nokogiri was escaping the new link, so I needed to split the text element into multiple sibling elements where I could replace some of the siblings with link elements instead of text nodes.
doc.xpath('p/text()').grep(/plane/) do |node|
node_content, *remaining_texts = node.content.split(/(plane)/)
node.content = node_content
remaining_texts.each do |text|
if text == 'plane'
node = node.add_next_sibling('plane').last
else
node = node.add_next_sibling(text).last
end
end
end
puts doc
# <p>Yes. No. Both. Maybe a plane?</p>
# <h2 id="2-is-it-a-plane">2. Is it a plane?</h2>
# <p>Yes. No. Both.</p>
# <h2 id="3-what-is-superman-anyway">3. What is Superman, anyway?</h2>
# <p>Is it a bird? Is it a plane? No, it’s Superman.</p>
A more general purpose XPath selector for all elements, except headings and links, might be:
*[not(name()='a')][not(name()='h1')][not(name()='h2')][not(name()='h3')][not(name()='h4')][not(name()='h5')][not(name()='h6')]/text()
You may need to tweak this some as I'm not an XML or Nokogiri expert, but it appears to me to be working for the provided example, at least, so it should get you going.

XPath - Nested path scraping

I'm trying to perform html scrapping of a webpage. I like to fetch the three alternate text (alt - highlighted) from the three "img" elements.
I'm using the following code extract the whole "img" element of slide-1.
from lxml import html
import requests
page = requests.get('sample.html')
tree = html.fromstring(page.content)
text_val = tree.xpath('//a[class="cover-wrapper"][id = "slide-1"]/text()')
print text_val
I'm not getting the alternate text values displayed. But it is an empty list.
HTML Script used:
This is one possible XPath :
//div[#id='slide-1']/a[#class='cover-wrapper']/img/#alt
Explanation :
//div[#id='slide-1'] : This part find the target <div> element by comparing the id attribute value. Notice the use #attribute_name syntax to reference attribute in XPath. Missing the # symbol would change the XPath selector meaning to be referencing a -child- element with the same name, instead of an attribute.
/a[#class='cover-wrapper'] : from each <div> element found by the previous bit of the XPath, find child element <a> that has class attribute value equals 'cover-wrapper'
/img/#alt : then from each of such <a> elements, find child element <img> and return its alt attribute
You might want to change the id filter to be starts-with(#id,'slide-') if you meant to return the all 3 alt attributes in the screenshot.
Try this:
//a[#class="cover-wrapper"]/img/#alt
So, I am first selecting the node having a tag and class as cover-wrapper and then I select the node img and then the attribute alt of img.
To find the whole image element :
//a[#class="cover-wrapper"]
I think you want:
//div[#class="showcase-wrapper"][#id="slide-1"]/a/img/#alt

how to get the data using XPATH from div with display:none?

I want to extract data from a div element with the attribute 'display:none'.
<div class='test' style='display:none;'>
<div id='test2'>data</div>
</div>
Here is what I tried:
//div[#class = "test"]//div[contains(#style, \'display:none\')';
Please help.
Try several changes:
1) Just put normal quotes around "display:none", like you did for your class attribute and close with ]
2) Then your div with class test and your style attribute is one and the same, so you need to call contains also for the same div:
'//div[#class = "test" and contains(#style, "display:none")]'
or the quotes the other way around, important is, that you are using differnt quotes around the expression than inside the expression
"//div[#class = 'test' and contains(#style, 'display:none')]"
if this still does not work, pls post an error message

How do I use Nokogiri to scrape text from an image tag?

I need to get text from a list of image tags that are formatted like this:
<img src="/images/TextImage.ashx?text=Richmond" style="border-width:0px;" class="">
When I enter the XPath into Nokogiri, I get:
[#<Nokogiri::XML::Element:0x80513954 name="img" attributes=[#<Nokogiri::XML::Attr:0x805138dc name="src" value="/images/TextImage.ashx?text=Richmond">, #<Nokogiri::XML::Attr:0x805138b4 name="style" value="border-width:0px;">]>]
Is there any way that I can tell Nokogiri to return "Richmond"? I'm looking for a method that will return the text after a certain string. If there is not a way to get only "Richmond", how do I get it to return the value?
You can extract the src attribute with an xpath expression like
src = doc.at_xpath '//img/#src'
After that, you’ll need to extract the name from the attribute, probably with a regex.
For example (this may need to be more involved, depending on what formats are possible in the src attribute in your HTML page):
/\?text=(.*)/ =~ src
puts $1

Resources