I try to use xpath to get the #content attribute of the following html code:
<meta content="52222" name="DCSext.job_id">
I use this xpath code as a portion of scrapy spider:
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//*')
for site in sites:
il = DataItemLoader(response=response, selector=site)
il.add_xpath('listing_id', 'meta[#name="DCSext.job_id"]#content')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
il.add_xpath('loc_pj', substring-after('h1[#class="title heading"]/text()',':'))
il.add_xpath('title', 'head/title/text()')
il.add_xpath('post_date', 'div[#id="extr"]/div/dl/dd[3]/text()')
il.add_xpath('web_url', 'head/link[#rel="canon"]#href')
yield il.load_item()
I got the error message of of the underlined code:
exceptions.ValueError: Invalid XPath: meta[#name="DCSext.job_id"]#content
How to fix this? Thanks a lot!
The correct code should be:
meta[#name="DCSext.job_id"]/#content
^
Related
I am trying to get all the images from .mht file by using Nokogiri gem. But since the .mht file has quoted-printable encoding, all the images that I received, has weird characters in it:
<img alt='3D"AFC-Logo' src="3D%22https://upload.=" width='3D"75"' height='3D"75"'>
<img src="3D%22https://en.wikipedia.org/static/images/footer/wikimedia-butto=" width='3D"88"' height='3D"31"' alt='3D"Wikimedia'>
<img src="3D%22https://en.wikipedia.org/static/images/footer/poweredby_mediawiki_8=" alt='3D"Powered' width='3D"88"' height='3D"31"'>
This is the link to that .mht file: https://drive.google.com/file/d/1DtbgrFyCEcggAk1nqpZSluNhRt-k3t95/view?usp=sharing
And below is the code that I am using to get all the images from the .mht file:
html = File.open("1646037951.mht").read
image_links = get_image_links(html)
def get_image_links(html)
html_doc = Nokogiri::HTML(html)
nodes = html_doc.xpath("//img[#src]")
raise "No <img .../> tags!" if nodes.empty?
nodes.inject([]) do |uris, node|
puts node.to_s
uris << node.attr('src').strip
end.uniq
end
I have tried to parse it by using .unpack('M').first but it's still not working as it just returns the same result as above.
Or maybe Rails have something for this?
I'm writing new code and having problem getting desired output. The code reads an html file and finds tags. it outputs the url only. I insert additional code to complete the link. I'm trying to insert the url two times within the string.
####### Parse for <a> tags and save ############
with open("page1.html", 'r') as htmlb:
soup2 = BeautifulSoup(htmlb, 'lxml')
links = []
for link in soup2.findAll('a', attrs={'href': re.compile("^https://")}):
links.append(''"{link}"'<br>')
time.sleep(.1)
with open("page-2.html", 'w') as html:
html.write('{links}\n'.format(links=links))
This should give you the desired html output file:
import re
from bs4 import BeautifulSoup
import html
with open("page1.html", 'r') as htmlb:
soup2 = BeautifulSoup(htmlb, 'lxml')
with open("page2.html", 'w') as h:
for link in soup2.find_all('a'):
h.write("{}<br>".format(link.get('href'),link.get('href')))
This gives me want I want I guess, but not exactly. I would rather see it written out "https://whatever.com/text/text/" than to see "whatever.com/text/text"
####### Parse for <a> tags and save ############
with open("page1.html", 'r') as htmlb:
soup2 = BeautifulSoup(htmlb, 'lxml')
links = []
for link in soup2.findAll('a', attrs={'href': re.compile("^https://")}):
links.append('{0}</a><br>'.format(link,link))
with open("page-2.html", 'w') as html:
html.write('{links}\n'.format(links=links))
Is it possible to find outlook specific markup via Capybara/Nokogiri ?
Given the following markup (erb <% %> tags are processed into regular HTML)
...
<div>
<!--[if gte mso 9]>
<v:rect
xmlns:v="urn:schemas-microsoft-com:vml" fill="true" stroke="false"
style="width:<%= card_width %>px;height:<%= card_header_height %>px;"
>
<v:fill type="tile"
src="<%= avatar_background_url.split('?')[0] %>"
color="<%= background_color %>" />
<v:textbox inset="0,0,0,0">
<![endif]-->
<div>
How can I get the list of <v:fill ../> tags ? (or eventually how can I get the whole comment if finding the tag inside a conditional comment is a problem)
I have tried the following
doc.xpath('//v:fill')
*** Nokogiri::XML::XPath::SyntaxError Exception: ERROR: Undefined namespace prefix: //v:fill
DO I need to somehow register the vml namespace ?
EDIT - following #ThomasWalpole approach
doc.xpath('//comment()').each do |comment_node|
vml_node_match = /<v\:fill.*src=\"(?<url>http\:[^"]*)"[^>]*\/>/.match(comment_node)
if vml_node_match
original_image_uri = URI.parse(vml_node_match['url'])
vml_tag = vml_node_match[0]
handle_vml_image_replacement(original_image_uri, comment_node, vml_tag)
end
My handle_vml_image_replacement then ends up calling the following replace_comment_image_src
def self.replace_comment_image_src(node:, comment:, old_url:, new_url:)
new_url = new_url.split('?').first # VML does not support URL with query params
puts "Replacing comment src URL in #{comment} by #{new_url}"
node.content = node.content.gsub(old_url, new_url)
end
But then it feels like the comment is actually no longer a "comment" and I can sometimes see the HTML as if it was escaped... I am most likely using the wrong method to change the comment text with Nokogiri ?
Here's the final code that I used for my email interceptor, thanks to #Thomas Walpole and #sschmeck for help along the way.
My goal was to replace images (linking to localhost) in VML markup with globally available images for testing with services like MOA or Litmus
doc.xpath('//comment()').each do |comment_node|
# Note : cannot capture beginning of tag, since it might span across several lines
src_attr_match = /.*src=\"(?<url>http\:[^"]*)"[^>]*\/>/.match(comment_node)
next unless src_attr_match
original_image_uri = URI.parse(src_attr_match['url'])
handle_comment_image_replacement(original_image_uri, comment_node)
end
WHich is later calling (after picking an url replacement strategy depending on source image type) :
def self.replace_comment_image_src(node:, old_url:, new_url:)
new_url = new_url.split('?').first
node.native_content = node.content.gsub(old_url, new_url)
end
I'm using rhoMobile platform
I'm trying to get a parameter in my erb file from rb file.
I have a properties file, in my app.rb file i'm getting values from keys in this properties file.
This value is saved in application.rb, and i want to use this value in my app.erb.
Here is some code:
myFunc(<%= Rho::RhoConfig.getValue %>)
I am not going to question if your doing things right, but this should work:
myFunc("<%= Rho::RhoConfig.getValue %>")
Try this:
<script type="text/javascript" charset="utf-8">
var rho_config_value = <%= Rho::RhoConfig.getValue || 'null' %>;
myFunc(rho_config_value)
</script>
myFunc('<%= Rho.get_app.getValue('key')%>')
I'm trying to get a gh-pages site up and running. First time using Jekyll.
I have a super basic layout (default.html) in /_layouts:
<!doctype html>
<html>
<head>
<meta charset="utf-8">
</head>
<body>
<div class="wrapper">
<section id="main">
{{ content }}
</section>
</div>
</body>
</html>
And a single content page (index.html)
---
layout: default
---
Hello World
My _config.yml file is simply
pygments: true
When running jekyll --no-auto --server I get the following error. No files are generated.
.rvm/rubies/ruby-1.9.3-p327/lib/ruby/1.9.1/psych.rb:203:in `parse':
(<unknown>): did not find expected node content while parsing a flow
node at line 3 column 1 (Psych::SyntaxError)
Anyone know what's wrong here?
Since line 3 is <head>, it is possible that some basic metadata is missing, like <title>.
All template I see have a title (zinga, Symplicity, ... either fixed or generated), and the most basic template has one too (see "Hello World, I'm Jekyll")
<html>
<head>
<title>Hello world!</title>
</head>
<body>
<h1>Hello world!</h1>
<p>This is my first Jekyll website.</p>
</body>
</html>
You should check that what it's parsing is YAML at all.
The way I'm checking this in by putting some debug commands in the gem directly and re-running.
Change the psych.rb which for me is at /home/user/.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/psych.rb. Look for the def self.load and change it from
def self.load yaml, filename = nil
result = parse(yaml, filename)
result ? result.to_ruby : result
end
to
def self.load yaml, filename = nil
puts "****************#{filename}"
result = parse(yaml, filename)
result ? result.to_ruby : result
end
and look for the output in your terminal when you re-run the command.
I am currently dealing with deploying a rails app with capistrano (no jekyll at all). In my case, the output was blank, which is obviously not a filename. So now I'm investigating further up the chain. I hope that gets you started.