YQL Yahoo Finance Scraper on XML in Ruby - ruby

I am using a YQL query (the standard example query, with GOOG, YHOO, MSFT and AAPL) to generate XML for all of the available fields. I wanted to scrape the YQL site for the XML output once it is generated using a Ruby script, so that I could run it over and over again for different stocks and store the data somewhere. I haven't finished my script yet, but what I have seems to just not run. Here is the code:
yahoo_finance_scrape.rb
require 'rubygems'
require 'nokogiri'
require 'restclient'
PAGE_URL = "http://developer.yahoo.com/yql/console/"
yql_query = 'use "http://github.com/spullara/yql-tables/raw/d60732fd4fbe72e5d5bd2994ff27cf58ba4d3f84/yahoo/finance/yahoo.finance.quotes.xml"
as quotes; select * from quotes where symbol in ("YHOO","AAPL","GOOG","MSFT") '
if page = RestClient.post(PAGE_URL, {'name' => yql_query, 'submit' => 'Test'})
puts "YQL query: #{yql_query}, is valid"
xml_output = Nokogiri::HTML(page)
lines = xml_output.css('#container #layout-doc #yui-gen3000008 #yui-gen3000009 #yui_3_11_0_3_1393417778356_354
#yui-gen3000015 #yui-gen3000016 div#yui_3_11_0_2_1393417778356_10 #centerBottomView
#outputContainer div#output #outputTabContent #formattedView #viewContent #prexml')
lines.each do |line|
puts line.css('span').map{|span| span.text}.join(' ')
sleep 0.03
end
end
When I run the program, it only prints
"YQL query: use "http://github.com/spullara/yql-tables/raw/d60732fd4fbe72e5d5bd2994ff27cf58ba4d3f84/yahoo/finance/yahoo.finance.quotes.xml"
as quotes; select * from quotes where symbol in ("YHOO","AAPL","GOOG","MSFT") , is valid"
And then just stops. Oh, I am using that Github url because yahoo.finance.quotes was not working, and someone else on Stackoverflow suggested to use it.
If you want to check the css tags, just go to http://developer.yahoo.com/yql/console/ and enter my query and do an inspect element on it. I would post it here, but I don't know how.

The output is just the content of your yql_query var. so this does not help much.
You probably should not put the "use xxxx ax quotes" as a string in your code.
Check out what "someone else" had in mind.
The RestClient.post() method returns a response object. With all HTTP operations, always check the response.code, otherwise you don't know about errors.
response = RestClient.post(...)
puts "HTTP Response code: #{response.code}"
if response.code == 200
page = repsonse.to_str
...
end
According to the Nokogiri website the xml_output.css() method filters like it is a css selector. if you have for example "#container #layout-doc", this means "filter elements with the id 'layout-doc' inside elements of the id 'container' and so on. Is this really what you itend to do? if yes, the last "#prexml" should be enough and much less error-prone, as ids should normally be unique.

Related

Process Jekyll content to replace first occurrence of any post title with a hyperlink of the post with that title

What I'm trying to do
I am building a Jekyll ruby plugin that will replace the first occurrence of any word in the post copy text content with a hyperlink linking to the URL of a post by the same name.
The problems I'm having
I've gotten this to work but I can't figure out two problems in the process_words method:
How to only search for a post title in the main content copy text of the post, and not the meta tags before the post or the table of contents (which is also generated before main post copy text)? I can't get this to work with Nokigiri, even though that seems to be the tool of choice here.
If a post's URL is not at post.data['url'], where is it?
Also, is there a more efficient, cleaner way to do this?
The current code works but will replace the first occurrence even if it's the value of an HTML attribute, like an anchor or a meta tag.
Example result
We have a blog with 3 posts:
Hobbies
Food
Bicycles
And in the "Hobbies" post body text, we have a sentence with each word appearing in it for the first time in the post, like so:
I love mountain biking and bicycles in general.
The plugin would process that sentence and output it as:
I love mountain biking and bicycles in general.
My current code (UPDATED 1)
# _plugins/hyperlink_first_word_occurance.rb
require "jekyll"
require 'uri'
module Jekyll
# Replace the first occurance of each post title in the content with the post's title hyperlink
module HyperlinkFirstWordOccurance
POST_CONTENT_CLASS = "page__content"
BODY_START_TAG = "<body"
ASIDE_START_TAG = "<aside"
OPENING_BODY_TAG_REGEX = %r!<body(.*)>\s*!
CLOSING_ASIDE_TAG_REGEX = %r!</aside(.*)>\s*!
class << self
# Public: Processes the content and updates the
# first occurance of each word that also has a post
# of the same title, into a hyperlink.
#
# content - the document or page to be processes.
def process(content)
#title = content.data['title']
#posts = content.site.posts
content.output = if content.output.include? BODY_START_TAG
process_html(content)
else
process_words(content.output)
end
end
# Public: Determines if the content should be processed.
#
# doc - the document being processes.
def processable?(doc)
(doc.is_a?(Jekyll::Page) || doc.write?) &&
doc.output_ext == ".html" || (doc.permalink&.end_with?("/"))
end
private
# Private: Processes html content which has a body opening tag.
#
# content - html to be processes.
def process_html(content)
content.output = if content.output.include? ASIDE_START_TAG
head, opener, tail = content.output.partition(CLOSING_ASIDE_TAG_REGEX)
else
head, opener, tail = content.output.partition(POST_CONTENT_CLASS)
end
body_content, *rest = tail.partition("</body>")
processed_markup = process_words(body_content)
content.output = String.new(head) << opener << processed_markup << rest.join
end
# Private: Processes each word of the content and makes
# the first occurance of each word that also has a post
# of the same title, into a hyperlink.
#
# html = the html which includes all the content.
def process_words(html)
page_content = html
#posts.docs.each do |post|
post_title = post.data['title'] || post.name
post_title_lowercase = post_title.downcase
if post_title != #title
if page_content.include?(" " + post_title_lowercase + " ") ||
page_content.include?(post_title_lowercase + " ") ||
page_content.include?(post_title_lowercase + ",") ||
page_content.include?(post_title_lowercase + ".")
page_content = page_content.sub(post_title_lowercase, "#{ post_title.downcase }")
elsif page_content.include?(" " + post_title + " ") ||
page_content.include?(post_title + " ") ||
page_content.include?(post_title + ",") ||
page_content.include?(post_title + ".")
page_content = page_content.sub(post_title, "#{ post_title }")
end
end
end
page_content
end
end
end
end
Jekyll::Hooks.register %i[posts pages], :post_render do |doc|
# code to call after Jekyll renders a post
Jekyll::HyperlinkFirstWordOccurance.process(doc) if Jekyll::HyperlinkFirstWordOccurance.processable?(doc)
end
Update 1
Updated my code with #Keith Mifsud's advice. Now using either the sidebar's aside element or the page__content class to select body content to work on.
Also improved checking and replacing the correct term.
PS: The code base example I started with working on my plugin was #Keith Mifsud's jekyll-target-blank plugin
this code looks very familiar :) I suggest you look into the Rspecs test file to test against your issues: https://github.com/keithmifsud/jekyll-target-blank
I'll try to answer your questions, sorry I couldn't test these myself the time of writing.
How to only search for a post title in the main content copy text of the post, and not the meta tags before the post or the table of contents (which is also generated before main post copy text)? I can't get this to work with Nokigiri, even though that seems to be the tool of choice here.
Your requirements here are:
1) Ignore content outside the <body></body> tags.
This seems to already be implemented in the process_html() method. This method is stating the only process the body_content and it should work as it is. Have you got tests for it? How are you debugging it? The same string splitting works in my plugin. I.e. only content inside the body is processed.
2) Ignore content inside the Table of Contents (TOC).
I suggest you extend the process_html() method by further splitting the body_content variable. Search for content in between the opening and closing tags of your TOC (by id, css class etc..) and exclude it, then add it back in it's position before or after process_words string.
3) Whether to use the Nokigiri plugin?
This plugin is great for parsing html. I think you are parsing strings and then creating html. So vanilla Ruby and the URI plugin should suffice. You can still use it if you want but it won't be any faster then splitting strings in ruby.
If a post's URL is not at post.data['url'], where is it?
I think you should a have method to get all all post titles and then match the "words" against the array. You can get all the posts collection from the doc itself doc.site.posts and foreach post return the title. The the process_words() method can check each work to see if it matched an item from the array. But what if the title is made of more than one word?
Also, is there a more efficient, cleaner way to do this?
So far so good. I'll start with getting the issues fixed and then refactor for speed and coding standards.
Again I suggest you use testing to help you with this.
Let me know if I can help more :)

Searching by text with Mechanize/Nogokiri

I'm trying to scrape some data on average GPA and more from a lot of pages similar to this one:
http://www.ptcas.org/ptcas/public/Listing.aspx?seqn=3200&navid=10737426783
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.ptcas.org/ptcas/public/Listing.aspx?seqn=3200&navid=10737426783')
gpa_headers = page.xpath('//h3[contains(text(), "GPA")]')
pp gpa_headers
My issue is that gpa_headers is nil but there is at least one h3 element containing "GPA".
What could be causing this issue? I thought it may be that since the page has dynamic elements that Mechanize had some issue with that yet I can puts page.body and the output includes:
... <h3 style="text-align:center;">GPA REQUIREMENT</h3> ...
Which, by my understanding should be found with the xpath I used.
If there is a better approach to this I would like to know that too.
This looks to be a problem with the DOM structure of the site, as it contains a tag named style which isn't being closed and looks like this:
<td colspan='7'><style='text-align:center;font-style:italic'>The
institution has been granted Candidate for Accreditation status by the
Commission on Accreditation in Physical Therapy Education (1111 North
Fairfax Street, Alexandria, VA, 22314; phone: 703.706.3245; email: <a
href='mailto:accreditation#apta.org'>accreditation#apta.org</a>).
Candidacy is not an accreditation status nor does it assure eventual
accreditation. Candidate for Accreditation is a pre-accreditation
status of affiliation with the Commission on Accreditation in Physical
Therapy Education that indicates the program is progressing toward
accreditation.</td>
as you can see, the td tag closes but the inner style never did.
If you don't need this part of the code I would recommend removing this before trying to work with the entire response. I don't have experience with ruby but I would do something like:
Get the raw body of the response.
Replace the part that matches this regex '(<style=\'.*)</td>' with empty string, or close the tag yourself.
Work with this new response body.
Now you would be able to work with xpath selectors.
eLRuLL gives the source of the problem above. Here is an example of how I fixed the issue:
require 'mechanize'
require 'nokogiri'
agent = Mechanize.new
page = agent.get('http://www.ptcas.org/ptcas/public/Listing.aspx?seqn=3200&navid=10737426783')
mangled_text = page.body
fixed_text = mangled_text.sub(/<style=.+?<\/td>/, "</td>")
page = Nokogiri::HTML(fixed_text)
gpa_headers = page.xpath('//h3[contains(text(), "GPA")]')
pp gpa_headers
This will return the header that I was looking for above:
[#<Nokogiri::XML::Element:0x2b28a8ec0c38 name="h3" attributes=[#<Nokogiri::XML::Attr:0x2b28a8ec0bc0 name="style" value="text-align:center;">] children=[#<Nokogiri::XML::Text:0x2b28a8ec0774 "GPA REQUIREMENT">]>]
A more reliable solution is to work with a HTML5 parser like nokogumbo:
require 'nokogumbo'
doc = Nokogiri::HTML5(page.body)
gpa_headers = doc.search('//h3[contains(text(), "GPA")]')

Posting data on website using Mechanize Nokogiri Selenium

I need to post data on a website through a program.
To achieve this I am using Mechanize Nokogiri and Selenium.
Here's my code :
def aeiexport
# first Mechanize is submitting the form to identify yourself on the website
agent = Mechanize.new
agent.get("https://www.glou.com")
form_login_AEI = agent.page.forms.first
form_login_AEI.util_vlogin = "42"
form_login_AEI.util_vpassword = "666"
# this is suppose to submit the form I think
page_compet_list = agent.submit(form_login_AEI, form_login_AEI.buttons.first)
#to be able to scrap the page you end up on after submitting form
body = page_compet_list.body
html_body = Nokogiri::HTML(body)
#tds give back an array of td
tds = html_body.css('.L1').xpath("//table/tbody/tr[position()>1]/td")
# Checking my array of td with some condition
tds.each do |td|
link = td.children.first # Select the first children
if link.html = "2015 32 92 0076 012"
# Only consider the html part of the link, if matched follow the previous link
previous_td = td.previous
previous_url = previous_td.children.first.href
#following the link contained in previous_url
page_selected_compet = agent.get(previous_url)
# to be able to scrap the page I end up on
body = page_selected_compet.body
html_body = Nokogiri::HTML(body)
joueur_access = html_body.search('#tabs0head2 a')
# clicking on the link
joueur_access.click
rechercher_par_numéro_de_licence = html_body.css('.L1').xpath("//table/tbody/tr/td[1]/a[1]")
pure_link_rechercher_par_numéro_de_licence = rechercher_par_numéro_de_licence['href']
#following pure_link_rechercher_par_numéro_de_licence
page_submit_licence = agent.get(pure_link_rechercher_par_numéro_de_licence)
body_submit_licence = page_submit_licence.body
html_body = Nokogiri::HTML(body_submit_licence)
#posting my data in the right field
form.field_with(:name => 'lic_cno[0]') == "9511681"
1) So far what do you think about this code, Do you think there is an error in there
2) This part is the one I am really not sure about : I have posted my data in the right field but now I need to submit it. The problem is that the button I need to click is like this:
<input type="button" class="button" onclick="dispatchAndSubmit(document.JoueurRechercheForm, 'rechercher');" value="Rechercher">
it triggers a javascript function onclick. I am triying Selenium to trigger the click event. Then I end up on another page, where I need to click a few more times.. I tried this:
driver.find_element(:value=> 'Rechercher').click
driver.find_element(:name=> 'sel').click
driver.find_element(:value=> 'Sélectionner').click
driver.find_element(:value=> 'Inscrire').click
But so far I have not succeeded in posting the data.
Could you please tell me if selenium will enable me to do what I need to do. If can I do it ?
At a glance your code can use less indentation and more white space/empty lines to separate the internal logic of AEIexport (which should be changed to aei_export since Ruby uses snake case for method names. You can find more recommendations on how to style ruby code here).
Besides the style of your code, an error I found at the beginning of your method is using an undefined variable page when defining form_login_AEI.
For your second question, I'm not familiar with Selenium; however since it does use a real web browser it can handle JavaScript. Watir is another possible solution.
An alternative would be to view the page source (i.e. in Firebug) and understand what the JavaScript on the page does. Then use Mechanize to follow the link manually.

Disable HTML within XML escaping with Nokogiri

I'm trying to parse an XML document from the Google Directions API.
This is what I've got so far:
x = Nokogiri::XML(GoogleDirections.new("48170", "48104").xml)
x.xpath("//DirectionsResponse//route//leg//step").each do |q|
q.xpath("html_instructions").each do |h|
puts h.inner_html
end
end
The output looks like this:
Head <b>south</b> on <b>Hidden Pond Dr</b> toward <b>Ironwood Ct</b>
Turn <b>right</b> onto <b>N Territorial Rd</b>
Turn <b>left</b> onto <b>Gotfredson Rd</b>
...
I would like the output to be:
Turn <b>right</b> onto <b>N Territorial Rd</b>
The problem seems to be Nokogiri escaping the html within the xml
I trust Google, but I think it would be also good to sanitize it further to:
Turn right onto N Territorial Rd
But I can't (using sanitize perhaps) without the raw xml. Ideas?
Because I don't have the Google Directions API installed I can't access the XML, but I have a strong suspicion the problem is the result of telling Nokogiri you're dealing with XML. As a result it's going to return you the HTML encoded like it should be in XML.
You can unescape the HTML using something like:
CGI::unescape_html('Head <b>south</b> on <b>Hidden Pond Dr</b> toward <b>Ironwood Ct</b>')
=> "Head <b>south</b> on <b>Hidden Pond Dr</b> toward <b>Ironwood Ct</b>\n"
unescape_html is an alias to unescapeHTML:
Unescape a string that has been HTML-escaped
CGI::unescapeHTML("Usage: foo "bar" <baz>")
# => "Usage: foo \"bar\" "
I had to think about this a bit more. It's something I've run into, but it was one of those things that escaped me during the rush at work. The fix is simple: You're using the wrong method to retrieve the content. Instead of:
puts h.inner_html
Use:
puts h.text
I proved this using:
require 'httpclient'
require 'nokogiri'
# This URL comes from: https://developers.google.com/maps/documentation/directions/#XML
url = 'http://maps.googleapis.com/maps/api/directions/xml?origin=Chicago,IL&destination=Los+Angeles,CA&waypoints=Joplin,MO|Oklahoma+City,OK&sensor=false'
clnt = HTTPClient.new
doc = Nokogiri::XML(clnt.get_content(url))
doc.search('html_instructions').each do |html|
puts html.text
end
Which outputs:
Head <b>south</b> on <b>S Federal St</b> toward <b>W Van Buren St</b>
Turn <b>right</b> onto <b>W Congress Pkwy</b>
Continue onto <b>I-290 W</b>
[...]
The difference is that inner_html is reading the content of the node directly, without decoding. text decodes it for you. text, to_str and inner_text are aliased to content internally in Nokogiri::XML::Node for our parsing pleasure.
Wrap your nodes in CDATA:
def wrap_in_cdata(node)
# Using Nokogiri::XML::Node#content instead of #inner_html (which
# escapes HTML entities) so nested nodes will not work
node.inner_html = node.document.create_cdata(node.content)
node
end
Nokogiri::XML::Node#inner_html escapes HTML entities except in CDATA sections.
fragment = Nokogiri::HTML.fragment "<div>Here is an unescaped string: <span>Turn left > right > straight & reach your destination.</span></div>"
puts fragment.inner_html
# <div>Here is an unescaped string: <span>Turn left > right > straight & reach your destination.</span></div>
fragment.xpath(".//span").each {|node| node.inner_html = node.document.create_cdata(node.content) }
fragment.inner_html
# <div>Here is an unescaped string: <span>Turn left > right > straight & reach your destination.</span>\n</div>
This is not a great or DRY solution, but it works:
puts h.inner_html.gsub("<b>" , "").gsub("</b>", "").gsub("<div style=\"font-size:0.9em\">", "").gsub("</div>", "")

Ruby script for posting comments

I have been trying to write a script that may help me to comment from command line.(The sole reason why I want to do this is its vacation time here and I want to kill time).
I often visit and post on this site.So I am starting with this site only.
For example to comment on this post I used the following script
require "uri"
require 'net/http'
def comment()
response = Net::HTTP.post_form(URI.parse("http://www.geeksforgeeks.org/wp-comments-post.php"),{'author'=>"pikachu",'email'=>"saurabh8c#gmail.com",'url'=>"geekinessthecoolway.blogspot.com",'submit'=>"Have Your Say",'comment_post_ID'=>"18215",'comment_parent'=>"0",'akismet_comment_nonce'=>"70e83407c8",'bb2_screener_'=>"1330701851 117.199.148.101",'comment'=>"How can we generalize this for a n-ary tree?"})
return response.body
end
puts comment()
Obviously the values were not hardcoded but for sake of clearity and maintaining the objective of the post i am hardcoding them.
Beside the regular fields that appear on the form,the values for the hidden fields i found out from wireshark when i posted a comment the normal way.I can't figure out what I am missing?May be some js event?
Edit:
As few people suggested using mechanize I switched to python.Now my updated code looks like:
import sys
import mechanize
uri = "http://www.geeksforgeeks.org/"
request = mechanize.Request(mechanize.urljoin(uri, "archives/18215"))
response = mechanize.urlopen(request)
forms = mechanize.ParseResponse(response, backwards_compat=False)
response.close()
form=forms[0]
print form
control = form.find_control("comment")
#control=form.find_control("bb2_screener")
print control.disabled
# ...or readonly
print control.readonly
# readonly and disabled attributes can be assigned to
#control.disabled = False
form.set_all_readonly(False)
form["author"]="Bulbasaur"
form["email"]="ashKetchup#gmail.com"
form["url"]="9gag.com"
form["comment"]="Y u no put a captcha?"
form["submit"]="Have Your Say"
form["comment_post_ID"]="18215"
form["comment_parent"]="0"
form["akismet_comment_nonce"]="d48e588090"
#form["bb2_screener_"]="1330787192 117.199.144.174"
request2 = form.click()
print request2
try:
response2 = mechanize.urlopen(request2)
except mechanize.HTTPError, response2:
pass
# headers
for name, value in response2.info().items():
if name != "date":
print "%s: %s" % (name.title(), value)
print response2.read() # body
response2.close()
Now the server returns me this.On going through the html code of the original page i found out there is one more field bb2_screener that i need to fill if I want to pretend like a browser to the server.But the problem is the field is not written inside the tag so mechanize won't treat it as a field.
Assuming you have all the params correct, you're still missing the session information that the site stores in a cookie. Consider using something like mechanize, that'll deal with the cookies for you. It's also more natural in that you tell it which fields to fill in with which data. If that still doesn't work, you can always use a jackhammer like selenium, but then technically you're using a browser.

Resources