File Creation/Loop Problems in Ruby - ruby

EDIT: My original question was way off, my apologies. Mark Reed has helped me find out the real problem, so here it is.
Note that this code works:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
source_url = "www.flickr.com"
puts "Visiting #{source_url}"
page = Nokogiri::HTML(open("http://website/script.php?value=#{source_url}"))
textarea = page.css('textarea')
filename = source_url.to_s + ".txt"
create_file = File.open("#{filename}", 'w')
create_file.puts textarea
create_file.close
Which is really awesome, but I need it to do this to ~110 URLs, not just Flickr. Here's my loop that isn't working:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
File.open('sources.txt').each_line do |source_url|
puts "Visiting #{source_url}"
page = Nokogiri::HTML(open("http://website/script.php?value=#{source_url}"))
textarea = page.css('textarea')
filename = source_url.to_s + ".txt"
create_file = File.open("#{filename}", 'w')
create_file.puts "#{textarea}"
create_file.close
end
What am I doing wrong with my loop?

Ok, now you're looping over the lines of the input file. When you do that, you get strings that end in a newilne. So you're trying to create a file with a newline in the middle of its name, which is not legal in Windows.
Just chomp the string:
File.open('sources.txt').each_line do |source_url|
source_url.chomp!
# ... rest of code goes here ...
You can also use File#foreach instead of File#open.each_line:
File.foreach('sources.txt') do |source_url|
source_url.chomp!
# ... rest of code goes here

You're putting your parentheses in the wrong place:
create_file = File.open(variable, 'w')

Related

How do I parse XML nodes from an API request?

How do I save the information from an XML page that I got from a API?
The URL is "http://api.url.com?number=8-6785503" and it returns:
<OperatorDataContract xmlns="http://psgi.pts.se/PTS_Number_Service" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Name>Tele2 Sverige AB</Name>
<Number>8-6785503</Number>
</OperatorDataContract>
How do I parse the Name and Number nodes to a file?
Here is my code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://api.url.com?number=8-6785503"
doc = Nokogiri::XML(open(url))
File.open("exporterad.txt", "w") do |file|
doc.xpath("//*").each do |item|
title = item.xpath('//result[group_name="Name"]')
phone = item.xpath("/Number").text.strip
puts "#{title} ; \n"
puts "#{phone} ; \n"
company = " #{title}; #{phone}; \n\n"
file.write(company.gsub(/^\s+/,''))
end
end
Besides the fact that your code isn't valid Ruby, you're making it a lot harder than necessary, at least for a simple scrape and save:
require 'nokogiri'
require 'open-uri'
url = "http://api.pts.se/PTSNumberService/Pts_Number_Service.svc/pox/SearchByNumber?number=8-6785503"
doc = Nokogiri::XML(open(url))
File.open("exported.txt", "w") do |file|
name = doc.at('Name').text
number = doc.at('Number').text
file.puts name
file.puts number
end
Running that results in a file called "exported.txt" that contains:
Tele2 Sverige AB
8-6785503
You can build upon that as necessary.

xpath search using libxml + ruby

I am trying to search for a specific node in an XML file using XPath. This search worked just fine under REXML but REXML was too slow for large XML docs. So moved over to LibXML.
My simple example is processing a Yum repomd.xml file, an example can be found here: http://mirror.san.fastserv.com/pub/linux/centos/6/os/x86_64/repodata/repomd.xml
My test script is as follows:
require 'rubygems'
require 'libxml'
p = LibXML::XML::Parser.file( "/tmp/dr.xml")
repomd = p.parse
filelist = repomd.find_first("/repomd/data[#type='filelists']/location#href")
puts "Length: " + filelist.length.to_s
filelist.each do |f|
puts f.attributes['href']
end
I get this error:
Error: Invalid expression.
/usr/lib/ruby/gems/1.8/gems/libxml-ruby-2.7.0/lib/libxml/document.rb:123:in `find': Error: Invalid expression. (LibXML::XML::Error)
from /usr/lib/ruby/gems/1.8/gems/libxml-ruby-2.7.0/lib/libxml/document.rb:123:in `find'
from /usr/lib/ruby/gems/1.8/gems/libxml-ruby-2.7.0/lib/libxml/document.rb:130:in `find_first'
from /tmp/scripty.rb:6
I have also tried simpler examples like below, but still no dice.
p = LibXML::XML::Parser.file( "/tmp/dr.xml")
repomd = p.parse
filelist = repomd.root.find(".//location")
puts "Length: " + filelist.length.to_s
In the above case I get the output:
Length: 0
Your inspired guidance would be greatly appreciated, and I have searched for what I am doing wrong, and I just can't figure it out...
Here is some code that will fetch the file and process it, still doesn't work...
require 'rubygems'
require 'open-uri'
require 'libxml'
raw_xml = open('http://mirror.san.fastserv.com/pub/linux/centos/6/os/x86_64/repodata/repomd.xml').read
p = LibXML::XML::Parser.string(raw_xml)
repomd = p.parse
filelist = repomd.find_first("//data[#type='filelists']/location[#href]")
puts "First: " + filelist
In the end I reverted back to REXML and used stream processing. Much faster and much easier XPath syntax implementation.
Looking at your code,it seems you want to collect only those location elements which has href attribute. If that's the case below should work:
"//data[#type='filelists']/location[#href]"

Output several times

I have the following code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
time = Time.new
url = "http://mobile.bahn.de/bin/mobil/bhftafel.exe/dox?input=Richard-Strauss-Stra%DFe%2C+M%FCnchen%23625127&date=" +
time.strftime("%d%m%Y") +
"&time=" +
time.strftime("%H") +
"%3A" +
time.strftime("%M") +
"&productsFilter=1111111111000000&REQTrain_name=&maxJourneys=10&start=Suchen&boardType=Abfahrt&ao=yes"
doc = Nokogiri::HTML(open(url))
doc.xpath('//div//p').remove
doc.encoding = 'UTF-8'
doc = doc.xpath('//div').each do |node|
text = node.text.gsub(/\n([ \t]*\n)+/,"\n",).gsub(/^\s+|\s+$/,'').gsub("Startseite", '').gsub("Impressum", '')
puts text unless text.empty?
end
I have two problems:
The code outputs three times and not one time.
The German "umlauts" like äü.
The original HTML is long and not indented, so it is very hard to debug.
But I think you need to replace:
doc = doc.xpath('//div').each do |node|
With:
doc = doc.xpath('//body/div').each do |node|
The first one was also including all <div> elements so it included //body/div and then separately included the <div>s inside //body/div
I had no problems with umlaut characters, using puts, but did have problems writing them to a file. What is your exact problem? It might be best if you create a new question on Stack Overflow for the umlauts issue.

Search Websites Content

How do you search a Websites source code with ruby, hard to explain but heres the code for doing it in python
import urllib2, re
word = "How to ask"
source = urllib2.urlopen("http://stackoverflow.com").read()
if re.search(word,source):
print "Found it "+word
Here's one way:
require 'open-uri'
word = "How to ask"
open('http://stackoverflow.com') do |f|
puts "Found it #{word}" if f.read =~ /#{word}/
end
If all you want to do is search jcrossley3 gave you your answere. If you want to do something more complicated you should look at an HTML parser that can let you treat the website like a DOM Tree. Have a look at why´s great hpricot gem to do just that.
require 'hpricot'
require 'open-uri'
doc = open("http://qwantz.com/") { |f| Hpricot(f) }
doc.search("//p[#class='posted']")
(doc/"p/a/img").each do |img|
puts img.attributes['class']
end

Ruby Regex Help

I want to Extract the Members Home sites links from a site.
Looks like this
<a href="http://www.ptop.se" target="_blank">
i tested with it this site
http://www.rubular.com/
<a href="(.*?)" target="_blank">
Shall output http://www.ptop.se,
Here comes the code
require 'open-uri'
url = "http://itproffs.se/forumv2/showprofile.aspx?memid=2683"
open(url) { |page| content = page.read()
links = content.scan(/<a href="(.*?)" target="_blank">/)
links.each {|link| puts #{link}
}
}
if you run this, it dont works. why not?
I would suggest that you use one of the good ruby HTML/XML parsing libraries e.g. Hpricot or Nokogiri.
If you need to log in on the site you might be interested in a library like WWW::Mechanize.
Code example:
require "open-uri"
require "hpricot"
require "nokogiri"
url = "http://itproffs.se/forumv2"
# Using Hpricot
doc = Hpricot(open(url))
doc.search("//a[#target='_blank']").each { |user| puts "found #{user.inner_html}" }
# Using Nokogiri
doc = Nokogiri::HTML(open(url))
doc.xpath("//a[#target='_blank']").each { |user| puts "found #{user.text}" }
Several issues with your code
I don't know what you mean by using
{link}. But if you want to append a '#' character to the link make sure
you wrap that with quotes. ie
"#{link}"
String.scan accepts a block. Use it
to loop through the matches.
The page you are trying to access
does not return any links that the
regex would match anyway.
Here's something that would work:
require 'open-uri'
url = "http://itproffs.se/forumv2/"
open(url) do |page|
content = page.read()
content.scan(/<a href="(.*?)" target="_blank">/) do |match|
match.each { |link| puts link}
end
end
There're better ways to do it, I am sure. But this should work.
Hope it helps

Resources