Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
require "openssl"
require "nokogiri"
require 'csv'
require "open-uri"
OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE
$n=0
#~ Open_Page
page = ('http://www.residentadvisor.net/dj/aguycalledgerald/tracks?sort=mostcharted')
html = Nokogiri::HTML(open(page))
#~ Array
names= []
html.css('a').each do |x|
names<< x.text.strip.gsub(/\t/,'')
names.delete('RA on YouTube')
names.delete('Login')
names.delete('Register')
names.delete('Resident Advisor')
names.delete('Submit')
names.delete('Listings')
names.delete('Clubs')
names.delete('News')
names.delete('Reviews')
names.delete('Features')
names.delete('Films')
names.delete('Submit event')
names.delete('Artists')
names.delete('Photos')
names.delete('DJ Charts')
names.delete('Labels')
names.delete('Podcasts')
names.delete('Search')
names.delete('Top 1000')
names.delete('Top 100')
names.delete('Local')
names.delete('Favourites')
names.delete('Create an artist profile')
names.delete('Reviews')
names.delete('Features')
names.delete('A')
names.delete('B')
names.delete('C')
names.delete('D')
names.delete('E')
names.delete('F')
names.delete('G')
names.delete('H')
names.delete('I')
names.delete('J')
names.delete('K')
names.delete('L')
names.delete('M')
names.delete('N')
names.delete('O')
names.delete('P')
names.delete('Q')
names.delete('R')
names.delete('S')
names.delete('T')
names.delete('U')
names.delete('V')
names.delete('W')
names.delete('X')
names.delete('Y')
names.delete('Z')
names.delete('0-9')
names.delete('RA')
names.delete('About')
names.delete('Advertise')
names.delete('Jobs')
names.delete('RA In Residence')
names.delete('Ticketing FAQ')
names.delete('Sell tickets on RA')
names.delete('Privacy')
names.delete('Terms')
names.delete('RA is also available in Japanese. 日本版')
names.delete('Download the RA Guide')
names.delete('RA on Twitter')
names.delete('RA on Facebook')
names.delete('RA on Google+')
names.delete('RA on Instagram')
names.delete('RA on Soundcloud')
names.delete('Biography')
names.delete('Events')
names.delete('Tracks')
names.delete('RA News')
names.delete('RA Editorial')
names.delete('Remixes')
names.delete('Solo productions')
names.delete('Collaborations')
names.delete('Laboratory Instinct')
names.delete('Highgrade Records')
names.delete('Bosconi')
names.delete('!K7')
names.delete('Perlon')
names.delete('Beatstreet')
names.delete('Title')
names.delete('Label')
names.delete('Release Date')
names.delete('51 chartings')
puts names
end
#~ To_CSV
for $n in 0..names.count do
CSV.open('Most_Charted.csv','a+') do |csv|
csv << [names[$n]]
end
end
That creates a CSV file with:
PositiveNoise (Carl Craig remix) System 7 & Guy Called Gerald A-Wave 22 chartings
Voodoo Ray (Shield Re-Edit) A Guy Called Gerald 18 chartings
Falling (D. Digglers Cleptomania remix) Tom Clark & Benno Blome feat.
A Guy Called Gerald 18 chartings
How Long Is Now A Guy Called Gerald 14 chartings
Groove Of The Ghetto A Guy Called Gerald 12 chartings
Voodoo Ray A Guy Called Gerald 10 chartings
Falling (D Diggler's Rescue remix) Tom Clark & Benno Blome feat. A
Guy Called Gerald 9 chartings
and so on.
How do I pass only the first 5 song names to the CSV file?
Be sure you know what you are doing when you disable SSL checks.
You can find a better selector for the track list, so you do not need all those "deletes". The tracks are all inside ul.tracks
Then i'd suggest you make the whole thing a class. So you can encapsulate the behavior. And then don't use $ globals. Not needed and usually a sign of bad code.
Here is a working sample:
require "openssl"
require "nokogiri"
require 'csv'
require "open-uri"
OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE
class Tracklist
def initialize(url)
#url = url
end
def parse(top = nil)
html = Nokogiri::HTML(open(url))
result = []
html.css('ul.tracks li').each do |node|
title = node.css('div.title a:nth-child(1)').first
result << title.text if !title.nil?
break if top && result.length == top
end
result
end
private
attr_reader :url
end
list = Tracklist.new("https://www.residentadvisor.net/dj/aguycalledgerald/tracks?sort=mostcharted")
p list.parse(5)
If you need more information about the tracks, then you can extract more details in the loop inside the parsemethod.
This code stops parsing after top has been reached. Afterwards you can build your CSV as you like.
Related
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
require 'open-uri'
require 'nokogiri'
def scrape(url)
html = open(url).read
nokogiri_doc = Nokogiri::HTML(html)
final_array = []
nokogiri_doc.search("a").each do |element|
element = element.text
final_array << element
end
final_array.each_with_index do |index|
puts "#{index}"
end
end
scrape('http://www.infranetsol.com/')
In this I'm only getting the a tag but I need the email id and phone number into an excel file.
All you have is text. So, what you can do, is to only keep string tha look like email or phone number.
Fo instance, if you keep your result in an array
a = scrape('http://www.infranetsol.com/')
You can get element with an email (string with a '#') :
a.select { |s| s.match(/.*#.*/) }
You can get element with a phone number (string with at least 5 digits) :
a.select{ |s| s.match(/\d{5}/) }
The whole code :
require 'open-uri'
require 'nokogiri'
def scrape(url)
html = open(url).read
nokogiri_doc = Nokogiri::HTML(html)
final_array = []
nokogiri_doc.search("a").each do |element|
element = element.text
final_array << element
end
final_array.each_with_index do |index|
puts "#{index}"
end
end
a = scrape('http://www.infranetsol.com/')
email = a.select { |s| s.match(/.*#.*/) }
phone = a.select{ |s| s.match(/\d{5}/) }
# in your example, you will have to email in email
# and unfortunately a complex string for phone.
# you can use scan to extract phone from text and flat_map
# to get an array without sub array
# But keep in mind it will only worked with this text
phone.flat_map{ |elt| elt.scan(/\d[\d ]*/) }
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I would like be able to insert templated YARD doc style comments into my existing Rails legacy application. At present it has few comments. I would like class headers and method headers that have the params specified (by extraction from the method signatures I presume) and placeholders for return values.
In PHP code I had tools that would examine the code and create the doc header comments inserted into the code in the proper spots. In Ruby with Duck typing etc, I am certain that things like the types of the #params etc, cannot be easily guessed at, and I am ok with that - I expect to review the code files one by one manually after insertion. Would just like to avoid having to insert all the skeleton templates into the code (over 500 files) if possible.
I have searched for a gem, etc. that does this and have come across none so far. Are there any out there?
It seems you will have to write it by yourself, but this is not a big problem havig access to the Ruby's S-expressions, which will parse the source for you. So you can do it like this:
require 'ripper'
def parse_sexp( sexp, stack=[] )
case sexp[0]
when :module
name = sexp[1][1][1]
line_number = sexp[1][1][2][0]
parse_sexp(sexp[2], stack+[sexp[0], sexp[1][1][1]])
puts "#{line_number}: Module: #{name}\n"
when :class
name = sexp[1][1][1]
line_number = sexp[1][1][2][0]
parse_sexp(sexp[3], stack+[sexp[0], sexp[1][1][1]])
puts "#{line_number}: Class: #{stack.last}::#{name}\n"
when :def
name = sexp[1][1]
line_number = sexp[1][2][0]
parse_sexp(sexp[3], stack+[sexp[0], sexp[1][1]])
puts "#{line_number}: Method: #{stack.last}##{name}\n"
else
if sexp.kind_of?(Array)
sexp.each { |s| parse_sexp(s,stack) if s.kind_of?(Array) }
end
end
end
sexp = Ripper.sexp(open 'prog.rb')
parse_sexp(sexp)
Prog.rb was:
$ cat -n prog.rb
1 module M1
2 class C1
3 def m1c1
4 a="test"
5 puts "hello"
6 return a if a.empty?
7 puts "hello2"
8 a
9 end
10 end
11 class C2 < C3
12 def m1c2
13 puts "hello"
14 end
15 end
16 class C3
17 end
18 end
What you'll get is:
#line_number #entity
3: Method: C1#m1c1
2: Class: M1::C1
12: Method: C2#m1c2
11: Class: M1::C2
16: Class: M1::C3
1: Module: M1
So you only need to customize the template, and extract the parameters which are available in the same array:
#irb > pp Ripper.sexp("def method(param1);nil; end")
...[:def,
[:#ident, "method", [1, 4]],
[:paren,
[:params, [[:#ident, "param1", [1, 11]]]...
Little bit harder task is to find out what is returned, but still doable - look for :returns while you have :def last in the stack and add it to the last statement of the method.
And finally put those comments above apropriate lines to the source file.
This is killing me and searching here and the big G is confusing me even more.
I followed the tutorial at Railscasts #190 on Nokogiri and was able to write myself a nice little parser:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.target.com/c/movies-entertainment/-/N-5xsx0/Ntk-All/Ntt-wwe/Ntx-matchallpartial+rel+E#navigation=true&facetedValue=/-/N-5xsx0&viewType=medium&sortBy=PriceLow&minPrice=0&maxPrice=10&isleaf=false&navigationPath=5xsx0&parentCategoryId=9975218&RatingFacet=0&customPrice=true"
doc = Nokogiri::HTML(open(url))
puts doc.at_css("title").text
doc.css(".standard").each do |item|
title = item.at_css("span.productTitle a")[:title]
format = item.at_css("span.description").text
price = item.at_css(".price-label").text[/\$[0-9\.]+/]
link = item.at_css("span.productTitle a")[:href]
puts "#{title}, #{format}, #{price}, #{link}"
end
I'm happy with the results and able to see it in the Windows console. However, I want to export the results to a CSV file and have tried numerous ways (with no luck) and I know I'm missing something. My latest updated code (after downloading the html files) is below:
require 'rubygems'
require 'nokogiri'
require 'csv'
#title = Array.new
#format = Array.new
#price = Array.new
#link = Array.new
doc = Nokogiri::HTML(open("index1.html"))
doc.css(".standard").each do |item|
#title << item.at_css("span.productTitle a")[:title]
#format << item.at_css("span.description").text
#price << item.at_css(".price-label").text[/\$[0-9\.]+/]
#link << item.at_css("span.productTitle a")[:href]
end
CSV.open("file.csv", "wb") do |csv|
csv << ["title", "format", "price", "link"]
csv << [#title, #format, #price, #link]
end
It works and spits a file out for me, but just the last result. I followed the tutorial at Andrew!: WEb Scraping... and trying to mix what I'm trying to achieve with someone else's process is confusing.
I assume it's looping through all of the results and only printing the last. Can someone give me pointers on how I should loop this (if that's the problem) so that all the results are in their respective columns?
Thanks in advance.
You're storing values in four arrays, but you're not enumerating the arrays when you generate your output.
Here is a possible fix:
CSV.open("file.csv", "wb") do |csv|
csv << ["title", "format", "price", "link"]
until #title.empty?
csv << [#title.shift, #format.shift, #price.shift, #link.shift]
end
end
Note that this is a destructive operation that shifts the values off of the arrays one at a time, so in the end they will all be empty.
There are more efficient ways to read and convert the data, but this will hopefully do what you want for now.
There are several things you could do to write this more in the "Ruby way":
require 'rubygems'
require 'nokogiri'
require 'csv'
doc = Nokogiri::HTML(open("index1.html"))
CSV.open('file.csv', 'wb') do |csv|
csv << %w[title format price link]
doc.css('.standard').each do |item|
csv << [
item.at_css('span.productTitle a')[:title]
item.at_css('span.description').text
item.at_css('.price-label').text[/\$[0-9\.]+/]
item.at_css('span.productTitle a')[:href]
]
end
end
Without sample HTML it's not possible to test this, but, based on your code, it looks like it'd work.
Notice that in your code you're using instance variables. They're not necessary because you aren't defining a class to have an instance of. You can use local values instead.
I have been working and tinkering with Nokogiri, REXML & Ruby for a month. I have this giant database that I am trying to crawl. The things that I am scraping are HTML links and XML files.
There are exactly 43612 XML files that I want to crawl and store in a CSV file.
My script works if crawl maybe 500 xml files, but larger that takes too much time and it freezes or something.
I have divided the code in pieces here so it would be easy to read, the whole script/code is here: https://gist.github.com/1981074
I am using two libraries beacuse I couldn't find a way to do this all in nokogiri. I personally find REXML easier to use.
My question: How can fix it so it wont that a week for me to crawl all this? How do I make it run faster?
HERE IS MY SCRIPT:
Require the necessary lib:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'rexml/document'
require 'csv'
include REXML
Create bunch of array to store that grabs data:
#urls = Array.new
#ID = Array.new
#titleSv = Array.new
#titleEn = Array.new
#identifier = Array.new
#typeOfLevel = Array.new
Grab all the xml links from a spec site and store them in a array called #urls
htmldoc = Nokogiri::HTML(open('http://testnavet.skolverket.se/SusaNavExport/EmilExporter?GetEvent&EMILVersion=1.1&NotExpired&EEFormOfStudy=normal&EIAcademicType=UoH&SelectEI'))
htmldoc.xpath('//a/#href').each do |links|
#urls << links.content
end
Loop throw the #urls array, and grab every element node that I want to grab with xpath.
#urls.each do |url|
# Loop throw the XML files and grab element nodes
xmldoc = REXML::Document.new(open(url).read)
# Root element
root = xmldoc.root
# Hämtar info-id
#ID << root.attributes["id"]
# TitleSv
xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]"){
|e| m = e.text
m = m.to_s
next if m.empty?
#titleSv << m
}
Then store them in a CSV file.
CSV.open("eduction_normal.csv", "wb") do |row|
(0..#ID.length - 1).each do |index|
row << [#ID[index], #titleSv[index], #titleEn[index], #identifier[index], #typeOfLevel[index], #typeOfResponsibleBody[index], #courseTyp[index], #credits[index], #degree[index], #preAcademic[index], #subjectCodeVhs[index], #descriptionSv[index], #lastedited[index], #expires[index]]
end
end
It's hard to pinpoint the exact problem because of the way the code is structured. Here are a few suggestions to increase the speed and structure the program so that it will be easier to find what's blocking you.
Libraries
You're using a lot of libraries here that probably aren't necessary.
You use both REXML and Nokogiri. They both do the same job. Except Nokogiri is much better at it (benchmark).
Use Hashes
Instead of storing data at index in 15 arrays, have one set of hashes.
For instance,
items = Set.new
doc.xpath('//a/#href').each do |url|
item = {}
item[:url] = url.content
items << item
end
items.each do |item|
xml = Nokogiri::XML(open(item[:url]))
item[:id] = xml.root['id']
...
end
Collect the data, then write to file
Now that you have your items set, you can iterate over it and write to the file. This is much faster than doing it line by line.
Be DRY
In your original code, you have the same thing repeated a dozen times. Instead of copying and pasting, try instead to abstract out the common code.
xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]"){
|e| m = e.text
m = m.to_s
next if m.empty?
#titleSv << m
}
Move what's common to a method
def get_value(xml, path)
str = ''
xml.elements.each(path) do |e|
str = e.text.to_s
next if str.empty?
end
str
end
And move anything constant to another hash
xml_paths = {
:title_sv => "/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]",
:title_en => "/educationInfo/titles/title[2] | /ns:educationInfo/ns:titles/ns:title[2]",
...
}
Now you can combine these techniques to make for much cleaner codes
item[:title_sv] = get_value(xml, xml_paths[:title_sv])
item[:title_en] = get_value(xml, xml_paths[:title_en])
I hope this helps!
It won't work without your fixings. And I believe you should do like #Ian Bishop said to refactor your parsing code
require 'rubygems'
require 'pioneer'
require 'nokogiri'
require 'rexml/document'
require 'csv'
class Links < Pioneer::Base
include REXML
def locations
["http://testnavet.skolverket.se/SusaNavExport/EmilExporter?GetEvent&EMILVersion=1.1&NotExpired&EEFormOfStudy=normal&EIAcademicType=UoH&SelectEI"]
end
def processing(req)
doc = Nokogiri::HTML(req.response.response)
htmldoc.xpath('//a/#href').map do |links|
links.content
end
end
end
class Crawler < Pioneer::Base
include REXML
def locations
Links.new.start.flatten
end
def processing(req)
xmldoc = REXML::Document.new(req.respone.response)
root = xmldoc.root
id = root.attributes["id"]
xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]") do |e|
title = e.text.to_s
CSV.open("eduction_normal.csv", "a") do |f|
f << [id, title ...]
end
end
end
end
Crawler.start
# or you can run 100 concurrent processes
Crawler.start(concurrency: 100)
If you really want to speed it up, you're going to have to go concurrent.
One of the simplest ways is to install JRuby and then run your application with one small modification: install either the 'peach' or 'pmap' gems and then change your items.each to items.peach(n) (parallel each), where n is the number of threads. You'll need at least one thread per CPU core, but if you put I/O in your loop then you'll want more.
Also, use Nokogiri, it's much faster. Ask a separate Nokogiri question if you need to solve something specific with Nokogiri. I'm sure it can do what you need.
I found this script on pastebin that is an IRC bot that will find youtube videos for you. I have not touched it at all (Bar the channel settings), it works well however it won't grab the URL to the video that has been searched. This code is not mine! I jsut would like to get it to work as it would be quite useful!
#!/usr/bin/env ruby
require 'rubygems'
require 'cinch'
require 'nokogiri'
require 'open-uri'
require 'cgi'
bot = Cinch::Bot.new do
configure do |c|
c.server = "irc.freenode.net"
c.nick = "YouTubeBot"
c.channels = ["#test"]
end
helpers do
#Grabs the first result and returns the TITLE,LINK,DESCRIPTION
def youtube(query)
doc = Nokogiri::HTML(open("http://www.youtube.com/results?q=#{CGI.escape(query)}"))
result = doc.css('div#search-results div.result-item-main-content')[0]
title = result.at('h3').text
link = "www.youtube.com"+"#{result.at('a')[:href]}"
desc = result.at('p.description').text
rescue
"No results found"
else
CGI.unescape_html "#{title} - #{desc} - #{link}"
end
end
on :channel, /^!youtube (.+)/ do |m, query|
m.reply youtube(query)
end
on :channel, "polkabot quit" do |m|
m.channel.part("bye")
end
end
bot.start
Currently if i use the command
!youtube asdf
I get this returned:
19:25 < YouTubeBot> asdfmovie - Worldwide Store www.cafepress.com ...
asdfmovie cakebomb tomska
epikkufeiru asdf movie ... tomska ... [Baby Giggling] Man: Got your nose! [Baby ...
- www.youtube.com#
As you can see the URL is just www.youtube.com# not the URL of the video.
Thanks a lot!
This is an xpath issue. It looks like the third 'a' that has the href you want so try:
link = "www.youtube.com#{result.css('a')[2][:href]}"