Html-parsing with Oga and httpclient in ruby - ruby

I'm trying to download a page with httpclient and parse it useing oga (https://github.com/YorickPeterse/oga)
My program looks like this:
require 'httpclient'
require 'oga'
url = 'http://stackoverflow.com/questions/1496096/is-there-a-limit-to-the-length-of-html-attributes'
c = HTTPClient.new
content = c.get_content(url)
document = Oga.parse_html(content)
I get this error:
LL::ParserError: Unexpected end of input, expected element closing tag instead on line 431
parser_error at /home/binaryplease/.rvm/gems/jruby-1.7.19/gems/oga-0.3.1-java/lib/oga/xml/parser.rb:255
each_token at /home/binaryplease/.rvm/gems/jruby-1.7.19/gems/oga-0.3.1-java/lib/oga/xml/parser.rb:231
parse at org/libll/Driver.java:303
parse at /home/binaryplease/.rvm/gems/jruby-1.7.19/gems/oga-0.3.1-java/lib/oga/xml/parser.rb:262
parse_html at /home/binaryplease/.rvm/gems/jruby-1.7.19/gems/oga-0.3.1-java/lib/oga/oga.rb:25
(root) at test.rb:12
I verified that httpclient is downloading correctly and the file doesnt end there. I also tryed other links, some work but most of them give me this error.
In general smaller pages seem to work just fine
Is there a problem with the library or am I making an error?

Related

Downloading a track from Soundcloud using Ruby SDK

I am trying to download a track from Soundcloud using the ruby sdk (soundcloud 0.2.0 gem) with an app. I have registered the app on soundcloud and the client_secret is correct. I know this because I can see my profile info and tracks using the app.
Now when I try to download a track using the following code
#track = current_user.soundcloud_client.get(params[:track_uri])
data = current_user.soundcloud_client.get(#track.download_url)
File.open("something.mp3","wb"){|f|f.write(data)}
and when I open the file it has nothing in it. I've tried many approaches including the following one,
data = current_user.soundcloud_client.get(#track.download_url)
file = File.read(data)
And this one gives me an error
can't convert nil into String
on line 13 which is in
app/controllers/store_controller.rb:13:in `read'
that is the File.read function.
I have double checked that the track I am trying to download is public and downloadable.
I tried to test the download_url that is being used explicitly by copying it from console and sending a request using Postman and it worked. I am not sure why it is not working with the app when other things are working so well.
What I want to do is to successfully be able to either download or at least get the data which I could store somewhere.
Version details : -
ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux]
Rails 3.2.18
soundcloud 0.2.0
There are few assumptions that you have to understand before doing this thing.
Not every track on SoundClound can be downloaded! Only tracks that are flagged as downloadable can be downloaded - your code has to consider that option!
Your track URL has to be "resolved" before you get to download_url and after you get download_url you have to use your client_id to get the final download URL.
Tracks can be big, and downlowding them requires time! You should never do tasks like this straight from your Rails app in your controller or model. If the tasks runs longer you always use some background worker or some other kind of background processing "thing" - Sidekiq for example.
Command-line client example
This is example of working client, that you can use to download tracks from SoundClound. Its using official Official SoundCloud API Wrapper for Ruby, assumes that you are using Ruby 1.9.x and its not dependent on Rails in any way.
# We use Bundler to manage our dependencies
require 'bundler/setup'
# We store SC_CLIENT_ID and SC_CLIENT_SECRET in .env
# and dotenv gem loads that for us
require 'dotenv'; Dotenv.load
require 'soundcloud'
require 'open-uri'
# Ruby 1.9.x has a problem with following redirects so we use this
# "monkey-patch" gem to fix that. Not needed in Ruby >= 2.x
require 'open_uri_redirections'
# First there is the authentication part.
client = SoundCloud.new(
client_id: ENV.fetch("SC_CLIENT_ID"),
client_secret: ENV.fetch("SC_CLIENT_SECRET")
)
# Track URL, publicly visible...
track_url = "http://soundcloud.com/forss/flickermood"
# We call SoundCloud API to resolve track url
track = client.get('/resolve', url: track_url)
# If track is not downloadable, abort the process
unless track["downloadable"]
puts "You can't download this track!"
exit 1
end
# We take track id, and we use that to name our local file
track_id = track.id
track_filename = "%s.aif" % track_id.to_s
download_url = "%s?client_id=%s" % [track.download_url, ENV.fetch("SC_CLIENT_ID")]
File.open(track_filename, "wb") do |saved_file|
open(download_url, allow_redirections: :all) do |read_file|
saved_file.write(read_file.read)
end
end
puts "Your track was saved to: #{track_filename}"
Also note that files are in AIFF (Audio Interchange File Format). To convert them to mp3 you do something like this with ffmpeg.
ffmpeg -i 293.aif final-293.mp3

Getting random "read_nonblock': end of file reached (EOFError)" with Net::HTTP.start

When I execute the following code...
http = Net::HTTP.start('jigsaw.w3.org')
http.request_post('/css-validator/validator', ' ', 'Content-type' => "multipart/form-data")
...then I very often get the following error:
EOFError: end of file reached
from /Users/josh/.rvm/rubies/ruby-2.0.0-p353/lib/ruby/2.0.0/net/protocol.rb:153:in `read_nonblock'
Is this only me? What could be the problem? Sometimes it seems to work, but most of the time it doesn't.
The problem seems to be on the side of the host:
Loading http://jigsaw.w3.org/css-validator/DOWNLOAD.html manually in a browser results most of the time in "no data received" at the moment.
I'm trying to set up the downloadable command line version of the validator on my local machine and use this. More info here: How can I validate CSS on internal web pages?

Send request and response get from server - Need working code for this WSDL file in RUBY

Please help me, I need code for my WSDL file for below wsdl file url using RUBY language parameter can be passed as pcEmployeecode...
Parameter can passed as STRING ie.,pcEmployeeCode="02385"
url ="http://online.mccolls.com.au:8080/wsa/wsa1/wsdl?targetURI=urn:OHWebServiceFMGT"
Please Refer this WSDL file and post me if you have sucess fully got the response result...
Any idea's....
I used like this, Still error hitting.
require 'rubygems'
#require 'soap4r'
require 'http-access2'
require 'soap/rpc/driver'
require 'soap/rpc/driver'
require 'net/http'
Net::HTTP.version_1_2
url =
"http://online.mccolls.com.au:8080/wsa/wsa1/wsdl?targetURI=urn:OHWebServiceFMGT"
soap = SOAP::RPC::Driver.new(url,"urn:OHWebServiceFMGT")
soap.wiredump_dev = STDERR
soap.options["protocol.http.ssl_config.verify_mode"] =
OpenSSL::SSL::VERIFY_NONE
soap.add_method('getEmployeeFmgtDetails','pcEmployeeCode')
puts soap.getEmployeeFmgtDetails('02385')

adding 'curl' command to sinatra and rails?

(GAVE UP ON INSTALLING CURB. POSTED NEW QUESTION PER SUGGESTION OF ONE OF THE RESPONDENTS)
I thought 'curl ' was 'built-in' but got an undefined method error in a sinatra app. is there a gem i need to add?
Same question for rails 3?
The application is that I have to simply 'hit' an external url (http://kickstartme.someplace.com?action=ACTIONNAME&token=XYZXYZXYZ) to kickstart a remote process.
the external url returns XML describing success/failure in the format:
<session>
<success>true</success>
<token>xyzxyzxyz</token>
<id>abcabcabc</id>
</session>
So really, ALL I need is for my rails and sinatra apps to hit that url and parse whatever is returned AND grcefully handle the remote server failing to reply.
require 'open-uri'
require 'nokogiri'
response = open("http://kickstartme.someplace.com?action=ACTIONNAME&token=XYZXYZXYZ").read
doc = Nokogiri::XML(response)
Use curb, a Ruby binding to libcurl. You will get all the curl features without having to shell out with system.
curl -b "auth=abcdef; ASP.NET_SessionId=lotsatext;" example.com
turns into
curl = Curl::Easy.new('http://example.com/')
curl.cookies = 'auth=abcdef; ASP.NET_SessionId=big-wall-of-text;'
curl.perform
More curb examples

Can I use Ruby in-built RSS module to read atom feed?

I am in an environment where I don't have access to install any gems. I only have standard ruby (version:1.8.7) installation.
I am trying something like this:
require 'rss/1.0'
require 'rss/2.0'
require 'open-uri'
source = "http://www.example.com/feed.atom" # url or local file
content = "" # raw content of rss feed will be loaded here
open(source) do |s| content = s.read end
rss = RSS::Parser.parse(content, false)
When I am parsing the content, I am getting nil. So I am wondering if in-built RSS module supports parsing an atom feed.
If you look under RSS::Maker what it can parse.
As an alternative, consider trying the nokogiri gem.

Resources