Trying to use open-uri in ruby, some HTML contents are coming in as "Loading..." - ruby

I am trying to create a program to compare a specific thing on a webpage, and then compare it another time, I'm currently working on getting the piece of information that will change. But, the text that would change appears if I inspect element in the page, but not if I use open-uri, it comes in as "Loading..." (see picture), is there a way to get all the HTML text?
Picture here.
This is the current code I have
contents = open('https://www.cargurus.com/Cars/l-Used-Mazda-MAZDASPEED6-d841', &:read)
File.open("testing.txt", "w") do |line|
line.puts "\r" + "#{contents}"
end
Any help to get the Loading... to change to the actual HTML code would be amazing.
Thanks

The problem
So, open uri just makes HTTP requests and gives you access to the body. In this case, the body is html. That html has a placeholder for this data, which is what you're seeing. Then that html says to load up some javascript that will make another request to the server to get the data, and when the data comes in, it will replace the placeholder with the real data. So, to handle this, you ultimately need whatever is coming back from that request the javascript is making.
Three solutions
Ordered from my least favourite to my most favourite.
You can try to evaluate the JavaScript to have it operate on the html. This is going to be painful, so I don't recommend it, but if you wanted to go down that path, I think there's a gem called "the ruby racer" or something (IIRC, it wraps v8).
You can launch a web browser, let the browser handle all the cray cray, and then ask the browser for the html after it's been updated. This is what Rahul's solution does, and it's a really nice solution. It's not my favourite because it's pretty heavy and you're relegated to information displayed in the html. This is called "scraping", and it's pretty fragile (some designer moves something around the page and your script breaks), and the information is in human presentation format, which means you usually have to do a lot of little parsing things.
You can open your browser's devtools, go to the network tab, filter to the XHR requests, and reload the page. One of these made the request to get the data that was used to fill in the place holder. Figure out which one it is and then you can make that request yourself. There's ways this can be fragile, too, eg sometimes you have to have the right cookies, and you often have to experiment with what the browser sent to figure out how much of it you need (usually it's way less than was sent, which is true for your case). Protip: When you do this, separate requesting the data from parsing and exploring it (ie save it to a file and then, while looking through the data, get it from the file rather than making a new request every time... this way it won't change on you and you won't get rate limited)
Solution #3
So, I was curious and went ahead and tried solution number 3 myself, and it worked pretty admirably, check it out:
require 'uri'
require 'net/http'
# build a post request to the URL that the page got the data from
uri = URI 'https://www.cargurus.com/Cars/inventorylisting/ajaxFetchSubsetInventoryListing.action?sourceContext=untrackedExternal_true_0'
req = Net::HTTP::Post.new(uri)
# set some headers
req['origin'] = 'https://www.cargurus.com' # for cross origin requests
req['cache-control'] = 'no-cache' # no caching, just in case,
req['pragma'] = 'no-cache' # we prob don't want stale data
# looks like you can pass it an awful lot of filters to use
req.set_form_data(
"page"=>"1", "zip"=>"", "address"=>"", "latitude"=>"", "longitude"=>"",
"distance"=>"100", "selectedEntity"=>"d841", "transmission"=>"ANY",
"entitySelectingHelper.selectedEntity2"=>"", "minPrice"=>"", "maxPrice"=>"",
"minMileage"=>"", "maxMileage"=>"", "bodyTypeGroup"=>"", "serviceProvider"=>"",
"filterBySourcesString"=>"", "filterFeaturedBySourcesString"=>"",
"displayFeaturedListings"=>"true", "searchSeoPageType"=>"",
"inventorySearchWidgetType"=>"AUTO", "allYearsForTrimName"=>"false",
"daysOnMarketMin"=>"", "daysOnMarketMax"=>"", "vehicleDamageCategoriesRaw"=>"",
"minCo2Emission"=>"", "maxCo2Emission"=>"", "vatOnly"=>"false",
"minEngineDisplacement"=>"", "maxEngineDisplacement"=>"", "minMpg"=>"",
"maxMpg"=>"", "minEnginePower"=>"", "maxEnginePower"=>"", "isRecentSearchView"=>"false"
)
# make the request (200 means it worked)
res = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) { |http| http.request req }
res.code # => "200"
# parse the response
require 'json'
json = JSON.parse res.body
# we're on page 1 of 1, and there are 48 results on this page
json['page'] # => 1
json['listings'].size # => 48
json['remainingResults'] # => false
# apparently we're looking at some sort of car or smth
json['modelId'] # => "d841"
json['modelName'] # => "Mazda MAZDASPEED6"
# a bunch of places sell this car
json['sellers'].size # => 47
json['sellers'][0]['location'] # => "Portland OR, 97217"
# the first of our 48 cars seems to be a deal
listing = json['listings'][0]
listing['mainPictureUrl'] # => "https://static.cargurus.com/images/forsale/2018/05/24/02/58/2006_mazda_mazdaspeed6-pic-61663369386257285-152x114.jpeg"
listing['expectedPriceString'] # => "$8,972"
listing['priceString'] # => "$6,890"
listing['daysOnMarket'] # => 61
listing['savingsRecommendation'] # => "Good Deal"
listing['carYear'] # => 2006
listing['mileageString'] # => "81,803"
# none of the 48 are salvaged or lemons
json['listings'].count { |l| l['lemon'] } # => 0
json['listings'].count { |l| l['salvage'] } # => 0
# the savings recommendations seem reasonably distributed
json['listings'].group_by { |l| l["savingsRecommendation"] }.map { |rec, ls| [rec, ls.size] }
# => [["Good Deal", 4],
# ["Fair Deal", 11],
# ["No Price Analysis", 23],
# ["High Price", 8],
# ["Overpriced", 2]]

Your web page contains ajax request and open-uri only returns server-side page, it not wait for ajax request
You can use the below code which waits for page loading
#load the libraries
require 'watir'
browser = Watir::Browser.new
browser.goto "https://www.cargurus.com/Cars/l-Used-Mazda-MAZDASPEED6-d841"
# giving some time for website to load
sleep 2
puts browser.html
NOTE: you need chromedriver to use the script http://chromedriver.chromium.org/downloads
if you don't want to open url in browser then you can use headless-WebKit

Related

Proper way to upload a doc to FSCrawler for indexing in Elasticsearch

I'm prototyping a Rails application to upload documents to FSCrawler (running the REST interface), to incorporate into an Elasticsearch index. Using their example, this works:
response = `curl -F "file=##{params[:document][:upload].tempfile.path}" "http://127.0.0.1:8080/fscrawler/_upload?debug=true"`
The file gets uploaded, and the content gets indexed. This is an example of what I get:
"{\n \"ok\" : true,\n \"filename\" : \"RackMultipart20200130-91061-16swulg.pdf\",\n \"url\" : \"http://127.0.0.1:9200/local/_doc/d661edecf3e28572676e97a6f0d1d\",\n \"doc\" : {\n \"content\" : \"\\n \\n \\n\\nBasically, what you need to know is that Dante is all IP-based, and makes use of common IT standards. Each Dante device behaves \\n\\nmuch like any other network device you would already find on your network. \\n\\nIn order to make integration into an existing network easy, here are some of the things that Dante does: \\n\\n▪ Dante...
When I run curl at the command line, I get EVERYTHING, like the "filename" being properly set. If I use it as above, in the Rails controller, as you can see, the filename is set to the Tempfile's filename. That's not a workable solution. Trying to use params[:document][:upload].tempfile (without .path) or just params[:document][:upload] both fail entirely.
I'm trying to do this "the right way," but every incarnation of using a proper HTTP client to do this fails. I can't figure out how to invoke an HTTP POST that will submit a file to FSCrawler the way curl (on the command line) does it.
In this example, I'm just trying to send the file by using the Tempfile file object. For some reason, FSCrawler gives me the error in the comment, and get a little metadata, but no content is indexed:
## Failed to extract [100000] characters of text for ...
## org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
uri = URI("http://127.0.0.1:8080/fscrawler/_upload?debug=true")
request = Net::HTTP::Post.new(uri)
form_data = [['file', params[:document][:upload].tempfile,
{ filename: params[:document][:upload].original_filename,
content_type: params[:document][:upload].content_type }]]
request.set_form form_data, 'multipart/form-data'
response = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(request)
end
If I change the above to use params[:document][:upload].tempfile.path, then I don't get the error about the InputStream, but I also (still) do not get any content indexed. This is an example of what I get:
{"_index":"local","_type":"_doc","_id":"72c9ecf2a83440994eb87d28786e6","_version":3,"_seq_no":26,"_primary_term":1,"found":true,"_source":{"content":"/var/folders/bn/pcc1h8p16tl534pw__fdz2sw0000gn/T/RackMultipart20200130-91061-134tcxn.pdf\n","meta":{},"file":{"extension":"pdf","content_type":"text/plain; charset=ISO-8859-1","indexing_date":"2020-01-30T15:33:45.481+0000","filename":"Similarity in Postgres and Rails using Trigrams · pganalyze.pdf"},"path":{"virtual":"Similarity in Postgres and Rails using Trigrams · pganalyze.pdf","real":"Similarity in Postgres and Rails using Trigrams · pganalyze.pdf"}}}
If I try to use RestClient, and I try send the file by referencing the actual path to the Tempfile, then I get this error message, and I get nothing:
## Unsupported media type
response = RestClient.post 'http://127.0.0.1:8080/fscrawler/_upload?debug=true',
file: params[:document][:upload].tempfile.path,
content_type: params[:document][:upload].content_type
If I try to .read() the file, and submit that, then I break the FSCrawler form:
## Internal server error
request = RestClient::Request.new(
:method => :post,
:url => 'http://127.0.0.1:8080/fscrawler/_upload?debug=true',
:payload => {
:multipart => true,
:file => File.read(params[:document][:upload].tempfile),
:content_type => params[:document][:upload].content_type
})
response = request.execute
Obviously, I've been trying this every way I can, but I can't replicate whatever curl is doing with any known Ruby-based HTTP clients. I'm utterly lost as to how to get Ruby to submit data to FSCrawler in a way that will get the document contents indexed properly. I've been at this far longer than I care to admit. What am I missing here?
I finally tried Faraday, and, based on this answer, came up with the following:
connection = Faraday.new('http://127.0.0.1:8080') do |f|
f.request :multipart
f.request :url_encoded
f.adapter :net_http
end
file = Faraday::UploadIO.new(
params[:document][:upload].tempfile.path,
params[:document][:upload].content_type,
params[:document][:upload].original_filename
)
payload = { :file => file }
response = connection.post('/fscrawler/_upload', payload)
Using Fiddler helped me to see the results of my attempts, as I got closer and closer to the curl request. This snippet posts the request almost exactly as curl does. To route this call through the proxy, I just needed to add , proxy: 'http://localhost:8866' to the end of the connection setup.

rack-attack isn't filtering blacklisted referers

I have set up the rack-attack config per the advanced configuration instructions. I am using Heroku and have confirmed the env variable contains all of the urls and everything is properly formatted.
I have even gone into the console on Heroku and run the following:
req = Rack::Attack::Request.new({'HTTP_REFERER' => '4webmasters.org'})
and then tested with:
Rack::Attack.blacklisted?(req)
to which I get:
=> true
but in my analytics on google the referrals are filled with every url on my list. What am I missing?
My config includes this pretty standard block:
# Split on a comma with 0 or more spaces after it.
# E.g. ENV['HEROKU_VARIABLE'] = "foo.com, bar.com"
# spammers = ["foo.com", "bar.com"]
spammers = ENV['HEROKU_VARIABLE'].split(/,\s*/)
#
# Turn spammers array into a regexp
spammer_regexp = Regexp.union(spammers) # /foo\.com|bar\.com/
blacklist("block referer spam") do |request|
request.referer =~ spammer_regexp
end
#
HEROKU_VARIABLE =>
"ertelecom.ru, 16clouds.com, bee.lt, belgacom.be, virtua.com.br, nodecluster.net, telesp.net.br, belgacom.be, veloxzone.com.br, baidu.com, floating-share-buttons.com, 4webmasters.org, trafficmonetizer.org, webmonetizer.net, success-seo.com, buttons-for-website.com, videos-for-your-business.com, Get-Free-Traffic-Now.com, 100dollars-seo.com, e-buyeasy.com, free-social-buttons.com, traffic2money.com, erot.co, success-seo.com, semalt.com"
These types of referrers are Google Analytic spam referrers. They never actually hit your website so blocking them with rack-attack is pointless. The data you see from them in GA is all fake. To stop this in your GA, set up a filter to ignore visits from that referrer.

`open_http': 403 Forbidden (OpenURI::HTTPError) for the string "Steve_Jobs" but not for any other string

I was going through the Ruby tutorials provided at http://ruby.bastardsbook.com/ and I encountered the following code:
require "open-uri"
remote_base_url = "http://en.wikipedia.org/wiki"
r1 = "Steve_Wozniak"
r2 = "Steve_Jobs"
f1 = "my_copy_of-" + r1 + ".html"
f2 = "my_copy_of-" + r2 + ".html"
# read the first url
remote_full_url = remote_base_url + "/" + r1
rpage = open(remote_full_url).read
# write the first file to disk
file = open(f1, "w")
file.write(rpage)
file.close
# read the first url
remote_full_url = remote_base_url + "/" + r2
rpage = open(remote_full_url).read
# write the second file to disk
file = open(f2, "w")
file.write(rpage)
file.close
# open a new file:
compiled_file = open("apple-guys.html", "w")
# reopen the first and second files again
k1 = open(f1, "r")
k2 = open(f2, "r")
compiled_file.write(k1.read)
compiled_file.write(k2.read)
k1.close
k2.close
compiled_file.close
The code fails with the following trace:
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:277:in `open_http': 403 Forbidden (OpenURI::HTTPError)
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:616:in `buffer_open'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:164:in `open_loop'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:162:in `catch'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:162:in `open_loop'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:132:in `open_uri'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:518:in `open'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:30:in `open'
from /Users/arkidmitra/tweetfetch/samecode.rb:11
My problem is not that the code fails but that whenever I change r2 to anything other than Steve_Jobs, it works. What is happening here?
Your code runs fine for me (Ruby MRI 1.9.3) when I request a wiki page that exists.
When I request a wiki page that does NOT exist, I get a mediawiki 404 error code.
Steve_Jobs => success
Steve_Austin => success
Steve_Rogers => success
Steve_Foo => error
Wikipedia does a ton of caching, so if you see reponses for "Steve_Jobs" that are different than other people who do exist, then best-guess this is because wikipedia is caching the Steve Jobs article because he's famous, and potentially adding extra checks/verifications to protect the article from rapid changes, defacings, etc.
The solution for you: always open the url with a User Agent string.
rpage = open(remote_full_url, "User-Agent" => "Whatever you want here").read
Details from the Mediawiki docs: "When you make HTTP requests to the MediaWiki web service API, be sure to specify a User-Agent header that properly identifies your client. Don't use the default User-Agent provided by your client library, but make up a custom header that includes the name and the version number of your client: something like "MyCuteBot/0.1".
On Wikimedia wikis, if you don't supply a User-Agent header, or you supply an empty or generic one, your request will fail with an HTTP 403 error. See our User-Agent policy."
I think this happens for locked down entries like "Steve Jobs", "Al-Gore" etc. This is specified in the same book that you are referring to:
For some pages – such as Al Gore's locked-down entry – Wikipedia will
not respond to a web request if a User-Agent isn't specified. The
"User-Agent" typically refers to your browser, and you can see this by
inspecting the headers you send for any page request in your browser.
By providing a "User-Agent" key-value pair, (I basically use "Ruby"
and it seems to work), we can pass it as a hash (I use the constant
HEADERS_HASH in the example) as the second argument of the method
call.
It is specified later at http://ruby.bastardsbook.com/chapters/web-crawling/

How to calculate the amount of data downloaded and the total data to be downloaded in Ruby

I'm trying to build a desktop client that manages some downloads with Ruby. I would like to know how to go about trying to identify how much of the data is downloaded and the size of the entire data that is to be downloaded.
Im trying to do this with Ruby so any help would be useful.
Thanks in advance.
Like Wayne said in his comment, it depends on the protocol that is used to transfer the files. With HTTP for example, the HTTP response will include a Content-Length header which will tell you the length of the file that you are downloading. After you know that you will have to keep track of the number of bytes that you've read from the HTTP connection.
Something like this seems to work (for HTTP), but I wouldn't be surprised if it could be done more elegantly:
require 'net/http'
url = URI.parse('http://www.google.com/index.html')
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) do |http|
http.request(req) do |res|
remaining = res.content_length
puts "total length: #{remaining}"
res.read_body do |segment|
puts "read #{segment.length} bytes"
remaining = remaining - segment.length
puts "#{remaining} bytes remaining"
end
end
end
www.google.com/index.html is a bad example since the content gets returned in one segment, but try it on a larger object and you should see multiple "read..." lines.
If you're using Net::HTTP then the length of whatever you're requesting should be in the response header. Net::HTTP mixin NET::HTTPHeader, in it you'll find content_length(). Although it only works if the size is determined before the transfer happens.
Net::HTTPResponse has a method that reads the body in chunks, so you can use that to determine the progress. Start at 0 and add the length of each chunk, compare it to the total size and you're done.
http.request_get('/index.html') {|res|
res.read_body do |segment|
print segment
end
} #Example taken from Ruby-Documentation
If you're using FTP then it should be easier through NET::FTP. Connect to the server, get the size of a given file with size(filename), and then download the file with get, getbinaryfile or gettextfile.
This is the signature of the get method: get(remotefile, localfile = File.basename(remotefile), blocksize = DEFAULT_BLOCKSIZE) {|data| ...}
ftp.get('file.something', 'file.something.local', 1024){ |data|
puts "Downloaded 1024 more bytes"
}

Multipart File Upload in Ruby

I simply want to upload an image to a server with POST. As simple as this task sounds, there seems to be no simple solution in Ruby.
In my application I am using WWW::Mechanize for most things so I wanted to use it for this too, and had a source like this:
f = File.new(filename, File::RDWR)
reply = agent.post(
'http://rest-test.heroku.com',
{
:pict => f,
:function => 'picture2',
:username => #username,
:password => #password,
:pict_to => 0,
:pict_type => 0
}
)
f.close
This results in a totally garbage-ready file on the server that looks scrambled all over:
alt text http://imagehub.org/f/1tk8/garbage.png
My next step was to downgrade WWW::Mechanize to version 0.8.5. This worked until I tried to run it, which failed with an error like "Module not found in hpricot_scan.so". Using the Dependency Walker tool I could find out that hpricot_scan.so needed msvcrt-ruby18.dll. Yet after I put that .dll into my Ruby/bin-folder it gave me an empty error box from where on I couldn't debug very much further. So the problem here is that Mechanize 0.8.5 has a dependency on Hpricot instead of Nokogiri (which works flawlessly).
The next idea was to use a different gem, so I tried using Net::HTTP. After short research I could find out that there is no native support for multipart forms in Net::HTTP and instead you have to build a class that encodes etc. for you. The most helpful I could find was the Multipart-class by Stanislav Vitvitskiy. This class looked good so far, but it does not do what I need, because I don't want to post only files, I also want to post normal data, and that is not possible with his class.
My last attempt was to use RestClient. This looked promising, as there have been examples on how to upload files. Yet I can't get it to post the form as multipart.
f = File.new(filename, File::RDWR)
reply = RestClient.post(
'http://rest-test.heroku.com',
:pict => f,
:function => 'picture2',
:username => #username,
:password => #password,
:pict_to => 0,
:pict_type => 0
)
f.close
I am using http://rest-test.heroku.com which sends back the request to debug if it is sent correctly, and I always get this back:
POST http://rest-test.heroku.com/ with a 101 byte payload,
content type application/x-www-form-urlencoded
{
"pict" => "#<File:0x30d30c4>",
"username" => "s1kx",
"pict_to" => "0",
"function" => "picture2",
"pict_type" => "0",
"password" => "password"
}
This clearly shows that it does not use multipart/form-data as content-type but the standard application/x-www-form-urlencoded, although it definitely sees that pict is a file.
How can I upload a file in Ruby to a multipart form without implementing the whole encoding and data aligning myself?
Long problem, short answer: I was missing the binary mode for reading the image under Windows.
f = File.new(filename, File::RDWR)
had to be
f = File.new(filename, "rb")
Another method is to use Bash and Curl. I used this method when I wanted to test multiple file uploads.
bash_command = 'curl -v -F "file=#texas.png,texas_reversed.png"
http://localhost:9292/fog_upload/upload'
command_result = `#{bash_command}` # the backticks are important <br/>
puts command_result

Resources