Sinatra + Mongo parallel requests - ruby

I'm developing a small script that does some data crunching. If I try to ab -n10 -c1 (benckmark sending requests one after another), the requests take ~750ms. If instead I try -c2 (send requests two at a time), the requests seem to take >2s. Here's how the code looks like:
get '/url/' do
# ...
# Images is a mongodb collection
Images.find({'searchable_data.i' => {
'$in' => color_codes
}}).limit(5_000).each do |image|
found_images << {
:url => image['url'],
:searchable_data => image['searchable_data'],
}
end
# ...
From debugging I noticed that the requests to mongo fire at roughly the same time, and return at roughly the same time (but they take >2x the time they would if I ran them one at a time. Also, I've watched the cpu/memory usage on the mongo processes, and mongo doesn't even flinch). Here's how I connect to mongo:
configure do
# ...
server_connection = Mongo::Connection.new(db.host, db.port, :pool_size => 60)
DB = server_connection.db(db_name)
DB.authenticate(db.user, db.password) unless (db.user.nil? || db.user.nil?)
Images = DB[:Images]
end
Is there something I'm doing wrong? I can't imagine the Mongodb driver being that bad.

Related

I lose user session with Ruby + Sinatra + puma + sequel only when worker process puma> 1

My app in Heroku with Ruby + Sinatra + puma + sequel is ok while worker process = 1 when increasing worker process = 2 or if increasing dyno = 2 I start with problems of losing the user session randomly at different points in the system making it very difficult to locate the specific error through heroku logs.
The same app works fine with:
But you lose the value of session[: user] with:
My app rack sinatra class:
class Main <Sinatra :: Aplicación
use Rack :: Session :: Pool
set: protection ,: except =>: frame_options
def usuarioLogueado?
if defined?( session[:usuario] )
if session[:usuario].nil?
return false
else
return true
end
else
return false
end
end
get "/" do
if usuarioLogueado?
redirect "/app"
.....
else
redirect "/home"
end
end
end
My sequel connection:
pool_size = 10
# db = Sequel.connect (strConexion ,: max_connections => pool_size )
# db.extension (: connection_validator)
# db.pool.connection_validation_timeout = -1
My puma.rb: (20 connections max DB)
workers Integer (ENV ['WEB_CONCURRENCY'] || 1)
threads_count = Integer (ENV ['MAX_THREADS'] || 10)
threads threads_count, threads_count
preload_app!
rackup DefaultRackup
port ENV ['PORT'] || 3000
Rack::Session::Pool is a simple memory based session store. Each process has its own store and they are not shared between processes or hosts. When a request gets directed to a different dyno or different process on the same dyno, the session data will not be available.
You could look at sticky sessions, but they won’t work in all situations (e.g. when dynos are created or destroyed) and won’t work at all if you have multiple processes on a single dyno.
You should look at using cookie based sessions, or set up a shared server side store such as memcached with Dalli, so that it doesn’t matter which dyno or process each request is routed to.

Trying to use open-uri in ruby, some HTML contents are coming in as "Loading..."

I am trying to create a program to compare a specific thing on a webpage, and then compare it another time, I'm currently working on getting the piece of information that will change. But, the text that would change appears if I inspect element in the page, but not if I use open-uri, it comes in as "Loading..." (see picture), is there a way to get all the HTML text?
Picture here.
This is the current code I have
contents = open('https://www.cargurus.com/Cars/l-Used-Mazda-MAZDASPEED6-d841', &:read)
File.open("testing.txt", "w") do |line|
line.puts "\r" + "#{contents}"
end
Any help to get the Loading... to change to the actual HTML code would be amazing.
Thanks
The problem
So, open uri just makes HTTP requests and gives you access to the body. In this case, the body is html. That html has a placeholder for this data, which is what you're seeing. Then that html says to load up some javascript that will make another request to the server to get the data, and when the data comes in, it will replace the placeholder with the real data. So, to handle this, you ultimately need whatever is coming back from that request the javascript is making.
Three solutions
Ordered from my least favourite to my most favourite.
You can try to evaluate the JavaScript to have it operate on the html. This is going to be painful, so I don't recommend it, but if you wanted to go down that path, I think there's a gem called "the ruby racer" or something (IIRC, it wraps v8).
You can launch a web browser, let the browser handle all the cray cray, and then ask the browser for the html after it's been updated. This is what Rahul's solution does, and it's a really nice solution. It's not my favourite because it's pretty heavy and you're relegated to information displayed in the html. This is called "scraping", and it's pretty fragile (some designer moves something around the page and your script breaks), and the information is in human presentation format, which means you usually have to do a lot of little parsing things.
You can open your browser's devtools, go to the network tab, filter to the XHR requests, and reload the page. One of these made the request to get the data that was used to fill in the place holder. Figure out which one it is and then you can make that request yourself. There's ways this can be fragile, too, eg sometimes you have to have the right cookies, and you often have to experiment with what the browser sent to figure out how much of it you need (usually it's way less than was sent, which is true for your case). Protip: When you do this, separate requesting the data from parsing and exploring it (ie save it to a file and then, while looking through the data, get it from the file rather than making a new request every time... this way it won't change on you and you won't get rate limited)
Solution #3
So, I was curious and went ahead and tried solution number 3 myself, and it worked pretty admirably, check it out:
require 'uri'
require 'net/http'
# build a post request to the URL that the page got the data from
uri = URI 'https://www.cargurus.com/Cars/inventorylisting/ajaxFetchSubsetInventoryListing.action?sourceContext=untrackedExternal_true_0'
req = Net::HTTP::Post.new(uri)
# set some headers
req['origin'] = 'https://www.cargurus.com' # for cross origin requests
req['cache-control'] = 'no-cache' # no caching, just in case,
req['pragma'] = 'no-cache' # we prob don't want stale data
# looks like you can pass it an awful lot of filters to use
req.set_form_data(
"page"=>"1", "zip"=>"", "address"=>"", "latitude"=>"", "longitude"=>"",
"distance"=>"100", "selectedEntity"=>"d841", "transmission"=>"ANY",
"entitySelectingHelper.selectedEntity2"=>"", "minPrice"=>"", "maxPrice"=>"",
"minMileage"=>"", "maxMileage"=>"", "bodyTypeGroup"=>"", "serviceProvider"=>"",
"filterBySourcesString"=>"", "filterFeaturedBySourcesString"=>"",
"displayFeaturedListings"=>"true", "searchSeoPageType"=>"",
"inventorySearchWidgetType"=>"AUTO", "allYearsForTrimName"=>"false",
"daysOnMarketMin"=>"", "daysOnMarketMax"=>"", "vehicleDamageCategoriesRaw"=>"",
"minCo2Emission"=>"", "maxCo2Emission"=>"", "vatOnly"=>"false",
"minEngineDisplacement"=>"", "maxEngineDisplacement"=>"", "minMpg"=>"",
"maxMpg"=>"", "minEnginePower"=>"", "maxEnginePower"=>"", "isRecentSearchView"=>"false"
)
# make the request (200 means it worked)
res = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) { |http| http.request req }
res.code # => "200"
# parse the response
require 'json'
json = JSON.parse res.body
# we're on page 1 of 1, and there are 48 results on this page
json['page'] # => 1
json['listings'].size # => 48
json['remainingResults'] # => false
# apparently we're looking at some sort of car or smth
json['modelId'] # => "d841"
json['modelName'] # => "Mazda MAZDASPEED6"
# a bunch of places sell this car
json['sellers'].size # => 47
json['sellers'][0]['location'] # => "Portland OR, 97217"
# the first of our 48 cars seems to be a deal
listing = json['listings'][0]
listing['mainPictureUrl'] # => "https://static.cargurus.com/images/forsale/2018/05/24/02/58/2006_mazda_mazdaspeed6-pic-61663369386257285-152x114.jpeg"
listing['expectedPriceString'] # => "$8,972"
listing['priceString'] # => "$6,890"
listing['daysOnMarket'] # => 61
listing['savingsRecommendation'] # => "Good Deal"
listing['carYear'] # => 2006
listing['mileageString'] # => "81,803"
# none of the 48 are salvaged or lemons
json['listings'].count { |l| l['lemon'] } # => 0
json['listings'].count { |l| l['salvage'] } # => 0
# the savings recommendations seem reasonably distributed
json['listings'].group_by { |l| l["savingsRecommendation"] }.map { |rec, ls| [rec, ls.size] }
# => [["Good Deal", 4],
# ["Fair Deal", 11],
# ["No Price Analysis", 23],
# ["High Price", 8],
# ["Overpriced", 2]]
Your web page contains ajax request and open-uri only returns server-side page, it not wait for ajax request
You can use the below code which waits for page loading
#load the libraries
require 'watir'
browser = Watir::Browser.new
browser.goto "https://www.cargurus.com/Cars/l-Used-Mazda-MAZDASPEED6-d841"
# giving some time for website to load
sleep 2
puts browser.html
NOTE: you need chromedriver to use the script http://chromedriver.chromium.org/downloads
if you don't want to open url in browser then you can use headless-WebKit

metriks log to file not working

I wrote a basic program to test the ruby metriks gem
require 'metriks'
require 'metriks/reporter/logger'
#registry = Metriks::Registry.new
#logger = Logger.new('/tmp/metrics.log')
#reporter = Metriks::Reporter::Logger.new(:logger => #logger)
#reporter.start
#registry.meter('tasks').mark
print "Hello"
#registry.meter('tasks').mark
#reporter.stop
After i execute the program, there is nothing in the log other than it got created.
$ cat /tmp/metrics.log
# Logfile created on 2015-06-15 14:23:40 -0700 by logger.rb/44203
You should either pass in your own registry while instantiating Metriks::Reporter::Logger or use the deafult registry (Metrics::Resgitry.default) if you are using a logger to log metrics.
Also the default log write interval is 60 seconds, your code completes before that so even if everything is setup okay it won't get recorded. So, since you want to use your own registry, this should work for you (I'm adding a little sleep since I'm gonna use an interval of 1 second) :
require 'metriks'
require 'metriks/reporter/logger'
#registry = Metriks::Registry.new
#logger = Logger.new('/tmp/metrics.log')
#reporter = Metriks::Reporter::Logger.new(:logger => #logger,
:registry => #registry
:interval => 1)
#reporter.start
#registry.meter('tasks').mark
print "Hello"
#registry.meter('tasks').mark
# Just giving it a little time so the metrics will be recorded.
sleep 2
#reporter.stop
But I don't really think short intervals are good.
UPDATE : Also I think #reporter.write will help you write down the logs instantly regardless of the time interval. So you don't have to use sleep (better).

Get all open pull requests from an organisation using the Github API Ruby gem

For our organisation's dashboard, I'd like to keep a count of all the open PRs on all our repositories. At the moment, all I've got is to loop through all the repos, and count through all the open PRs on each repo like so (which often results in a rate limit error):
connection = Github.new oauth_token: MY_OAUTH_TOKEN
pulls = 0
connection.repos.list(:org => GITHUB_ORGANISATION).each do |repo|
pulls += connection.pull_requests.list(:user => repo['owner']['login'], :repo => repo['name']).count
end
I know there must be a nicer way round this. Any ideas? (short of screen scraping!)
OK, so I think I've cracked this now. Pull requests are issues, so I can get all issues, and loop through the issues like so:
pulls = 0
issues = connection.issues.list(:org => GITHUB_ORGANISATION, :filter => 'all', :auto_pagination => true)
issues.each do |issue|
if issue["pull_request"]
pulls += 1
end
end
Once you remember that pull requests are issues too, everything just falls into place.

MongoDB return codes meaning (ruby driver)

I'm calling collection update from ruby driver to mongodb and gets a return code 117.
How do I generally interpret the error codes that I get?
If you are using safe mode, the update method returns a hash containing the output of getLastError. However, when you are not using safe mode, we simply return the number of bytes that were sent to the server.
# setup connection & get handle to collection
connection = Mongo::Connection.new
collection = connection['test']['test']
# remove existing documents
collection.remove
=> true
# insert test document
collection.insert(:_id => 1, :a => 1)
=> 1
collection.find_one
=> {"_id"=>1, "a"=>1}
# we sent a message with 64 bytes to a mongod
collection.update({_id: 1},{a: 2.0})
=> 64 # number of bytes sent to server
# with safe mode we updated one document -- output of getLastError command
collection.update({_id: 1},{a: 3.0}, :safe => true)
=> {"updatedExisting"=>true, "n"=>1, "connectionId"=>19, "err"=>nil, "ok"=>1.0}
This is something that could be made clearer in the documentation. I will update it for the next ruby driver release.

Resources