I can successfully upload a single file using a Mechanize form like this:
def add_attachment(form, attachments)
attachments.each_with_index do |attachment, i|
form.file_uploads.first.file_name = attachment[:path]
where form is a mechanize form. But if attachments has more than one element, the last one overwrites the previous ones. This is obviously because I'm using the first accessor which always returns the same element of the file_uploads array.
To fix this, I tried this, which results an error, because there is only one element in this array.
def add_attachment(form, attachments)
attachments.each_with_index do |attachment, i|
form.file_uploads[i].file_name = attachment[:path]
If I try to create a new file_upload object, it also doesn't work:
def add_attachment(form, attachments)
attachments.each_with_index do |attachment, i|
form.file_uploads[i] ||= Mechanize::Form::FileUpload.new(form, attachment[:path])
form.file_uploads[i].file_name = attachment[:path]
Any idea how I can upload multiple files using Mechanize?

So, I solved this issue, but not exactly how I imagined it would work out.
The site I was trying to upload files to was a Redmine project. Redmine is using JQueryUI for the file uploader, which confused me, since Mechanize doesn't use Javascipt. But, it turns out that Redmine degrades nicely if Javascript is disabled and I could take advantage of this.
When Javascript is disabled, only one file at time can be uploaded in the edit form, but going to the 'edit' url for the issue that was just created gives the chance to upload a second file. My solution was to simply attach a file, upload the form and then click the 'Update' link on the resulting page, which presented a page with a new form and another upload field, which I could then use to attach the next file to. I did this for all attachments but the last, so that the form processing could be completed and then uploaded for a final time. Here is the relavant bit of code:
def add_attachment(agent,form, attachments)
attachments.each_with_index do |attachment, i|
form.file_uploads.first.file_name = attachment[:path]
if i < attachments.length - 1
submit_form(agent, form)
agent.page.links_with(text: 'Update').first.click
form = get_form(agent)

I used the following
form.file_uploads[0].file_name = "path to the first file that to be uploaded" form.file_uploads[1].file_name = "path to the second file that to be uploaded" form.file_uploads[2].file_name = "path to the third file that to be uploaded".
and worked fine. Hope this helps.


How to check that a PDF file has some link with Ruby/Rspec?

I am using prawnpdf/pdf-inspector to test that content of a PDF generated in my Rails app is correct.
I would want to check that the PDF file contains a link with certain URL. I looked at yob/pdf-reader but haven't found any useful information related to this topic
Is it possible to test URLs within PDF with Ruby/RSpec?
I would want the following:
expect(urls_in_pdf(pdf)).to include 'https://example.com/users/1'
The https://github.com/yob/pdf-reader contains a method for each page called text.
Do something like
pdf = PDF::Reader.new("tmp/pdf.pdf")
assert pdf.pages[0].text.include? 'https://example.com/users/1'
assuming what you are looking for is at the first page
Since pdf-inspector seems only to return text, you could try to use the pdf-reader directly (pdf-inspector uses it anyways).
reader = PDF::Reader.new("somefile.pdf")
reader.pages.each do |page|
puts page.raw_content # This should also give you the link
Anyway I only did a quick look at the github page. I am not sure what raw_content exactly returns. But there is also a low-level method to directly access the objects of the pdf:
reader = PDF::Reader.new("somefile.pdf")
puts reader.objects.inspect
With that it surely is possible to get the url.

How to not show extracted links and scraped items?

Newbie here, running scrapy in windows. How to avoid showing the extracted links and crawled items in the command window? I found comments in the "parse" section on this linkhttp://doc.scrapy.org/en/latest/topics/commands.html, not sure if it's relevant and how to apply it if so. Here is more detail with part of the code, starting from my second Ajax request (In the first Ajax request, the callback function is "first_json_response":
def first_json_response(self, response):
data = json.loads(response.body)
meta = {'results': data['results']}
yield Request(url=url, callback=self.second_json_response,headers={'x-requested-with': 'XMLHttpRequest'}, meta = meta)
def second_json_response(self, response):
meta = response.meta
data2 = json.loads(response.body)
The "second_json_response" is to retrieve the response from the requested result in first_json_response, as well as to load the new requested data. "meta" and "data" are then both used to define items that need to be crawled. Currently, the meta and links are shown in the windows terminal where I submitted my code. I guess it is taking up some extra time for computer to show them on the screen, and thus want them to disappear. I hope by running scrapy on a kinda-of batch mode will speed up my lengthy crawling process.
Thanks! I really appreciate your comment and suggestion!
From scrapy documentation:
"You can set the log level using the –loglevel/-L command line option, or using the LOG_LEVEL setting."
So append to your scray crawl etc command -loglevel='ERROR' . That should make all the info disappear from your command line, but I don't think this will speed things much.
In your pipelines.py file, try using something like:
import json
class JsonWriterPipeline(object):
def __init__(self):
self.file = open('items.jl', 'wb')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
return item
This way, when you yield an item from your spider class, it will print it out to items.jl.
Hope that helps.

Avoid repeated calls to an API in Jekyll Ruby plugin

I have written a Jekyll plugin to display the number of pageviews on a page by calling the Google Analytics API using the garb gem. The only trouble with my approach is that it makes a call to the API for each page, slowing down build time and also potentially hitting the user call limits on the API.
It would be possible to return all the data in a single call and store it locally, and then look up the pageview count from each page, but my Jekyll/Ruby-fu isn't up to scratch. I do not know how to write the plugin to run once to get all the data and store it locally where my current function could then access it, rather than calling the API page by page.
Basically my code is written as a liquid block that can be put into my page layout:
class GoogleAnalytics < Liquid::Block
def initialize(tag_name, markup, tokens)
super # options that appear in block (between tag and endtag)
#options = markup # optional optionss passed in by opening tag
def render(context)
path = super
# Read in credentials and authenticate
cred = YAML.load_file("/home/cboettig/.garb_auth.yaml")
Garb::Session.api_key = cred[:api_key]
token = Garb::Session.login(cred[:username], cred[:password])
profile = Garb::Management::Profile.all.detect {|p| p.web_property_id == cred[:ua]}
# place query, customize to modify results
data = Exits.results(profile,
:filters => {:page_path.eql => path},
:start_date => Chronic.parse("2011-01-01"))
Full version of my plugin is here
How can I move all the calls to the API to some other function and make sure jekyll runs that once at the start, and then adjust the tag above to read that local data?
EDIT Looks like this can be done with a Generator and writing the data to a file. See example on this branch Now I just need to figure out how to subset the results: https://github.com/Sija/garb/issues/22
To store the data, I had to:
Write a Generator class (see Jekyll wiki plugins) to call the API.
Convert data to a hash (for easy lookup by path, see 5):
result = Hash[data.collect{|row| [row.page_path, [row.exits, row.pageviews]]}]
Write the data hash to a JSON file.
Read in the data from the file in my existing Liquid block class.
Note that the block tag works from the _includes dir, while the generator works from the root directory.
Match the page path, easy once the data is converted to a hash:
Code for the full plugin, showing how to create the generator and write files, etc, here
And thanks to Sija on GitHub for help on this.

amazon s3 and carrierwave random image name in bucket does not match in database

I'm using carrier wave, rails and amazon s3. Every time I save an image, the image shows up in s3 and I can see it in the management console with the name like this:
But in the model, the name is this:
First off, why is the random name different? I am generating it in the uploader like so:
def filename
if original_filename
I know it is not generating a random string every call because the wrong url in the model is consistent and saved. Somewhere in the process a new one must be getting generated to save in the model after the image name has been saved and sent to amazon s3. Strange.
Also, can I have the url match the one in terms of s3/bucket instead of bucket.s3 without using a regex? Is there an option in carrierwave or something for that?
CarrierWave by default doesn't store the URL. Instead, it generates it every time you need it.
So, every time filename is called it will return a different value, because of Time.now.to_i.
Use created_at column instead, or add a new column for storing the random id or the full filename.
I solved it by saving the filename if it was still the original filename. In the uploader, put:
def filename
if original_filename && original_filename == #filename
#filename = "#{any_string}#{File.extname(original_filename).downcase}"
The issue of the sumbdomain versus the path is not actually an issue. It works with the subdomain. I.e. https://s3.amazonaws.com/bucket-name/ and https://bucket-name.s3.amazonaws.com/ both work fine.

How do I use CGI in a Heroku application written with Ruby and Sinatra?

I am trying to move information from a text form to a new web page using CGI. To do this, I set action to action="new.html" in the form. Then, in the relevant part of my .rb file, I have:
get "/new.html" do
#graph = Koala::Facebook::API.new(session[:access_token])
#app = #graph.get_object(ENV["FACEBOOK_APP_ID"])
if session[:access_token]
#query=CGI.new() # Line of interest
#input=#query["tool_1"] # Line of interest
erb :my_tools_F
post "/new.html" do
redirect "/new.html"
The new web page loads, but #input is blank when I call it in the .erb file. Prior to this part of the script, I did require CGI. My web host is Heroku, and both of the .erb files are in a directory called views. The application is built to be launched on Facebook.
The example code is here.
It seems like you're trying to get the parameters for the form. I had another answer here but that wasn't working for you. You can easily do this without cgi and you should consider using the built in methods to do so. However, before you can do that I noticed some errors in your github post.
Your folder Views should read views. Small but it matters. I couldn't get the pages rendering correctly.
On your new.erb and index.erb on line 33 it reads:
<input type="submit" value="Add"">
There is an extra " at the end. Just remove it to look like:
<input type="submit" value="Add">
Lastly, to do what you need to do:
get "/new.html" do
erb :new
post "/new.html" do
#input = params[:tool_1]
erb :new
instead of what you did. Do a find on http://www.sinatrarb.com/intro for params.
