How to scrape "data:image" URIs; Encountering Errno::ENAMETOOLONG? - ruby

I having been trying to write a script that scrapes a page for images the way it has been outlined in
"Save all image files from a website".
I tested that method with another page and it worked fine, but when inserting my link to scrape data:image URIs, which look like:
data:image/jpg;base64,/9j/4FEJFOIEJNFOEJOIAD//gAQTGFGRGREGg2LjEwMAD/2wBDAAgEBAQEREGREWGRWEGUFBQYGBgYGBgYGB...
I get an error beginning with initialize': File name too long and ending in (Errno::ENAMETOOLONG).
Has anyone found a way to deal with situations like this?

data:image URLs actually contain the image inline as base 64. All you need to do is grab that data and decode it:
require 'base64'
File.open(File.basename(uri),'wb'){ |f| f.write(Base64.decode64(url[/base64,(.*)/, 1])) }

Related

How to check that a PDF file has some link with Ruby/Rspec?

I am using prawnpdf/pdf-inspector to test that content of a PDF generated in my Rails app is correct.
I would want to check that the PDF file contains a link with certain URL. I looked at yob/pdf-reader but haven't found any useful information related to this topic
Is it possible to test URLs within PDF with Ruby/RSpec?
I would want the following:
expect(urls_in_pdf(pdf)).to include 'https://example.com/users/1'
The https://github.com/yob/pdf-reader contains a method for each page called text.
Do something like
pdf = PDF::Reader.new("tmp/pdf.pdf")
assert pdf.pages[0].text.include? 'https://example.com/users/1'
assuming what you are looking for is at the first page
Since pdf-inspector seems only to return text, you could try to use the pdf-reader directly (pdf-inspector uses it anyways).
reader = PDF::Reader.new("somefile.pdf")
reader.pages.each do |page|
puts page.raw_content # This should also give you the link
end
Anyway I only did a quick look at the github page. I am not sure what raw_content exactly returns. But there is also a low-level method to directly access the objects of the pdf:
reader = PDF::Reader.new("somefile.pdf")
puts reader.objects.inspect
With that it surely is possible to get the url.

Parsing Liquid in a Jekyll generator before converting to JSON

Best to start by saying that I am very new to Ruby and Liquid. I have searched around looking for some resource on this issue, but as yet haven't been able to find anything of real use.
I have a Jekyll site, which utilises the HTML5 History API. I have a Jekyll generator plugin which creates a single JSON file which holds all the post and page content, ready for use with HTML5 PushState and PopState. This part is functioning properly and is tested.
My problem comes when I have a post/page on the site which has Liquid tags in it. I am guessing I need to parse these Liquid tags to get the template output before I create my JSON object for each post/page. Here is what I have for pages as an example:
# Iterate over all pages
site.pages.each do |page|
# Encode the page HTML content to JSON
link = page.url
#content = Liquid::Template.parse(page.content)
hash[link] = { "body_class" => page.data['body_class'], "content" => converter.convert(#content.render), "title" => '<h1>' + page.data["content_title"] + '</h1>' }
end
Now, this at the minute is basically removing all Liquid tags from the generated JSON file, leaving nothing in it's place.
Here is my full generator file on Github which is based very heavily on nice work by Jezen Thomas.
The output JSON file is also in that repo with the site, or can be accessed quickly here. The blog.html content is the last item in the JSON file and shows the empty h1 and div tags which should have content.

Render a view's output later via a delayed_job

If I render html I get html to the browser which works great. However, how can I get a route's response (the html) when being called in a module or class.
I need to do this because I'm sending documents to DocRaptor and rather than store the markup/html in a db column I would like to instead store record IDs and create the markup when the job executes.
A possible solution is using Ruby's HTTP library, Httparty or wget or something and open up the route and use the response.body. Before doing so I thought I'd ask around.
Thanks!
-- Update --
Here's something like what I ended up going with:
Quick tip - in case anyone does this and need their helper methods you need to extend AV with ApplicationHelper:
Here's something like what I ended up doing:
av = ActionView::Base.new()
av.view_paths = ActionController::Base.view_paths
av.extend ApplicationHelper #or any other helpers your template may need
body = av.render(:template => "orders/receipt.html.erb",:locals => {:order => order})
Link:
http://www.rigelgroupllc.com/blog/2011/09/22/render-rails3-views-outside-of-your-controllers/
check this question out, it contains the code probably want in an answer:
Rails 3 > Rendering views in rake task

Avoid repeated calls to an API in Jekyll Ruby plugin

I have written a Jekyll plugin to display the number of pageviews on a page by calling the Google Analytics API using the garb gem. The only trouble with my approach is that it makes a call to the API for each page, slowing down build time and also potentially hitting the user call limits on the API.
It would be possible to return all the data in a single call and store it locally, and then look up the pageview count from each page, but my Jekyll/Ruby-fu isn't up to scratch. I do not know how to write the plugin to run once to get all the data and store it locally where my current function could then access it, rather than calling the API page by page.
Basically my code is written as a liquid block that can be put into my page layout:
class GoogleAnalytics < Liquid::Block
def initialize(tag_name, markup, tokens)
super # options that appear in block (between tag and endtag)
#options = markup # optional optionss passed in by opening tag
end
def render(context)
path = super
# Read in credentials and authenticate
cred = YAML.load_file("/home/cboettig/.garb_auth.yaml")
Garb::Session.api_key = cred[:api_key]
token = Garb::Session.login(cred[:username], cred[:password])
profile = Garb::Management::Profile.all.detect {|p| p.web_property_id == cred[:ua]}
# place query, customize to modify results
data = Exits.results(profile,
:filters => {:page_path.eql => path},
:start_date => Chronic.parse("2011-01-01"))
data.first.pageviews
end
Full version of my plugin is here
How can I move all the calls to the API to some other function and make sure jekyll runs that once at the start, and then adjust the tag above to read that local data?
EDIT Looks like this can be done with a Generator and writing the data to a file. See example on this branch Now I just need to figure out how to subset the results: https://github.com/Sija/garb/issues/22
To store the data, I had to:
Write a Generator class (see Jekyll wiki plugins) to call the API.
Convert data to a hash (for easy lookup by path, see 5):
result = Hash[data.collect{|row| [row.page_path, [row.exits, row.pageviews]]}]
Write the data hash to a JSON file.
Read in the data from the file in my existing Liquid block class.
Note that the block tag works from the _includes dir, while the generator works from the root directory.
Match the page path, easy once the data is converted to a hash:
result[path][1]
Code for the full plugin, showing how to create the generator and write files, etc, here
And thanks to Sija on GitHub for help on this.

How to serve generated images with sinatra in ruby

I wrote a simple Sinatra app that generate an image using rmagick from some user inputs. The image is saved in the ./public directory with a unique file name. The unique file name is used in the HTML generated by Sinatra so that each user gets the correct image. Once a day a script deletes files older than one hour. This is clearly a terrible hack but I have no web experience!
Is there any way to serve the rmagick image in sinatra without first saving it to disk?
Use the Image#to_blob method to turn the in-memory image into a string:
get '/' do
content_type 'image/png'
img = Magick::Image.read('logo:')[0]
img.format = 'png'
img.to_blob
end

Resources