Using gzip compression in Sinatra with Ruby - ruby

Note: I had another similar question about how to GZIP data using Ruby's zlib which technically was answered and I didn't feel I could start evolving the question since it had been answered so although this question is related it is not the same...
The following code (I believe) is GZIP'ing a static CSS file and storing the results in the result variable. But what do I do with this in the sense: how can I send this data back to the browser so it is recognised as being GZIP'ed rather than the original file size (e.g. when checking my YSlow score I want to see it correctly marking me for making sure I GZIP static resources).
z = Zlib::Deflate.new(6, 31)
z.deflate(File.read('public/Assets/Styles/build.css'))
z.flush
#result = z.finish # could also of done: result = z.deflate(file, Zlib::FINISH)
z.close
...one thing to note is that in my previous question the respondent clarified that Zlib::Deflate.deflate will not produce gzip-encoded data. It will only produce zlib-encoded data and so I would need to use Zlib::Deflate.new with the windowBits argument equal to 31 to start a gzip stream.
But when I run this code I don't actually know what to do with the result variable and its content. There is no information on the internet (that I can find) about how to send GZIP encoded static resources (like JavaScript, CSS, HTML etc) to the browser, this making the page load quicker. It seems every Ruby article I read is based on someone using Ruby on Rails!!?
Any help really appreciated.

After zipping the file you would simply return the result and ensure to set the header Content-Encoding: gzip for the response. Google has a nice, little introduction to gzip compression and what you have to watch out for. Here is what you could do in Sinatra:
get '/whatever' do
headers['Content-Encoding'] = 'gzip'
StringIO.new.tap do |io|
gz = Zlib::GzipWriter.new(io)
begin
gz.write(File.read('public/Assets/Styles/build.css'))
ensure
gz.close
end
end.string
end
One final word of caution, though. You should probably choose this approach only for content that you created on the fly or if you just want to use gzip compression in a few places.
If, however, your goal is to serve most or even all of your static resources with gzip compression enabled, then it will be a much better solution to rely on what is already supported by your web server instead of polluting your code with this detail. There's a good chance that you can enable gzip compression with some configuration settings. Here's an example of how it is done for nginx.
Another alternative would be to use the Rack::Deflater middleware.

Just to highlight 'Rack::Deflater' way as an 'answer' ->
As mentioned in the comment above, just put the compression in config.ru
use Rack::Deflater
thats pretty much it!

We can see that users are going to compress web related data like css files. I want to recommend using brotli. It was heavily optimized for such purpose. Any modern web browser today supports it.
You can use ruby-brs bindings for ruby.
gem install ruby-brs
require "brs"
require "sinatra"
get "/" do
headers["Content-Encoding"] = "br"
BRS::String.compress File.read("sample.css")
end
You can use streaming interface instead, it is similar to Zlib interface.
require "brs"
require "sinatra"
get "/" do
headers["Content-Encoding"] = "br"
StringIO.new.tap do |io|
writer = BRS::Stream::Writer.new io
begin
writer.write File.read("sample.css")
ensure
writer.close
end
end
.string
end
You can also use nonblock methods, please read more information about ruby-brs.

Related

View routes in Sinatra

The following code snippet is how I am handling routes in my Sinatra application. All of my views are contained in my views/pages directory. These are just haml files that represent static html, with some javascript. Are there any negative implications to loading views in this manner? If the page does not exist, it throws a file not found error. I am worried this is somehow an attack vector.
error RuntimeError do
status 500
"A RuntimeError occured"
end
get '/:page' do
begin
haml "pages/#{params['page']}".to_sym
rescue Errno::ENOENT
status 404
"404"
end
end
I'm not sure if this is a security concern here (I'm not that into all the details of Sinatra) but I tend to be paranoid when using user specified data like for example params['page'] in your example. As said I'm not sure if Sinatra sanitizes the content and would make this example impossible, but imagine it said something like ../db_connection.yml. So Sinatra would be told to load the haml file pages/../db_connection.yml which might actually exist and display it to the user showing them your database configuration.
If you don't have any weird symlinks in your pages directory it would probably be enough to replace all double occurences of a dot with something like .gsub(/\.+/, ".") of the passed string (or replace all dots if you don't need them in the name to be even more paranoid). I'm not sure if there are any multi-byte insecurities where someone could do something ugly with encodings though and it might be useless to do the replacing at all because the exploit would work nevertheless.
Edit: Short reading into the Sinatra manual yielded that
By the way, unless you disable the path traversal attack protection (see below), the request path might be modified before matching against your routes.
So it seams that it should be secure to just use the params value without any special filtering, you may however like to look into the documentation some more (especially the security sections). However I don't think it's too much of a security issue if it's possible to figure out if a file in your pages directory exists or not.
Are there any negative implications to loading views in this manner?
Time would be one, it takes a lot longer to generate a page than to serve a static one. Resource use would be another for the same reason. Added complexity another. Reinventing the wheel would be another.
Why not just put static pages in the public directory? Or why not use a static site generator?
Pick the tool that fits your needs, and don't reinvent the wheel (especially when the framework has already provided you with that wheel!)

What is the difference between Ruby's 'open-uri' and 'Net:HTTP' gems?

It seems like both of these gems perform very similar tasks. Can anyone give examples of where one gem would be more useful than the other? I don't have specific code that I'm referring to, I'm more wondering about general use cases for each gem. I know this is a short question, I will fill in the blanks upon request. Thanks.
The reason they look like they perform similar tasks is OpenURI is a wrapper for Net::HTTP, Net::HTTPS, and Net::FTP.
Usually, unless you feel you need a lower level interface, using OpenURI is better as you can get by with less code. Using OpenURI you can open a URL/URI and treat it as a file.
See: http://www.ruby-doc.org/stdlib-1.9.3/libdoc/open-uri/rdoc/OpenURI.html
and http://ruby-doc.org/stdlib-1.9.3//libdoc/net/http/rdoc/Net.html
I just found out that open does follow redirections, while Net::HTTP doesn't, which is an important difference.
For example, open('http://www.stackoverflow.com') { |content| puts content.read } will display the proper HTML after following the redirection, while Net::HTTP.get(URI('http://www.stackoverflow.com')) will show the redirection message and 302 status code.

Multiple Converters/Generators in Jekyll?

I would like to write a Jekyll plugin that makes all posts available in PDF format by utilizing Kramdown's LaTeX export capabilities. For each post in Markdown format, I'd like to end up with the normal .html post along with a .tex file containing the LaTeX markup and finally a .pdf.
Following the documentation for creating plugins, I see two ways of approaching the problem, either with a Converter or with a Generator.
Converter plugins seem to run after the built-in Converters, so the .markdown files have all been converted to .html by the time they reach the Converter.
When I try to implement a Generator, I am able to use fileutils to write a file successfully, but by the end of Jekyll's cycle, that file has been removed. It seems there's a StaticFile class which you can use to register new output files with Jekyll, but I cannot find any real guidance on how to use it.
If you take a look at the ThumbGenerator class in this: https://github.com/matthewowen/jekyll-slideshow/blob/master/_plugins/jekyll_slideshow.rb you'll seen a similar example. This particular plugin makes thumbnail sized versions of all images in the site. Hopefully it gives a useful guide to how you can interact with Jekyll's StaticFile class (though I'm not a Ruby pro, so forgive any poor style).
Unfortunately, there isn't really documentation for this - I gleaned it from reading through the source.
I wrote this a few months ago and don't particularly remember the details (which is why I gave an example rather than a workthrough), but if this doesn't get you on the right track let me know and I'll try to help.
I try to do the same but with direct html->pdf conversion.
It did not work inside a gitlab-ci pipeline at this time, nonetheless it work on my workstation (see here) with a third possibility : a hook !
(here with pdfkit)
require 'pdfkit'
module Jekyll
Jekyll::Hooks.register :site, :post_write do |post|
post.posts.docs.each do |post|
filename = post.site.dest + post.id + ".pdf"
dirname = File.dirname(filename)
Dir.mkdir(dirname) unless File.exists?(dirname)
kit = PDFKit.new(post.content, :page_size => 'Letter')
kit.stylesheets << './css/bootstrap.min.css'
kit.to_file(filename)
end
end
end

How can I scrape, parse and crawl files in Ruby?

I have a number of data files to process from a data warehouse that have the following format:
:header 1 ...
:header n
# remarks 1 ...
# remarks n
# column header 1
# column header 2
DATA ROWS
(Example: "#### ## ## ##### ######## ####### ###afp## ##e###")
The data is separated by white spaces and has both numbers and other ASCII chars. Some of those pieces of data will be split up and made more meaningful.
All of the data will go into a database, initially an SQLite db for development, and then pushed up to another, more permanent, storage.
These files will be pulled in actually via HTTP from the remote server and I will have to crawl a bit to get some of it as they span folders and many files.
I was hopeful to get some input what the best tools and methods may be to accomplish this the "Ruby way", as well as to abstract out some of this. Otherwise, I'll tackle it probably similar to how I would in Perl or other such approaches I've taken before.
I was thinking along the lines of using OpenURI to open each url, then if input is HTML collect links to crawl, otherwise process the data. I would use String.scan to break apart the file appropriately each time into a multi-dimensional array parsing each component based on the established formatting by the data provider. Upon completion, push the data into the database. Move on to next input file/uri. Rinse and repeat.
I figure I must be missing some libs that those with more experience would use to clean/quicken this process up dramatically and make the script much more flexible for reuse on other data sets.
Additionally, I will be graphing and visualizing this data as well as generating reports, so perhaps that should too be considered.
Any input as to perhaps a better approach or libs to simply this?
Your question focuses on a lot on "low level" details -- parsing URL's and so on. One key aspect of the "Ruby Way" is "Don't reinvent the wheel." Leverage existing libraries. :)
My recommendation? First, leverage a crawler such as spider or anemone. Second, use Nokogiri for HTML/XML parsing. Third, store the results. I recommend this because you might do different analyses later and you don't want to throw away the hard work of your spidering.
Without knowing too much about your constraints, I would look at storing your results in MongoDB. After thinking this, I did a quick search and found a nice tutorial Scraping a blog with Anemone and MongoDB.
I've written probably a bajillion spiders and site analyzers and find that Ruby has some nice tools that should make this an easy process.
OpenURI makes it easy to retrieve pages.
URI.extract makes it easy to find links in pages. From the docs:
Description
Extracts URIs from a string. If block given, iterates through all matched URIs. Returns nil if block given or array with matches.
require "uri"
URI.extract("text here http://foo.example.org/bla and here mailto:test#example.com and here also.")
# => ["http://foo.example.com/bla", "mailto:test#example.com"]
Simple, untested, logic to start might look like:
require "openuri"
require "uri"
urls_to_scan = %w[
http://www.example.com/page1
http://www.example.com/page2
]
loop do
break if urls_to_scan.empty?
url = urls_to_scan.shift
html = open(url).read
# you probably want to do something to make sure the URLs are not
# pointing outside the site you're walking.
#
# Something like:
#
# URI.extract(html).select{ |u| u[%r{^http://www\.example\.com}i] }
#
new_urls = URI.extract(html)
if (new_urls.any?)
urls_to_scan += new_urls
else
; # parse your file as data using the content in html
end
end
Unless you own the site you're crawling, you want to be kind and gentle: Don't run as fast as possible because it's not your pipe. Pay attention to the site's robot.txt file or risk being banned.
There are true web-crawler gems for Ruby, but the basic task is so simple I never bother with them. If you want to check out other alternatives, visit some of the links to the right for other questions on SO that touch on this subject.
If you need more power or flexibility, the Nokogiri gem makes short work of parsing HTML, allowing you to use CSS accessors to search for tags of interest. There are some pretty powerful gems for making it easy to grab pages such as typhoeus.
Finally, while ActiveRecord, which is recommended in some comments, is nice, finding documentation for using it outside of Rails can be difficult or confusing. I recommend using Sequel. It is a great ORM, very flexible, and well documented.
Hi I would start by taking a very close look at the gem called Mechanize before firing up any basic open-uri stuff - cause it's build into mechanize. It's a brilliant, fast, and easy to use gem for automating web-crawling. Since your data-format is pretty strange (at least compared to json, xml or html) I don't think you will make any use of the build-in parser - but you could still take a look at it. it's called nokogiri and is extremely smart as well. But in the last end, after crawling and fetching the resources, you will probably have to go with some good old regular expression stuff.
Good luck!

Ruby CGI Stream Audio

What I want to do is use a CGI script (written in Ruby) to read a binary file off of the filesystem (audio, specifically) and stream it to the browser.
This is how I am doing that so far,
require 'config'
ENV['REQUEST_URI'] =~ /\?(.*)$/
f= $config[:mdir] + '/' + $1.url_decode
f = File.open f, 'rb'
print "Content-Type: audio/mpeg\r\n"#TODO: Make it guess mime type
print "\r\n"
#Outputs file
while blk = f.read(4096)
$stdout.puts blk
$stdout.flush
end
f.close
It's not perfect, there are security holes (exposing the whole filesystem..), but it just isn't working right. It's reading the right file, and as far as I can tell, outputting it in 4k blocks like it should. Using Safari, if I go to the URL, it will show a question mark on the audio player. If I use wget to download it, it appears to work and is about the right size, but is corrupted. It begins playing fine, then crackles, and then stops.
How should I go about doing this? Do I need to Base-64 encode, if can I do that without loading the whole file into memory in one go?
Btw, this is only going to be used over local area network, and I want easy setup, so I'm not interested in a dedicated streaming server.
You could just use IO#print instead of IO#puts but this has some disadvantages.
Don't do the file handling in ruby
Ruby is not good in doing stupid tasks very fast. With this code, you will probably not be able to fill your bandwith.
No CGI
All you want to do is expose a part of the filestystem via http, here are some options on how to do it
Set your document root to the folder you want to expose.
Make a symlink to the folder you want to expose.
Write some kind of rule in your webserver config to map certain url's to a folder
Use any of these and the http server will do for you what you want.
X-Sendfile
Some http servers honor a special header called X-Sendfile. If you do
print "X-Sendfile: #{file}"
the server will use the specified file as body. In lighty this works only through fastcgi, so you would need a fastcgi wrapper like: http://cgit.stbuehler.de/gitosis/fcgi-cgi/ . I don't know about apache and others.
Use a c extension
C extensions are good at doing stupid tasks fast. I wrote a c extension which does nothing, but reading from one IO and writing it to another IO. With this c extension you can fill a GIGABIT Lan through ruby: git://johannes.krude.de/io2io
Use it like this:
IO2IO.do(file, STDOUT)
puts is for writing "lines" of "text", and therefore appends a newline to whatever you give it (turning your mpegs into garbage). You want syswrite.

Resources