Ruby CGI Stream Audio - ruby

What I want to do is use a CGI script (written in Ruby) to read a binary file off of the filesystem (audio, specifically) and stream it to the browser.
This is how I am doing that so far,
require 'config'
ENV['REQUEST_URI'] =~ /\?(.*)$/
f= $config[:mdir] + '/' + $1.url_decode
f = File.open f, 'rb'
print "Content-Type: audio/mpeg\r\n"#TODO: Make it guess mime type
print "\r\n"
#Outputs file
while blk = f.read(4096)
$stdout.puts blk
$stdout.flush
end
f.close
It's not perfect, there are security holes (exposing the whole filesystem..), but it just isn't working right. It's reading the right file, and as far as I can tell, outputting it in 4k blocks like it should. Using Safari, if I go to the URL, it will show a question mark on the audio player. If I use wget to download it, it appears to work and is about the right size, but is corrupted. It begins playing fine, then crackles, and then stops.
How should I go about doing this? Do I need to Base-64 encode, if can I do that without loading the whole file into memory in one go?
Btw, this is only going to be used over local area network, and I want easy setup, so I'm not interested in a dedicated streaming server.

You could just use IO#print instead of IO#puts but this has some disadvantages.
Don't do the file handling in ruby
Ruby is not good in doing stupid tasks very fast. With this code, you will probably not be able to fill your bandwith.
No CGI
All you want to do is expose a part of the filestystem via http, here are some options on how to do it
Set your document root to the folder you want to expose.
Make a symlink to the folder you want to expose.
Write some kind of rule in your webserver config to map certain url's to a folder
Use any of these and the http server will do for you what you want.
X-Sendfile
Some http servers honor a special header called X-Sendfile. If you do
print "X-Sendfile: #{file}"
the server will use the specified file as body. In lighty this works only through fastcgi, so you would need a fastcgi wrapper like: http://cgit.stbuehler.de/gitosis/fcgi-cgi/ . I don't know about apache and others.
Use a c extension
C extensions are good at doing stupid tasks fast. I wrote a c extension which does nothing, but reading from one IO and writing it to another IO. With this c extension you can fill a GIGABIT Lan through ruby: git://johannes.krude.de/io2io
Use it like this:
IO2IO.do(file, STDOUT)

puts is for writing "lines" of "text", and therefore appends a newline to whatever you give it (turning your mpegs into garbage). You want syswrite.

Related

Determining encoding for a file in Ruby

I have come up with a method to determine encoding (or at least a guess at it) for a file that I pass in:
def encoding_type(file_path)
File.read(file_path).encoding.name
end
The problem with this is that I have a file that is 15GB, so that means the entire file is being read into memory.
Is there anyway to accomplish what I am doing in this method without needing to read the entire file into memory?
The file -mime command will return the mime type and encoding of the file:
file -mime myfile
myfile: text/plain; charset=iso-8859-1
def detect_charset(file_path)
`file --mime #{file_path}`.strip.split('charset=').last
rescue => e
Rails.logger.warn "Unable to determine charset of #{file_path}"
Rails.logger.warn "Error: #{e.message}"
end
The method you suggest in your question will not do what you think. It will simply set the file to the Encoding.default_internal encoding, possibly after transcoding it from Encoding.default_external. These are both usually UTF-8. The encoding is going to always be Encoding.default_internal after you run that code, it is not guessing or determining the encoding from the actual file.
If you have a file and you really don't know what encoding it is, you indeed will have to guess. There's no way to be 100% sure you've gotten it right as the author intended (and some files are corrupt and mixed encoding or not legal in any encoding).
There are libraries with heuristics meant to try and guess (they won't be right all the time).
Here's one, which I've never actually used myself, but the likelyist prospect I found in 10 minutes of googling: https://github.com/oleander/rchardet There might be other ruby gems for this. You could also use ruby system() to call a linux command line utility that tries to do this as well, someone above mentions the Linux file command.
If you don't want to load the entire file in to test it, you can certainly just load part of it in. Probably the chardet library will work more reliably the more it's got, but, sure, just read the first X bytes of the file in and then ask chardet to guess it's encoding.
require 'chardet19'
first1000bytes = File.read(file, 1000)
cd = CharDet.detect(first1000bytes)
cd.encoding
cd.confidence
You can also always check to see if any string in ruby is valid for the encoding it's set at:
str.valid_encoding?
So you could simply go through a variety of encodings and see if it's valid:
orig_encoding = str.encoding
str.force_encoding("ISO-8859-1").valid_encoding?
str.force_encoding("UTF-8").valid_encoding?
str.force_enocding(orig_encoding) # put it back to what it was
But it's certainly possible for a file to be valid in more than one encoding, or to be valid in a given encoding but read as nonsense by humans in that encoding.
If you have your best guess encoding, but it's still not valid_encoding? for that encoding, it may just have a few bad bytes in it. You can remove them with String.scrub in ruby 2.1, or with this pure-ruby backport of String.scrub in other ruby versions.
Hope this helps give you some idea of what you're dealing with and what your options are.

How to parse mailbox file in Ruby?

The Ruby gem rmail has methods to parse a mailbox file on local disk. Unfortunately this gem has broken (in Ruby 2.0.0). It might not get fixed, because folks are migrating to the gem mail.
Gem mail has method Mail.read('filename.txt'), but that parses only the first message in a mailbox.
That gem, and builtin Net::IMAP, have flooded the net with tutorials on accessing mailboxes through imap.
So, is there still a way to parse a plain old file, without imap?
As the lone rubyist in my group I'd rather not embarrass myself by resorting to http://docs.python.org/2/library/mailbox.html.
Or, worse yet, PHP's imap_open('/var/mail/www-data', ...) -- if only Net::IMAP.new accepted filenames like that.
The good news is the Mbox format is really dead simple, though it's simplicity is why it was eventually replaced. Parsing a large mailbox file to extract a single message is not specially efficient.
If you can split apart the mailbox file into separate strings, you can pass these strings to the Mail library for parsing.
An example starting point:
def parse_message(message)
Mail.new(message)
do_other_stuff!
end
message = nil
while (line = STDIN.gets)
if (line.match(/\AFrom /))
parse_message(message) if (message)
message = ''
else
message << line.sub(/^\>From/, 'From')
end
end
The key is that each message starts with "From " where the space after it is key. Headers will be defined as From: and any line that starts with ">From" is to be treated as
actually being "From". It's things like this that make this encoding method really inadequate, but if Maildir isn't an option, this is what you've got to do.
You can use tmail parsing email boxes, but it was replaced by mail, but I can't really find a class that substitutes it. So you might want to keep along with tmail.
EDIT: as #tadman pointed out, it should not be working with ruby 1.9. However you can port this class (and put it on github for everyone else use :-) )
The mbox format is about as simple as you can get. It's simply the concatenation of all the messages, separated by a blank line. The first line of each message starts with the five characters "From "; when messages are added to the file, any line which starts "From" has a > prefixed, so you can reliably use the fact that a line starts with "From" as an indicator that it is the start of a message.
Of course, since this is an old format and it was never standardized, there are a number of variants. One variant uses the Content-Length header to determine the length of a message, and some implementations of this variant fail to insert the '>'. However, I think this is rare in practice.
A big problem with mbox format is that the file needs to be modified in place by mail agents; consequently, every implementation has some locking procedure. Of course, there is no standardization there, so you need to watch out for other processes modifying the mailbox while you are reading it. In practice, many mail systems solved this problem by using maildir format instead, in which a mailbox is actually a directory and every message is a single file.
Other things you might want to do include MIME decoding, but you should be able to find utilities which do that.

Using gzip compression in Sinatra with Ruby

Note: I had another similar question about how to GZIP data using Ruby's zlib which technically was answered and I didn't feel I could start evolving the question since it had been answered so although this question is related it is not the same...
The following code (I believe) is GZIP'ing a static CSS file and storing the results in the result variable. But what do I do with this in the sense: how can I send this data back to the browser so it is recognised as being GZIP'ed rather than the original file size (e.g. when checking my YSlow score I want to see it correctly marking me for making sure I GZIP static resources).
z = Zlib::Deflate.new(6, 31)
z.deflate(File.read('public/Assets/Styles/build.css'))
z.flush
#result = z.finish # could also of done: result = z.deflate(file, Zlib::FINISH)
z.close
...one thing to note is that in my previous question the respondent clarified that Zlib::Deflate.deflate will not produce gzip-encoded data. It will only produce zlib-encoded data and so I would need to use Zlib::Deflate.new with the windowBits argument equal to 31 to start a gzip stream.
But when I run this code I don't actually know what to do with the result variable and its content. There is no information on the internet (that I can find) about how to send GZIP encoded static resources (like JavaScript, CSS, HTML etc) to the browser, this making the page load quicker. It seems every Ruby article I read is based on someone using Ruby on Rails!!?
Any help really appreciated.
After zipping the file you would simply return the result and ensure to set the header Content-Encoding: gzip for the response. Google has a nice, little introduction to gzip compression and what you have to watch out for. Here is what you could do in Sinatra:
get '/whatever' do
headers['Content-Encoding'] = 'gzip'
StringIO.new.tap do |io|
gz = Zlib::GzipWriter.new(io)
begin
gz.write(File.read('public/Assets/Styles/build.css'))
ensure
gz.close
end
end.string
end
One final word of caution, though. You should probably choose this approach only for content that you created on the fly or if you just want to use gzip compression in a few places.
If, however, your goal is to serve most or even all of your static resources with gzip compression enabled, then it will be a much better solution to rely on what is already supported by your web server instead of polluting your code with this detail. There's a good chance that you can enable gzip compression with some configuration settings. Here's an example of how it is done for nginx.
Another alternative would be to use the Rack::Deflater middleware.
Just to highlight 'Rack::Deflater' way as an 'answer' ->
As mentioned in the comment above, just put the compression in config.ru
use Rack::Deflater
thats pretty much it!
We can see that users are going to compress web related data like css files. I want to recommend using brotli. It was heavily optimized for such purpose. Any modern web browser today supports it.
You can use ruby-brs bindings for ruby.
gem install ruby-brs
require "brs"
require "sinatra"
get "/" do
headers["Content-Encoding"] = "br"
BRS::String.compress File.read("sample.css")
end
You can use streaming interface instead, it is similar to Zlib interface.
require "brs"
require "sinatra"
get "/" do
headers["Content-Encoding"] = "br"
StringIO.new.tap do |io|
writer = BRS::Stream::Writer.new io
begin
writer.write File.read("sample.css")
ensure
writer.close
end
end
.string
end
You can also use nonblock methods, please read more information about ruby-brs.

How do I get Zlib to compress to a stream in Ruby?

I’m trying to upload files to Amazon S3 using AWS::S3, but I’d like to compress them with Zlib first. AWS::S3 expects its data to be a stream object, i.e. you would usually upload a file with something like
AWS::S3::S3Object.store('remote-filename.txt', open('local-file.txt'), 'bucket')
(Sorry if my terminology is off; I don’t actually know much about Ruby.) I know that I can zlib-compress a file with something like
data = Zlib::Deflate.deflate(File.read('local-file.txt'))
but passing data as the second argument to S3Object.store doesn’t seem to do what I think it does. (The upload goes fine but when I try to access the file from a web browser it doesn’t come back correctly.) How do I get Zlib to deflate to a stream, or whatever kind of object S3Object.store wants?
I think my problem before was not that I was passing the wrong kind of thing to S3Object.store, but that I was generating a zlib-compressed data stream without the header you’d usually find in a .gz file. In any event, the following worked:
str = StringIO.new()
gz = Zlib::GzipWriter.new(str)
gz.write File.read('local-file.txt')
gz.close
AWS::S3::S3Object.store('remote-filename.txt', str.string, 'bucket')

How can I scrape, parse and crawl files in Ruby?

I have a number of data files to process from a data warehouse that have the following format:
:header 1 ...
:header n
# remarks 1 ...
# remarks n
# column header 1
# column header 2
DATA ROWS
(Example: "#### ## ## ##### ######## ####### ###afp## ##e###")
The data is separated by white spaces and has both numbers and other ASCII chars. Some of those pieces of data will be split up and made more meaningful.
All of the data will go into a database, initially an SQLite db for development, and then pushed up to another, more permanent, storage.
These files will be pulled in actually via HTTP from the remote server and I will have to crawl a bit to get some of it as they span folders and many files.
I was hopeful to get some input what the best tools and methods may be to accomplish this the "Ruby way", as well as to abstract out some of this. Otherwise, I'll tackle it probably similar to how I would in Perl or other such approaches I've taken before.
I was thinking along the lines of using OpenURI to open each url, then if input is HTML collect links to crawl, otherwise process the data. I would use String.scan to break apart the file appropriately each time into a multi-dimensional array parsing each component based on the established formatting by the data provider. Upon completion, push the data into the database. Move on to next input file/uri. Rinse and repeat.
I figure I must be missing some libs that those with more experience would use to clean/quicken this process up dramatically and make the script much more flexible for reuse on other data sets.
Additionally, I will be graphing and visualizing this data as well as generating reports, so perhaps that should too be considered.
Any input as to perhaps a better approach or libs to simply this?
Your question focuses on a lot on "low level" details -- parsing URL's and so on. One key aspect of the "Ruby Way" is "Don't reinvent the wheel." Leverage existing libraries. :)
My recommendation? First, leverage a crawler such as spider or anemone. Second, use Nokogiri for HTML/XML parsing. Third, store the results. I recommend this because you might do different analyses later and you don't want to throw away the hard work of your spidering.
Without knowing too much about your constraints, I would look at storing your results in MongoDB. After thinking this, I did a quick search and found a nice tutorial Scraping a blog with Anemone and MongoDB.
I've written probably a bajillion spiders and site analyzers and find that Ruby has some nice tools that should make this an easy process.
OpenURI makes it easy to retrieve pages.
URI.extract makes it easy to find links in pages. From the docs:
Description
Extracts URIs from a string. If block given, iterates through all matched URIs. Returns nil if block given or array with matches.
require "uri"
URI.extract("text here http://foo.example.org/bla and here mailto:test#example.com and here also.")
# => ["http://foo.example.com/bla", "mailto:test#example.com"]
Simple, untested, logic to start might look like:
require "openuri"
require "uri"
urls_to_scan = %w[
http://www.example.com/page1
http://www.example.com/page2
]
loop do
break if urls_to_scan.empty?
url = urls_to_scan.shift
html = open(url).read
# you probably want to do something to make sure the URLs are not
# pointing outside the site you're walking.
#
# Something like:
#
# URI.extract(html).select{ |u| u[%r{^http://www\.example\.com}i] }
#
new_urls = URI.extract(html)
if (new_urls.any?)
urls_to_scan += new_urls
else
; # parse your file as data using the content in html
end
end
Unless you own the site you're crawling, you want to be kind and gentle: Don't run as fast as possible because it's not your pipe. Pay attention to the site's robot.txt file or risk being banned.
There are true web-crawler gems for Ruby, but the basic task is so simple I never bother with them. If you want to check out other alternatives, visit some of the links to the right for other questions on SO that touch on this subject.
If you need more power or flexibility, the Nokogiri gem makes short work of parsing HTML, allowing you to use CSS accessors to search for tags of interest. There are some pretty powerful gems for making it easy to grab pages such as typhoeus.
Finally, while ActiveRecord, which is recommended in some comments, is nice, finding documentation for using it outside of Rails can be difficult or confusing. I recommend using Sequel. It is a great ORM, very flexible, and well documented.
Hi I would start by taking a very close look at the gem called Mechanize before firing up any basic open-uri stuff - cause it's build into mechanize. It's a brilliant, fast, and easy to use gem for automating web-crawling. Since your data-format is pretty strange (at least compared to json, xml or html) I don't think you will make any use of the build-in parser - but you could still take a look at it. it's called nokogiri and is extremely smart as well. But in the last end, after crawling and fetching the resources, you will probably have to go with some good old regular expression stuff.
Good luck!

Resources