upload analogue with XSendFile? - ruby

Is there some way to use something similar to x-sendfile for uploading files, e.g. saving particular stream/parameter from request to file, without putting it wholly into memory?
(In particular, with apache2 and ruby fcgi)

require 'open-uri'
CHUNK_SIZE = 8192
File.open("local_filename.dat","w") do |w|
open("http://some_file.url") do |r|
w.write(r.read(CHUNK_SIZE)) while !r.eof?
end
end

Apache's ModPorter seems to be the way.

Related

Parse huge file (10+gb) and write content in another one

I'm trying to use Sphinx Search Server to index a really huge file (around 14gb).
The file is whitespace separated, one entry per line.
To be able to use it with Sphinx, I need to provide a xml file to the Sphinx server.
How can I do it without killing my computer ?
What is the best strategy? Should I try to split the main file in several little files? What's the best way to do it?
Note: I'm doing it in Ruby, but I'm totally open to other hints.
Thanks for your time.
I think the main idea would be to parse the main file line by line, while generating a result XML. And every time it gets large enough, to feed it to Sphinx. Rinse and repeat.
What parsing do you need to do? If the transformations are restricted to just one line in the input at once and not too complicated, I would use awk instead of Ruby...
I hate guys who doesn't write solution after a question. So I'll try to don't be one of them, hopefully it will help somebody.
I added a simple reader method to the File class then used it to loop on the file based on a chunk size of my choice. Quite simple actually, working like a charm with Sphinx.
class File
# New static method
def self.seq_read(file_path,chunk_size=nil)
open(file_path,"rb") do |f|
f.each_chunk(chunk_size) do |chunk|
yield chunk
end
end
end
# New instance method
def each_chunk(chunk_size=1.kilobyte)
yield read(chunk_size) until eof?
end
end
Then just use it like this:
source_path = "./my_very_big_file.txt"
CHUNK_SIZE = 10.megabytes
File.seq_read(source_path, CHUNK_SIZE) do |chunk|
chunk.each_line do |line|
...
end
end

Ruby RSS::Parser.to_s silently fails?

I'm using Ruby 1.8.7's RSS::Parser, part of stdlib. I'm new to Ruby.
I want to parse an RSS feed, make some changes to the data, then output it (as RSS).
The docs say I can use '#to_s', but and it seems to work with some feeds, but not others.
This works:
#!/usr/bin/ruby -w
require 'rss'
require 'net/http'
url = 'http://news.ycombinator.com/rss'
feed = Net::HTTP.get_response(URI.parse(url)).body
rss = RSS::Parser.parse(feed, false, true)
# Here I would make some changes to the RSS, but right now I'm not.
p rss.to_s
Returns expected output: XML text.
This fails:
#!/usr/bin/ruby -w
require 'rss'
require 'net/http'
url = 'http://feeds.feedburner.com/devourfeed'
feed = Net::HTTP.get_response(URI.parse(url)).body
rss = RSS::Parser.parse(feed, false, true)
# Here I would make some changes to the RSS, but right now I'm not.
p rss.to_s
Returns nothing (empty quotes).
And yet, if I change the last line to:
p rss
I can see that the object is filled with all of the feed data. It's the to_s method that fails.
Why?
How can I get some kind of error output to debug a problem like this?
From what I can tell, the problem isn't in to_s, it's in the parser itself. Stepping way into the parser.rb code showed nothing being returned, so to_s returning an empty string is valid.
I'd recommend looking at something like Feedzirra.
Also, as a FYI, take a look at Ruby's Open::URI module for easy retrieval of web assets, like feeds. Open-URI is simple but adequate for most tasks. Net::HTTP is lower level, which will require you to type a lot more code to replace the functionality of Open-URI.
I had the same problem, so I started debugging the code. I think the ruby rss has a few too many required elements. The channel need to have "title, link, description", if one is missing to_s will fail.
The second feed in the example above is missing the description, which will make the to_s fail...
I believe this is a bug, but I really don't understand the code and barely ruby so who knows. It would seem natural to me that to_s would try its best even if some elements are missing.
Either way
rss.channel.description="something"
rss.to_s
will "work"
The problem lies in def have_required_elements?
Or in the
self.class::MODELS

Preserve key order loading YAML from a file in Ruby

I want to preserve the order of the keys in a YAML file loaded from disk, processed in some way and written back to disk.
Here is a basic example of loading YAML in Ruby (v1.8.7):
require 'yaml'
configuration = nil
File.open('configuration.yaml', 'r') do |file|
configuration = YAML::load(file)
# at this point configuration is a hash with keys in an undefined order
end
# process configuration in some way
File.open('output.yaml', 'w+') do |file|
YAML::dump(configuration, file)
end
Unfortunately, this will destroy the order of the keys in configuration.yaml once the hash is built. I cannot find a way of controlling what data structure is used by YAML::load(), e.g. alib's orderedmap.
I've had no luck searching the web for a solution.
Use Ruby 1.9.x. Previous version of Ruby do not preserve the order of Hash keys, but 1.9 does.
If you're stuck using 1.8.7 for whatever reason (like I am), I've resorted to using active_support/ordered_hash. I know activesupport seems like a big include, but they've refactored it in later versions to where you pretty much only require the part you need in the file and the rest gets left out. Just gem install activesupport, and include it as shown below. Also, in your YAML file, be sure to use an !!omap declaration (and an array of Hashes). Example time!
# config.yml #
months: !!omap
- january: enero
- february: febrero
- march: marzo
- april: abril
- may: mayo
Here's what the Ruby behind it looks like.
# loader.rb #
require 'yaml'
require 'active_support/ordered_hash'
# Load up the file into a Hash
config = File.open('config.yml','r') { |f| YAML::load f }
# So long as you specified an !!omap, this is actually a
# YAML::PrivateClass, an array of Hashes
puts config['months'].class
# Parse through its value attribute, stick results in an OrderedHash,
# and reassign it to our hash
ordered = ActiveSupport::OrderedHash.new
config['months'].value.each { |m| ordered[m.keys.first] = m.values.first }
config['months'] = ordered
I'm looking for a solution that allows me to recursively dig through a Hash loaded from a .yml file, look for those YAML::PrivateClass objects, and convert them into an ActiveSupport::OrderedHash. I may post a question on that.
Someone came up with the same issue. There is a gem ordered hash. Note that it is not a hash, it creates a subclass of hash. You might give it a try, but if you see a problem dealing with YAML, then you should consider upgrading to ruby1.9.

HTML to Plain Text with Ruby?

Is there anything out there to convert html to plain text (maybe a nokogiri script)? Something that would keep the line breaks, but that's about it.
If I write something on googledocs, like this, and run that command, it outputs (removing the css and javascript), this:
\n\n\n\n\nh1. Test h2. HELLO THEREI am some teexton the next line!!!OKAY!#*!)$!
So the formatting's all messed up. I'm sure someone has solved the details like these somewhere out there.
Actually, this is much simpler:
require 'rubygems'
require 'nokogiri'
puts Nokogiri::HTML(my_html).text
You still have line break issues, though, so you're going to have to figure out how you want to handle those yourself.
You could start with something like this:
require 'open-uri'
require 'rubygems'
require 'nokogiri'
uri = 'http://stackoverflow.com/questions/2505104/html-to-plain-text-with-ruby'
doc = Nokogiri::HTML(open(uri))
doc.css('script, link').each { |node| node.remove }
puts doc.css('body').text.squeeze(" \n")
Is simply stripping tags and excess line breaks acceptable?
html.gsub(/<\/?[^>]*>/, '').gsub(/\n\n+/, "\n").gsub(/^\n|\n$/, '')
First strips tags, second takes duplicate line breaks down to one, third removes line breaks at the start and end of the string.
require 'open-uri'
require 'nokogiri'
url = 'http://en.wikipedia.org/wiki/Wolfram_language'
doc = Nokogiri::HTML(open(url))
text = ''
doc.css('p,h1').each do |e|
text << e.content
end
puts text
This extracts just the desired text from a webpage (most of the time). If for example you wanted to also include links then add a to the css classes in the block.
I'm using the sanitize gem.
(" " + Sanitize.clean(html).gsub("\n", "\n\n").strip).gsub(/^ /, "\t")
It does drop hyperlinks though, which may be an issue for some applications. But I'm doing NLP text analysis, so this is perfect for my needs.
if you are using rails you can:
html = '<div class="asd">hello world</div><p><span>Hola</span><br> que tal</p>'
puts ActionView::Base.full_sanitizer.sanitize(html)
You want hpricot_scrub:
http://github.com/UnderpantsGnome/hpricot_scrub
You can specify which tags to strip / keep in a config hash.
if its in rails, you may use this:
html_escape_once(value).gsub("\n", "\r\n<br/>").html_safe
Building slightly on Matchu's answer, this worked for my (very similar) requirements:
html.gsub(/<\/?[^>]*>/, ' ').gsub(/\n\n+/, '\n').gsub(/^\n|\n$/, ' ').squish
Hope it makes someone's life a bit easier :-)

How to load a Web page and search for a word in Ruby

How to load a Web page and search for a word in Ruby??
Here's a complete solution:
require 'open-uri'
if open('http://example.com/').read =~ /searchword/
# do something
end
For something simple like this I would prefer to write a couple of lines of code instead of using a full blown gem. Here is what I will do:
require 'net/http'
# let's take the url of this page
uri = 'http://stackoverflow.com/questions/1878891/how-to-load-a-web-page-and-search-for-a-word-in-ruby'
response = Net::HTTP.get_response(URI.parse(uri)) # => #<Net::HTTPOK 200 OK readbody=true>
# match the word Ruby
/Ruby/.match(response.body) # => #<MatchData "Ruby">
I can go to the path of using a gem if I need to do more than this and I need to implement some algorithm for that which is already being done in one of the gems
I suggest using Nokogiri or hpricot to open and parse HTML documents. If you need something simple that doesn't require parsing the HTML, you can just use the open-uri library built in to most ruby distributions. If need something more complex for posting forms (or logging in), you can elect to use Mechanize.
Nokogiri is probably the preferred solution post _why, but both are about as simple as this:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri(open("http://www.example.com"))
if doc.inner_text.match(/someword/)
puts "got it"
end
Both also allow you to search using xpath-like queries or CSS selectors, which allows you to grab items out of all divs with class=foo, for example.
Fortunately, it's not that big of a leap to move between open-uri, nokogiri and mechanize, so use the first one that meets your needs, and revise your code once you realize you need the capabilities of one of the other libraries.
You can also use mechanize gem, something similar to this.
require 'rubygems'
require 'mechanize'
mech = WWW::Mechanize.new.get('http://example.com') do |page|
if page.body =~ /mysearchregex/
puts "found it"
end
end

Resources