I have a small application that process emails as downloaded from a imap-server with fetchmail. The processing consists of finding base64-encoded attachments with a XML-file inside.
Here is the code (somewhat stripped):
def extract_data_from_mailfile(mailfile)
begin
mail = TMail::Mail.load(mailfile)
rescue
return nil
end
bodies_found = []
if mail.multipart? then
mail.parts.each do |m|
bodies_found << m.body
end
end
## Let's parse the parts we found in the mail to see if one of them
## looks XML-ish. Hacky but works for now.
## was XML.
bodies_found.each do |body|
if body =~ /^<\?XML /i then
return body
end
end
return nil # Nothing found.
end
This works great, but on large XML-files (typically >600k mailfiles), this breaks.
>> mail.parts[1].body.size
=> 487424 <-- should have been larger - doesn't include the end of the file
Base64-decoding doesn't happen automatically either. But this is when I try to run decode manually:
>> Base64::decode64(mail.parts[1].body)
[...] ll="SMTP"></Sendt><Sendt"
That's part of the XML-file, but it has been clipped.
Any way to get the entire attachment? any tips?
I see your code breaks out the loop at the first found XML fragment. Perhaps the larger messages divide their XML into smaller chunks inside the same multi-part MIME message? You would then return an array of bodies and concat them
mail.parts[1].body[0] + mail.parts[1].body[1]
(PS. It's a long shot, I haven't tried this)
Related
I am trying to generate XML, but the loop isn't breaking. Here is a part of the code:
#key = 0
#cont.each do |pr|
xml.product {
#key += 1
puts #key.to_s
begin
#main = Nokogiri::HTML(open(#url+pr['href'], "User-Agent" => "Ruby/#{RUBY_VERSION}","From" => "foo#bar.invalid", "Referer" => "http://www.ruby-lang.org/"))
rescue
puts "rescue"
next
end
puts pr['href']
puts #key.to_s
break //this break doesn't work
#something else
}
end
Most interesting is that in the final generated XML file, break worked. The file contains only one product, but on the console #key was printed fully, which means the foreach loop doesn't break.
Could it be a Nokogiri XML-specific error, because of open brackets in the head of the loop?
In general I think how you're going about trying to generate the XML is confused. Don't convolute your code any more than necessary; Instead of starting to generate some XML then aborting it inside the block because you can't find the page you want, grab the pages you want first, then start processing.
I'd move the begin/rescue block outside the XML generation. Its existence inside the XML generation block results in poor logic and questionable practices of using next and break. Instead I'd recommend something like this untested code:
#main = []
#cont.each do |pr|
begin
#main << Nokogiri::HTML(
open(#url + pr['href'])
)
rescue
puts 'rescue'
next
end
end
builder = Nokogiri::XML::Builder.new do |xml|
xml.root {
xml.products {
#main.each do |m|
xml.product {
xml.id_ m.at('id').text
xml.name m.at('name').text
}
end
}
}
end
puts builder.to_xml
Which makes it easy to see that the code is keying off being able to retrieve a page.
This code is untested because we have no idea what your input values are or what your output should look like. Having valid input, expected output and a working example of your code that demonstrates the problem is essential if you want help debugging a problem with your code.
The use of #url + pr['href'] isn't generally a good idea. Instead use the URI class to build up the URL for you. URI handles encoding and ensures the URI is valid.
def test_post_with_file filename = 'test01.xml'
File.open(filename) do |file|
response = #http_client.post(url, {'documents'=>file})
end
end
How do I modify the above method to handle a multi-file-array post/upload?
file_array = ['test01.xml', 'test02.xml']
You mean like this?
def test_post_with_file(file_array=[])
file_array.each do |filename|
File.open(filename) do |file|
response = #http_client.post(url, {'documents'=>file})
end
end
end
I was having the same problem and finally figured out how to do it:
def test_post_with_file(file_array)
form = file_array.map { |n| ['documents[]', File.open(n)] }
response = #http_client.post(#url, form)
end
You can see in the docs how to pass multiple values: http://rubydoc.info/gems/httpclient/HTTPClient#post_content-instance_method .
In the "body" row, I tried without success to use the 4th example. Somehow HttpClient just decides to apply .to_s to each hash in the array.
Then I tried the 2nd solution and it wouldn't work either because only the last value is kept by the server. But I discovered after some tinkering that the second solution works if the parameter name includes the square brackets to indicate there are mutiple values as an array.
Maybe this is a bug in Sinatra (that's what I'm using), maybe the handling of such data is implementation-dependent, maybe the HttpClient doc is outdated/wrong. Or a combination of these.
BACKGROUND
I have a XML response from a device REST API I need to pick out a particular key/value pair. Currently I am use HTTParty get to retrieve the XML and picking out the text. I think I am doing it the hard way and there must be a much easier method.
QUESTIONS
Is there an easier way to accomplish this to make it easier to understand and make more reusable?
XML looks like this. I am trying to pick out the formatted="Off" key/value pair.
<?xml version="1.0" encoding="UTF-8"?><properties><property id="ST" value="0" formatted="Off" uom="on/off"/></properties>
Code I am currently using:
require 'httparty'
class Rest
include HTTParty
format :xml
end
listen_for (/status (.*)/i) do |input|
command_status input.downcase.strip
request_completed
end
def command_status(input)
inputst = #inputSt[input]
unless inputst.nil?
status = status_input(inputst)
say "#{input} is #{status}"
else
say "I'm sorry, but I am not programmed to check #{input} status."
end
end
def status_input(input)
# Battery operated devices do not continuously reports status, thus will be blank until first change after an ISY reboot or power cycle.
resp = Rest.get(#isyIp + input, #isyAuth).inspect
resp = resp.gsub(/^.*tted"=>"/, "")
status = resp.gsub(/", "uom.*$/, "")
return status.downcase.strip
end
I figured out how to parse the XML into a HASH using parsed_response and understanding the resulting HASH depth. Thanks for the tip, Dave!
def status_input(input)
# Battery operated devices do not continuously reports status, thus will be blank until first change after an ISY reboot or power cycle.
resp = Hash[Rest.get(#isyIp + input, #isyAuth).parsed_response]
status = resp["properties"]["property"]["formatted"]
return status.downcase.strip
end
Thanks for your help!
post '/upload' do
unless params[:file] && (tmpfile = params[:file][:tempfile]) && (name = params[:file][:filename])
return haml(:upload)
end
time = Time.now.to_s
time.gsub!(/\s/, '')
name = time + name
while blk = tmpfile.read(65536)
File.open(File.join(Dir.pwd,"public/uploads", name), "wb") { |f| f.write(tmpfile.read) }
end
'success'
end
Everything goes where expected the files just end up being corrupted.
This bit looks really funky:
while blk = tmpfile.read(65536)
File.open(File.join(Dir.pwd,"public/uploads", name), "wb") { |f| f.write(tmpfile.read) }
end
I'm guessing you're trying to read your tempfile a 65536-byte block at a time, and then write those blocks successively to your destination file. But you never write blk, which is the first block you read; you write the rest of the file (tempfile.read) instead. And even if this loop did write blocks like it should, it opens the file anew for each block, overwriting the old contents! Anyway, I suspect you meant something like this:
File.open(File.join(Dir.pwd,"public/uploads", name), "wb") do |f|
while(blk = tempfile.read(65536))
f.write(blk)
end
end
That said, if you've got the file as a temp file (presumably already on your local file system), maybe all you need to do is move that file? It'll go way faster if that's the case - if the source and destination are on the same disk, it's just a matter of swapping some file system pointers, rather than copying all that data.
Hope that helps!
The code opens and replaces the file during every iteration of the loop, which causes part of the problem. The code also reads the tmpfile into blk then throws that data away. Time.now.to_s contains colons, which is the path separator on Mac OS X, and could cause a problem on OS X. The user-supplied filename could contain some bad stuff like .. which may allow users to overwrite files. Try this instead:
require 'pathname'
require 'zaru'
post '/upload' do
unless tmpfile = params[:file].try(:[], :tempfile)
return haml(:upload)
end
name = Zaru.sanitize!("#{Time.now.to_i}#{params[:file][:filename]}")
Pathname.pwd.join("public/uploads", name).open("wb") do |f|
while blk = tmpfile.read(65536)
f.write(blk)
end
end
'success'
end
You should also make sure that the filename doesn't end in something nefarious, like .js or .css, which could be exploited.
I am trying to build a script that gives me feedback about progress on the command-line. Actually it is just putting a newline for every n-th progress step made. Console looks like
10:30:00 Parsed 0 of 1'000'000 data entries (0 %)
10:30:10 Parsed 1'000 of 1'000'000 data entries (1 %)
10:30:20 Parsed 2'000 of 1'000'000 data entries (2 %)
[...] etc [...]
11:00:00 Parsed 1'000'000 of 1'000'000 data entries (100 %)
Even if timestamp and progressnumbers are fictional, you should see the problem.
What I want is to do it "wget-style" with a progressbar updated on the command line, with linewidth in mind.
First I thought about the use of curses because I had hands on as I tried to learn C, but I never could get warm with it, also I think it is bloated for the purpose of manipulating just a few lines. Also I dont need any coloring. Also most other libraries I found seemed to be specialized for coloring.
Can someone help me with this problem?
A while ago I created a class to be a status text on which you can change part of the content of the text within the line. It might be useful to you.
The class with an example use are:
class StatusText
def initialize(parms={})
#previous_size = 0
#stream = parms[:stream]==nil ? $stdout : parms[:stream]
#parms = parms
#parms[:verbose] = true if parms[:verbose] == nil
#header = []
#onChange = nil
pushHeader(#parms[:base]) if #parms[:base]
end
def setText(complement)
text = "#{#header.join(" ")}#{#parms[:before]}#{complement}#{#parms[:after]}"
printText(text)
end
def cleanAll
printText("")
end
def cleanContent
printText "#{#parms[:base]}"
end
def nextLine(text=nil)
if #parms[:verbose]
#previous_size = 0
#stream.print "\n"
end
if text!=nil
line(text)
end
end
def line(text)
printText(text)
nextLine
end
#Callback in the case the status text changes
#might be useful to log the status changes
#The callback function receives the new text
def onChange(&block)
#on_change = block
end
def pushHeader(head)
#header.push(head)
end
def popHeader
#header.pop
end
def setParm(parm, value)
#parms[parm] = value
if parm == :base
#header.last = value
end
end
private
def printText(text)
#If not verbose leave without printing
if #parms[:verbose]
if #previous_size > 0
#go back
#stream.print "\033[#{#previous_size}D"
#clean
#stream.print(" " * #previous_size)
#go back again
#stream.print "\033[#{#previous_size}D"
end
#print
#stream.print text
#stream.flush
#store size
#previous_size = text.gsub(/\e\[\d+m/,"").size
end
#Call callback if existent
#on_change.call(text) if #on_change
end
end
a = StatusText.new(:before => "Evolution (", :after => ")")
(1..100).each {|i| a.setText(i.to_s); sleep(1)}
a.nextLine
Just copy, paste in a ruby file and try it out. I use escape sequences to reposition the cursor.
The class has lots of features I needed at the time (like piling up elements in the status bar) that you can use to complement your solution, or you can just clean it up to its core.
I hope it helps.
In the meanwhile I found some gems that give me a progressbar, I will list them up here:
ProgressBar from paul at github
a more recent version from pgericson at github
ruby-progressbar from jfelchner at github
simple_progressbar from bitboxer at github
I tried the one from pgericson and that from jfelchner, they both have pros and cons but also both fits my needs. Probably I will fork and extend one of them in the future.
I hope this one helps others to find faster, what I searched for months.
Perhaps replace your outputting to this:
print "Progress #{progress_var}%\r"