I have encountered an issue processing a file upload in Ruby when the filename contains characters that appear to be percent encoded.
Sample File Name: Test %AF.txt
Sample Form
<form id="upload" enctype="multipart/form-data"
action="/object/upload" accept-charset="UTF-8" method="post">
<label for="file">File:</label>
<input type="file" name="file" id="file">
</form>
Handle Upload
puts params[:file].original_filename
puts params[:file].tempfile
puts params[:file].headers
Sample Output
Test �.jpg
#<File:0x00005618f8a3e4d8>
Content-Disposition: form-data; name="file"; filename="Test %AF.jpg"
Content-Type: image/jpeg
Problem Summary
params[:file].original_filename cannot be unencoded to restore the "%AF" string.
Here is my solution to the problem above. The original filename is extracted from the headers attribute of the uploaded file.
def uploaded_filename
filename = params[:file].original_filename
unless filename.valid_encoding?
begin
m = /filename="([^"]+)"/.match(params[:file].headers)
filename = m[1] if m
rescue StandardError
filename = filename.encode('UTF-8', invalid: :replace, undef: :replace)
end
end
filename
end
Related
In a script which sends emails through Net::SMTP, I've to figure out how to properly encode the email body in order to support accentuated characters. I've separated the email into 3 parts: headers, body and attachment, as from this tutorial https://www.tutorialspoint.com/ruby/ruby_sending_email.htm
Fixing this issue for the Subject header field wasn't a big deal:
require 'base64'
MARKER = 'FOOBAR'
def self.headers
<<~EOF
From: someemail#domain.tld
To: anotheremail#domaim.tld
# Base64 encoded UTF-8
Subject: =?UTF-8?B?#{Base64.strict_encode64('Accentuated characters supportés')}?=
Date: #{Time.now.to_s}
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary=#{MARKER}
--#{MARKER}
EOF
end
I was tempted to reproduce the same logic for the email body, without any success. I've try several headers, such as Language, Content-Language, Content-Tranfer-Encoding, again, without any success. Using Ruby's .encode!('utf-8') was also ineffective.
The only working solution I can think of would be to send HTML encoded characters: using é instead of é inside a HTML block. Though, I'd like to avoid this solution as I've to improve my comprehension of encoding issues.
Does anyone has a suggestion about this issue ?
Here's my code so far, if it can help anyone:
module Reports
module SMTP
MARKER = 'FOOBAR'
def self.send_report(file_path)
file_content = File.binread(file_path)
encoded_content = [file_content].pack('m') # Base64
mail_content = headers + body + attachment(file_path, encoded_content)
begin
Net::SMTP.start('my.smtp.srv', 25, 'HELO_fqdn', 'username', 'p455w0rD', :plain) do |smtp|
smtp.send_message(mail_content, 'from#domain.tld', ['to1#domain.tld', 'to2#domain.tld'])
end
rescue => e
puts e.inspect, e.backtrace
end
end
def self.headers
# see above
end
def self.body
<<~EOF
Content-Type: text/html
Content-Transfer-Encoding:utf8
Here's a franglish text with some characters accentués
--#{MARKER}
EOF
end
def self.attachment(file_path, encoded_content)
<<~EOF
Content-Type: multipart/mixed; name = #{file_path}
Content-Transfer-Encoding:base64
Content-Disposition: attachment; filename = #{file_path}
#{encoded_content}
--#{MARKER}--
EOF
end
end
end
Note: these emails are correctly decoded by ProtonMail webclients, but our company's webclient (OBM) doesn't display accentuated character nor attachment properly.
Adding ;charset="utf-8" to the Content-Type header from the body part fixed my problem.
Content-Type: text/html;charset="utf-8"
I have made my onw jekyll plugin which gives some text special css (spoiler hider).
This is my code:
class Spoiler < Liquid::Tag
def initialize(tag_name, input, tokens)
super
#input = input
end
def render(context)
output = "<div class='spoiler'>" + #input + "<div>"
return output;
end
end
Liquid::Template.register_tag('spoiler', Spoiler)
There is example how I want to use it in my markdown posts:
---
layout: post
title: "testing file"
date: 2019-09-25
category: article
---
aaaaaaaaaaa {% spoiler secret text %} bbbbbbbbbbbb
but this is how page looks like:
When I look in to generated source code, the text looks like this:
<p>aaaaaaa <div class='spoiler'>secret text </div> bbbbbbbb</p>
What should I do make jekyll plugin generate html element instead of text ?
PS: If I manually replace < by < and > by >, it works fine.
Technically, every line separated by whitespace get rendered into an HTML <p> element.
To avoid generating <p> tags automatically, explicitly wrap lines in a <div>:
<div>
aaaaaaaaaaa {% spoiler secret text %} bbbbbbbbbbbb
</div>
If I have the below HTML:
<a class="name" title="file_name" href="/somelink>NAMEOFFILE</a>
<div class="profile">
<img class="FFVAD" decoding="auto" style="" sizes="496px" src="https://websitename.com/054a89a69181e68399c756d746f3b996/followme.jpg">
</div>
How do use Watir to save download and save the image using the link title.
So file followme.jpg, would be downloaded and saved as title_name.jpg
Apologies, I figured this one out in the end. I stored the title of the element into a string, then used the string when calling the file write.
#image_src = #browser.div(:class => "profile").image(:class => "FFVAD").src
#userimage = #browser.link(:class => "name").text
#filename = "./folder/#{#userimage}.jpg"
File.open(#filename, 'wb') do |f|
f.write open(#image_src).read
end
To extract URL's I am using the following:
html = open('http://lab/links.html')
urls = URI.extract(html)
This works great.
Now I need to extract a list of URL's without the prefix http or https, which are between <br > tags. Since there are no http or https tags, URI.extract doesnt work.
domain1.com/index.html<br >domain2.com/home/~john/index.html<br >domain3.com/a/b/c/d/index.php
Each unprefixed URL is between <br > tags.
I have been looking at this Nokogiri Xpath to retrieve text after <BR> within <TD> and <SPAN> but couldnt get it to work.
Output
domain1.com/index.html
domain2.com/home/~john/index.html
domain3.com/a/b/c/d/index.php
Intermediate solution
doc = Nokogiri::HTML(open("http://lab/noprefix_domains.html"))
doc.search('br').each do |n|
n.replace("\n")
end
puts doc
I still need to strip out the rest of the HTML tags (!DOCTYPE, html, body, p)...
Solution
str = ""
doc.traverse { |n| str << n.to_s if (n.name == "text" or n.name == "br") }
puts str.split /\s*<\s*br\s*>\s*/
Thanks.
Assuming that you already have a method to extract the example string you showed in your question, you can use split on the string:
str = "domain1.com/index.html<br >domain2.com/home/~john/index.html<br >domain3.com/a/b/c/d/index.php"
str.split /\s*<\s*br\s*>\s*/
#=> ["domain1.com/index.html",
# "domain2.com/home/~john/index.html",
# "domain3.com/a/b/c/d/index.php"]
This will split the string at every <br> tag. It will also remove whitespace before and after the <br> and allow for whitespace inside the <br> tag, e.g. <br > or < br>. If you need to handle self-closing tags, too (e.g. <br />), use this regex instead:
/\s*<\s*br\s*\/?\s*>\s*/
I need to truncate some data received from a URI:PARSE...it is full of html codes and data, The result at the end is what I need.
Here's the string (abbreviated) ' junk"Result">Q8:0;junk
What's is the best way to truncate the extra stuff in the string so that I can split the data I need into variables.
Thanks in advance,
Philip
pabbott#cpak.com
i would recommend to use Nokogiri to extract your value from Result span:
require 'nokogiri'
response = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">;
<html xmlns="w3.org/1999/xhtml"><head><title>;
</title></head><body>
<form name="form1" method="post" action="tenHSServer.aspx?t=34&f=DeviceValue&d=R10" id="form1">
<div>
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUKMTkzNDcxNzcwM2RkM4AHUDZdWZytDdspzLq7+FOXRfQ=" />
</div>
<span id="Result">R10:100;</span>
</form></body>
</html>'
result = nil
if doc = Nokogiri::HTML(response) rescue nil
if span = doc.css('#Result')
result = span.text
end
end
puts result
#=> R10:100;
however if you can not / do not want to install Nokogiri, use this regexp instead:
result = response.scan(/id=["|']Result["|']>([^<]*)<\//m).flatten.first
puts result
#=> R10:100;
Remove everything up to and including <span id=\"Result\"> with the first call to sub()
Then remove everything after and including </span> from what's left with the second call to sub()
Assume you store your html in the variable mystring
result = mystring.sub(/.*<span id=\"Result\">/,'').sub(/<\/span>.*/,'')
If you can't always rely on the elements being spans, you could use the following:
result = mystring.sub(/.*id=\"Result\">/,'').sub(/<\/.*/,'')