gather all links to *.txt files - ruby

I need to retrieve all the links to text files in an HTML document, I don't know what the best way to do this is but, I have tried the following in ruby...
line.scan(/<a href="([\w+:\/.-]*.txt)/)
but I am not sure if this expression covers all possible links pointing to text file, I am wondering if there are some built-in regular expressions for this? or if anyone knows of a better way to retrieve all links to a text file in a huge webpage.

This will walk through the HTML and find all hrefs with a '.txt' extension:
#!/usr/bin/env ruby
require 'nokogiri'
html = <<EOT
<html>
<head><title>foo</title></head>
<body>
text file
jpg file
text file 2
</body>
</html>
EOT
doc = Nokogiri::HTML(html)
puts doc.search('a').select { |n| n['href'][/\.txt$/] }.map{ |n| n['href'] }
> file.txt
> file2.txt
It's using Nokogiri to parse the content, which really is a lot more bullet-proof than trying to use regex.

Try this (captures all txt files, not just links):
html.scan(/[^\s"']+\.txt/)
To capture links to text files only:
html.scan(/<a [^<>\n]*?href=["']([^\s"']+\.txt)["'][^<>\n]*?>.*?<\/a>/m)

Related

Replace text String from the shell disabling any regular expression

I need to replace a large set of broken HTML links in a file. For that, I'd need to do a find/replace disabling any kind of regular expression- i.e. the kind of basic Find/Replace you would do from your notepad.
I came across to a Ruby script which should do exactly that:
ruby -p -i -e "gsub('Home', 'NEWLINK')" test.txt
However, the file test.txt is not changed, nor an output is returned. (I don't know much about ruby so I might be just missing something obvious)
Is there any other tool which does what I need?
Edit: I'd expect that the following test.txt file:
Home
....is changed to:
NEWLINK
Thanks
Instead of a regular expression consider using a HTML parser which actually understands HTML and won't leave you with a broken HTML document.
# link_parser.rb
require 'bundler/inline'
gemfile do
source 'https://rubygems.org'
gem 'nokogiri'
end
fn = ARGV[0]
if File.exist(fn)
puts "Processing #{fn}..."
File.open(fn, 'rw') do |file|
doc = Nokogiri::HTML(file)
links = doc.css('a[href="index.php?option=com_content&view=article&id=130&catid=111&Itemid=324"]')
if links.any?
links.each do |link|
link.href = "NEWLINK"
end
file.rewind
file.write(doc.to_s)
puts "#{links.length} links replaced"
else
puts "No links found"
end
end
else
puts "File not found."
end
ruby link_parser.rb path/to/file.html

Injecting base directory to RDiscount Markdown conversion

I am using Ruby's RDiscount to convert markdown to HTML.
The markdown documents contain links (or images) that are relative to the path of the markdown file itself.
Is there a way for me to tell RDiscount that it should prepend all relative links with a string (folder) of my choice?
I am trying to achieve an effect similar to how GitHub shows images in README - where the files are looked for in the same directory as the README.
Here is an example code:
require 'rdiscount'
markdown = "![pic](image.png)\n\n[link](somewhere.html)"
doc = RDiscount.new(markdown)
# I would like to do something here, like:
# doc.base_link_path = 'SOMEFOLDER'
html = doc.to_html
puts html
# actual output =>
# <p><img src="image.png" alt="pic" /></p>
# <p>link</p>"
# desired output =>
# <p><img src="SOMEFOLDER/image.png" alt="pic" /></p>
# <p>link</p>"
I looked at the RDiscount class documentation but did not find anything like this.

HAML::Engine, image is not rendering

I am using just Ruby (no rails) to make a gem that takes a yaml file as an input an it produces a pdf.
I am using pdfKit and Haml. In the haml file I need to render an image. When using pure HTML, PDFKit shows the image, but when using Haml::Engine and the haml file the image doesn't get rendered.
I have put the image, the .module and the haml file under one folder to make sure the issue is not the path.
In my gemspec I have the following
.
.
gem.add_development_dependency("pdfkit", "0.8.2")
gem.add_development_dependency("wkhtmltopdf-binary")
gem.add_development_dependency("haml")
The HTML that works:
<html>
<head>
<title>
junk
</title>
</head>
<body>
HI
<img src="watermark.png"/>
Bye
</body>
</html>
Haml version that doesn't work:
%html
%head
%title
junk
%body
HI
= img "watermark.png"
Bye
module:
require "pdfkit"
require "yaml"
require "haml"
require "pry"
.
.
junk = Haml::Engine.new(File.read("lib/abc/models/junk.haml")).render
kit2 = PDFKit.new(junk)
kit2.to_file("pdf/junk.pdf")
when using the html file, pdf renders the image, however, when I use the haml this is now my pdf looks like
and If I use
%img(src="watermark.png")
I get the following error when pdf is generated
Exit with code 1 due to network error: ContentNotFoundError
The PDF still gets generated, but the still looks like the image above.
So I am trying to see without using any rails, and img_tag, image_tag etc.. how would I just use the %img in haml file to render the proper image
Following is the output of junk, when I create #img = "watermark.png"
1 pry(DD)> junk
=> "<html>\n<head>\n<title>\njunk\n</title>\n</head>\n<body>\nHI\n<img src='watermark.png'>\nBye\n</body>\n</html>\n"
Replace the img tag %img(src=#img) still same result
I think this is not HAML issue, but wkhtmltopdf thing. Apparently it's not very good at handling relative urls to images. It should work with absolute path, for example:
%img(src="/home/user/junk/watermark.png")
To create an image tag in HAML you will need to use
%img(src="foo.png")
# or
%img{src: "foo.png"}
If you use
= img "watermark.png"
then you are calling the img method (if it exists), passing it "watermark.png" as argument and outputting the return value of that method in the generated HTML.
Honestly i'm not sure how that HTML template should work, when the HAML template that generates the same HTML does not. So I guess you run that script from different directories or so?
Anyways: Problem is, that you will need absolute paths for your files. Otherwise wkhtmltopdf will not be able to resolve your images.
You can use File.expand_path (see sample) for this, or write a custom helper.
I tried following:
Created two files:
/tmp/script.rb
/tmp/watermark.png
/tmp/template.haml
Where watermark.png is a screenshot I took from this question, script.rb is:
require "pdfkit"
require "haml"
template = File.read("template.haml")
html = Haml::Engine.new(template).render
File.write("template.html", html)
pdf = PDFKit.new(html)
pdf.to_file("output.pdf")
And template.haml is:
%html
%head
%title Something
%body
BEFORE
%img(src="#{File.expand_path(__dir__, "watermark.png")}")
AFTER
When I run this script like:
ruby-2.5.0 tmp$ ruby script.rb
Then the HTML that is generated is:
<html>
<head>
<title>Something</title>
</head>
<body>
BEFORE
<img src='/tmp/watermark.png'>
AFTER
</body>
</html>
And the generated PDF looks like:

How can I directly use pandoc to generate docx files within a Sinatra app?

I have a Sinatra app which needs to provide downloadable reports in Microsoft Word format. My approach to creating the reports is to generate the content using ERB, and then convert the resulting HTML into docx. Pandoc seems to be the best tool for accomplishing this, but my implementation involves generating some temporary files which feels kludgy.
Is there a more direct way to generate the docx file and send it to the user?
I know that PandocRuby exists, but I couldn't quite get it working for my purposes. Here is an example of my current implementation:
#setting up the docx mime type
configure do
mime_type :docx, 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
end
# route to generate the report
get '/report/:name' do
content_type :docx
input = erb :report, :layout=>false #get the HTML content for the input file
now = Time.now.to_i.to_s #create a unique file name
input_path = File.join('tmp', now+'.txt')
f = File.new(input_path, "w+")
f.write(input.to_s) #write HTML to the input to the file
f.close()
output_path = File.join('tmp', now+'.docx') # create a unique output file
system "pandoc -f html -t docx -o #{output_path} #{input_path}" # convert the input file to docs
send_file output_path
end
A recent update to pandoc-ruby added support for piping binary output to standard output. Does that solve your problem?
I don't have any experience with Sinatra, and I have not tried to use pandoc-ruby to pipe binary output, but something like
puts PandocRuby.convert(input, :from => :html, :to => :docx)
might do the trick.

Ruby Premature EOF?

I'm trying to write one file into another one in Ruby, but the output seems to stop prematurely.
Input file - large CSS file with base64 embedded fonts
Output file - basic html file.
#write some HTML before the CSS (works)
...
#write the external CSS (doesn't work, output finished prematurely)
while !ext_css_file.eof()
out_file.puts(ext_css_file.read())
end
...
#write some HTML after the CSS (works)
The resulting file is basically a valid HTML file, with a truncated CSS (in the middle of an embedded font)
When doing a puts on the result of read(), I get the same result: The CSS file is read only up to this last string: "RMSHhoPCAGt/mELDBESFBQSggGfAgESKCUAAAAAAAwAlgABAAAAAAABAAUADAABAAAAAAAC"
It is difficult to provide a detailed solution without more insight into what the CSS file actually contains. Based on your code above, I would try something like this instead:
#write some HTML before the CSS (works)
...
#write the external CSS (doesn't work, output finished prematurely)
out_file.puts(ext_css_file.read())
...
#write some HTML after the CSS (works)
I don't think you need the .eof check because the read method reads and returns the entire file contents, or an empty string or nil if at the end of file. See here: http://apidock.com/ruby/IO/read
I would tend to read and write the same type of data. For instance if I were writing data into the new file using puts, I would read data using readlines. If I were writing binary data using write, I would read the data using read. I would be consistent with either strings or bytes and not mix the two.
Try something like this...
File.open('writable_file_path', 'w') do |f|
# f.puts "some html"
f.puts IO.readlines('css_file_path')
# f.puts "some more html"
end

Resources