Extract text from PDF(I have link to PDF) in ruby - ruby

I have a link like
http://www.downloads.com/help.pdf
I want to download this, and parse it to get the text content.
How do I go about this? I also plan to tag-ize(if there is a word like that) the extracted text

You can either use the pdf-reader gem (the example/text.rb example is simple and worked for me): https://github.com/yob/pdf-reader
Or the command-line utility pdftotext.

The Yomu gem will also be able to extract the text from a PDF (as well as other MIME types) for you.
require 'yomu'
Yomu.new(file_path).text

You can also take a look at DocRipper, a gem I maintain, that provides a Ruby interface for text extraction from a number of document formats including PDF, doc, docx and sketch.
DocRipper uses pdftotext under the hood and avoids Java dependencies.
require 'doc_ripper'
DocRipper::rip('/path/to/file.pdf') => "Pdf text"
You can read remote files using the Ruby standard library:
require 'open-uri'
require 'doc_ripper'
tmp_file = open("some_uri")
DocRipper::rip(tmp_file.path)

Related

Regex for Ruby code extraction out of plain text?

I want to extract ruby code snippets out of plain text.
Using the gem https://github.com/Erol/yomu makes it possible to extract the text of a PDF document. Now I want to get just the well-formed ruby code out of, for instance, a ruby-programming-book.
Any idea how a regex for multi-line matches of ruby methods and classes could look like?
I tried many different expressions, but did not get the results, that I expected.
Try this
Go through the file line by line and try to parse each line as Ruby code
If a line parses as Ruby start adding more lines to it until they don't parse as Ruby code anymore
Voila, here is your first code snippet
Maybe apply some filter to exclude trivial snippets like single words
Repeat
This is the common best practice to extract source code from unstructured text like emails and what not. This has been used to scan millions of emails for research projects.
Use the ripper core library to parse Ruby code.

where is a list of markdown tags supported by redcarpet gem

Is there is list of the markdown tags supported by the redcarpet gem?
For example, some markdown implementations support centering text, some don't. Rather than trial and error experimentation, it seems like such a popular gem would be documented somewhere?
I don't think redcarpet is responsible for the markdown - it's simply a renderer; it uses some libraries to interpret the required code
After some research, it seems all of the markdown interpreters are originally based on the UpSkirt library, which was derived from this Daring Fireball project:
Markdown is a text-to-HTML conversion tool for web writers. Markdown
allows you to write using an easy-to-read, easy-to-write plain text
format, then convert it to structurally valid XHTML (or HTML).
Thus, “Markdown” is two things: (1) a plain text formatting syntax;
and (2) a software tool, written in Perl, that converts the plain text
formatting to HTML. See the Syntax page for details pertaining to
Markdown’s formatting syntax. You can try it out, right now, using the
online Dingus.
You can find the sytnax here

Extract all Urls from a folder

I want to extract all urls from a folder using ruby but i have no idea about this please someone help me.I have expand lot of time on google but i could not find any suggetion
Thx
Ruby's URI class can scan a document and return all URLS. Look at the extract method.
Wrap that in a loop that scans your directory using Dir::glob or Dir::entries and reads each file using File.read.
If you want, you can write a quick parser-based scanner using Nokogiri, but it's probably going to have the same results. URI's method is easier.
You can use Nokogiri to parse and search HTML documents.
> require 'nokogiri'
> require 'open-uri'
> doc = Nokogiri::HTML(open("http://www.example.com"))
> doc.css("a").map{|node| node.attr("href")}
=> ["http://www.iana.org/domains/special"]

Find FontName from a TTF file in UBUNTU

I need to programmatically install (In Ubuntu) few hundreds of fonts for my application (Developed in Ruby on rails) and track the font names installed while installing.
So i need a tool or some library in UBUNTU which takes TTF file as input and provide the Font-Name as output.
Even if there is some way in ruby (GEM OR anything which can help) to find the font-name with given TTF file, will be great help.
Thanks in Advance.
I imagine there ought to be a ruby font parser of some sort that could do this, but if you have a little skill you could probably make your own pretty easily. You'll need two important pieces of info:
Data structure for TTF file header
Data structure for the 'name' table (where the font name is stored)
Read the header, locate the 'name' table, search the 'name' table for a nameID 4 entry ("full name", e.g. "My Cool Font Bold Italic"). Or maybe nameID 1 if you just want the family name ("My Cool Font").
The data structures are pretty simple and should not be much trouble at all for even a beginner developer to understand and parse.
Basically i used fop-ttfreader to get the details from a TTF file..
fop-ttfreader fontfile.ttf xmlfile.xml
http://manpages.ubuntu.com/manpages/natty/man1/fop-ttfreader.1.html
So now this xml file can be parsed and get the font details.
This looks like its increased to 2 steps, instead as suggested by djangodude, it can be done in a simple code by parsing ourself. In my case it will be a one time process like a rake.. So this helped me to finish my job.. :P
Hope this may help someone.
The ttfunk gem will do most of the heavy lifting without you needing to do any parsing yourself:
require 'ttfunk'
file = TTFunk::File.open("some/path/myfont.ttf")
puts "name: #{file.name.font_name.join(', ')}"
(excerpt from the readme in said gem)

Parsing an RSS item that has a colon in the tag with Ruby?

I'm trying to parse the info from an RSS feed that has this tag structure:
<dc:subject>foo bar</dc:subject>
using the built in Ruby RSS library. Obviously, doing item.dc:subject is throwing errors, but I can't figure out any way to pull out that info. Is there any way to get this to work? Or is it possible with a different RSS library?
Tags with ':' in them are really XML tags with a namespace. I never had good results using the RSS module because the feed formats often don't meet the specs, causing the module to give up. I highly recommend using Nokogiri to parse the feed, whether it is RDF, RSS or ATOM.
Nokogiri has the ability to use XPath accessors or CSS accessors, and, both support namespaces. The last two lines would be equivalent:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::XML(open('http://somehost.com/rss_feed'))
doc.at('//dc:subject').text
doc.at('dc|subject').text
When dealing with namespaces you'll need to add the declaration to the XPath accessor:
doc.at('//dc:subject', 'dc' => 'link to dc declaration')
See the "Namespaces" section for more info.
Without a URL or a better sample I can't do more, but that should get you pointed in a better direction.
A couple years I wrote a big RSS aggregator for my job using Nokogiri that handled RDF, RSS and ATOM. Ruby's RSS library wasn't up to the task but Nokogiri was awesome.
If you don't want to roll your own, Paul Dix's Feedzirra is a good gem for processing feeds.
The RSS module seems to have the ability to do those XML namespace attributes, i.e. <dc:date> like this:
feed.items.each do |item|
puts "Date: #{item.dc_date}"
end
I think item['dc:subject'] might work.

Resources