how to show different content on txt format using nokogiri - ruby

I need to full out the difference from two different text file on the internet. I tried look at the answer and all of the answer direct me to nokogiri.
Any solution on how to pull out the data difference using nokogiri in ruby? or is there any better way to do this?

You can use diff-lcs gem.
require 'diff/lcs'
require 'open-uri'
text1 = URI.parse('http://www.example.org/text1.txt').read
text2 = URI.parse('http://www.example.org/text2.txt').read
diff = Diff::LCS.diff(text1, text2)
Unfortunately, as you declined to provide an example of output even after several people asked you about it, I can't say much more than this.

Related

Extract all Urls from a folder

I want to extract all urls from a folder using ruby but i have no idea about this please someone help me.I have expand lot of time on google but i could not find any suggetion
Thx
Ruby's URI class can scan a document and return all URLS. Look at the extract method.
Wrap that in a loop that scans your directory using Dir::glob or Dir::entries and reads each file using File.read.
If you want, you can write a quick parser-based scanner using Nokogiri, but it's probably going to have the same results. URI's method is easier.
You can use Nokogiri to parse and search HTML documents.
> require 'nokogiri'
> require 'open-uri'
> doc = Nokogiri::HTML(open("http://www.example.com"))
> doc.css("a").map{|node| node.attr("href")}
=> ["http://www.iana.org/domains/special"]

Wiki quotes API?

I would want to get a structured version of a Wikiquote page via JSON (basically I need all phrases)
Example: http://en.wikiquote.org/wiki/Fight_Club_(film)
I tried with: http://en.wikiquote.org/w/api.php?format=xml&action=parse&page=Fight_Club_(film)&prop=text
but I get all HTML source code. I need each pharse as an element of an Array
How could I achieve that with DBPEDIA?
For one thing Iam not sure whether you can query wiki quotes using DBpedia and secondly, DBpedia gives you only info box data in a structured way, it does not in a any way the article content in a structured way. Instead with a little bit of trouble you can use the Media wiki api to get the data
EDIT
The URI you are trying gives you a text so this will make things easier, but not completely.
Try this piece of code in your console:
require 'Nokogiri'
content = JSON.parse(open("http://en.wikiquote.org/w/api.php?format=json&action=parse&page=Fight_Club_%28film%29&prop=text").read)
data = content['parse']['text']['*']
xpath_data = Nokogiri::HTML data
xpath_data.xpath("//ul/li").map{|data_node| data_node.text}
This is the closest I have come to an answer, of course this is not completely right because you will get a lot on unnecessary data. But if you dig into Nokogiri and xpath and find out how to pin point the nodes you need you can get a solution which will give you correct quotes at least 90% of the time.
Just change the format to JSON. Look up the Wikipedia API for more details.
http://en.wikiquote.org/w/api.php?format=json&action=parse&page=Fight_Club_(film)&prop=text

How do I write a web scraper in Ruby?

I would like to crawl a popular site (say Quora) that doesn't have an API and get some specific information and dump it into a file - say either a csv, .txt, or .html formatted nicely :)
E.g. return only a list of all the 'Bios' of the Users of Quora that have, listed in their publicly available information, the occupation 'UX designer'.
How would I do that in Ruby ?
I have a moderate enough level of understanding of how Ruby & Rails work. I just completed a Rails app - mainly all written by myself. But I am no guru by any stretch of the imagination.
I understand RegExs, etc.
Your best bet would be to use Mechanize.It can follow links, submit forms, anything you will need, web client-wise. By the way, don't use regexes to parse HTML. Use an HTML parser.
If you want something more high level, try wombat, which is this gem I built on top of Mechanize and Nokogiri. It is able to parse pages and follow links using a really simple and high level DSL.
I know the answer has been accepted, but Hpricot is also very popular for parsing HTML.
All you have to do is take a look at the html source of the pages and try to find a XPath or CSS expression that matches the desired elements, then use something like:
doc.search("//p[#class='posted']")
Mechanize is awesome. If you're looking to learn something new though, you could take a look at Scrubyt: https://github.com/scrubber/scrubyt. It looks like Mechanize + Hpricot. I've never used it, but it seems interesting.
Nokogiri is great, but I find the output messy to work with. I wrote a ruby gem to easily create classes off HTML: https://github.com/jassa/hyper_api
The HyperAPI gem uses Nokogiri to parse HTML with CSS selectors.
E.g.
Post = HyperAPI.new_class do
string title: 'div#title'
string body: 'div#body'
string author: '#details .author'
integer comments_count: '#extra .comment' do
size
end
end
# => Post
post = Post.new(html_string)
# => #<Post title: 'Hi there!', body: 'This blog post will talk about...', author: 'Bob', comments_count: 74>

Parsing an RSS item that has a colon in the tag with Ruby?

I'm trying to parse the info from an RSS feed that has this tag structure:
<dc:subject>foo bar</dc:subject>
using the built in Ruby RSS library. Obviously, doing item.dc:subject is throwing errors, but I can't figure out any way to pull out that info. Is there any way to get this to work? Or is it possible with a different RSS library?
Tags with ':' in them are really XML tags with a namespace. I never had good results using the RSS module because the feed formats often don't meet the specs, causing the module to give up. I highly recommend using Nokogiri to parse the feed, whether it is RDF, RSS or ATOM.
Nokogiri has the ability to use XPath accessors or CSS accessors, and, both support namespaces. The last two lines would be equivalent:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::XML(open('http://somehost.com/rss_feed'))
doc.at('//dc:subject').text
doc.at('dc|subject').text
When dealing with namespaces you'll need to add the declaration to the XPath accessor:
doc.at('//dc:subject', 'dc' => 'link to dc declaration')
See the "Namespaces" section for more info.
Without a URL or a better sample I can't do more, but that should get you pointed in a better direction.
A couple years I wrote a big RSS aggregator for my job using Nokogiri that handled RDF, RSS and ATOM. Ruby's RSS library wasn't up to the task but Nokogiri was awesome.
If you don't want to roll your own, Paul Dix's Feedzirra is a good gem for processing feeds.
The RSS module seems to have the ability to do those XML namespace attributes, i.e. <dc:date> like this:
feed.items.each do |item|
puts "Date: #{item.dc_date}"
end
I think item['dc:subject'] might work.

Extract text from PDF(I have link to PDF) in ruby

I have a link like
http://www.downloads.com/help.pdf
I want to download this, and parse it to get the text content.
How do I go about this? I also plan to tag-ize(if there is a word like that) the extracted text
You can either use the pdf-reader gem (the example/text.rb example is simple and worked for me): https://github.com/yob/pdf-reader
Or the command-line utility pdftotext.
The Yomu gem will also be able to extract the text from a PDF (as well as other MIME types) for you.
require 'yomu'
Yomu.new(file_path).text
You can also take a look at DocRipper, a gem I maintain, that provides a Ruby interface for text extraction from a number of document formats including PDF, doc, docx and sketch.
DocRipper uses pdftotext under the hood and avoids Java dependencies.
require 'doc_ripper'
DocRipper::rip('/path/to/file.pdf') => "Pdf text"
You can read remote files using the Ruby standard library:
require 'open-uri'
require 'doc_ripper'
tmp_file = open("some_uri")
DocRipper::rip(tmp_file.path)

Resources