I want to extract all urls from a folder using ruby but i have no idea about this please someone help me.I have expand lot of time on google but i could not find any suggetion
Thx
Ruby's URI class can scan a document and return all URLS. Look at the extract method.
Wrap that in a loop that scans your directory using Dir::glob or Dir::entries and reads each file using File.read.
If you want, you can write a quick parser-based scanner using Nokogiri, but it's probably going to have the same results. URI's method is easier.
You can use Nokogiri to parse and search HTML documents.
> require 'nokogiri'
> require 'open-uri'
> doc = Nokogiri::HTML(open("http://www.example.com"))
> doc.css("a").map{|node| node.attr("href")}
=> ["http://www.iana.org/domains/special"]
Related
I need to full out the difference from two different text file on the internet. I tried look at the answer and all of the answer direct me to nokogiri.
Any solution on how to pull out the data difference using nokogiri in ruby? or is there any better way to do this?
You can use diff-lcs gem.
require 'diff/lcs'
require 'open-uri'
text1 = URI.parse('http://www.example.org/text1.txt').read
text2 = URI.parse('http://www.example.org/text2.txt').read
diff = Diff::LCS.diff(text1, text2)
Unfortunately, as you declined to provide an example of output even after several people asked you about it, I can't say much more than this.
I need to be able to make a Ruby application (no Rails, if possible) that opens an external YAML file that has over 104K lines of code in it, reads from it and filters out the following three things:
!ruby/object:EvtEvent
!ruby/object:NwsPost
!ruby/object:Asset
and then outputs these things to an XML file that would have to be built by the Ruby program.
I am unclear how to start with setting this up, as I am only a junior-level developer with one year's experience.
Although I found something on Stack Overflow that shows this snippet of a code example on using Nokogiri, I don't know exactly where to put this code which I would have to modify for my situation:
require 'yaml'
require 'nokogiri'
yaml = "getOrderDetails:
Id: '114'
Name: 'XYZ'"
doc = YAML.load yaml
output = Nokogiri::XML::Builder.new do |xml|
xml.product{
xml.id doc["getOrderDetails"]["Id"]
xml.name doc["getOrderDetails"]["Name"]
}
end
puts output.to_xml
#=> <?xml version="1.0"?>
#=> <product>
#=> <id>114</id>
#=> <name>XYZ</name>
#=> </product>
How would I code the init.rb file to launch a Ruby program that would open the YAML file in question, read from it, and then output it to XML?
What other Ruby files would I need to put in my lib folder for such a Ruby program to handle this task?
The code can go wherever it's convenient. Ruby has no real expectation of file locations; You just run them. Your development team probably has guidelines, so you need to talk to them.
"init.rb" is a non-descriptive name for a file. Try to use something more indicative of the purpose of the script.
Reading a remove file for this purpose is easy with OpenURI:
foo = open('http://domain.com/path/to/file.yaml').read
will return the contents of the file and store them in the variable foo.
The contents of the YAML can be parsed easily using:
yaml = YAML.load(foo)
At that point haml will contain an array or a hash, which can then be accessed as normal.
What's more interesting is that, once OpenURI is loaded, it will patch the open method, which should make it possible to do something like:
require 'open-uri'
yaml = YAML.load_file('http://domain.com/path/to/file.yaml')
YAML has to open a file to load from the disk, which is what load_file normally does, and after OpenURI does its magic the YAML class should have inherited that magic. I haven't tested that but it should work.
Nokogiri's Builder interface is probably a good way to go.
I would like to crawl a popular site (say Quora) that doesn't have an API and get some specific information and dump it into a file - say either a csv, .txt, or .html formatted nicely :)
E.g. return only a list of all the 'Bios' of the Users of Quora that have, listed in their publicly available information, the occupation 'UX designer'.
How would I do that in Ruby ?
I have a moderate enough level of understanding of how Ruby & Rails work. I just completed a Rails app - mainly all written by myself. But I am no guru by any stretch of the imagination.
I understand RegExs, etc.
Your best bet would be to use Mechanize.It can follow links, submit forms, anything you will need, web client-wise. By the way, don't use regexes to parse HTML. Use an HTML parser.
If you want something more high level, try wombat, which is this gem I built on top of Mechanize and Nokogiri. It is able to parse pages and follow links using a really simple and high level DSL.
I know the answer has been accepted, but Hpricot is also very popular for parsing HTML.
All you have to do is take a look at the html source of the pages and try to find a XPath or CSS expression that matches the desired elements, then use something like:
doc.search("//p[#class='posted']")
Mechanize is awesome. If you're looking to learn something new though, you could take a look at Scrubyt: https://github.com/scrubber/scrubyt. It looks like Mechanize + Hpricot. I've never used it, but it seems interesting.
Nokogiri is great, but I find the output messy to work with. I wrote a ruby gem to easily create classes off HTML: https://github.com/jassa/hyper_api
The HyperAPI gem uses Nokogiri to parse HTML with CSS selectors.
E.g.
Post = HyperAPI.new_class do
string title: 'div#title'
string body: 'div#body'
string author: '#details .author'
integer comments_count: '#extra .comment' do
size
end
end
# => Post
post = Post.new(html_string)
# => #<Post title: 'Hi there!', body: 'This blog post will talk about...', author: 'Bob', comments_count: 74>
I'm trying to parse the info from an RSS feed that has this tag structure:
<dc:subject>foo bar</dc:subject>
using the built in Ruby RSS library. Obviously, doing item.dc:subject is throwing errors, but I can't figure out any way to pull out that info. Is there any way to get this to work? Or is it possible with a different RSS library?
Tags with ':' in them are really XML tags with a namespace. I never had good results using the RSS module because the feed formats often don't meet the specs, causing the module to give up. I highly recommend using Nokogiri to parse the feed, whether it is RDF, RSS or ATOM.
Nokogiri has the ability to use XPath accessors or CSS accessors, and, both support namespaces. The last two lines would be equivalent:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::XML(open('http://somehost.com/rss_feed'))
doc.at('//dc:subject').text
doc.at('dc|subject').text
When dealing with namespaces you'll need to add the declaration to the XPath accessor:
doc.at('//dc:subject', 'dc' => 'link to dc declaration')
See the "Namespaces" section for more info.
Without a URL or a better sample I can't do more, but that should get you pointed in a better direction.
A couple years I wrote a big RSS aggregator for my job using Nokogiri that handled RDF, RSS and ATOM. Ruby's RSS library wasn't up to the task but Nokogiri was awesome.
If you don't want to roll your own, Paul Dix's Feedzirra is a good gem for processing feeds.
The RSS module seems to have the ability to do those XML namespace attributes, i.e. <dc:date> like this:
feed.items.each do |item|
puts "Date: #{item.dc_date}"
end
I think item['dc:subject'] might work.
I have a link like
http://www.downloads.com/help.pdf
I want to download this, and parse it to get the text content.
How do I go about this? I also plan to tag-ize(if there is a word like that) the extracted text
You can either use the pdf-reader gem (the example/text.rb example is simple and worked for me): https://github.com/yob/pdf-reader
Or the command-line utility pdftotext.
The Yomu gem will also be able to extract the text from a PDF (as well as other MIME types) for you.
require 'yomu'
Yomu.new(file_path).text
You can also take a look at DocRipper, a gem I maintain, that provides a Ruby interface for text extraction from a number of document formats including PDF, doc, docx and sketch.
DocRipper uses pdftotext under the hood and avoids Java dependencies.
require 'doc_ripper'
DocRipper::rip('/path/to/file.pdf') => "Pdf text"
You can read remote files using the Ruby standard library:
require 'open-uri'
require 'doc_ripper'
tmp_file = open("some_uri")
DocRipper::rip(tmp_file.path)