Any Ruby models to traverse DOM's quickly? - ruby

Does anyone know of any Ruby libraries/gems that allow you to traverse a DOM quickly?
I need something which is fast, and doesn't have a lot of dependencies. I've been trying to use Nokogiri, but I'm concerned with the number of 'bug segmentation faults' I've been getting.

Hpricot is a personal favourite of mine.

Related

Ruby/Rails HTML page parsing

I want to parse web page (catalog) using some Ruby libraries for that and store it to the database. Currently it is hard for me what to choose what kind of library is the best for such kind of purposes. I'm familiar with Hpricot but I'm not really sore that nowadays it is on the edge.
P.S - Or any kind of data to parse URL-s?
Thank you!
I think for HTML parsing nokogiri with open-uri is best.
Why do you care about a library, that "nowadays is on the edge"? If you feel yourself confidently with Hpricot, then use it. Don't waste your time on endless seeks: merely start writing a program. That is my answer.
Hehe, I was looking to quote Hpricot author on this matter, and I've found this comment:
Hpricot was the work of the hacker _why who has now disappeared. But
even before he disappeared nokogiri overtook hpricot in performance.
He even tweeted "caller asks, “should i use hpricot or nokogiri?” if
you're NOT me: use nokogiri. and if you're me: well cut it out, stop
being me"
And here is a link to a comment I've quoted:
http://news.ycombinator.com/item?id=1955644
Summing this up: go with Nokogiri.

find repeat patterns in webpages in ruby

I am trying to find a way of finding repeat patterns in webpages so that i can extract the content into my database.
EDIT : I don't know what the repeat pattern is before hand so i can't just search for a given pattern via a regex or something.
For example if you have 10 sites selling cars but the sites are all different, looking on each site the cars are listed in html in a repeated way down the page for this site.
The other sites will be listed in a different way but each with a repeated pattern.
Does anyone know how, or have any experience of this sort of thing?
i love ruby so was hoping to do it in ruby if any one has or knows of any libs / gems that may help me out ?
Rick, machine pattern matching is a complicated topic, and not something that you'll find a good library for out of the box on Ruby.
Kyle's answer was a start, once you get the page with Ruby, the typical techology for this would be xpath or "The XML Path Language".
Using Xpath you can write simple selectors that will extract every item matching a pattern, for instance, every link on an HTML document might be //a, every h1 would be //h1, and every image directly inside a div, where the image has the class "car" would be something like: //div/image[class="car"].
The result of the XPath is an enumerable list of each item, you can then query for sub-elements, get the content() of the elements, and build relationships to extract the data you need.
The go-to library for Ruby is called Nokogiri, and is avaiable as a gem - the direct documentation is a little weak, but it's all covered there if you know what to look for.
Some libraries for Ruby combine the crawling, with an easy way to access the underlying HTML/XML as a Nokogiri document, one such example is Anemone which is a "framework for building web spiders in Ruby" - and I can recomment it very highly.
In Ruby, if you want to get the text of a webpage all you have to do is use the Net::HTTP namespace. The get method returns a string representation of the webpage.
Net::HTTP.get 'http://www.target-site.com', '/target-page.html'
You're probably going to want to use some sort of XML Parser after that to make a model of the page and navigate over it. I've heard good things about Hpricot.

How to create a spreadsheet with formulas using Rails?

I need some gem/plugin to create an Excel spreadsheet with formulas to use in my Rails application. Any suggestions?
I've used Roo and it's quite good and easy to do spreadsheet processing (once you get all the gem dependencies installed). However, it doesn't support formulas natively. It won't eval the formula and return the result (this would be difficult I think -- use the excel engine?) but it will give you the text of the formula, for example:
=SUM(.A1,.B1)
It'd be pretty easy to handle this specific case but if you have many different formulas and functions then rolling your own evaluator is going to be difficult. Going and getting A1 and B1 to add them together is very doable with Roo. It's just a question of how complex your formulas are.
writeexcel does it wonderfully!
I think you should create blank Excel file with formulas and then fill it with Rails. Because you can't create formulas with Ruby.
There's a spreadsheet gem listed on RubyGems but having never used it I can't recommend it.

What is a good approach for extracting keywords from user-submitted text?

I'm building a site that allows users to make sense of a debate by graphically representing arguments for and against a particular issue. (Wrangl)
I'd like to categorise these debates so they are more easily found and connected. I don't want to irritate the person creating the debate by asking them to add tags and categories before they see any benefit, so I'm looking at a way of automatically extracting keywords.
What's a good approach for taking the debate's title and description (and possibly the content of the arguments themselves once there are some) to pull out, say, ten strong keywords that could be used as metadata to connect similar debates together, or even as the content of the "meta" keywords tag in the head of the HTML page where the debate is viewable. Eg. Datamapper vs ActiveRecord
The site is coded in Ruby with Sinatra, using DataMapper for data storage. I'm ideally looking for something which will work on Heroku (I don't have a way of writing files to disk dynamically), and I'd consider a web service, an API or ideally a Ruby gem.
Maybe you can use TextAnalyzer.
I understand that you're wanting to find an easy way of achieving this, I've recently dived into the world of NLP (Natural Language Processing) and Text-mining and its a daunting process of which most went far above my head.
Although i managed to code some functionality that resembles what you're looking for, though I did it in PHP. What i would suggest, that if you want it tailored to your project (Wrangl) then do it yourself.
Using the Porter stemming algorithm which I'm sure there will be Ruby code for.
Ruby Porter stemmer
You can try the salsaAPI to automatically extract keywords and categorize the debates!

Ruby XML Parsing with Nokogiri/XPath

I have a shopify store that I want to automatically update the product variants inventory levels with, using a live xml feed from the wholesaler I use.
I'm learning to program (Ruby) and this is my first project, but after researching here is how I think it should work.
Use Ruby/Nokugiri to parse the XML feed from the wholesaler, and then Xpath to locate both the unique product variant SKU code, and the stock level.
Somehow I need to use this SKU to refer back to my Shopify store product XML list, and pull out the variants unique ID using the SKU code.
Then use something like the builder gem to build the XML format that shopify needs, and then use curl to PUT the changes. I'm guessing I loop this process for every product?
I know Shopify only has a 300 call limit, so I've got the article on putting a delay in the script, but I get the feeling the above method isn't the easiest way to go about this?
With Shopify you need to apply the variant stock level update against unique variant xml files, so I need to build the unique xml file/code and PUT it against /admin/variants/#[thevariantid].xml
I'm looking forward to trying to put this together and learning in the process, but am I on the right track with this? Are there simpler gems I should be looking at?
n.b I've only recently started learning Ruby, and will head to Rails afterwards. I know a bit about XML and it's structure so should be ok finding what I need with XPath.
You’re on the right track, but I’d use the shopify_api gem to do the talking to Shopify instead of having to form the XML and URIs yourself: https://github.com/Shopify/shopify_api
There’s an article on our wiki that might also help you out with regards to the API call limit but just let me know if you need more space – we’re pretty flexible and the limit is really just there to keep scripts from going wild and affecting service for everyone else.
Your proposed path seems good, except that there's no need to use the 'builder' gem, as Nokogiri has some very nice XML-building built into it.

Resources