Nokogiri: ids Vs hierarchy xpath performance - ruby

I have to write down the xml schema for a dataset which is hierarchically organized. It has to be parsed by Nokogiri for information retrieval. My question is, under a performance point of view, is it better to respect the hierarchy or to flatten it?
E.g.
<item_1 id="id_1">
<item_2 id="id_2">value</item_2>
</item_1>
or
<item id_1="id_2" id_2="id_2">value</item>
I know that multiple attributes should be avoided as far as readability and maintainability are concerned, but performance is my priority.

If you want the absolute fastest performance and the documents are large, you probably don't want to use XPath at all. A SAX (or Reader) filter will be the fastest.
But if you are going to have Nokogiri parse the document and create a DOM for XPath, I don't think it will make much difference whether you query using:
doc.xpath('/item1[#id=x]/item2[#id=y]') #first case
or
doc.xpath('/item[#id_1=x and #id2=y]') #second case
Of course, benchmarking these two solutions against your real data is the only way to know for sure.

Related

a simple filtering language that can be embedded in ruby?

I have a ruby project where part of the operation is to select entities given user-specified constraints. So far, I've been hacking my own filter language, using regular expressions and specifying inclusion/exclusion based on the fields in the entities.
If you are interested in my current approach, here's an example: For instance, given this list of entities:
[{"type":"dog", "name":"joe"}, {"type":"dog", "name":"fuzz"}, {"type":"cat", "name":"meow"}]
A user could specify a filter like so:
{"filter":{
"type":{"included":["dog"] },
"name":{"excluded":["^f.*"] }
}}
Would match all dogs but exclude fuzz.
This is sort of working now. However, I am starting to require more sophisticated selection parameters. I am thinking that rather than continuing to hack on my own filter language, there might be a more general-purpose filter language I can just embed in my application? For instance, is there a parser that can in-app filter using a SQL where clause? Or are there some other general, simple filter languages that I'm not aware of? I would especially like to move away from regexps since I want to do range querying on numbers (like is entity["size"] < 50 ?)
It is a little bit of an extrapolation, but I think you may be looking for a search engine, or at least enough of one that you may as well use one just for the query language.
If so you might want to look at elasticsearch which does have Ruby client bindings, and could be a good fit for what you are trying to do. Especially if you want or need to express the data you want to search as JSON for use by client code, as that format is natively supported by the search engine.
The query language is quite expressive, and there are a variety of built-in and plugin tools available to explore and use it.
in the end, i ended up implementing a ruby dsl. it's easy, fun, and powerful.

In XQuery, how do I avoid deteriorating performance in a paged simple query?

I have a ML database with a few tens of thousands of documents in it, and a query that returns some simple calculated values for either all or a subset of those documents. The document count has grown to the point that the "all documents" option no longer reliably runs without timing out, and is only going to get worse as the document count grows. The obvious solution is for the client application to use the other form and paginate the results. It's an offline batch process, so overall speed isn't an issue - we'd just like to keep individual requests sane.
The paged version of the query is very simple:
declare namespace ns = "http://some.namespace/here"
declare variable $fromCount external;
declare variable $toCount external;
<response> {
for $doc in fn:doc()/ns:entity[$fromCount to $toCount]
return
<doc> omitted for brevity </doc>
} </response>
The problem is that the query is slower the further through the document set the requested page is; presumably because it's having to load every document in order, check whether it's the right type and iterate until its found $fromCount ns:entitys before it even begins building the response.
One wrinkle is that there are other types of document in the database, so just using fn:doc isn't a realistic option (although, they are in different directories, so xdmp:directory() might be an option; something I'll look into.)
There also isn't currently an index on the ns:entity element; would that help? It's always the root-node of a document, and the documents are quite large, so I'm concerned about the size of the index. Also, (the slow part of) this query isn't interested in the value of the element, just that it exists.
I thought about using the search: api for it's built-in paging, but it seems overkill for a query that is intended to match all documents; surely it's possible to manually construct the query that search:search() would build internally.
It seems like what I really need is an efficient list of all root-nodes of a certain type in the database. Does Marklogic maintain such a thing? If Not would an index solve the problem?
Edit: It turns out that the answer in my case is use the xdmp:directory() option, since ML apparently stores a fast, in-memory list of all documents. Still, if there is a more general solution to the problem, it's bound to be of interest, so I'll leave the question here.
Your analysis is correct:
presumably because it's having to load every document in order, check whether it's the right type and iterate until its found $fromCount ns:entitys before it even begins building the response
The usual answer is cts:search plus the unfiltered option. You found that xdmp:directory was faster, but you should still be able to measure pagination times as O(n) even if the scale is smaller. See http://docs.marklogic.com/guide/performance/unfiltered#chapter - basically the database is guarding against returning false positives, unless you tell it not to.
Another approach might be to use cts:uris and its limit option, but this might require managing pagination state in terms of start values rather than page counts. For example if the last item on page 1 was "cat", you would use "cat" as arg2 when calling cts:uris for the next page. You could still use pagination start-stop values, too. That would still be O(n) - but at a much smaller scale.

How do I extract a HTML topic heading from a web page?

Given a page like "What popular startup advice is plain wrong?", I'd like to be able to extract the first topic under the topic heading on the upper right hand side, in this case, "Common Misconceptions".
What's the best way for me to do this in Ruby? Is it with Nokogiri or a regex? Presumably I need to do some HTML parsing?
First, you almost never, ever, want to use regular expressions to parse/extract/fold/spindle/mutilate XML or HTML. There are too many ways it can go wrong. Regular expressions are great for some jobs, but XML/HTML extractions are not a good fit.
That said, here's what I'd do using Nokogiri:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.quora.com/What-popular-startup-advice-is-plain-wrong'))
topic = doc.at('span a.topic_name span').content
puts topic
Running that outputs:
Common Misconceptions
The code is taking a couple shortcuts, that should work consistently:
Using Ruby's OpenURI allows easy accessing of Internet resources. It's my go-to for most simple to average apps. There are more powerful tools but none as convenient.
doc.at tells Nokogiri to traverse the document, and find the first occurrence of the CSS accessor 'span a.topic_name span', which should be consistent in that page as the first entry.
Note that Nokogiri supports some variants of searching for a node: at vs. search. at and % and things like css_at find the first occurrence and return a Node, which is an individual tag or text or comment. search, /, and those variants return a NodeSet which is like an array of Nodes. You'll have to walk that list or extract the individual nodes you want using some sort of Array accessor. In the above code I could have said doc.search(...).first to get the node I wanted.
Nokogiri also supports using XPath accessors, but for most things I'll usually go with CSS. It's simpler, and easier to read, but your mileage might vary.

Classify documents with tags

I have a huge amount of documents (mainly pdfs and doc's) I want to classify, so I can search over them according to certain tags. These tags could either be of my own (I put the tags to the document) or extracted from the text.
I've just seen a post related to this (Classify data using Apache Mahout), but perhaps there is something even more simple.
Mahout might be overkill for your problem - but you can get a fairly quick, easy solution by using OpenNLP.
http://opennlp.sourceforge.net/api/index.html
Specifically, look at the opennlp.tools.doccat package. Essentially, you have to go through and manually tag a small(ish) set of the items for each category you desire. If they are really distinct, you can get away with a small sample size.
You can use the DocumentCategorizerME.train() static function to train a collection of documents, where each requires a category tag and the text block to train on. Then, you can initialize the DocumentCategorizerME with the trained model and begin classifying all the rest of your documents.
Once you do this, you can (I think) write the model to a file so you don't have to ever do that again.
This post on extracting keywords and classifying webpages is related and may be helpful. In your example it sounds like you can use tags in lieu of the keyword extraction piece (although you may want to use both in combination). Weka is easy to use, I would definitely recommend giving it a look.

Is there an XPath equivilent for Linq to XML?

I have been using Linq to XML for a few hours and while it seems lovely and powerful when it comes to loops and complex selections, it doesn't seem so good for situations where I just want to select a single node value which XPath seems to be good at.
I may be missing something obvious here but is there a way to use XPath and Linq to XML together without having to parse the document twice?
You can still use XPath, with the XPathEvaluate, XPathSelectElement and XPathSelectElements extension methods. You can also call CreateNavigator to create an XPathNavigator.

Resources