Is there a way to interactively search for a nodes that matches a given xpath expression in emacs?
I would like something similar to re-forward-search but instead of using a regular expression I'd type an xpath expression.
I don't have an answer wrt XPath queries; sorry. But you might try Icicles search search keys M-s M-s x and M-s M-s X (commands icicle-search-xml-element and icicle-search-xml-element-text-node).
These let you search the contents and the text() nodes, respectively, of top-level XML elements whose names match a regexp that you provide.
For icicle-search-xml-element, can have any of these
forms:
<ELEMENTNAME>...</ELEMENTNAME>
<ELEMENTNAME ATTRIBUTE1="..."...>...</ELEMENTNAME>
<ELEMENTNAME/>
<ELEMENTNAME ATTRIBUTE1="...".../>
You can alternatively choose to search, not the search contexts as
defined by the element-name regexp, but the non-contexts, that is, the
buffer text that is outside such elements. To do this, use `C-M-~'
during completion. (This is a toggle, and it affects only future
search commands, not the current one.)
For icicle-search-xml-element-text-node, the top-level matching elements must not have attributes. Only top-level elements of the form <ELEMENTNAME>...</ELEMENTNAME> are
matched.
HTH.
I did something like that a long time ago. I can't give you any details now, but I'll provide an overview of the approach I took.
I created some Emacs functions to interact with (query) a native XML database. I did it with a MarkLogic server once and with a Berkley DB XML database another time. One of those functions simply queried the database. Another one of the functions would send an XQuery query that included an Emacs buffer or buffer selection.
The native XML database server would process the query, return the results, and my Emacs functions would render the result in a result buffer.
This approach allowed me to query the XML with XPath and XQuery, which is a much more powerful query language that includes XPath. (I wrote about XQuery a long time ago, here: https://www.ibm.com/developerworks/library/x-xqueryxpath/)
As difficult as all of this might sound, it turned out to be surprisingly easy.
Related
Normally, one would use an XPath query to obtain a certain value or node. In my case, I'm doing some web-scraping with google spreadsheets, using the importXML function to update automatically some values. Two examples are given below:
=importxml("http://www.creditagricoledtvm.com.br/";"(//td[#class='xl7825385'])[9]")
=importxml("http://www.bloomberg.com/quote/ELIPCAM:BZ";"(//span)[32]")
The problem is that the pages I'm scraping will change every now and then and I understand very little about XML/XPath, so it takes a lot of trial and error to get to a node. I was wondering if there is any tool I could use to point to an element (either in the page or in its code) that would provide an appropriate query.
For example, in the second case, I've noticed the info I wanted was in a span node (hence (//span)), so I printed all of them in a spreadsheet and used the line count to find the [32] index. This takes long to load, so it's pretty inconvenient. Also, I don't even remember how I've figured the //td[#class='xl7825385'] query. Thus why I'm wondering if there is more practical method of pointing to page elements.
Some clues :
Learning XPath basics is still useful. W3Schools is a good starting point.
https://www.w3schools.com/xml/xpath_intro.asp
Otherwise, built-in dev tools of your browser can help you to generate absolute XPath. Select an element, right-click on it then >Copy>Copy XPath.
https://developers.google.com/web/tools/chrome-devtools/open
Browser extensions like Chropath can generate absolute or relative XPath for you.
https://autonomiq.io/chropath/
I am using google docs for web scraping. More specifically, I am using the Google Sheets built in IMPORTXML function in which I use XPath to select nodes to scrape data from.
What I am trying to do is basically check if a particular node exists, if YES, select some other random node.
/*IF THIS NODE EXISTS*/
if(exists(//table/tr/td[2]/a/img[#class='special'])){
/*SELECT THIS NODE*/
//table/tr/td[2]/a
}
You don't have logic quite like that in XPath, but you might be able to do something like what you want.
If you want to select //table/tr/td[2]/a but only if it has a img[#class='special'] in it, then you can use //table/tr/td[2]/a[img[#class='special']].
If you want to select some other node in some other circumstance, you could union two paths (the | operator), and just make sure that each has a filter (within []) that is mutually exclusive, like having one be a path and the other be not() of that path. I'd give an example, but I'm not sure what "other random node" you'd want… Perhaps you could clarify?
The key thing is to think of XPath as a querying language, not a procedural one, so you need to be thinking of selectors and filters on them, which is a rather different way of thinking about problems than most programmers are used to. But the fact that the filters don't need to specifically be related to the selector (you can have a filter that starts looking at the root of the document, for instance) leads to some powerful (if hard-to-read) possibilities.
Use:
/self::node()[//table/tr/td[2]/a/img[#class='special']]
//table/tr/td[2]/a
I have some XML that is structured like this:
<whatson>
<productions>
<production>
<category>Film</category>
</production>
<production>
<category>Business</category>
</production>
<production>
<category>Business training</category>
</production>
</productions>
</whatson>
And I need to select every production with a category that doesn't contain "Business" (so just the first production in this example).
Is this possible with XPath? I tried working along these lines but got nowhere:
//production[not(contains(category,'business'))]
XPath queries are case sensitive. Having looked at your example (which, by the way, is awesome, nobody seems to provide examples anymore!), I can get the result you want just by changing "business", to "Business"
//production[not(contains(category,'Business'))]
I have tested this by opening the XML file in Chrome, and using the Developer tools to execute that XPath queries, and it gave me just the Film category back.
I need to select every production with a category that doesn't contain "Business"
Although I upvoted #Arran's answer as correct, I would also add this...
Strictly interpreted, the OP's specification would be implemented as
//production[category[not(contains(., 'Business'))]]
rather than
//production[not(contains(category, 'Business'))]
The latter selects every production whose first category child doesn't contain "Business". The two XPath expressions will behave differently when a production has no category children, or more than one.
It doesn't make any difference in practice as long as every <production> has exactly one <category> child, as in your short example XML. Whether you can always count on that being true or not, depends on various factors, such as whether you have a schema that enforces that constraint. Personally, I would go for the more robust option, since it doesn't "cost" much... assuming your requirement as stated in the question is really correct (as opposed to e.g. 'select every production that doesn't have a category that contains "Business"').
You can use not(expression) function.
not() is a function in xpath (as opposed to an operator)
Example:
//a[not(contains(#id, 'xx'))]
OR
expression != true()
Should be xpath with not contains() method, //production[not(contains(category,'business'))]
I'm currently working with JMeter in order to stress test one of our systems before release. Through this, I need to simulate users clicking links on the webpage presented to them. I've decided to extract theese links with an XPath Post-Processor.
Here's my problem:
I have an a XPath expression that looks something like this:
//div[#data-attrib="foo"]//a//#href
However I need to extract a specific child for each thread (user). I want to do something like this:
//div[#data-attrib="foo"]//a[position()=n]//#href
(n being the current index)
My question:
Is there a way to make this query work, so that I'm able to extract a new index of the expression for each thread?
Also, as I mentioned, I'm using JMeter. JMeter creates a variable for each of the resulting nodes, of an XPath query. However it names them as "VarName_n", and doesn't store them as a traditional array. Does anyone know how I can dynamicaly pick one of theese variables, if possible? This would also solve my problem.
Thanks in advance :)
EDIT:
Nested variables are apparently not supported, so in order to dynamically refer to variables that are named "VarName_1", VarName_2" and so forth, this can be used:
${__BeanShell(vars.get("VarName_${n}"))}
Where "n" is an integer. So if n == 1, this will get the value of the variable named "VarName_1".
If the "n" integer changes during a single thread, the ForEach controller is designed specifically for this purpose.
For the first question -- use:
(//div[#data-attrib="foo"]//a)[position()=$n]/#href
where $n must be substituted with a specific integer.
Here we also assume that //div[#data-attrib="foo"] selects a single div element.
Do note that the XPath pseudo-operator // typically result in very slow evaluation (a complete sub-tree is searched) and also in other confusing problems ( this is why the brackets are needed in the above expression).
It is recommended to avoid using // whenever the structure of the document is known and a complete, concrete path can be specified.
As for the second question, it is not clear. Please, provide an example.
I'm currently developing a website that allows a search on a PostgreSQL
database, the search works with to_tsquery() and I'm trying to find a way to validate the input before it's being sent as a query.
Other than that I'm also trying to add a phrasing capability, so that if someone searches for HELLO | "I LIKE CATS" it will only find results with "hello" or the entire phrase "i like cats" (as opposed to I & LIKE & CATS that will find you articles that have all 3 words,
regardless where they might appear).
Is there some reason why it's too expensive to let the DB server validate it? It does seem a bit excessive to duplicate the ts_query parsing algorithm in the client.
If the concern is that you don't want it to try running the whole query (which presumably will involve table access) each time it validates, you could use the input in a smaller query, just in pseudocode (which may look a bit like Python, but that's just coincidence):
is_valid_query(input):
try:
execute("SELECT ts_query($1)", input);
return True
except DatabaseError:
return False
With regard to phrasing, it's probably easiest to search by the non-phrased query first (using indexes), then filter those for having the phrase. That could be done server side or client side. Depending on the language being parsed, it might be easiest to construct a simple regex of the phrase that deals with repeated whitespace or other ignorable symbols.
Search for to_tsquery('HELLO|(I&LIKE&CATS)'), getting back a list of documents which loosely match.
In the client, filter that to those matching the regex "HELLO|(I\s+LIKE\s+CATS)".
The downside is you do need some additional code for translating your query into the appropriate looser query, and then for translating it into a regex.
Finally, there might be a technique in PostgreSQL to do proper phrase searching using the lexeme positions that are stored in ts_vectors. I'm guessing that phrase searches are one of the intended uses, but I couldn't find an example of it in my cursory search. There's a section on it near the bottom of http://linuxgazette.net/164/sephton.html at least.