Django-MPTT - ordering root nodes by count of immediate descendants - django-queryset

I'm using Django-MPTT to do a display a simple 2 level hierarchy (root => child(ren)). I'm looking for a way to structure my queryset so that nodes get returned with the root node having the most children first and the node with the least children (if any) last.

Take a look at your parent field and make note of the related_name. Suppose it is children. Then do the following:
from django.db.models import Count
MyMPTTModel.objects.root_nodes().annotate(
Count('children')).order_by('-children__count')
If you need access to the child instances themselves, you may also want to look at doing a qs.prefetch_related('children') as well.

something like this should do it:
from mptt.templatetags.mptt_tags import cache_tree_children
qs = qs.filter(level__lt=2)
root_nodes = cache_tree_children(qs)
root_nodes.sort(key=lambda node: len(node.get_children()), reverse=True)

Related

Depth first search on Neo4j with filtering on node properties

I would like to perform a depth first search on my graph and so, get all the paths existing from a given node ('N1456' in my example), and all the nodes of theses path must have the same property "PROPERTY_TO_FILTER".
Typically, my graph is composed of two types of node, and two types of relations.
For now, I tested the following request :
WITH "
MATCH (my_node{name : 'N1456'})
CALL apoc.path.expandConfig(protein, {uniqueness:'NODE_GLOBAL', bfs : FALSE}) YIELD path
WITH path, my_node, last(nodes(path)) as subgraph
WHERE my_node<> subgraph and my_node.my_property CONTAINS 'PROPERTY_TO_FILTER'
RETURN nodes(path), length(path) AS len
ORDER BY len DESC" AS query
CALL apoc.export.json.query(query, "my_results.json", {})
YIELD properties, data
RETURN properties, data;
However, the results are not the ones attended. I get a list of paths but only the first node has the property "PROPERTY_TO_FILTER" ; this filter is not taken into account for the other nodes...
I guess I should put a filter at apoc.path.expandConfig level, but I see in the documentation that this is only possible to filter the node label, not the node properties.
Could someone help please ?
Maybe this can help:
MATCH(fromNode:LABEL{name : 'N1456'})-[r:REL_TO_TRAVERSE*1..2]->(toNode:LABEL)
WHERE toNode.my_property CONTAINS 'PROPERTY_TO_FILTER'
RETURN fromNode,r,toNode
It's called variable length pattern matching:
https://neo4j.com/docs/cypher-manual/current/syntax/patterns/#cypher-pattern-varlength

Problems with '._ElementUnicodeResult'

While trying to help another user out with some question, I ran into the following problem myself:
The object is to find the country of origin of a list of wines on the page. So we start with:
import requests
from lxml import etree
url = "https://www.winepeople.com.au/wines/Dry-Red/_/N-1z13zte"
res = requests.get(url)
content = res.content
res = requests.get(url)
tree = etree.fromstring(content, parser=etree.HTMLParser())
tree_struct = etree.ElementTree(tree)
Next, for reasons I'll get into in a separate question, I'm trying to compare the xpath of two elements with certain attributes. So:
wine = tree.xpath("//div[contains(#class, 'row wine-attributes')]")
country = tree.xpath("//div/text()[contains(., 'Australia')]")
So far, so good. What are we dealing with here?
type(wine),type(country)
>> (list, list)
They are both lists. Let's check the type of the first element in each list:
type(wine[0]),type(country[0])
>> (lxml.etree._Element, lxml.etree._ElementUnicodeResult)
And this is where the problem starts. Because, as mentioned, I need to find the xpath of the first elements of the wine and country lists. And when I run:
tree_struct.getpath(wine[0])
The output is, as expected:
'/html/body/div[13]/div/div/div[2]/div[6]/div[1]/div/div/div[2]/div[2]'
But with the other:
tree_struct.getpath(country[0])
The output is:
TypeError: Argument 'element' has incorrect type (expected
lxml.etree._Element, got lxml.etree._ElementUnicodeResult)
I couldn't find much information about _ElementUnicodeResult), so what is it? And, more importantly, how do I fix the code so that I get an xpath for that node?
You're selecting a text() node instead of an element node. This is why you end up with a lxml.etree._ElementUnicodeResult type instead of a lxml.etree._Element type.
Try changing your xpath to the following in order to select the div element instead of the text() child node of div...
country = tree.xpath("//div[contains(., 'Australia')]")

How to do nokogiri attribute selection?

I have many statements like this in my test.xml file
<House name="bla"><Room id="bla" name="black" ></Room></House>
How do I print all Rooms with name="black". I am using CSS selector but Only House and Room attributes are taken by the selector.
I started with trying to print all name's, doesn't matter House or Room.
nodes = doc.css("name"). But it gives null as the output. So I am not able to proceed.
In CSS you have a syntax for matching elements by an attribute key-val pair:
nodes = doc.css("[name='black']")
For future reference you can also chain attribute selectors
nodes = doc.css(".my-class[name='black'][foo='bar']")
Or omit the val and match any element where the attribute is present:
nodes = doc.css("[name]")

XPath: Select Certain Child Nodes

I'm using XPath with Scrapy to scrape data off of a movie website BoxOfficeMojo.com.
As a general question: I'm wondering how to select certain child nodes of one parent node all in one Xpath string.
Depending on the movie web page from which I'm scraping data, sometimes the data I need is located at different children nodes, such as whether or not there is a link or not. I will be going through about 14000 movies, so this process needs to be automated.
Using this as an example. I will need actor/s, director/s and producer/s.
This is the Xpath to the director: Note: The %s corresponds to a determined index where that information is found - in the action Jackson example director is found at [1] and actors at [2].
//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font/text()
However, would a link exist to a page on the director, this would be the Xpath:
//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font/a/text()
Actors are a bit more tricky, as there <br> included for subsequent actors listed, which may be the children of an /a or children of the parent /font, so:
//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()
Gets all most all of the actors (except those with font/br).
Now, the main problem here, I believe, is that there are multiple //div[#class="mp_box_content"] - everything I have works EXCEPT that I also end up getting some digits from other mp_box_content. Also I have added numerous try:, except: statements in order to get everything (actors, directors, producers who both have and do not have links associated with them). For example, the following is my Scrapy code for actors:
actors = hxs.select('//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()' % (locActor,)).extract()
try:
second = hxs.select('//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()
for n in second:
actors.append(n)
except:
actors = hxs.select('//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()
This is an attempt to cover for the facts that: the first actor may not have a link associated with him/her and subsequent actors do, the first actor may have a link associated with him/her but the rest may not.
I appreciate the time taken to read this and any attempts to help me find/address this problem! Please let me know if any more information is needed.
I am assuming you are only interested in textual content, not the links to actors' pages etc.
Here is a proposition using lxml.html (and a bit of lxml.etree) directly
First, I recommend you select td[2] cells by the text content of td[1], with expressions like .//tr[starts-with(td[1], "Director")]/td[2] to account for "Director", or "Directors"
Second, testing various expressions with or without <font>, with or without <a> etc., makes code difficult to read and maintain, and since you're interested only in the text content, you might as well use string(.//tr[starts-with(td[1], "Actor")]/td[2]) to get the text, or use lxml.html.tostring(e, method="text", encoding=unicode) on selected elements
And for the <br> issue for multiple names, the way I do is generally modify the lxml tree containing the targetted content to add a special formatting character to <br> elements' .text or .tail, for example a \n, with one of lxml's iter() functions. This can be useful on other HTML block elements, like <hr> for example.
You may see better what I mean with some spider code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import lxml.etree
import lxml.html
MARKER = "|"
def br2nl(tree):
for element in tree:
for elem in element.iter("br"):
elem.text = MARKER
def extract_category_lines(tree):
if tree is not None and len(tree):
# modify the tree by adding a MARKER after <br> elements
br2nl(tree)
# use lxml's .tostring() to get a unicode string
# and split lines on the marker we added above
# so we get lists of actors, producers, directors...
return lxml.html.tostring(
tree[0], method="text", encoding=unicode).split(MARKER)
class BoxOfficeMojoSpider(BaseSpider):
name = "boxofficemojo"
start_urls = [
"http://www.boxofficemojo.com/movies/?id=actionjackson.htm",
"http://www.boxofficemojo.com/movies/?id=cloudatlas.htm",
]
# locate 2nd cell by text content of first cell
XPATH_CATEGORY_CELL = lxml.etree.XPath('.//tr[starts-with(td[1], $category)]/td[2]')
def parse(self, response):
root = lxml.html.fromstring(response.body)
# locate the "The Players" table
players = root.xpath('//div[#class="mp_box"][div[#class="mp_box_tab"]="The Players"]/div[#class="mp_box_content"]/table')
# we have only one table in "players" so the for loop is not really necessary
for players_table in players:
directors_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Director")
actors_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Actor")
producers_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Producer")
writers_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Producer")
composers_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Composer")
directors = extract_category_lines(directors_cells)
actors = extract_category_lines(actors_cells)
producers = extract_category_lines(producers_cells)
writers = extract_category_lines(writers_cells)
composers = extract_category_lines(composers_cells)
print "Directors:", directors
print "Actors:", actors
print "Producers:", producers
print "Writers:", writers
print "Composers:", composers
# here you should of course populate scrapy items
The code can be simplified for sure, but I hope you get the idea.
You can do similar things with HtmlXPathSelector of course (with the string() XPath function for example), but without modifying the tree for <br> (how to do that with hxs?) it works only for non-multiple names in your case:
>>> hxs.select('string(//div[#class="mp_box"][div[#class="mp_box_tab"]="The Players"]/div[#class="mp_box_content"]/table//tr[contains(td, "Director")]/td[2])').extract()
[u'Craig R. Baxley']
>>> hxs.select('string(//div[#class="mp_box"][div[#class="mp_box_tab"]="The Players"]/div[#class="mp_box_content"]/table//tr[contains(td, "Actor")]/td[2])').extract()
[u'Carl WeathersCraig T. NelsonSharon Stone']

xpath - select a node value if another node value exists

I have several nodes (see below). I know how to select specific nodes which have a certain attribute. But in this case I would like to import the "file_url" value of the media objects that belong to the group "narrowImage".
<media_object>
<media_object>
<file_id>5175967</file_id>
<group>wideImage</group>
<file_url>http://www.mysite.com/image1.jpg</file_url>
</media_object>
<media_object>
<file_id>5175968</file_id>
<group>wideImage</group>
<file_url>http://www.mysite.com/image2.jpg</file_url>
</media_object>
<media_object>
<file_id>5175969</file_id>
<group>narrowImage</group>
<file_url>http://www.mysite.com/image3.jpg</file_url>
</media_object>
</media_object>
In the above case i would only need the value "http://www.mysite.com/image3.jpg"
any xpath expert out there who can point me in the right direction?
Use:
/*/*[group = 'narrowImage']/file_url
This selects any file_url element that is a "grand-child" of the top element in the XML document, and whose parent has a group child-element whose string value is 'narrowImage'.
I think you should be able to use:
//media_object[group='narrowImage']/file_url
This should select every media_object in your file (regardless of the level) then filter them based on group='narrowImage' then give you the file_url child.

Resources