How can I select the element in JavaScript source? - xpath

I need to get the value of the "html" key in the bellow JavaScript source code which was extracted by xpath('.//script[34]') and embedded in a html source page.
<script>
FM.view({
"ns": "pl.content.homeFeed.index",
"domid": "Pl_Official_MyProfileFeed__24",
"css": ["style/css/module/list/comb_WB_feed_profile.css?version=73267f08bd52356e"],
"js": "page/js/pl/content/homeFeed/index.js?version=dad90e594db2c334",
"html": " <div class=\"WB_feed WB_feed_v3\" pageNum=\"\" node-type='feed_list' module-type=\"feed\">\r\n...."
})
</script>
I don't know how to process the text "FM.view" especially.

I would use .re() to extract the html key value from the script:
>>> response.xpath("//script[contains(., 'Pl_Official_MyProfileFeed__24')]/text()").re(r'"html": "(.*?)"\n')
[0].strip()
u'<div class=\\"WB_feed WB_feed_v3\\" pageNum=\\"\\" node-type=\'feed_list\' module-type=\\"feed\\">\\r\\n..'
Or, you can extract the complete object from the script, load it with json and get the html value:
>>> import json
>>> data = response.xpath("//script[contains(., 'Pl_Official_MyProfileFeed__24')]/text()").re(r'(?ms)FM\.view\((\{.*?\})\)')[0]
>>> obj = json.loads(data)
>>> obj['html'].strip()
u'<div class="WB_feed WB_feed_v3" pageNum="" node-type=\'feed_list\' module-type="feed">\r\n....'
Note the (?ms) part in the regular expression - this is the way we set the flags - multiline and dotall - required for the pattern to work in this case.

Here's an alternative to regular expression + json using js2xml package.
First step is to get the JavaScript statements within <script> from HTML. You probably have that step already. Here I'm building a Scrapy selector from your input HTML. In your case you are probably working with a response within a callback:
>>> import scrapy
>>> import js2xml
>>> t = r''' <script>
... FM.view({
... "ns": "pl.content.homeFeed.index",
... "domid": "Pl_Official_MyProfileFeed__24",
... "css": ["style/css/module/list/comb_WB_feed_profile.css?version=73267f08bd52356e"],
... "js": "page/js/pl/content/homeFeed/index.js?version=dad90e594db2c334",
... "html": " <div class=\"WB_feed WB_feed_v3\" pageNum=\"\" node-type='feed_list' module-type=\"feed\">\r\n...."
... })
... </script>'''
>>> selector = scrapy.Selector(text=t, type='html')
Second step is to build a tree representation of the JavaScript program using js2xml.parse(). You get an lxml tree back:
>>> js = selector.xpath('//script/text()').extract_first()
>>> jstree = js2xml.parse(js)
>>> jstree
<Element program at 0x7ff19ec94ea8>
>>> type(jstree)
<type 'lxml.etree._Element'>
>>> print(js2xml.pretty_print(jstree))
<program>
<functioncall>
<function>
<dotaccessor>
<object>
<identifier name="FM"/>
</object>
<property>
<identifier name="view"/>
</property>
</dotaccessor>
</function>
<arguments>
<object>
<property name="ns">
<string>pl.content.homeFeed.index</string>
</property>
<property name="domid">
<string>Pl_Official_MyProfileFeed__24</string>
</property>
<property name="css">
<array>
<string>style/css/module/list/comb_WB_feed_profile.css?version=73267f08bd52356e</string>
</array>
</property>
<property name="js">
<string>page/js/pl/content/homeFeed/index.js?version=dad90e594db2c334</string>
</property>
<property name="html">
<string> <div class="WB_feed WB_feed_v3" pageNum="" node-type='feed_list' module-type="feed">
....</string>
</property>
</object>
</arguments>
</functioncall>
</program>
Third is to select the object you want from the tree.
Here, it's the 1st argument of the FM.view() call. Calling .xpath() on the lxml tree gives you a list even if you selected 1 node (XPath returns node-sets)
# select the function call for "FM.view"
# and get first argument
>>> jstree.xpath('''
//functioncall[
function[.//identifier/#name="FM"]
[.//identifier/#name="view"]]
/arguments
/*[1]''')
[<Element object at 0x7ff19ec94ef0>]
>>> args = jstree.xpath('//functioncall[function[.//identifier/#name="FM"][.//identifier/#name="view"]]/arguments/*[1]')
Fourth, convert the <object> into a Python dict using js2xml.jsonlike.make_dict():
# use js2xml.jsonlike.make_dict() on that argument
>>> js2xml.jsonlike.make_dict(args[0])
{'ns': 'pl.content.homeFeed.index', 'html': ' <div class="WB_feed WB_feed_v3" pageNum="" node-type=\'feed_list\' module-type="feed">\r\n....', 'css': ['style/css/module/list/comb_WB_feed_profile.css?version=73267f08bd52356e'], 'domid': 'Pl_Official_MyProfileFeed__24', 'js': 'page/js/pl/content/homeFeed/index.js?version=dad90e594db2c334'}
>>> from pprint import pprint
>>> pprint(js2xml.jsonlike.make_dict(args[0]))
{'css': ['style/css/module/list/comb_WB_feed_profile.css?version=73267f08bd52356e'],
'domid': 'Pl_Official_MyProfileFeed__24',
'html': ' <div class="WB_feed WB_feed_v3" pageNum="" node-type=\'feed_list\' module-type="feed">\r\n....',
'js': 'page/js/pl/content/homeFeed/index.js?version=dad90e594db2c334',
'ns': 'pl.content.homeFeed.index'}
>>>
And finally, you simply use the "html" key from that dict:
>>> jsdata = js2xml.jsonlike.make_dict(args[0])
>>> jsdata['html']
' <div class="WB_feed WB_feed_v3" pageNum="" node-type=\'feed_list\' module-type="feed">\r\n....'
>>>

Related

Clarification of Nokogiri::NodeSet XML Content based on 'puts node' and 'puts node.inspect'

I rarely use xpath() but when I do I keep tripping myself up on interpreting content of Nokogiri::Nodesets and believe I now know where I have always gone wrong.
Simply put when I do a 'puts NodeSet' I have always assumed that I could search the Nodeset based on the returned XML. But the first tag returned does not appear to actually part of the node XML.
'puts n1' returns XML that has a SPAN as the first element of the XML, but if I then do an search n1.xpath('SPAN') or n1.xpath('SPAN/DIV') no nodes are found. n1.xpath('DIV') returns the output I expect and proves no SPAN tag in the XML.
The only way I can logically explain this to myself is if assume that the first xml tag of a 'puts node' is the "Node Name" and not part of the node XML. This works for me going forward but am I missing something that is going to bite me elsewhere.
CODE:
docxml = Nokogiri::XML(<<EOT)
<DIV><SPAN><DIV id='1'><H1>-H1-</H1><h1>-h1-</h1></DIV>
<DIV id='2'><H2>-H2-</H2> <h2>-h2-</h2></DIV>
<DIV id='3'><H3>-H3-</H3><h3>-h3-</h3></DIV>
</SPAN></DIV>
EOT
n0 = docxml.xpath('DIV')
n1 = n0.xpath('SPAN')
n2 = n1.xpath('DIV')
n3 = n2.xpath('*')
n4 = n3.xpath('*')
puts "n1:xpath('SPAN'): \n#{n1.xpath('SPAN')}\n#{'^'*80} \nn1 XML:\n#{n1}\n#{'^'*80}\
\nn1:inspect \n#{n1.inspect}\n#{'^'*80}\n"
OUTPUT:
=begin
n1:xpath('SPAN'):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
n1 XML:
<SPAN>
<DIV id="1"> <H1>-H1-</H1> <h1>-h1-</h1> </DIV>
<DIV id="2"> <H2>-H2-</H2> <h2>-h2-</h2> </DIV>
<DIV id="3"> <H3>-H3-</H3> <h3>-h3-</h3> </DIV>
</SPAN>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
n1:inspect
[#<Nokogiri::XML::Element:0x1c10964 name="SPAN"
children=[
#<Nokogiri::XML::Element:0x1c10820 name="DIV" attributes=[#<Nokogiri::XML::Attr:0x18fff90 name="id" value="1">]
children=[#<Nokogiri::XML::Element:0x1c1064c name="H1" children=[#<Nokogiri::XML::Text:0x1c1ffe8 "-H1-">]>,
#<Nokogiri::XML::Element:0x1c10604 name="h1" children=[#<Nokogiri::XML::Text:0x1c1fdcc "-h1-">]>
]>,
#<Nokogiri::XML::Element:0x1c107d8 name="DIV" attributes=[#<Nokogiri::XML::Attr:0x1c1fc10 name="id" value="2">]
children=[#<Nokogiri::XML::Element:0x1c105bc name="H2" children=[#<Nokogiri::XML::Text:0x1c1f874 "-H2-">]>,
#<Nokogiri::XML::Text:0x1c1f778 " ">,
#<Nokogiri::XML::Element:0x1c10574 name="h2" children=[#<Nokogiri::XML::Text:0x1c1f5f8 "-h2-">]
>]>,
#<Nokogiri::XML::Element:0x1c10790 name="DIV" attributes=[#<Nokogiri::XML::Attr:0x1c1f43c name="id" value="3">]
children=[#<Nokogiri::XML::Element:0x1c1052c name="H3" children=[#<Nokogiri::XML::Text:0x1c1f0a0 "-H3-">]>,
#<Nokogiri::XML::Element:0x1c104e4 name="h3" children=[#<Nokogiri::XML::Text:0x1c1ee90 "-h3-">]
>]
>]
>]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
=end
Now that I have had some sleep this works for me.
'nodeset = xpath(tag1/tag2)' returns a 'nodeset' containing member node 'tag2'
'puts nodeset' displays the 'tag2' node member
'nodeset.xpath('*')' returns the content of 'tag2
'nodeset.xpath('tag2')' invalid as 'tag2' is not part of the content of 'tag2'

Unable to find element by attribute with lxml

I'm using a European Space Agency API to query (result can be viewed here) for satellite image metadata to parse into python objects.
Using the requests library I can successfully get the result in XML format and then read the content with lxml. I am able to find the elements and explore the tree as expected:
# loading the response into an ElementTree
tree = etree.fromstring(response.content)
root = tree.getroot()
ns = root.nsmap
# get the first entry element and its summary
e = root.find('entry',ns)
summary = e.find('summary',ns).text
print summary
>> 'Date: 2018-11-28T09:10:56.879Z, Instrument: OLCI, Mode: , Satellite: Sentinel-3, Size: 713.99 MB'
The entry element has several date descendants with different values of the attriubute name:
for d in e.findall('date',ns):
print d.tag, d.attrib
>> {http://www.w3.org/2005/Atom}date {'name': 'creationdate'}
{http://www.w3.org/2005/Atom}date {'name': 'beginposition'}
{http://www.w3.org/2005/Atom}date {'name': 'endposition'}
{http://www.w3.org/2005/Atom}date {'name': 'ingestiondate'}
I want to grab the beginposition date element using XPath syntax [#attrib='value'] but it just returns None. Even just searching for a date element with the name attribute ([#attrib]) returns None:
dt_begin = e.find('date[#name="beginposition"]',ns) # dt_begin is None
dt_begin = e.find('date[#name]',ns) # dt_begin is None
The entry element includes other children that exhibit the same behaviour e.g. multiple str elements also with differing name attributes.
Has anyone encountered anything similar or is there something I'm missing? I'm using Python 2.7.14 with lxml 4.2.4
It looks like an explicit prefix is needed when a predicate ([#name="beginposition"]) is used. Here is a test program:
from lxml import etree
print etree.LXML_VERSION
tree = etree.parse("data.xml")
ns1 = tree.getroot().nsmap
print ns1
print tree.find('entry', ns1)
print tree.find('entry/date', ns1)
print tree.find('entry/date[#name="beginposition"]', ns1)
ns2 = {"atom": 'http://www.w3.org/2005/Atom'}
print tree.find('atom:entry', ns2)
print tree.find('atom:entry/atom:date', ns2)
print tree.find('atom:entry/atom:date[#name="beginposition"]', ns2)
Output:
(4, 2, 5, 0)
{None: 'http://www.w3.org/2005/Atom', 'opensearch': 'http://a9.com/-/spec/opensearch/1.1/'}
<Element {http://www.w3.org/2005/Atom}entry at 0x7f8987750b90>
<Element {http://www.w3.org/2005/Atom}date at 0x7f89877503f8>
None
<Element {http://www.w3.org/2005/Atom}entry at 0x7f8987750098>
<Element {http://www.w3.org/2005/Atom}date at 0x7f898774a950>
<Element {http://www.w3.org/2005/Atom}date at 0x7f898774a7a0>

Changing a time format of an xml file using a bash script

I'm trying to change the formatting of a time in a file using a bash script.
Current time format:
08:05:00
Goal time format:
8-05
There are other timestamps I don't want to change in the file and every instance that I want to change is wrapped in xml:
time="current time format"
Can anyone help?
You must use an XML parser to solve this problem. I would do it in a language that comes with an XML parsing library and a date time parsing library. Python fits the bill
import xml.etree.ElementTree as ET
from datetime import datetime
import sys
tree = ET.parse(sys.argv[1])
# for each element with a "time" attribute, alter the format of the attribute value
for elem in tree.findall('.//*[#time]'):
time = datetime.strptime(elem.get('time'), '%H:%M:%S')
elem.set('time', time.strftime('%k-%M').lstrip())
# print the new XML to stdout
print(bytes.decode(ET.tostring(tree.getroot())))
Testing:
$ cat file.xml
<root>
<a>
<b time="08:05:00">
<c>text contains time 08:05:00</c>
</b>
<d foo="bar" time="19:54:55"/>
</a>
</root>
$ python3 alter_time.py file.xml > new.file.xml
$ cat new.file.xml
<root>
<a>
<b time="8-05">
<c>text contains time 08:05:00</c>
</b>
<d foo="bar" time="19-54" />
</a>
</root>
Error handling left as an exercise

None type returned / Tag not found

I have to parse a file to find min and mult qty of each sku
<product sku="13603">
<sku>13603</sku>
<quantity unit="pcs">
<min-order-quantity>1</min-order-quantity>
<step-quantity>1</step-quantity>
</quantity>
</product>
<product sku="13713">
<sku>13713</sku>
<quantity unit="pcs">
<min-order-quantity>1</min-order-quantity>
<step-quantity>1</step-quantity>
</quantity>
</product>
...
My program is very simple
from lxml import etree
tree = etree.parse('./file-above.xml')
for elem in tree.iterfind('product'):
vSKU = elem.find('sku').text
vMin = elem.find('quantity/min_order_quantity').text
When I run it, it creates an error:
AttributeError: 'NoneType' object has no attribute 'text'
When run interactively and changing the last line to...
print elem.find('sku').text
it works, but the line...
print elem.find('quantity/min_order_quantity').text
fails. What's wrong ?
You have a typo in your XPath, you need vMin = elem.find('quantity/min-order-quantity').text instead of vMin = elem.find('quantity/min_order_quantity').text (i.e. hyphen instead of underscore)

SLD Filter function if_then_else argument #2 - expected type Object

I'm facing a problem with the GeoServer SLD XML.
My XML code is as follows:
<Fill>
<CssParameter name="fill">
<ogc:Function name="if_then_else">
<ogc:Function name="isNull">
<ogc:PropertyName>LTE_RSRP</ogc:PropertyName>
</ogc:Function>
<ogc:Literal>#FF0000</ogc:Literal>
<ogc:Function name="Interpolate">
<ogc:PropertyName>LTE_RSRP</ogc:PropertyName>
<ogc:Literal>-80</ogc:Literal>
<ogc:Literal>#ff0000</ogc:Literal>
<ogc:Literal>-70</ogc:Literal>
<ogc:Literal>#00ff00</ogc:Literal>
<ogc:Literal>-60</ogc:Literal>
<ogc:Literal>#0000ff</ogc:Literal>
<ogc:Literal>color</ogc:Literal>
</ogc:Function>
</ogc:Function>
</CssParameter>
<CssParameter name="fill-opacity">0.3</CssParameter>
</Fill>
My intention is as follows:
If LTE_RSRP is null, fill with #FF0000.
Else, interpolate the color.
But when the above XML is applied, the following error occurs.
ERROR [geotools.rendering] - Filter Function problem for function if_then_else argument #2 - expected type Object
Here, argument #2 is the function Interpolate. (argument counting starts from 0, according to the geotools source code.) It seems that the return value of the function Interpolate is not an object.
Is this intentional? Or am I doing something wrong?
This is intentional - how could a function that interpolates a colour map return an object? What you want to do can be done using Rules and Filters so something (untested) like this should work:
<Rule>
<ogc:Filter>
<ogc:PropertyIsNull>
<ogc:PropertyName>LTE_RSRP</ogc:PropertyName>
</ogc:PropertyIsNull>
</ogc:Filter>
<ogc:PolygonSymbolizer>
....
<Fill>
....
</Rule>
<Rule>
<ElseFilter/>
<PolygonSymbolizer>
....
<Fill>
<ogc:Function name="Interpolate">
<ogc:PropertyName>LTE_RSRP</ogc:PropertyName>
<ogc:Literal>-80</ogc:Literal>
<ogc:Literal>#ff0000</ogc:Literal>
<ogc:Literal>-70</ogc:Literal>
<ogc:Literal>#00ff00</ogc:Literal>
<ogc:Literal>-60</ogc:Literal>
<ogc:Literal>#0000ff</ogc:Literal>
<ogc:Literal>color</ogc:Literal>
</ogc:Function>
....

Resources