How to copy particular elements from web page - xpath

My goal is to get particular text area from web page. Imagine it as if you were able to draw a rectangle anywhere on a page and everything in this rectangle would be copied in your clipboard. I am using FireBug (feel free to suggest another solutions, I have searched for plugin or bookmarklets but did not find anything usefull) with it's console window and XPath for this purpose. The values which I want obtain are in following format (this was observed from FireBug "HTML inspect"):
<span class="number3_0" title="Numbers">3.00</span>
so I end up with following code, which I issue from FireBug console:
$x("//span[#title='Numbers']/text()")
After this I get something like this:
[<TextNode textContent="2.00">, <TextNode textContent="2.00">, <TextNode textContent="2.00">, <TextNode textContent="2.00">, <TextNode textContent="3.00">]
After this I click (with right mouse button) on [ and select Inspect in DOM panel then I press ctrl+a and copy/paste the data in following format:
0 <TextNode textContent="2.00">
1 <TextNode textContent="2.00">
2 <TextNode textContent="2.00">
3 <TextNode textContent="2.00">
4 <TextNode textContent="3.00">
As you can assume the value of textContent is the information that I am interested in. I have tried to modify original XPath query to return me only this numbers but no luck. I was:
wrapping whole query into string() as suggested here Xpath - get only node content without other elements
trying to figure out how this one is working Extracting text in between nodes through XPath and lot of more.
To be able to obtain desired values I used some bash-scripting + xml-formatting, after this tedious/error-prone task I get following format:
<?xml version="1.0"?>
<head>
<TextNode textContent="2.00"/>
<TextNode textContent="2.00"/>
<TextNode textContent="2.00"/>
<TextNode textContent="2.00"/>
<TextNode textContent="3.00"/>
<TextNode textContent="3.00"/>
</head>
Now I use xmlstarlet to obtain those values (yes I know that I can use regexp in previous step and have all data that I need. But I am interesting in DOM/XPath parsing and trying to figure out how it is working) in following way:
cat input | xmlstarlet sel -t -m "//TextNode" -v 'concat(#textContent,"
")'
This finnaly gives me the desired output:
2.00
2.00
2.00
2.00
3.00
My questions are a bit generic:
How this terrible long process can be automated?
How to modify the original XPath string used in FireBug
$x("//span[#title='Numbers']/text()") to immediatelly get only
numbers and save myself rest of steps?
I am still not very familiar with xmlstarlet, especially selection
(sel) mode drives me crazy. I have seen various combinations of
following options:
-c or --copy-of - print copy of XPATH expression
-v or --value-of - print value of XPATH expression
-o or --output - output string literal
-m or --match - match XPATH expression
can somebody please explain when to use which one? It would be glad to see in particular examples if is possible. In case of interest there are various combinations of mentioned options, that I do not understand well:
http://www.grahl.ch/blog/minutiae-return-content-element-xmlstarlet
Extracting and dumping elements using xmlstarlet
Testing for an XML attribute
4.) The last question regarding xmlstarlet is a bit cosmetic syntactical sugar, how to obtain nice newline separated output, as you can see I 'cheat' with adding newline as a separator but when I tried it with escape character like this:
cat input | xmlstarlet sel -t -m "//TextNode" -v 'concat(#textContent,"\n")'
it did not worked, also the original reference from where I learn a lot used it in this 'ugly' way http://www.ibm.com/developerworks/library/x-starlet/index.html
PS: maybe those all steps could be simplified with curl + xmlstarlet but it could be handy to have also FireBug option for pages which requires login or such other stuff.
Thanks for all idea.

From what I gather you want to collect numbers from spans that have a title 'Numbers' and want it as a string.
Try the following:
var numberNodes = document.querySelectorAll('span[title="Numbers"]')
function giveText(me) { return me.textContent; }
Array.prototype.map.call(numberNodes, giveText).join("\n");
The first line selects all nodes using CSS query selectors in the document (meaning you do not need XPath).
The second line creates a function that returns the text content of a node.
The third line maps the elements from the numberNodes list using the giveText function, produces an array of numbers, and then finally joins them with a newline.
After this you might not need this xmlstarlet.

$$("<CSS3 selector>") and $x("<XPATH>") in Firebug actually return a real Array (not like the results of document.querySelectorAll() or document.evaluate). So they are more convenient.
With Firefox + Firebug:
var numbersNode = $x("//span[#title='Numbers']/text()");
var numbersText = numbersNode.map(function(numberNode) {
return numberNode.textContent;
}).join("\n");
// Special command of Firebug to copy text into clipboard:
copy(numbersText);
You can even do with a compact way using arrow functions of the EcmaScript 6:
copy($x("//span[#title='Numbers']/text()").map(x => x.textContent).join("\n"));
The same if you chose $$('span[title="Numbers"]') as suggested William Narmontas.
Florent

Related

Finding elements: different syntax -> different results - Explanation needed

When I use:
cy.get('b').contains('xdz') // find 1 element
but when I use:
cy.get('b:contains("xdz")') // find 2 elements
Can someone explain me what is the difference?
cy.get('b').contains('xdz') is invoking a Cypress command, which is designed to only return a single element. This is by design so that you can narrow a search by text content.
cy.get('b:contains("xdz")') is using a jquery pseudo-selector :contains() to test the text inside element <b> and is designed to return all matching elements.
Pseudo-selectors are extensions to the CSS selector syntax that apply jQuery methods during the selection. In this case :contains(sometext) is shorthand for $el.text().contains('sometext'). Becuase it's part of the selector, it returns all matching elements.
It's worth while understanding jquery selector variations, as this example illustrates - it can give you different results for different situations.
contains('xdz') is a cypress command which always yields only the first element containing the text. You can read more about it from this Github Thread.
:contains("xdz") is a Jquery command and it returns all elements containing the text. You can read more about it from the Jquery Docs.

How to properly scraping filtered content using XPath Query to Google Sheet?

So, this is about a content from a website which I want to get and put it in my Google Sheets, but I'm having difficulty understanding the class of the content.
target link: https://www.cnbc.com/quotes/?symbol=XAU=
This number is what I want to get from. Picture 1: The part which i want to scrape
And this is what the code looks like in inspector. Picture 2: The code shown in inspector
The target is inside a span attribute but the span attribute looks very difficult to me, so I tried to simplify it using this line of code here =IMPORTXML("https://www.cnbc.com/quotes/?symbol=XAU=","//table[#class='quote-horizontal regular']//tr/td/span")
Picture 3: List is shown when putting the code
After some tries, I am able to get the right target, but it confuse me, Im using this code =IMPORTXML("https://www.cnbc.com/quotes/?symbol=XAU=","//table[#class='quote-horizontal regular']//tr/td/span[#class='last original'][1]")
Picture 4: The right target is shown when the xpath query is more specified
As what you can see in 2nd Picture, 'last original' is not really the full name of the class, when I put the 'last original ng-binding' instead it gave me an error saying imported content is empty
So, correct me if my code is wrong, or accidental worked out somehow because there's another correct way?
How about this answer?
Modified formula 1:
When the name of class is last original and last original ng-binding, how about the following xpath and formula?
=IMPORTXML(A1,"//span[contains(#class,'last original')][1]")
In this case, the URL of https://www.cnbc.com/quotes/?symbol=XAU= is put in the cell "A1".
In this case, //span[contains(#class,'last original')][1] is used as the xpath. The value of span that the name of class includes last original is retrieved. So last original and last original ng-binding can be used.
Modified formula2:
As other xpath, how about the following xpath and formula?
=IMPORTXML(A1,"//meta[#itemprop='price']/#content")
It seems that the value is included in the metadata. So this sample retrieves the value from the metadata.
Reference:
IMPORTXML
To complete #Tanaike's answer, two alternatives :
=IMPORTXML(B2;"//span[#class='year high']")
"Year high" seems always equal to the current stock index value.
Or, with value retrieved from the script element :
=IMPORTXML(B2;"substring-before(substring-after(//script[contains(.,'modApi')],'""last\"":\""'),'\')")
Note : since I'm based in Europe, you need to replace ; with , in the formulas.

How to extract items inside a table using scrapy

I want to extract all the functions listed inside the table in the below link : python functions list
I have tried using the chrome developers console to get the exact xpath to be used in the file spider.py as below:
$x('//*[#id="built-in-functions"]/table[1]/tbody//a/#href')
but this returns a list of all href's ( which I think what the xpath expression refers to).
I need to extract the text from here I believe but appending /text() to the above xpath return nothing. Can someone please help me to extract the function names from the table.
I think this should do the trick
response.css('.docutils .reference .pre::text').extract()
a non-exact xpath equivalent of it (but that also works in this case) would be:
response.xpath('//table[contains(#class, "docutils")]//*[contains(#class, "reference")]//*[contains(#class, "pre")]/text()').extract()
Try this:
for td in response.css("#built-in-functions > table:nth-child(4) td"):
td.css("span.pre::text").extract_first()

can't display ordered list of markdown

I'am using Hexo to post my blog. I edited my blog by markdown. And I encountered with some problems when I try to use the ordered list and it can't be displayed normally.
Here is my code:
1. first
2. second
+ inner first
+ inner second
However, only disordered list was shown.
I would like it was shown as follow:
http://7xjj3m.com1.z0.glb.clouddn.com/20150622_0.jpg
but it was follow indeed:
http://7xjj3m.com1.z0.glb.clouddn.com/20150622_1.jpg
So, what's the problem?
Your syntax is correct but you need to indent by four spaces. I would refrain from using tab because your computer / program could be defaulted to 2 instead of 4.
I use a markdown editor called Mou; and I inputed your syntax and got the proper result.

View elements using javascript codes in vimperator?

I tried to use a command like this in Vimperator:
echo document.getElementsByTagName("p");
To view nodes whose tag name is <p> in vimperator. However, the result is like this:
I also tried the same command in Firebug. Following is the result:
While Vimperator's result is empty, Firebug's is not empty. Does anyone know why Vimperator echoes the Collection whose length is zero?
See the related question: How do I getElementByID in Vimperator?. The short answer is to use echo window.content.window.document.getElementsByTagName("p") instead.

Resources