Trying to scrape data off of dividendinvestor.com

Trying to scrape data off of dividendinvestor.com - xpath

I'm trying to import some stock data regarding dividend history using Google Sheets.
The data I'm trying to grab is from this page: https://www.dividendinvestor.com/dividend-quote/
(e.g. https://www.dividendinvestor.com/dividend-quote/ibm or https://www.dividendinvestor.com/dividend-quote/msft)
With other sites, I've been able to use a combination of INDEX and IMPORTHTML to get data from a table. For example, if I wanted to get the "Forward P/E" for IBM from finviz.com, I do this:
=index(IMPORTHTML("http://finviz.com/quote.ashx?t=IBM","table", 11),11,10)
That grabs table 11 and goes down 11 rows and over 10 columns to get the piece of data that I want.
However, I cannot seem to find any tables to import via IMPORTHTML from the www.dividendinvestor.com/dividend-quote/ibm site.
I'm trying to import the value to the right of the "Consecutive Dividend Increases" field.
In this case, the output I'm trying to achieve is "19 years".
I've also tried IMPORTXML, but everything I try with XPATH (using this path: "/html/body/div[3]/div/div/div[2]/div/div/div[2]/div[2]/div[2]/span[20]" ) fails too.
Any help out there? The desired end result will be that I will dynamically build the dividendinvestor.com URL by appending a different ticker symbol and have a result of how many years of consecutive increases in their dividend payout.

Nice solution proposed by #player0. If you don't want to use INDEX, you can go with :
=IMPORTXML("https://www.dividendinvestor.com/dividend-quote/"&B3,"//a[.='Consecutive Dividend Increases']/following::span[1]")
Update (May 2022) :
New working formula :
=REGEXEXTRACT(TEXTJOIN("|";TRUE;IMPORTXML("https://www.dividendinvestor.com/ajax/?action=quote_ajax&symbol="&B2;"//text()"));"\d+ Years")
Note : I'm based in Europe, so semi-colons may have to be replaced with commas.

try:
=INDEX(IMPORTXML("https://www.dividendinvestor.com/dividend-quote/ibm/",
"//span[#class = 'data']"), 9, 1)

Related

How use filter formula in Google Sheet with data contains #N/A

For my example, I have 2 columns A,B in Google Sheet
Column A with list of Stocks symbols like AAPL, IBM, etc....
Column B with simple formula of GOOGLEFINANCE(A2,"price")
Sometimes GOOGLEFINANCE returns error and the cells display #N/A. But this is not my issue...
I would like using filter in column B which show all symbols with prices greater than 100 or #N/A
I prefer not using extra column to achieve that
I'm struggling with it and still didn't find the way to get my result
Just note, my issue isn't GOOGLEFINANCE, It's like example to get the #N/A value
My tought was using filter with formula like: =OR(ISNA(B:B), B:B>100)
But it seems it's ignore the #N/A and doesn't show it
Link for example

In my question I tried the formula "=OR(ISNA(B:B), B:B>100)"
But I must know that if Google Sheets "see" cells with N/A on this column - The result is automatically N/A, even if I put the ISNA in the first condition
So to solve it I used a formula like this:
=IF(ISERROR(B:B), TRUE, B:B>100)
I updated the sheet if someone wants to check it

Get meaning for a word from dictionary Using XPath Google sheets importxml function

I'm trying to use the IMPORTXML function in Google sheets to grab the meaning and information words on https://www.powerthesaurus.org/
I kind of succeeded getting some data from another website, but as a newbie, I got some troubles to get any data when I try on this one in this Google sheet in D6 cell.
=ImportXML("https://www.powerthesaurus.org/"&A6,"//*[#id='link link--primary link--term']")
Could someone help to educate me with the correct formula?

You're looking for synonyms. Note you can display up to 200 on Power Thesaurus.
To get the 50 first synonyms in one cell (since you have one word per row), you can try this :
Create 50 numbered columns in your GoogleSheet.
Apply this formula to the first cell and drag it to the right.
=IMPORTXML("https://www.powerthesaurus.org/abbreviation/synonyms";"(//div[#class='pt-thesaurus-card__term'])"&"["&B2&"]")
Then use join formula to get all the words in one cell (XX:XX is the range of your columns, B3:F3 on the provided screenshot).
=JOIN("|";XX:XX)
Result :
Alternatively we could have use this one-liner (and make some cleanup afterwards) but GoogleSheet returns a blank cell whereas the XPath is perfectly valid :
=IMPORTXML("https://www.powerthesaurus.org/abbreviation/synonyms";"normalize-space(//div[#class='pt-list-terms__container'])")

How to capture the index number of a specific node in an absolute xpath

It's a little complicated - but necessary - to explain the backstory, so some patience is requested.
I'm trying to parse an SEC Edgar filing (this Form 10-K, as a random example), not for its financial data, but for the list of Exhibits contained in a table toward the end of the document. Each document has in that table 3 attributes I'm interested in (exhibit number, title and URL), but for this example I'll focus only on the URL.
Finding all the URLs in the document is easy enough to begin with:
from lxml import etree
import lxml.html
for element in tree.iter('a'):
target = element.values()[0]
But since the document may contain hundreds of URLs, most of which are irrelevant, I have to filter the results for the presence of the word Archives which appears without exception in all Edgar URLs. So in the next stage, I get the xpath of each of them:
if target is not None and 'Archives' in target:
print(tree.getpath(element))
So far so good, but this is where I get stuck: it turns out that, for some really bizarre reason, each of the relevant URLs appears not in one but two (and in some documents - up to four!) tables and that these tables are not, unfortunately, the first or last tables in the document but randomly stuck somewhere in the middle. So, for example, Exhibit 10-5's xpaths are:
/html/body/document/type/sequence/filename/text/div[2]/table[9]/tr[17]/td[3]/p/a
/html/body/document/type/sequence/filename/text/div[2]/table[12]/tr[17]/td[3]/p/a
So the URL appears in exactly the same location in both table 9 and table 12. Obviously, I don't want this URL to appear twice is my final URL list, so in my final search I would like to run
for i in tree.xpath('//table[XXX]//*/a'):
print(i.values()[0])
Where XXX is either 9 or 12, in this example.
So back to the title of the question - how do I extract the index number of the table so I can select the higher (or lower) index number for my tree.xpath() expression? Altenatively, is there a way to stop the getpath search at table 9?

Filtering the output for importhtml in Google Sheets

I am building a google sheet to do calculations based on information I found on different websites and stumbled upon the IMPORTHTML function in Google Sheets.
Terrific, I want to import tables and then use some of the values out of those tables to build my sheet and make further calculations.
However, since the function retrieves both the headers and all the information in the table that makes it quite hard to work with. Instead I would like to pull only certain of the data, preferably specific cells in the table pulled.
Is this possible?
For example:
=ImportHtml("http://en.wikipedia.org/wiki/Demographics_of_India"; "table";3)
returns a huge list, what if I would like to pull only the values of B7 and D7? Is that possible? Even filtering out a single row would be useful, whatever that is more feasible. The most important part is that I can get a single row and dont have the full table.

Found the INDEX function, doing exactly what I need it to do!

How do I return multiple columns of data using ImportXML in Google Spreadsheets?

I'm using ImportXML in a Google Spreadsheet to access the user_timeline method in the Twitter API. I'd like to extract the created_at and text fields from the response and create a two-column display of the results.
Currently I'm doing this by calling the API twice, with
=ImportXML("http://twitter.com/status/user_timeline/matthewsim.xml?count=200","/statuses/status/created_at")
in the cell at the top of one column, and
=ImportXML("http://twitter.com/status/user_timeline/matthewsim.xml?count=200","/statuses/status/text")
in another.
Is there a way for me to create this display with a single call?

ImportXML supports using the xpath | separator to include as many queries as you like.
=ImportXML("http://url"; "//#author | //#catalogid| //#publisherid")
However it does not expand the results into multiple columns. You get a single column of repeating triplets (or however many attributes you've selected) as shown below in column A.
The following is deprecated
2015.06.16: continue is not available in "the new Google Sheets" (see: The Google Documentation for continue).
However you don't need to use the automatically inserted CONTINUE() function to place your results.
=CONTINUE($A$2, (ROW()-ROW($A$2)+1)*$A$1-B$1, 1)
Placed in B2 that should cleanly fill down and right to give you sane column data.
ImportXML is in A2.
A3 and below are how the CONTINUE() functions are automatically filled in.
A1 is the number of attributes.
B1:D1 are the attribute index for their columns.

Another way to convert the rows of =CONTINUE() into columns is to use transpose():
=transpose(importxml("http://url","//a | //b | //c"))

Just concatenate your queries with "|"
=ImportXML("http://twitter.com/status/user_timeline/matthewsim.xml?count=200","/statuses/status/created_at | /statuses/status/text")

I posed this question to the Google Support Forum and this is was a solution that worked for me:
=ArrayFormula(QUERY(QUERY(IFERROR(IF({1,1,0},IF({1,0,0},INT((ROW(A:A)-1)/2),MOD(ROW(A:A)-1,2)),IMPORTXML("http://example.com","//td/a | //td/a/#href"))),"select min(Col3) where Col3 <> '' group by Col1 pivot Col2",0),"offset 1",0))
Replace the contents of IMPORTXML with your data and query and see if that works for you. I
Apparently, this attempts to invoke the IMPORTXML function only once. It's a solution for now, at least.
Here's the full thread.

This is the best solution (NOT MINE) posted in the comments below. To be honest, I'm not sure how it works. Perhaps #Pandora, the original poster, could provide an explanation.
=ArrayFormula(iferror(hlookup(1,{1;ARRAY},(row(A:A)+1)*2-transpose(sort(row(A1:A2)+0,1,0)))))
This is a very ugly solution and doesn't even explain how it works. At least I couldn't get it to work due to multiple errors, like i.e. to much parameters for IF (because an array is used). A shorter solution can be found here =ArrayFormula(iferror(hlookup(1,{1;ARRAY},(row(A:A)+1)*2-transpose(sort(row(A1:A2)+0,1,0))))) "ARRAY" can be replaced with IMPORTXML-Function. This function can be used for as much XPATHS one wants. – Pandora Mar 7 '19 at 15:51
In particular, it would be good to know how to modify the formula to accommodate more columns.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Trying to scrape data off of dividendinvestor.com - xpath

try: =INDEX(IMPORTXML("https://www.dividendinvestor.com/dividend-quote/ibm/", "//span[#class = 'data']"), 9, 1)

Related

How use filter formula in Google Sheet with data contains #N/A

Get meaning for a word from dictionary Using XPath Google sheets importxml function

How to capture the index number of a specific node in an absolute xpath

Filtering the output for importhtml in Google Sheets

How do I return multiple columns of data using ImportXML in Google Spreadsheets?

Categories

Resources