So I'm not a professional programmer, but I'm trying to scrape data off the Reuters homepage and import it into google spreadsheets.
I know that there have already been questions answerd about scraping from Reuters, however, that didn't help me.
I want data from this page: http://www.reuters.com/finance/stocks/financialHighlights?symbol=9983.T
specifically, if you scroll down, there's a lot of data on the company's financials, packed into tables. I need specific values out of the tables.
So naturally my question to you is, how can I get specific values out of the tables? For instance, I want the first value out of the line that's labelled "Net Profit Margin (TTM)". The value should be 7.30.
So I got the xpath by using google chrome developer tools, right-click on the element and select "copy xpath". Since I'm not a programmer I dont know any other way for arriving at a specific element from the tables.
I tried the following function in google spreadsheets:
=IMPORTXML(URL as written above,"//*[#id='content']/div[2]/div/div[2]/div[1]/div[13]/div[2]/table/tbody/tr[14]/td[2]")
but it returns
"#N/A - Error, imported content is empty"
What can I do to get the value?
The IMPORTXML() function of Google Sheets is known to be incredibly buggy and it is not surprising if people dig up real errors in it. Still, we don't know exactly why your original XPath expression does not work.
I want the first value out of the line that's labelled "Net Profit Margin (TTM)". The value should be 7.30.
The path expression you got from the developer tools heavily relies on positioning, and not at all on actual values.
If you can rely on the text content of the first cell in this row, use
=IMPORTXML("http://www.reuters.com/finance/stocks/financialHighlights?symbol=9983.T","//tr[contains(td[1],'Net Profit Margin (TTM)')]/td[2]")
which means
Select all tr elements where the text content of the first td child element contains "Net Profit MArgin (TTM)" and select the second td of that tr.
and the result will be
7.3
Related
I'm developing logs viewer web program with Vue.js
I receive log data with ajax and display it with Slickgrid.
What i need to do is highlighting keyword after searching.
I found some examples highlighting whole cells or row but couldn't find highlighting specific keyword in cell.
ex)When i search a word 'cat', slickgrid shows cells which include 'cat'.
And i need to highlight the word 'cat' in the cell.
Anyone knows how to do this? or any examples??
Thank you.
You'll need to write a custom formatter. See here for an example page. Make sure you're using the 6pac repo - it's up to date, the MLeibman repo is unmaintained now.
Re highlighting a word, you'll need to return HTML from the formatter, and just have a special span to hilight the word, eg:
we will build a <span class="hilight">wall<span/>
It's a tricky business finding a full word, that is making sure it's not part of another word, if that's what you want eg.
did you buy the <span class="hilight">wall<span/>paper yet?
That's a whole 'nother Google search in itself.
I'm very new to Ruby, Selenium, and UI automation, and have a quick question on how to get a count of visible items in a drop down menu.
Example: I have a drop down menu of 10 currency values (USD, EUR, JPN, etc.). They're coded as:
<div class="list_item">Currency Symbol</div>
The drop down menu is searchable, and if I type in "USD", then the only visible item would be that particular currency value. All other div of that class get a style="display: none;" attribute. How do I verify that that USD indeed is the only item in the menu? An example of such a situation can be seen here: https://www.oanda.com/currency/converter/
Conceptually, I was thinking of doing this:
Iterate through each div tag with class=List_item, and if I find one that has a display, count it. Then verify if it equals to '1'.
I tried using find_elements but can't seem to find attribute within each element in the array (is it because they aren't webdriver objects?).
If there's another better approach, it'd be really good to know and learn more as well. Appreciate any responses.
So, in this particular example, you could do
#driver.find_elements(:xpath, "//div[#class='currency_dropdown']/div[#id='scroll-innerBox-1']/div[not(#style='display: none;')]").size
This should return 1 when USD is typed into the search field.
Edit: I recommend getting ChroPath for your preferred browser (FF or Chrome). It helps significantly when testing xpath.
I'm 'scraping' a few product descriptions from a website and bringing them into a google spreadsheet using importXML.
It has gone fairly smoothly, but there is one major snag that I would love to correct, and I need your help!
The website in question prohibits those posting products from including contact information (email addresses usually) in the product description. Sometimes people ignore the rule, and include the contact information anyways. When this occurs, the website automatically hides the contact information in the product description, replacing it with [obscured], as in "...please feel free to contact me at [obscured]" or something close to that. The [obscured] appears in a different colour, and is obviously treated differently by the website.
When these product descriptions are imported into my spreadsheet, the [obscured] causes the scraping to kind of be 'bumped'-- the description text stops prior to [obscured], the word [obscured] appears in an adjacent cell all by itself, and the description text that follows [obscured] then continues in a third cell.
This separation ruins the alignment and logic in my spreadsheet, as product descriptions having an [obscured] word become broken up and misaligned from those that do not.
I would love to be able to have my importXML or XPath accommodate for this, and essentially 'ignore' the [obscured]. I don't mind it being included in the scraped description, but I want to stop the breaking-up into 3 separate adjacent cells.
The [obscured] is part of a 'span' that appears to occasionally lie within the description class 'desc' I am calling.
Is there a way to do this? Instruct importXML to import that 'desc' class BUT 'ignore/omit/exception' of the span which might sometimes appear within?
I've included the source code (inspect element in Safari) below:
<div class="desc descFull collapsed">
<span class="obscureText">[obscured]</span>
As mentioned, this span only occurs in some of the product descriptions, not all of them.
Does anyone know what kind of language I would use in the importXML to call the 'desc' but ignore the 'span', or prevent the splitting into 3 cells when the [obscured] is encountered??
My current call is
=ImportXML(A1,"//div[#class='desc']")
which works fine, unless the [obscured] span is encountered.
Thank you for any help you can give!
Unless Google Drive is breaking the definition of Xpath, Xpaths can't be used to query CSS classes, like CSS selectors can.
The Xpath //div[#class='desc'] will only match a div element with a class attribute that is literally "desc". It won't match "desc descFull collapsed" as the string is different.
As for excluding the text of the obscured node, that would require finding the text nodes and exluding on, which would return a nodeset, not a string, and you wouldn't be able to concatenate these back together using XPath 1.0. If Google Drive uses XPath 2.0 it might be possible, using the techniques in that linked question.
I have tried different solutions to get the table node which I can identify as next sibling to the text node in the existing dom.
I used the following code, but the nextsibling is always null.
var element = browser.Element(Find.ByText(t => t.Contains("Individual Notices")));
if (element != null)
{
var table = element.NextSibling as Table;
}
Would appreciate the help if any one can guide me how to iterate through the rows which are there in the table next to the node "Individual Notices"
Thanks
You're having trouble as the Contains ends up not finding the element you want. Put Console.WriteLine("START" + element.Text+ "END"); in there right after the variable declaration/assignment, and I bet you'll see a whole lot of text output besides "Individual Notices".
If the Dom element you need ONLY has the text "Individual Notices" text, simply remove the lambda call and have Find.ByText("Individual Notices") and then table will have your table.
If this is not an option as the text isn't a known value, you might be able to search on a specific element type (eg: Div) so that parent nodes aren't being returned as the lambda contains result.
Edit:
Sometimes searching for an individual element by text is problematic due to browser oddities. At times text values shown to the user don't necessarily equal the text values seen by the DOM due to whitespaces being added or removed. Basically you might think you have "Individual Notices" but WatiN might see "Individual Notices " <- See the space at the end. The way I run not being able to find a particular element after easy/obvious methods are exhausted is to just iterate through the elements in WatiN code by searching for what I think should find it and then flashing the elements found and/or writing to the console. If not found, widen the search. Repeat as needed.
Basically there is a table with names, edit buttons, and a checkbox at the end column that I want to check on with selenium. But I want to make sure I click on the one I created with selenium and that's where my problems begin.
Using the selenium IDE, the names xpath is
//tr[5]/td[2]
The checkbox is
//tr[5]/td[4]/input
So the text is in column 2 and the box is column 4, and my record would be the 5th one. but I cannot for the life of me get ANY text search to work. Even something basic like
<tr>
<td>storeText</td>
<td>//tr[contains(text(), 'McGowan')]/td[2]</td>
<td>text</td>
</tr>
Even if the text matches identically, it gives me the locator not found error. No matter what combination i use to find xpath by text it has never worked, and ive spent quite a few hours reading every combination out there.
We are using the IDE and the RC in html, so no java or any other exporting.
Thank! (My first post!)
//td[text()='McGowan']/../td/input[#type='checkbox']
Let me know if this works for you!
This might be odd, but the coment below the answer, regarding a random click that let to the answer ---> //tr[contains(., 'text')]/td[3]/a <--- was just randomly verified as exactly what I needed.
Good job guys.