How to export scrubyt extractor? - ruby

I've written a scrubyt extractor based on the 'learning' technique - that is, specifying the current text on the page and getting it to work out the XPath expressions itself. However, I now want to export the extractor so that it can be used even when the page has changed.
The documentation for scrubyt seems to be all over the place now, but from what I can find I should be able to put the line extractor.export(__FILE__) and it should work. It doesn't - I just get an error saying that there is the wrong number of arguments for export, it should have 0. I've tried it without any arguments and it still fails.
I would ask on the scrubyt forum, but it seems like no-one's been there for ages!
Any ideas what to do here?

Just had the same problem and tried "puts google_data.export()" (trying to get some stuff from google)
This gave me the following:
=== Extractor tree ===
export() is not working at the moment, due to the removal or
ParseTree, ruby2ruby and RubyInline.
For now, in case you are using examples, you can replace them by hand
based on the output below.
So if your pattern in the learning extractor looks like
book "Ruby Cookbook"
and you see the following below:
[book] /table[1]/tr/td[2]
then replace "Ruby Cookbook" with "/table[1]/tr/td[2]" (and all the
other XPaths) and you are ready!
[link] /body/div/div/div/div/div/ol/li/h3/a
which gave me the xpath I was looking for
scrubyt version is 0.4.06

Related

Confused about XPath Syntax

Problem Summary:
Hi, I'm trying to learn to use the Scrapy Framework for python (available at https://scrapy.org). I'm following along with a tutorial I found here: https://www.scrapehero.com/scrape-alibaba-using-scrapy/, but I was going to use a different site for practice rather than just copy them on Alibaba. My goal is to get game data from https://www.mlb.com/scores.
So I need to use Xpath to tell the spider which parts of the html to scrape, (I'm about halfway down on that tutorial page on the scrapehero site, at the "Construct Xpath selectors for the product list" section). Problem is I'm having a hell of a time figuring out what syntax should actually be to get the pieces I want? I've been going over xpath examples all morning trying to figure out the right syntax but I haven't been able to get it.
Background info:
So what I want is- from https://www.mlb.com/scores, I want an xpath() command which will return an array with all the games displayed.
Following along with the tutorial, what I understand about how to do this is I'd want to inspect the elements from the webpage, determine their class/id, and specific that in the xpath command.
I've tried a lot of variations to get the data but all are returning empty arrays.
I don't really have any training in XPath so I'm not sure if my syntax is just off somewhere or what, but I'd really appreciate any help on getting this command to return the objects I'm looking for. Thanks for taking the time to read this.
Code:
Here are some of the attempts that didn't work:
response.xpath("//div[#class='g5-component--mlb-scores__game-wrapper']")
response.xpath("//div[#class='g5-component]")
response.xpath("//li[#class='mlb-scores__list-item mlb-scores__list-item--game']")
response.xpath("//li[#class='mlb-scores__list-item']")
response.xpath("//div[#!data-game-pk-id > 0]")'
response.xpath("//div[contains(#class, 'g5-component')]")
Expected Results and Actual Results
I want an XPath command that returns an array containing a selector object for each game on the mlb.com/scores page.
So far I've been able to get generic returns that aren't actually what I want (I can get a selector that returns the whole page by just leaving out the predicates, but whenever I try to specify I end up with an empty array).
So for all my attempts I either get the wrong objects or an empty array.
You need to always check HTML source code (Ctrl+U in a browser) for the data you need. For MLB page you'll find that content you are want to parse is loaded dynamically using JavaScript.
You can try to use Scrapy-Splash to get target content from your start_urls or you can find direct HTTP request used to get information you want (using Network tab of Chrome Developer Tools) and parse JSON:
https://statsapi.mlb.com/api/v1/schedule?sportId=1,51&date=2019-06-26&gameTypes=E,S,R,A,F,D,L,W&hydrate=team(leaders(showOnPreview(leaderCategories=[homeRuns,runsBattedIn,battingAverage],statGroup=[pitching,hitting]))),linescore(matchup,runners),flags,liveLookin,review,broadcasts(all),decisions,person,probablePitcher,stats,homeRuns,previousPlay,game(content(media(featured,epg),summary),tickets),seriesStatus(useOverride=true)&useLatestGames=false&language=en&leagueId=103,104,420

Import Internal Error during ImportXML with Google Spreadsheet

I am trying to import some data (Market Capitalization) from Bloomberg website to my Google spreadsheet, but Google gives me Import Internal Error.
=INDEX(ImportXml("http://www.bloomberg.com/quote/7731:JP","//*[#id='quote_main_panel']/div[1]/div[1]/div[3]/table/tbody/tr[7]/td"),1,1)
I really do not know what causes this problem, but I used to overcome it playing with the xpath query. This time I couldn't find a xpath query which works.
Does anybody know the reason of this error, or how can I make it work?
I am not familiar with Google Spreadsheet, but I think there is simply a superfluous closing parenthesis in your code.
Replace
=INDEX(ImportXml("http://www.bloomberg.com/quote/7731:JP"),"//*[#id='quote_main_panel']/div[1]/div[1]/div[3]/table/tbody/tr[7]/td"),1,1)
with
=INDEX(ImportXml("http://www.bloomberg.com/quote/7731:JP","//*[#id='quote_main_panel']/div[1]/div[1]/div[3]/table/tbody/tr[7]/td"),1,1)
Also, are you sure it's ImportXml and not ImportXML?
If this does not solve your problem, you have to explain what exactly you are looking for in the HTML.
Edit
Applying the Xpath expression you show to the HTML source, I get the following result:
<td xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml" class="company_stat">641,807.15</td>
Is this what you would have expected? If yes, then XPath is not at fault and the problem lies somewhere else. If not, then please describe what you are looking for and I'll try to find a suitable XPath expression.
Second Edit
The following formula works fine for me:
=ImportXML("http://www.bloomberg.com/quote/7731:JP","//table[#class='key_stat_data']//tr[7]/td")
Resulting cell value:
641,807.15
The XPath expression now looks for a particular table (since there are only 3 tables in the HTML and all of them have unique class attribute values).
EDIT
The reason why your intial path expression does not work is that it contains tbody, see this excellent answer for more information. Credit for this goes to #JensErat.

Google spreadsheet ImportXML Error:"the XPath query did not return any data"

I continue to get this error when I try to run this XPath query
//div[#iti='0']
on this link (flight search from google)
https://www.google.com/flights/#search;f=LGW;t=JFK;d=2014-05-22;r=2014-05-26
I get something like this:
=ImportXML("https://www.google.fr/flights/#search;f=jfk;t=lgw;d=2014-02-22;r=2014-02-26";"//div[#iti='0']")
I verified and the XPath is correct (I get the answer wanted using XPath helper, the answer wanted are the data relative to the first flight selected).
I guess that it is a problem of syntax, but I tried more or less all the combinations of lower/uppercase, punctuation (replacing ; , ' ") and I tried to link the URI and the XPath query stored in cells, but nothing works.
Any help will be appreciated.
As a matter of fact, maybe it is a bug on the new google sheets or they have changed how the function works. I've activated mine and when I try to use the ImportXML it simply wont work. Since I have some old sheets here (on the old mechanism) they still work normally. If I copy and paste the script from the old to the new one it simply doesn't get any data.
Here a example:
=ImportXML("http://www.nytimes.com/pages/todayspaper/index.html";"//div[#class='columnGroup first']//h3")
If I run this on the old mechanism it works fine, but if I run the same on the new mechanism, first it will exchange my ";" for a "," and then it will bring a "#N/A" with a warning of "Error: Imported XML content cannot be parsed".
Edit (05/05/2015):
I am happy to say that I tested this function again today on the new spreadsheets and they've fixed it. I was checking that every two months and now finally they have solved this issue. The example I've added above is now returning information.
I'm sorry, but you won't be able to easily parse Google result pages. The reason your function throws an error is because the content of the page you see in your browser is generated by javascript, and Google spreadsheet doesn't execute js.
Your ImportXML has the right syntax, it doesn't return anything because the node you're looking for isn't there (importXML Parse Error).
You will have to find another source if you want these result in your spreadsheet. For info some libraries already parse the usual result page (http://www.seerinteractive.com/blog/google-scraper-in-google-docs-update for example, if it still works), but I doubt finding one for your special case will be easy.
This gives the answer (importXML Parse Error), but it's not entirely obvious.
ImportXML doesn't load Javascript. When you're building ImportXML queries on Google results, make sure you're testing against a version of the page that has Javascript turned off. You can do this using the Chrome DevTools.
(But I agree that ImportXML is fickle, idiosyncratic, and generally rage-inducing).

How to ignore errors about html tags in JMeter when using XPath Extractor

I successfully added an XPath Extractor to my JMeter test. Now, I am receiving errors in the JMeter.log complaining about 2 of the html tags for one of our web pages. These tags are created by us and are tags that are ok for us to use in our code. But, JMeter does not like them. Is there somewhere that I can enter these tags to let JMeter know to exclude checking them.
Let's say the tags were:
xxxxx
and
xxxxx
Here is the JMeter log info:
2014/01/29 14:27:18 WARN - jmeter.util.XPathUtil: Tidy errors: line 25 column 4 - Error: is not recognized!
line 255 column 18 - Error: is not recognized!
InputStream: Doctype given is ""
InputStream: Document content looks like HTML 4.01 Transitional
33 warnings, 2 errors were found!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
If I disable the XPath Extractor in my test, I no longer get these errors. So, I know the XPath Extractor brought this on. But, I need the XPath Extractor in order to get some other information necessary to run the test. So, I cannot remove that. Any ideas how I can ignore these 2 new errors?
I have used an HTML Assertion before and set the Error Threshhold to 2 for a different project. But, that did not seem to help here.
*Edit: Also, I checked "Use Tidy" for the "XML Parsing Options" on the XPath Extractor.
As per WARN level of your error it looks like that you have checked Show warnings or Report Errors or both.
If your page isn't XHTML/XML compliant you'll need to have Use Tidy checked.
If your server response is "too broken" from Tidy point of view you can always consider following post processors to get required data:
Regular Expressions Extractor - which doesn't care about wrong or invalid markup
CSS/JQuery Extractor - which uses different selectors and doesn't require page to be XML/XHTML compliant.
In general I would suggest to check page using HTML Assertion as situation described looks like a real issue to me. Page which is severely broken might happen to be incorrectly rendered, not picked up by search engines, etc.
Dmitri's answer was already correct. I just wanted to add what I did to solve my issue as it may help someone else.
I ended up using the regular expression extractor successfully ( FINALLY :-) ). I stumbled upon this page which is EXTREMELY helpful:
http://jmeter.apache.org/usermanual/regular_expressions.html
( section 20.2 )
So, in JMeter, I added a Regular Expression Extractor as a child to the HTTP Request I was trying to pull the information from. Then, my new best friends are the XPath Tester and RegExp Tester under the View Results Tree. This makes it much easier to quickly test if your expressions are right or wrong. I ended up with this in the Regular Expression field of the Regular Expression Extractor:
name="token" value="(.+?)"
What else I realized, for those reading this in the future, is that you can build up to an expression if one of the expressions you find online do not work for you. Of course, I found mine from that page, but I also found how I could build up to it if I was not this lucky. What do I mean?
Before I found that section in the JMeter site showing the examples, I tried this:
1. Ran my test
2. Looked at View Results Tree I added to the HTTP Request I was trying to pull the value from
3. In the View Results Tree, I clicked on drop down to change it to RegExp Tester
4. Started typing in many things to see what would match and not match into the RegExp Tester. I tried:
4a. id="token" and this retrieved information
4b. id="token"/#value and this did not retrieve anything
4c. name="token" and this retrieved information
4d. name="token" value="(.+?)" and this retrieved the data I was after
Hope this helps someone!

Wiki quotes API?

I would want to get a structured version of a Wikiquote page via JSON (basically I need all phrases)
Example: http://en.wikiquote.org/wiki/Fight_Club_(film)
I tried with: http://en.wikiquote.org/w/api.php?format=xml&action=parse&page=Fight_Club_(film)&prop=text
but I get all HTML source code. I need each pharse as an element of an Array
How could I achieve that with DBPEDIA?
For one thing Iam not sure whether you can query wiki quotes using DBpedia and secondly, DBpedia gives you only info box data in a structured way, it does not in a any way the article content in a structured way. Instead with a little bit of trouble you can use the Media wiki api to get the data
EDIT
The URI you are trying gives you a text so this will make things easier, but not completely.
Try this piece of code in your console:
require 'Nokogiri'
content = JSON.parse(open("http://en.wikiquote.org/w/api.php?format=json&action=parse&page=Fight_Club_%28film%29&prop=text").read)
data = content['parse']['text']['*']
xpath_data = Nokogiri::HTML data
xpath_data.xpath("//ul/li").map{|data_node| data_node.text}
This is the closest I have come to an answer, of course this is not completely right because you will get a lot on unnecessary data. But if you dig into Nokogiri and xpath and find out how to pin point the nodes you need you can get a solution which will give you correct quotes at least 90% of the time.
Just change the format to JSON. Look up the Wikipedia API for more details.
http://en.wikiquote.org/w/api.php?format=json&action=parse&page=Fight_Club_(film)&prop=text

Resources