Get <div> as DOM object using selenium - selenium-rc

I am using selenium rc and exetuing
selenium.getText("//div/div[2][contains(#id,'gp-PACKAGE NAME-')]/div["+i+"]/table/tbody/tr/td["+1+"]/div")
I have to execute this command for 20-30 rows for which it takes 20-30 mins. I would like to get dom object for the table and parse it using java rather than executing selenium.getText for each row.
My expectation is that, I get dom object for all the rows from selenium and perform xpath query outside selenium using some dom parser.

Provided you already know what xpath parsing engine you want to use, you could always pass it the HTML of //div/div[2][contains(#id,'gp-PACKAGE NAME-')] and then do what ever you wanted to do to it.
To do that all you need is to grab the html using Javascript and getEval:
selenium.getEval("var document = selenium.browserbot.getcurrentwindow().document;" +
"var xpathResults = document.evaluate('//div/div[2][contains(#id,\"gp-PACKAGE NAME-\")]', document, null, XPathResult.ANY_TYPE, null);" +
"var element = xpathResults.iterateNext();" +
"var innerHtml = element.innerHTML;"
);

Related

how to pull data using xpath from script?

Just started learning scraping and for my test project i'm trying to retreive the quantity of a certain project in scrapy shell by using
response.xpath('//script[contains("quantity")]/text()').extract()
This doesn't work.
help me understand what should be the right covention to retreive data from such attributes like quantity, category_path & etc
<script>
window.dataLayer = window.dataLayer || [];
dataLayer.push({"event":"datalayer-initialized","region":"India","account_type":"ecom","customer":{"id":""},"page_type":"Product","product":{"ffr":"csddfas","name":"tote bag by singh","materials":"100% polyester","specs":"Dimensions: 18.5\" x 6.75\"; 24L","color":null,"size":null,"upc":null,"new":false,"brand":null,"season":"HOLIDAY 2017","on_sale":false,"quantity":"158","original_price":100,"price":100,"category_path":
["Mens","Accessories","Backpacks \/ Bags"],"created":"2016-09-07","modified":"2018-02-12",
"colors":["BLACK"],"sizes":["S","M","L","XS","XL","XXL"]}});
</script>
Scrapy selectors have built-in support for regular expressions and it can help you on this case:
response.xpath('//script[contains(text(),"quantity")]/text()').re(r'"quantity":"(\d+)"')
(you need to update your xpath to collect script content since your script not good enough)
Another Way: You also can use regular expressions to collect the json content on script, parse them to json obj and work with it as easier!
You are using css method and giving it an Xpath
Try
response.xpath('//script[contains(text(),'quantity')]').extract()
or
response.css('script::contains(quantity)').extract()
And you will need a Regex to extract that JSON string
re.findall(r'(?<=dataLayer\.push\().*(?=\)\;)', your_script_tag_data, re.DOTALL)
javascript = response.xpath('//script[contains("quantity")]/text()').extract_first()
json_string = re.search( r'dataLayer\.push\((.+?)\);', javascript, re.DOTALL ).group(1)
data = json.loads(json_string)
print( "Quantity: {0}".format(data["product"]["quantity"]) )
In my experience, have no way to get quantity, category_path & etc by only Xpath, because they're in Json format. Xpath can get information in XML data.
I assume that you've already have xml data, use
python
data = yourXML.xpath('//script//text()')
Now data is a string that contain all information. Then, you need to get string in dataLayer.push and convert it to Json format. With Json, it' easy to get your information.

Is it possible to write these Protractor expectations using no continuations?

Using protractor and jasmine(wd) we want to check that a table on the web page contains expected values. We get fetch the table from the page using a CSS selector:
var table = element(by.css('table#forderungenTable')).all(by.tagName('tr'));
We then set our expectations:
table.then(function(forderungen){
...
forderungen[2].all(by.tagName('td')).then(function(columns){
expect(columns[1].getText()).toEqual('1');
expect(columns[5].getText()).toEqual('CHF 277.00');
});
});
Is it possible to change this code so that we don't have to pass functions to then, in the same way that using jasminewd means that we don't have to do this? See this page, which states:
Protractor uses jasminewd, which wraps around jasmine's expect so that you can write:
expect(el.getText()).toBe('Hello, World!')
Instead of:
el.getText().then(function(text) {
expect(text).toBe('Hello, World!');
});
I know that I could write my own functions in a way similar to which jasminewd does it, but I want know if there is a better way to construct such expectations using constructs already available in protractor or jasminewd.
You can actually call getText() on an ElementArrayFinder:
var texts = element(by.css('table#forderungenTable')).all(by.tagName('tr')).get(2).all(by.tagName('td'));
expect(texts).toEqual(["text1", "text2", "text3"]);

Extracting value from complex hash in Ruby

I am using an API (zillow) which returns a complex hash. A sample result is
{"xmlns:xsi"=>"http://www.w3.org/2001/XMLSchema-instance",
"xsi:schemaLocation"=>"http://www.zillow.com/static/xsd/SearchResults.xsd http://www.zillowstatic.com/vstatic/5985ee4/static/xsd/SearchResults.xsd",
"xmlns:SearchResults"=>"http://www.zillow.com/static/xsd/SearchResults.xsd", "request"=>[{"address"=>["305 Vinton St"], "citystatezip"=>["Melrose, MA 02176"]}],
"message"=>[{"text"=>["Request successfully processed"], "code"=>["0"]}],
"response"=>[{"results"=>[{"result"=>[{"zpid"=>["56291382"], "links"=>[{"homedetails"=>["http://www.zillow.com/homedetails/305-Vinton-St-Melrose-MA-02176/56291382_zpid/"],
"graphsanddata"=>["http://www.zillow.com/homedetails/305-Vinton-St-Melrose-MA-02176/56291382_zpid/#charts-and-data"], "mapthishome"=>["http://www.zillow.com/homes/56291382_zpid/"],
"comparables"=>["http://www.zillow.com/homes/comps/56291382_zpid/"]}], "address"=>[{"street"=>["305 Vinton St"], "zipcode"=>["02176"], "city"=>["Melrose"], "state"=>["MA"], "latitude"=>["42.466805"],
"longitude"=>["-71.072515"]}], "zestimate"=>[{"amount"=>[{"currency"=>"USD", "content"=>"562170"}], "last-updated"=>["06/01/2014"], "oneWeekChange"=>[{"deprecated"=>"true"}], "valueChange"=>[{"duration"=>"30", "currency"=>"USD", "content"=>"42749"}], "valuationRange"=>[{"low"=>[{"currency"=>"USD",
"content"=>"534062"}], "high"=>[{"currency"=>"USD", "content"=>"590278"}]}], "percentile"=>["0"]}], "localRealEstate"=>[{"region"=>[{"id"=>"23017", "type"=>"city",
"name"=>"Melrose", "links"=>[{"overview"=>["http://www.zillow.com/local-info/MA-Melrose/r_23017/"], "forSaleByOwner"=>["http://www.zillow.com/melrose-ma/fsbo/"],
"forSale"=>["http://www.zillow.com/melrose-ma/"]}]}]}]}]}]}]}
I can extract a specific value using the following:
result = result.to_hash
p result["response"][0]["results"][0]["result"][0]["zestimate"][0]["amount"][0]["content"]
It seems odd to have to specify the index of each element in this fashion. Is there a simpler way to obtain a named value?
It looks like this should be parsed into XML. According to the Zillow API Docs, it returns XML by default. Apparently, "to_hash" was able to turn this into a hash (albeit, a very ugly one), but you are really trying to swim upstream by using it this way. I would recommend using it as intended (xml) at the start, and then maybe parsing it into an easier to use format (like a JSON/Hash structure) later.
Nokogiri is GREAT at parsing XML! You can use the xpath syntax for grabbing elements from the dom, or even css selectors.
For example, to get an array of the "content" in every result:
response = #get xml response from zillow
results = Nokogiri::XML(response).remove_namespaces!
#using css
content_array = results.css("result content")
#same thing using xpath:
content_array = results.xpath("//result//content")
If you just want the content from the first result, you can do this as a shortcut:
content = results.at_css("result content").content
Since it is indeed XML dumped into a JSON, you could use JSONPath to query the JSON

how to extract data using jtidy and xpath

i have to extract d company name and face value from
http://money.rediff.com/companies/20-microns-ltd/15110088
i noticed that this task could be accomplished using xpath api.
since this is an html page, i am using jtidy parser.
this is the xpath for the face value which i have to extract.
/html/body/div[4]/div[6]/div[9]/div/table/tbody/tr[4]/td[2]
This is my code
URL oracle = new URL("http://money.rediff.com/companies/20-microns-ltd/15110088");
URLConnection yc = oracle.openConnection();
InputStream is = yc.getInputStream();
is = oracle.openStream();
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
Document tidyDOM = tidy.parseDOM(is, null);
XPathFactory xPathFactory = XPathFactory.newInstance();
XPath xPath = xPathFactory.newXPath();
String expression = "/html";
XPathExpression xPathExpression = xPath.compile(expression);
Object result = xPathExpression.evaluate(tidyDOM,XPathConstants.NODESET);
System.out.println(result.toString());
please guide me further, because, i cannot find a right solution for the above
Try not to use "full" xpaths.
//div[#id='leftcontainer']//div[9]//table//tr[4]/td[2]
is better than
/html/body/.../.../.../.../.../...
Most HTML pages are not valid or even well-formed. So the DOM structure may change when processed by "real-world HTML parsers". For example, a <tbody> may be inserted under <table> if there isn't one. Things are worse when different HTML parsers generate different DOM trees so one XPath may be valid for one parser, but not the other. I would rather use "wildcards" like table//tr[4] instead of table/tbody/tr[4] or table/tr[4] so that I can forget about <tbody>. Such expressions are more robust when used against the messy real-world HTML pages.
You can use Firepath, a plugin for Firebug which is then a plugin for Firefox, to debug XPath expressions.
p.s. You can try my JHQL (http://github.com/wks/jhql) project for exactly this task. You will like it if you have more pages to extract data from.

Most efficient way to parse and reformat data with Nokogiri & Sinatra

I'm working on reformatting HTML output from a search query for an inventory manager for a number of car dealers. There's no direct DB access, no information available from the service creators so I decided to attempts to parse and reformat the data with Nokogiri and generate new pages of results based on the search query.
On first load of the page, I'm just using a default search to generate the first results.
For the search to work, I'm sending the query to a URL like this:
post '/search/?:search_query' do
url = "http://domain.com/v/?DealerId=" + settings.dealer_id + "&maxrows=10&#{params[:search_query]}"
doc = Nokogiri::HTML(open(url))
doc.css("td:nth-child(5) .ForeColor4").each do |msrp|
session["msrp"] = msrp.inner_html
end
doc.css("td:nth-child(4) .ForeColor4").each do |price|
session["price"] = price.inner_html
end
erb :index
end
I know there's got to be a smarter way to do this.
Edit:
An example URL to request data:
http://domain.com/?DealerId=1234&object=list&lang=en&MAKE=&MODEL=&maxrows=50&MinYear=&MaxYear=2011&Type=N&MinPrice=&MaxPrice=&STYLE=&ExtColor=&MaxMiles=&StockNo=
A description of the HTML generated:
Unfortunately, it's old code that's almost entirely table-based, has inline-styles and lacks classes or ids in most areas.
An example of a CSS selector:
td:nth-child(5) .ForeColor4
An XPath selector:
//td[(((count(preceding-sibling::*) + 1) = 5) and parent::*)]//*[contains(concat( " ", #class, " " ), concat( " ", "ForeColor4", " " ))]
I've also looked at mechanize or hpricot as possibilities but I'm not aware of the best tools for the job as I haven't attempted screen-scraping before.
Summary: I want to pull the data from the HTML, temporarily store it in a variable / session / cookie (data changes several times per day), and then be able to reformat the output into my own HTML/CSS styling.
Personally, I'd decouple the scraping from the user action. Have an independent process scrape and fill your database. This will improve performance drastically, as the fetching, creating a DOM, parsing, then rendering output on every action is going to be slow.
doc.css("td:nth-child(5) .ForeColor4").each do |msrp|
session["msrp"] = msrp.inner_html
end
doc.css("td:nth-child(4) .ForeColor4").each do |price|
session["price"] = price.inner_html
end
You might want to use Nokogiri's at_css() method instead of the regular css(). at_css() finds the first occurrence of your target and only returns that one node, similar to doing a .first against the nodeset that .css() returns.
That would simplify your lookups to this form:
session["msrp"] = doc.at_css("td:nth-child(5) .ForeColor4").inner_html
I'd probably add something like rescue 'msrp lookup failed' while testing to the end of the lookups just in case you've got bad accessors. Or you could let the code fail when inner_html() got mad trying to read from a nil. It's just a bit friendlier way to debug.
Otherwise your lookups seem to be decent.

Resources