i have to extract d company name and face value from
http://money.rediff.com/companies/20-microns-ltd/15110088
i noticed that this task could be accomplished using xpath api.
since this is an html page, i am using jtidy parser.
this is the xpath for the face value which i have to extract.
/html/body/div[4]/div[6]/div[9]/div/table/tbody/tr[4]/td[2]
This is my code
URL oracle = new URL("http://money.rediff.com/companies/20-microns-ltd/15110088");
URLConnection yc = oracle.openConnection();
InputStream is = yc.getInputStream();
is = oracle.openStream();
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
Document tidyDOM = tidy.parseDOM(is, null);
XPathFactory xPathFactory = XPathFactory.newInstance();
XPath xPath = xPathFactory.newXPath();
String expression = "/html";
XPathExpression xPathExpression = xPath.compile(expression);
Object result = xPathExpression.evaluate(tidyDOM,XPathConstants.NODESET);
System.out.println(result.toString());
please guide me further, because, i cannot find a right solution for the above
Try not to use "full" xpaths.
//div[#id='leftcontainer']//div[9]//table//tr[4]/td[2]
is better than
/html/body/.../.../.../.../.../...
Most HTML pages are not valid or even well-formed. So the DOM structure may change when processed by "real-world HTML parsers". For example, a <tbody> may be inserted under <table> if there isn't one. Things are worse when different HTML parsers generate different DOM trees so one XPath may be valid for one parser, but not the other. I would rather use "wildcards" like table//tr[4] instead of table/tbody/tr[4] or table/tr[4] so that I can forget about <tbody>. Such expressions are more robust when used against the messy real-world HTML pages.
You can use Firepath, a plugin for Firebug which is then a plugin for Firefox, to debug XPath expressions.
p.s. You can try my JHQL (http://github.com/wks/jhql) project for exactly this task. You will like it if you have more pages to extract data from.
Related
Just started learning scraping and for my test project i'm trying to retreive the quantity of a certain project in scrapy shell by using
response.xpath('//script[contains("quantity")]/text()').extract()
This doesn't work.
help me understand what should be the right covention to retreive data from such attributes like quantity, category_path & etc
<script>
window.dataLayer = window.dataLayer || [];
dataLayer.push({"event":"datalayer-initialized","region":"India","account_type":"ecom","customer":{"id":""},"page_type":"Product","product":{"ffr":"csddfas","name":"tote bag by singh","materials":"100% polyester","specs":"Dimensions: 18.5\" x 6.75\"; 24L","color":null,"size":null,"upc":null,"new":false,"brand":null,"season":"HOLIDAY 2017","on_sale":false,"quantity":"158","original_price":100,"price":100,"category_path":
["Mens","Accessories","Backpacks \/ Bags"],"created":"2016-09-07","modified":"2018-02-12",
"colors":["BLACK"],"sizes":["S","M","L","XS","XL","XXL"]}});
</script>
Scrapy selectors have built-in support for regular expressions and it can help you on this case:
response.xpath('//script[contains(text(),"quantity")]/text()').re(r'"quantity":"(\d+)"')
(you need to update your xpath to collect script content since your script not good enough)
Another Way: You also can use regular expressions to collect the json content on script, parse them to json obj and work with it as easier!
You are using css method and giving it an Xpath
Try
response.xpath('//script[contains(text(),'quantity')]').extract()
or
response.css('script::contains(quantity)').extract()
And you will need a Regex to extract that JSON string
re.findall(r'(?<=dataLayer\.push\().*(?=\)\;)', your_script_tag_data, re.DOTALL)
javascript = response.xpath('//script[contains("quantity")]/text()').extract_first()
json_string = re.search( r'dataLayer\.push\((.+?)\);', javascript, re.DOTALL ).group(1)
data = json.loads(json_string)
print( "Quantity: {0}".format(data["product"]["quantity"]) )
In my experience, have no way to get quantity, category_path & etc by only Xpath, because they're in Json format. Xpath can get information in XML data.
I assume that you've already have xml data, use
python
data = yourXML.xpath('//script//text()')
Now data is a string that contain all information. Then, you need to get string in dataLayer.push and convert it to Json format. With Json, it' easy to get your information.
I've been able to set, via code, the xpaths for the placeholders found in the document.
for (Object o : finderSdtRun.results) {
if (o instanceof SdtRun){
SdtPr sdtPr=((SdtRun) o).getSdtPr();
Tag t = sdtPr.getTag();
CTDataBinding ctDataBinding = Context.getWmlObjectFactory().createCTDataBinding();
//JAXBElement jaxbDB = Context.getWmlObjectFactory().createSdtPrDataBinding(ctDataBinding);
sdtPr.setDataBinding(ctDataBinding);
ctDataBinding.setXpath("tuttappostaferragost");
ctDataBinding.setStoreItemID("something");
ObjectFactory factory = new org.opendope.xpaths.ObjectFactory();
DataBinding db = factory.createXpathsXpathDataBinding();
db.setXpath("tuttappostaferragost");
db.setStoreItemID("something");
Xpaths.Xpath xp = factory.createXpathsXpath();
xp.setDataBinding(db);
xp.setId("something");
try {
wordMLPackage.getMainDocumentPart().getXPathsPart().getContents().getXpath().add(xp);
} catch (Docx4JException e) {
e.printStackTrace();
}
;
The problem is that, once set, they are not recognized by word, so I thought to add the created Xpaths to a new XpathPart, and then add it to the main Document part.
But I failed because the method:
wordMLPackage.getMainDocumentPart().getXPathsPart()
returns null. This sounded reasonable, since only content control was set, without any Xpath.
Then I set the Xpaths via content control toolkit and the same line of code like above, returned me null, which added a lot of confusion in my yet confused ideas.
Is there any way to tell the document that new Xpath have been added to the document?
I mean, if there is a way to add Xpath via code (the w:databinding w:storedItemId tags), why it is not possible to make it work?
In general I want to add Xpath and all information necessary, via code, avoiding the use of any toolkit.
Thank you :D
First, you have to decide whether you want plain old Word databinding, or the additional OpenDoPE capabilities (which use the content control tag to support repeats, conditionals etc).
You only need an XPaths part if you are using the OpenDoPE extensions.
I'll assume for now that you are just looking to do basic Word content control databinding.
To set that up programmatically, you need to add a custom xml part, and a rel from it to its itemProps.xml part, which contains something like:
<ds:datastoreItem ds:itemID="{5448916C-134B-45E6-B8FE-88CC1FFC17C3}" xmlns:ds="http://schemas.openxmlformats.org/officeDocument/2006/customXml">
<ds:schemaRefs/>
</ds:datastoreItem>
(to add a part B to part A, use partA.addTargetPart)
You can see it is this part with gives the custom xml part its itemID; this corresponds with the value you set in:
DataBinding db = factory.createXpathsXpathDataBinding();
db.setStoreItemID("something");
Then, set the XPath via the method you were using.
I am using an API (zillow) which returns a complex hash. A sample result is
{"xmlns:xsi"=>"http://www.w3.org/2001/XMLSchema-instance",
"xsi:schemaLocation"=>"http://www.zillow.com/static/xsd/SearchResults.xsd http://www.zillowstatic.com/vstatic/5985ee4/static/xsd/SearchResults.xsd",
"xmlns:SearchResults"=>"http://www.zillow.com/static/xsd/SearchResults.xsd", "request"=>[{"address"=>["305 Vinton St"], "citystatezip"=>["Melrose, MA 02176"]}],
"message"=>[{"text"=>["Request successfully processed"], "code"=>["0"]}],
"response"=>[{"results"=>[{"result"=>[{"zpid"=>["56291382"], "links"=>[{"homedetails"=>["http://www.zillow.com/homedetails/305-Vinton-St-Melrose-MA-02176/56291382_zpid/"],
"graphsanddata"=>["http://www.zillow.com/homedetails/305-Vinton-St-Melrose-MA-02176/56291382_zpid/#charts-and-data"], "mapthishome"=>["http://www.zillow.com/homes/56291382_zpid/"],
"comparables"=>["http://www.zillow.com/homes/comps/56291382_zpid/"]}], "address"=>[{"street"=>["305 Vinton St"], "zipcode"=>["02176"], "city"=>["Melrose"], "state"=>["MA"], "latitude"=>["42.466805"],
"longitude"=>["-71.072515"]}], "zestimate"=>[{"amount"=>[{"currency"=>"USD", "content"=>"562170"}], "last-updated"=>["06/01/2014"], "oneWeekChange"=>[{"deprecated"=>"true"}], "valueChange"=>[{"duration"=>"30", "currency"=>"USD", "content"=>"42749"}], "valuationRange"=>[{"low"=>[{"currency"=>"USD",
"content"=>"534062"}], "high"=>[{"currency"=>"USD", "content"=>"590278"}]}], "percentile"=>["0"]}], "localRealEstate"=>[{"region"=>[{"id"=>"23017", "type"=>"city",
"name"=>"Melrose", "links"=>[{"overview"=>["http://www.zillow.com/local-info/MA-Melrose/r_23017/"], "forSaleByOwner"=>["http://www.zillow.com/melrose-ma/fsbo/"],
"forSale"=>["http://www.zillow.com/melrose-ma/"]}]}]}]}]}]}]}
I can extract a specific value using the following:
result = result.to_hash
p result["response"][0]["results"][0]["result"][0]["zestimate"][0]["amount"][0]["content"]
It seems odd to have to specify the index of each element in this fashion. Is there a simpler way to obtain a named value?
It looks like this should be parsed into XML. According to the Zillow API Docs, it returns XML by default. Apparently, "to_hash" was able to turn this into a hash (albeit, a very ugly one), but you are really trying to swim upstream by using it this way. I would recommend using it as intended (xml) at the start, and then maybe parsing it into an easier to use format (like a JSON/Hash structure) later.
Nokogiri is GREAT at parsing XML! You can use the xpath syntax for grabbing elements from the dom, or even css selectors.
For example, to get an array of the "content" in every result:
response = #get xml response from zillow
results = Nokogiri::XML(response).remove_namespaces!
#using css
content_array = results.css("result content")
#same thing using xpath:
content_array = results.xpath("//result//content")
If you just want the content from the first result, you can do this as a shortcut:
content = results.at_css("result content").content
Since it is indeed XML dumped into a JSON, you could use JSONPath to query the JSON
I am using selenium rc and exetuing
selenium.getText("//div/div[2][contains(#id,'gp-PACKAGE NAME-')]/div["+i+"]/table/tbody/tr/td["+1+"]/div")
I have to execute this command for 20-30 rows for which it takes 20-30 mins. I would like to get dom object for the table and parse it using java rather than executing selenium.getText for each row.
My expectation is that, I get dom object for all the rows from selenium and perform xpath query outside selenium using some dom parser.
Provided you already know what xpath parsing engine you want to use, you could always pass it the HTML of //div/div[2][contains(#id,'gp-PACKAGE NAME-')] and then do what ever you wanted to do to it.
To do that all you need is to grab the html using Javascript and getEval:
selenium.getEval("var document = selenium.browserbot.getcurrentwindow().document;" +
"var xpathResults = document.evaluate('//div/div[2][contains(#id,\"gp-PACKAGE NAME-\")]', document, null, XPathResult.ANY_TYPE, null);" +
"var element = xpathResults.iterateNext();" +
"var innerHtml = element.innerHTML;"
);
I have an element whose html is like :
<div class="gwt-Label textNoStyle textNoWrap titlePanelGrayDiagonal-Text">Announcements</div>
I want to check the presence of this element. So I am doing something like :
WebDriver driver = new FirefoxDriver(profile);
driver.findElement(By.cssSelector(".titlePanelGrayDiagonal-Text"));
But its not able to evaluate the CSSSelector.
Even I tried like :
By.cssSelector("gwt-Label.textNoStyle.textNoWrap.titlePanelGrayDiagonal-Text")
tried with this as well :
By.cssSelector("div.textNoWrap.titlePanelGrayDiagonal-Text")
Note : titlePanelGrayDiagonal-Text class is used by only this element in the whole page. So its unique.
Contains pseudo selector I can not use.
I want to identify only with css class.
Versions: Selenium 2.9 WebDriver
Firefox 5.0
When using Webdriver you want to use W3C standard css selectors not sizzle selectors like you may be used to using in jquery. In your example you would want to use:
driver.findElement(By.cssSelector("div[class='titlePanelGrayDiagonal-Text']"));
From reading over your post what you should do since that class is unique is just do a FindElement(By.ClassName("titlePanelGrayDiagonal-Text"));
Also the CssSelector doesn't handle the contains keyword it was something that the w3 talked about but never added.
I haven't used css selectors, but this is the xpath selector I would use:
"xpath=//div[#class='gwt-Label textNoStyle textNoWrap titlePanelGrayDiagonal-Text']"
The css selector should then probably be something like
"css=div[class='gwt-Label textNoStyle textNoWrap titlePanelGrayDiagonal-Text']"
Source: http://release.seleniumhq.org/selenium-remote-control/0.9.2/doc/dotnet/Selenium.html
Did you ever tried following code,
By.cssSelector("div#gwt-Label.textNoStyle.textNoWrap.titlePanelGrayDiagonal-Text");
I believe using a wildcard in CSS would be more helpful. Something as follows
driver.findElement(By.cssSelector("div[class$='titlePanelGrayDiagonal-Text']");
This will look into the class attribute and see what that attribute is ending with. Since your class attribute is ending with "titlePanelGrayDiagonal-Text" string, the added '$' in the css statement will find the element and then you can perform whatever action you're trying to perform.