Scraping based on date

Scraping based on date - xpath

I'm trying to scrape data from a website, which does not seem to have to many clases in the tags. However i'm still wondering whether it is possible to scrape the titles from today using xpath.
So that it only retrieve the titles which is from 09/4 - 2015?
url: http://www.hltv.org/?pageid=96

Since date is unique 10/4 - 2015, you might locate a b tag node using xpath's contents(), see html here:
//b[contains(., '10/4 - 2015')]
then based on this node you go to its parent and siblings, smth. like this (not tested):
//b[contains(., '10/4 - 25')]/parent::div/siblings::div
Update
Since the current date items go at the bottom, here accorting to the html all the following-sibling nodes pertain to this data (google xpath sibling after)
//b[contains(., '10/4 - 25')]/parent::div/following-sibling::div[#class='newsItem']
See test here. If you want to fetch divs inbetween, then explore this

Related

XPath fails because Namespace colon in Title

I'm generating an XML report, using the JDF standard for PDFs going into a printing workflow.
There are 3 "DPart" sections, and I can use an xPath query to recognize them, but I want to grab the "Separation" attribute of each "cip4:Part". I can also get a query to find that, but it does not distinguish between the multiple "DPart"s.
<DPart End="0" ID="0003" ParentRef="0002" Start="0">
<DPM>
<cip4:Root>
<cip4:Intent cip4:ProductType="ProductPart"/>
<cip4:Production>
<cip4:Resource>
<cip4:Part Separation="K1"/>
<cip4:Color cip4:ActualColorName="Black" cip4:ColorType="Normal">
</cip4:Resource>
<cip4:Resource>
<cip4:Part Separation="S1"/>**
<cip4:Color cip4:ActualColorName="Dieline" cip4:ColorType="Normal">
</cip4:Resource>
<cip4:Resource>
<cip4:ColorantControl ColorantOrder="K1 S1" ColorantParams="K1 S1"/>
</cip4:Resource>
<cip4:Resource>
<eg:InkCoverage>
<eg:InkCov eg:Mm2="0.000000" eg:Pct="0.000000" eg:Separation="K1"/>
<eg:InkCov eg:Mm2="182.337538" eg:Pct="0.721209" eg:Separation="S1"/>
</eg:InkCoverage>
</cip4:Resource>
</cip4:Production>
</cip4:Root>
</DPM>
</DPart>
I want to do something like:
/DPM[2]/*[name ()='cip4:Part'], but it's not working.
I'm in a low-code pre-press environment (Esko Automation Engine), but the system gives me tools to parse an xPath, and throw some JavaScript at it.

There are at least three reasons your XPath selects nothing:
DPM is not an immediate child of the root node
There is only one DPM, so DPM[2] won't select anything
There is no child of a DPM whose name is cip4:Part.
You also say in the narrative that there are three DPart's, which implies that DPart is not actually the outermost element as it appears to be in your sample. This makes it difficult to provide the correct XPath. However, you might be able to make a start with
(//DPM)[2]//*[name()='cip4:Part']

Summing XML Data in MSWord Mail Merge

I have a report card written in Word that uses an XML file for its input. In the XML file, if a student remains in the same section all three trimesters there will be one node for that class; if they change sections at the trimester they'll have one node for each section. The nodes look something like this (greatly simplified):
<ReportCardSectionFB Abs1="2" Abs2="11" CourseID="ELMATH1" CourseTitle="Math" PeriodStart="3" TeacherName="Jones, Jennifer" TermCode="Year" SectionID="ELMATH1-4" />
<ReportCardSectionFB Abs1="1.50" Abs2="6" CourseID="ELMATH1" CourseTitle="Math" PeriodStart="3" TeacherName="Smith, Tina" TermCode="Year" SectionID="ELMATH1-3" />
There is no indicator within the XML as to which trimester the node belongs to.
In the Word document, we're pulling the absence data with the following mail merge command:
{MERGEFIELD "ReportCardSectionFB[#PeriodStart='3']/ #Abs1" \# 0.# \* MERGEFORMAT }
That's not working in this situation: it only gets the absence data from the first node it comes across, i.e.: 2.0. Is there a way to get the sum of #Abs1 for all period 3 classes, i.e.: 3.5? If not, is there a way to only get the last #Abs1 for period 3, i.e.: 1.5?

I recommend you to use this 3rd party product, which can use xml as input and is capable of merging it with MS Word template. I is also much more powerful than the built-in Word's mail merge. You can see some examples here.

You could also try summing the absences in Synergy - there's a new checkbox under AttDef1, 2, etc. that adds up all the absences for the data range - Include all day data for the entire date range regardless of section enrollment or section timeframe. That way the absences should be the same for each section, if that works for your district.
You can also try the SET function in Word to nest the MERGEFIELDS as bookmarks and use the Word operator functions to then add the bookmarks.

Extract a specific node from an XML file

I want to extract only the body node/tag from an XML file using doc.xpath in Ruby
The node to extract from the XML file:
<wcm:element name="Body"><p>A new study suggests that <a href="ssNODELINK/SmokingAndCancer">tobacco</a> companies may be using online video portals, such as YouTube, to get around advertising restrictions and market their products to young people.</p>
</wcm:element>
I have tried the following:
page_content = doc.xpath("/wcm:root/wcm:element").inner_text
But this extracts every node everything
Then I tried this:
page_content = doc.xpath("/wcm:root/wcm:element/Body")
But does not work.
Anyone has any suggestions how to extract exactly the body section of an XML file using doc.xpath in Ruby?

I'm not 100% certain I've understood what you mean but… let's not let that stop us. You want to get the content of a particular node from the input. Your first XPath statement:
/wcm:root/wcm:element
is extracting every element with name wcm:element that is a child of the wcm:root element which is the root element.
Your second:
/wcm:root/wcm:element/Body
is similar but looks for elements with name Body which are children of the wcm:element.
What you need to is to get the values of the wcm:element element where the attribute name is set to the value Body. You access attributes in XPath by prefixing them with an # sign and to express a where condition you use [...] - a predicate. You XPath statement needs to be:
/wcm:root/wcm:element[#name = 'Body']
I'm assuming that your XPath execution environment is fine the namespace prefixes (wcm) because you say that your first query returned content.

Dealing with duplicate ids in selenium webdriver

I am trying to automate some tests using selenium webdriver. I am dealing with a third-party login provider (OAuth) who is using duplicate id's in their html. As a result I cannot "find" the input fields correctly. When I just select on an id, I get the wrong one.
This question has already been answered for JQuery. But I would like an answer (I am presuming using Xpath) that will work in Selenium webdriver.
On other questions about this issue, answers typically say "you should not have duplicate id's in html". Preaching to the choir there. I am not in control of the webpage in question. If it was, I would use class and id properly and just fix the problem that way.
Since I cannot do that. What options do I get with xpath etc?

you can do it by driver.find_element_by_id, for example ur duplicate "duplicate_ID" is inside "div_ID" wich is unique :
driver.find_element_by_id("div_ID").find_element_by_id("duplicate_id")
for other duplicate id under another div :
driver.find_element_by_id("div_ID2").find_element_by_id("duplicate_id")

This XPath expression:
//div[#id='something']
selects all div elements in the XML document, the string value of whose id attribute is the string "something".
This Xpath expression:
count(//div[#id='something'])
produces the number of the div elements selected by the first XPath expression.
And this XPath expression:
(//div[#id='something'])[3]
selects the third (in document order) div element that is selected by the first XPath expression above.
Generally:
(//div[#id='something'])[$k]
selects the $k-th such div element ($k must be substituted with a positive integer).
Equipped with this knowledge, one can get any specific div whose id attribute has string value "something".

Which language are you working on? Dublicate id's shouldn't be a problem as you can virtually grab any attribute not just the id tag using xpath. The syntax will differ slightly in other languages (let me know if you want something else than Ruby) but this is how you do it:
driver.find_element(:xpath, "//input[#id='loginid']"
The way you go about constructing the xpath locator is the following:
From the html code you can pick any attribute:
<input id="gbqfq" class="gbqfif" type="text" value="" autocomplete="off" name="q">
Let's say for example that you want to consturct your xpath with the html code above (Google's search box) using name attribute. Your xpath will be:
driver.find_element(:xpath, "//input[#name='q']"
In other words when the id's are the same just grab another attribute available!
Improvement:
To avoid fragile xpath locators such as order in the XML document (which can change easily) you can use something even more robust. Two xpath locators instead of one. This can also be useful when dealing with hmtl tags that are really similar. You can locate an element by 2 of its attributes like this:
driver.find_element(:id, 'amount') and driver.find_element(xpath: "//input[#maxlength='50']")
or in pure xpath one liner if you prefer:
//input[#id="amount" and #maxlength='50']
Alternatively (and provided your xpath will only return one unique element) you can move one more step higher in the abstraction level; completely omitting the attribute values:
//input[#id and #maxlength]

It's not listed at http://selenium-python.readthedocs.io/locating-elements.html but I'm able access a method find_elements_by_id
This returns a list of all elements with the duplicate ID.
links = browser.find_elements_by_id("link")
for link in links:
print(link.get_attribute("href"))

you should use driver.findElement(By.xpath() but while locating element with firebug you should select absolute path for particular element instead of getting relative path this is how you will get the element even with duplicate ID's

HtmlUnit getByXpath returns null

I am coding with Groovy, however, I don't believe its a language specific set of questions.
I actually have two questions
First Question
I've run into an issue while using HtmlUnit. It is telling me that what I am trying to grab is null.
The page I'm testing it on is:
http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0#/dbwam4
My code:
client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false
page = client.getPage(url)
//coming up as null
title = page.getByXPath("//html/body/div[4]/div/div[3]/div/div/div/div/div/div/div/div/div/div/h1/a")
println title
This simply prints out: []
Is this because the page uses onclick()? If so, how would I get around that? Enabling javascript creates a mess in my cmd prompt.
Second Question
I am wanting to also get the image but am having trouble because when I attempt to get the XPath (via firebug) it shows up as: //*[#id="gmi-ResViewSizer_img"]
How do I handle that?

First Answer:
/html/body/div[3]/div/div[3]/div/div/div/div/div/div/div/div/div/div/h1/a
Your XPATH was off by one in the predicate filter for the 4th div of the body, it should be the 3rd div. It appears the HTML for the site can/does change from when you had origionally snagged the XPATH using Firebug. You may need to adjust your XPATH to accommodate for potential change and be less sensitive to some differences in document structure.
Maybe something like this:
/html/body//div/h1/a
Second Answer: The XPATH that you listed will work. It may look odd/short(and may not be the most efficient), but // starts at the root node and looks throughout every node in the tree, * matches on any element(to include the img) and the [] predicate filter restricts it to those that have an id attribute who's value equals "gmi-ResViewSizer_img".
There are many other options for XPATHs that could work as well. It will also depend on how often the HTML structure changes. This is one that also works for the page referenced to select that img:
/html/body/div/div/div/div/img[1]

I had the same problem, I solved when I realize iframe tags on page, try call
((HtmlPage)current_page.getFrames()[n].getEnclosedPage()).getElementByXPath(...
where n is the position in frame in iframe collection. It's work for me !!!
Thanks a lot.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Scraping based on date - xpath

I'm trying to scrape data from a website, which does not seem to have to many clases in the tags. However i'm still wondering whether it is possible to scrape the titles from today using xpath. So that it only retrieve the titles which is from 09/4 - 2015? url: http://www.hltv.org/?pageid=96

Related

XPath fails because Namespace colon in Title

Summing XML Data in MSWord Mail Merge

Extract a specific node from an XML file

Dealing with duplicate ids in selenium webdriver

HtmlUnit getByXpath returns null

Categories

Resources