I'm using Html Agility Pack to perform a basic web scraping of Google search results. As a newbie to XPath, I make sure my path expression is correct(with the help of FirePath). However, the returned HtmlNodeCollection is always NULL.
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument htmlDoc = web.Load("http://www.google.com/search?num=10&q=Hello+World");
// get search result URLs
var items = htmlDoc.DocumentNode.SelectNodes("//div[#id='ires']/ol[#id='rso']/li/div[#class='vsc']/h3/a/#href");
foreach (HtmlNode node in items)
{
Console.WriteLine(node.Attributes);
}
Am I missing something? Can anyone please enlighten me?
Thanks in advance,
HAP can only process the raw HTML that is returned from the url, it will not run any additional javascript that is on the page or whatnot. You need to adjust your query accordingly.
In the raw HTML, the ires div exists but the rso doesn't get inserted until the javascript is run hence you get no results. There are other transformations done here which you'll have to adjust for as well.
Here's a fragment of the HTML:
<div id="ires">
<ol>
<li class="g">
<h3 class="r">
...
A more appropriate xpath to use for this would be:
var xpath = "//li[contains(concat(' ',#class,' '),' g ')]" +
"/h3[contains(concat(' ',#class,' '),' r ')]" +
"/a/#href";
It'd be easier to find all li with the g class as those correspond to all the results. You'll want to filter all h3 with the r class otherwise you'd include other results (such as image results).
Related
I have the current HTML code:
<div class="group">
<ul class="smallList">
<li><strong>Date</strong>
13.06.2019
</li>
<li>...</li>
<li>...</li>
</ul>
</div>
and here is my "wrong" XPath:
//div[#class='group']/ul/li[1]
and I would like to extract the date with XPath without the text in the strong tag, but I'm not sure how NOT is used in XPath or could it even be used in here?
Keep in mind that the date is dynamic.
Use substring-after() to get the date value.
substring-after(//div[#class='group']/ul/li[1],'Date')
Output:
The easiest way to get the date is by using the XPath-1.0 expression
//div[#class='group']/ul/li[1]/text()[normalize-space(.)][1]
The result does include the spaces.
If you want to get rid of them, too, use the following expression:
normalize-space(//div[#class='group']/ul/li[1]/text()[normalize-space(.)][1])
Unfortunately this only works for one result in XPath-1.0.
If you'd have XPath-2.0 available, you could append the normalize-space() to the end of the expression which also enables the processing of multiple results:
//div[#class='group']/ul/li[1]/text()[normalize-space(.)][1]/normalize-space()
Here is the python method that will read the data directly from the parent in your case the data is associated with ul/li.
Python:
def get_text_exclude_children(element):
return driver.execute_script(
"""
var parent = arguments[0];
var child = parent.firstChild;
var textValue = "";
while(child) {
if (child.nodeType === Node.TEXT_NODE)
textValue += child.textContent;
child = child.nextSibling;
}
return textValue;""",
element).strip()
This is how to call this in your case.
ulEle = driver.find_element_by_xpath("//div[#class='group']/ul/li[1]")
datePart = get_text_exclude_children(ulEle)
print(datePart)
Please feel free to convert to the language that you are using, if it's not python.
CSS/xpath selector to get the link text excluding the text in .muted.
I have html like this:
<a href="link">
Text
<span class="muted"> –text</span>
</a>
When I do getText(), I get the complete text like, Text-text. Is it possible to exclude the muted subclass text ?
Tried cssSelector = "a:not([span='muted'])" doesn't work.
xpath = "//a/node()[not(name()='span')][1]"
ERROR: The result of the xpath expression "//a/node()[not(name()='span')][1]" is: [objectText]. It should be an element.
AFAIK this cannot be done with CSS selector only. You can try to use JavaScriptExecutor to get required text.
As you didn't mention programming language you use I show you example on Python:
link = driver.find_element_by_css_selector('a[href="link"]')
driver.execute_script('return arguments[0].childNodes[0].nodeValue', link)
This will return just "Text" without " -text"
You cannot do this using Selenium WebDriver's API. You have to handle it in your code as follows:
// Get the entire link text
String linkText = driver.findElement(By.xpath("//a[#href='link']")).getText();
// Get the span text only
String spanText = driver.findElement(By.xpath("//a[#href='link']/span[#class='muted']")).getText();
// Replace the span text from link text and trim any whitespace
linkText.replace(spanText, "").trim();
Hello I want to ask a question
I scrape a website with xpath ,and the result is like this:
[u'<tr>\r\n
<td>address1</td>\r\n
<td>phone1</td>\r\n
<td>map1</td>\r\n
</tr>',
u'<tr>\r\n
<td>address1</td>\r\n
<td>telephone1</td>\r\n
<td>map1</td>\r\n
</tr>'...
u'<tr>\r\n
<td>address100</td>\r\n
<td>telephone100</td>\r\n
<td>map100</td>\r\n
</tr>']
now I need to use xpath to analyze this results again.
I want to save the first to address,the second to telephone,and the last one to map
But I can't get it.
Please guide me.Thank you!
Here is code,it's wrong. it will catch another thing.
store = sel.xpath("")
for s in store:
address = s.xpath("//tr/td[1]/text()").extract()
tel = s.xpath("//tr/td[2]/text()").extract()
map = s.xpath("//tr/td[3]/text()").extract()
As you can see in scrappy documentation to work with relative XPaths you have to use .// notation to extract the elements relative to the previous XPath, if not you're getting again all elements from the whole document. You can see this sample in the scrappy documentation that I referenced above:
For example, suppose you want to extract all <p> elements inside <div> elements. First, you would get all <div> elements:
divs = response.xpath('//div')
At first, you may be tempted to use the following approach, which is wrong, as it actually extracts all <p> elements from the document, not only those inside <div> elements:
for p in divs.xpath('//p'): # this is wrong - gets all <p> from the whole document
This is the proper way to do it (note the dot prefixing the .//p XPath):
for p in divs.xpath('.//p'): # extracts all <p> inside
So I think in your case you code must be something like:
for s in store:
address = s.xpath(".//tr/td[1]/text()").extract()
tel = s.xpath(".//tr/td[2]/text()").extract()
map = s.xpath(".//tr/td[3]/text()").extract()
Hope this helps,
I'm using the HtmlAgilityPack to parse href tags in an html file. The href tags look like this:
<h3 class="product-name">Super Cool Product</h3>
So far I can successfully pull out the url and the title together, and display it in a list. This is the main code I'm using to parse the html:
var linksOnPage = from lnks in document.DocumentNode.SelectNodes("//h3[#class='product-name']//a")
where
lnks.Attributes["href"] != null &&
lnks.InnerText.Trim().Length > 0
select new
{
Url = lnks.Attributes["href"].Value,
Text = lnks.InnerText
};
The code above gives me a result that looks like this:
Super Cool Product - http://www.somewebsite.com/blahblah
I'm trying to figure out how to pull out the name and url separately, and put them into separate strings, instead of pulling them out together and putting them into one string. I'm guessing there is some sort of Xpath notation I can use to do this. I would be extremely thankful if someone could lead me in the right direction
Thanks,
Miles
<div class="mvb"><b>Date 1</b></div>
<div class="mxb"><b>Header 1</b></div>
<div>
inner hmtl 1
</div>
<div class="mvb"><b>Date 2</b></div>
<div class="mxb"><b>Header 2</b></div>
<div>
inner html 2
</div>
I would like to parse the inner html between the tags in such a way that I can
* associate the inner html 1 with header 1 and date 1
* associate the inner html 2 with header 2 and date 2
In other words, at the time I parse the inner html 1 I would like to know that the html nodes containing "Date 1" and "Header 1" have been parsed (but the nodes containing "Date 2" and "Header 2" have not been parsed)
If I were doing this via regular text parsing, I would read one line at a time and record the last "Date" and "Header" than I had parsed. Then when it came time to parse the inner html 1, I could refer to the last parsed "Date" and "Header" object to associate them together.
Using the Html Agility Pack, you can leverage XPATH power - and forget about that verbose xlinq crap :-). The XPATH position() function is context sensitive. Here is a sample code:
HtmlDocument doc = new HtmlDocument();
doc.Load("your html file");
// select all DIV without a CLASS attribute defined
foreach (HtmlNode div in doc.DocumentNode.SelectNodes("//div[not(#class)]"))
{
Console.WriteLine("div=" + div.InnerText.Trim());
Console.WriteLine(" header=" + div.SelectSingleNode("preceding-sibling::div[position()=1]/b").InnerText);
Console.WriteLine(" date=" + div.SelectSingleNode("preceding-sibling::div[position()=2]/b").InnerText);
}
That will prrint this with your sample:
div=inner hmtl 1
header=Header 1
date=Date 1
div=inner html 2
header=Header 2
date=Date 2
Well, you can do this in several ways...
For example, if the HTML you want to parse is the one you wrote in your question, an easy way could be:
Store all dates in a HtmlNodeCollection
Store all headers in a HtmlNodeCollection
Store all inner texts in another HtmlNodeCollection
If everything is okay and the HTML has that layout, you will have the same number of elements in both 3 collections.
Then you can easily do:
for (int i = 0; i < innerTexts.Count; i++) {
//Get Date, Headers and Inner Texts at position i
}
The following should work:
var document = new HtmlWeb().Load("http://www.url.com"); //Or load it from a Stream, local file, etc.
var dateNodes = document.DocumentNode.SelectNodes("//div[#class='mvb']/b");
var headerNodes = document.DocumentNode.SelectNodes("//div[#class='mxb']/b");
var innerTextNodes = (from node in document.DocumentNode.SelectNodes("//div")
let previous = node.PreviousSibling
where previous.Name == "div" && previous.GetAttributeValue("class", "") == "mxb"
select node).ToList();
//Check here if the number of elements of the 3 collections are the same
for (int i = 0; i < dateNodes.Count; i++) {
var date = dateNodes[i].InnerText;
var header = headerNodes[i].InnerText;
var innerText = innerTextNodes[i].InnerText;
//Now you have the set you want: You have the Date, Header and Inner Text
}
This is a way of doing this.
Of course, you should check for exceptions (that .SelectNodes(..) method are not returning null), check for errors in the LINQ expression when storing innerTextNodes, and refactor the for (...), maybe into a method that receives a HtmlNode and returns the InnerText property of it.
Take in count that the only way you can know, in the HTML code you posted, what is the <div> tag that contains the Inner Text, is to assume it is the one that is next to the <div> tag that contains the Header. That's why I used the LINQ expression.
Another way of knowing it could be if the <div> has some particular attribute (like class="___") or similar, or if it contains some tags inside it and not just text. There is no magic when parsing HTMLs :)
Edit:
I have not tested this code. Test it by yourself and let me know if it worked.