Given the following html content :
<div>
<h3>Name :</h3>
<p>Person A</p>
<h3>Name :</h3>
<p>Person B</p>
<h3>Name :</h3>
<p>Person c</p>
</div>
I need to extract the name of every person under the p tag using xPath.
When I use the following expression :
name = container.xpath(".//h3[text()='Name :']/following-sibling::p/text()")
I get this output in a .csv file I am extracting:
Person A Person B Person C
But I need to have line breaks after every person, like this:
Person A
Person B
Person C
The code I use to get the csv file is as below:
with open("person.csv", "w") as f:
writer = csv.DictWriter(f, fieldnames = fieldnames, lineterminator = '\n')
writer.writeheader()
for row in output:
writer.writerow(row)
Is there a way I can structure my xPath in order to achieve that?
Try something like this:
name = container.xpath(".//h3[text()='Name :']/following-sibling::p/text()")
names = ''
for n in name:
names+=(n+'\n')
and use names in your output before you save to csv.
Related
I have the current HTML code:
<div class="group">
<ul class="smallList">
<li><strong>Date</strong>
13.06.2019
</li>
<li>...</li>
<li>...</li>
</ul>
</div>
and here is my "wrong" XPath:
//div[#class='group']/ul/li[1]
and I would like to extract the date with XPath without the text in the strong tag, but I'm not sure how NOT is used in XPath or could it even be used in here?
Keep in mind that the date is dynamic.
Use substring-after() to get the date value.
substring-after(//div[#class='group']/ul/li[1],'Date')
Output:
The easiest way to get the date is by using the XPath-1.0 expression
//div[#class='group']/ul/li[1]/text()[normalize-space(.)][1]
The result does include the spaces.
If you want to get rid of them, too, use the following expression:
normalize-space(//div[#class='group']/ul/li[1]/text()[normalize-space(.)][1])
Unfortunately this only works for one result in XPath-1.0.
If you'd have XPath-2.0 available, you could append the normalize-space() to the end of the expression which also enables the processing of multiple results:
//div[#class='group']/ul/li[1]/text()[normalize-space(.)][1]/normalize-space()
Here is the python method that will read the data directly from the parent in your case the data is associated with ul/li.
Python:
def get_text_exclude_children(element):
return driver.execute_script(
"""
var parent = arguments[0];
var child = parent.firstChild;
var textValue = "";
while(child) {
if (child.nodeType === Node.TEXT_NODE)
textValue += child.textContent;
child = child.nextSibling;
}
return textValue;""",
element).strip()
This is how to call this in your case.
ulEle = driver.find_element_by_xpath("//div[#class='group']/ul/li[1]")
datePart = get_text_exclude_children(ulEle)
print(datePart)
Please feel free to convert to the language that you are using, if it's not python.
Hello I want to ask a question
I scrape a website with xpath ,and the result is like this:
[u'<tr>\r\n
<td>address1</td>\r\n
<td>phone1</td>\r\n
<td>map1</td>\r\n
</tr>',
u'<tr>\r\n
<td>address1</td>\r\n
<td>telephone1</td>\r\n
<td>map1</td>\r\n
</tr>'...
u'<tr>\r\n
<td>address100</td>\r\n
<td>telephone100</td>\r\n
<td>map100</td>\r\n
</tr>']
now I need to use xpath to analyze this results again.
I want to save the first to address,the second to telephone,and the last one to map
But I can't get it.
Please guide me.Thank you!
Here is code,it's wrong. it will catch another thing.
store = sel.xpath("")
for s in store:
address = s.xpath("//tr/td[1]/text()").extract()
tel = s.xpath("//tr/td[2]/text()").extract()
map = s.xpath("//tr/td[3]/text()").extract()
As you can see in scrappy documentation to work with relative XPaths you have to use .// notation to extract the elements relative to the previous XPath, if not you're getting again all elements from the whole document. You can see this sample in the scrappy documentation that I referenced above:
For example, suppose you want to extract all <p> elements inside <div> elements. First, you would get all <div> elements:
divs = response.xpath('//div')
At first, you may be tempted to use the following approach, which is wrong, as it actually extracts all <p> elements from the document, not only those inside <div> elements:
for p in divs.xpath('//p'): # this is wrong - gets all <p> from the whole document
This is the proper way to do it (note the dot prefixing the .//p XPath):
for p in divs.xpath('.//p'): # extracts all <p> inside
So I think in your case you code must be something like:
for s in store:
address = s.xpath(".//tr/td[1]/text()").extract()
tel = s.xpath(".//tr/td[2]/text()").extract()
map = s.xpath(".//tr/td[3]/text()").extract()
Hope this helps,
I'm using the HtmlAgilityPack to parse href tags in an html file. The href tags look like this:
<h3 class="product-name">Super Cool Product</h3>
So far I can successfully pull out the url and the title together, and display it in a list. This is the main code I'm using to parse the html:
var linksOnPage = from lnks in document.DocumentNode.SelectNodes("//h3[#class='product-name']//a")
where
lnks.Attributes["href"] != null &&
lnks.InnerText.Trim().Length > 0
select new
{
Url = lnks.Attributes["href"].Value,
Text = lnks.InnerText
};
The code above gives me a result that looks like this:
Super Cool Product - http://www.somewebsite.com/blahblah
I'm trying to figure out how to pull out the name and url separately, and put them into separate strings, instead of pulling them out together and putting them into one string. I'm guessing there is some sort of Xpath notation I can use to do this. I would be extremely thankful if someone could lead me in the right direction
Thanks,
Miles
<div class="mvb"><b>Date 1</b></div>
<div class="mxb"><b>Header 1</b></div>
<div>
inner hmtl 1
</div>
<div class="mvb"><b>Date 2</b></div>
<div class="mxb"><b>Header 2</b></div>
<div>
inner html 2
</div>
I would like to parse the inner html between the tags in such a way that I can
* associate the inner html 1 with header 1 and date 1
* associate the inner html 2 with header 2 and date 2
In other words, at the time I parse the inner html 1 I would like to know that the html nodes containing "Date 1" and "Header 1" have been parsed (but the nodes containing "Date 2" and "Header 2" have not been parsed)
If I were doing this via regular text parsing, I would read one line at a time and record the last "Date" and "Header" than I had parsed. Then when it came time to parse the inner html 1, I could refer to the last parsed "Date" and "Header" object to associate them together.
Using the Html Agility Pack, you can leverage XPATH power - and forget about that verbose xlinq crap :-). The XPATH position() function is context sensitive. Here is a sample code:
HtmlDocument doc = new HtmlDocument();
doc.Load("your html file");
// select all DIV without a CLASS attribute defined
foreach (HtmlNode div in doc.DocumentNode.SelectNodes("//div[not(#class)]"))
{
Console.WriteLine("div=" + div.InnerText.Trim());
Console.WriteLine(" header=" + div.SelectSingleNode("preceding-sibling::div[position()=1]/b").InnerText);
Console.WriteLine(" date=" + div.SelectSingleNode("preceding-sibling::div[position()=2]/b").InnerText);
}
That will prrint this with your sample:
div=inner hmtl 1
header=Header 1
date=Date 1
div=inner html 2
header=Header 2
date=Date 2
Well, you can do this in several ways...
For example, if the HTML you want to parse is the one you wrote in your question, an easy way could be:
Store all dates in a HtmlNodeCollection
Store all headers in a HtmlNodeCollection
Store all inner texts in another HtmlNodeCollection
If everything is okay and the HTML has that layout, you will have the same number of elements in both 3 collections.
Then you can easily do:
for (int i = 0; i < innerTexts.Count; i++) {
//Get Date, Headers and Inner Texts at position i
}
The following should work:
var document = new HtmlWeb().Load("http://www.url.com"); //Or load it from a Stream, local file, etc.
var dateNodes = document.DocumentNode.SelectNodes("//div[#class='mvb']/b");
var headerNodes = document.DocumentNode.SelectNodes("//div[#class='mxb']/b");
var innerTextNodes = (from node in document.DocumentNode.SelectNodes("//div")
let previous = node.PreviousSibling
where previous.Name == "div" && previous.GetAttributeValue("class", "") == "mxb"
select node).ToList();
//Check here if the number of elements of the 3 collections are the same
for (int i = 0; i < dateNodes.Count; i++) {
var date = dateNodes[i].InnerText;
var header = headerNodes[i].InnerText;
var innerText = innerTextNodes[i].InnerText;
//Now you have the set you want: You have the Date, Header and Inner Text
}
This is a way of doing this.
Of course, you should check for exceptions (that .SelectNodes(..) method are not returning null), check for errors in the LINQ expression when storing innerTextNodes, and refactor the for (...), maybe into a method that receives a HtmlNode and returns the InnerText property of it.
Take in count that the only way you can know, in the HTML code you posted, what is the <div> tag that contains the Inner Text, is to assume it is the one that is next to the <div> tag that contains the Header. That's why I used the LINQ expression.
Another way of knowing it could be if the <div> has some particular attribute (like class="___") or similar, or if it contains some tags inside it and not just text. There is no magic when parsing HTMLs :)
Edit:
I have not tested this code. Test it by yourself and let me know if it worked.
I have an XML document which contains nodes like following:-
<a class="custom">test</a>
<a class="xyz"></a>
I was tryng to get the nodes for which class is NOT "Custom" and I wrote an expression like following:-
XmlNodeList nodeList = document.SelectNodes("//*[self::A[#class!='custom'] or self::a[#class!='custom']]");
Now, I want to get IMG tags as well and I want to add the following experession as well to the above expression:-
//*[self::IMG or self::img]
...so that I get all the IMG nodes as well and any tag other than having "custom" as value in the class attribute.
Any help will be appreciated.
EDIT :-
I tried the following and this is an invalid syntax as this returns a boolean and not any nodelist:-
XmlNodeList nodeList = document.SelectNodes("//*[self::A[#class!='custom'] or self::a[#class!='custom']] && [self::IMG or self::img]");
Not sure of what you are asking, but have you tried something like the following?
"//A[#class!='custom'] | //a[#class!='custom'] | //IMG | //img"