XPath to exclude only a specific amount of html tags - xpath

<div class="page-main-content content-style">
<h1 class="title home-title">What’s On This Week <span class="">in the Best Sports Bar:</span></h1>
<div class="page-main-content-inner">
<p class="">Now open and serving food from 7.00am till 1.00am every day</p>
<p class="">Our kitchen stays open outside of these hours for special events.</p>
<h2>Here’s the sport showing this week at biggest family friendly sports bar & restaurant:</h2>
<h2>BOXING</h2><p class="">Vergil Ortiz Jr v Michael McKinson, Sunday, 7th August # 8.00am</p>
<h2>UFC on ESPN: Santos vs. Hill</h2><p class="">Sunday, 7th August # 9.00am (prelims # 7.00am)<br class=""> Full main card replay # 3.00pm on Sunday</p>
<h2>MOTO GP</h2><p class="">Practice: Friday, 5th August # 3.00pm and Saturday, 6th August # 3.00pm<br class=""> Qualifying: Saturday, 6th August # 6.00pm<br class=""> Races: Sunday, 7th August # 4.30pm</p>
<h2>CRICKET – WEST INDIES v INDIA</h2><p class="">2nd T20I: Monday, 1st August # 9.30pm<br class=""> 3rd T20I: Tuesday, 2nd August # 9.30pm<br class=""> 4th T20I: Saturday, 6th August # 9.30pm<br class=""> 5th T20I: Sunday, 7th August # 9.30pm</p>
</div>
</div>
I am trying to use XPATH to scrape all this content but I don't need the first two p tags and I don't need the first H2 tag either (Here’s the sport showing this week..)
So effectively I need to start scraping at BOXING which is the second H2 tag and then grab ALL content from there.
I've tried dozens of variations to exclude these:
//div[#class='page-main-content-inner']/*[not(self::p)]
But I cannot seem to get this to work. If I exclude p tags it excludes them all. Tried to limit this using stuff like [position()>1] but still cannot do it.

Try this:
//div[#class='page-main-content-inner']/h2[1]/following-sibling::*
It finds your first h2 and returns every sibling element that follows it:
<h2>BOXING</h2>
<p class="">Vergil Ortiz Jr v Michael McKinson, Sunday, 7th August # 8.00am</p>
<h2>UFC on ESPN: Santos vs. Hill</h2>
<p class="">Sunday, 7th August # 9.00am (prelims # 7.00am)<br class=""/> Full main card replay # 3.00pm on Sunday</p>
<h2>MOTO GP</h2>
<p class="">Practice: Friday, 5th August # 3.00pm and Saturday, 6th August # 3.00pm<br class=""/> Qualifying: Saturday, 6th August # 6.00pm<br class=""/> Races: Sunday, 7th August # 4.30pm</p>
<h2>CRICKET – WEST INDIES v INDIA</h2>
<p class="">2nd T20I: Monday, 1st August # 9.30pm<br class=""/> 3rd T20I: Tuesday, 2nd August # 9.30pm<br class=""/> 4th T20I: Saturday, 6th August # 9.30pm<br class=""/> 5th T20I: Sunday, 7th August # 9.30pm</p>

Related

How to get this value with only one xpath?

I want to have an XPATH which is able to select the date and time (like june 19 2020 at 08:59 pm) in all cases:
<span class="post_date"><span title="June 21, 2020 at 08:18 AM" currentmouseover="12">1 hour ago</span> <span class="post_edit" id="edited_by_2462600"> </span></span>
<span class="post_date" currentmouseover="62">June 19, 2020 at 08:56 PM <span class="post_edit" id="edited_by_2454907"> </span></span>
<span class="post_date" currentmouseover="157"><span title="June 20, 2020" currentmouseover="168">Yesterday</span> at 10:41 AM <span class="post_edit" id="edited_by_2457722"> </span></span>
I can get the second one easily with //*[#class="post_date"]/text(), but is there any way to get the 2 others and have 1 xpath for all cases? Or am I better off writting a function for this?
Thank you
Working XPath expression to select all dates with one expression :
(//#title|//text())[contains(.,", ") or contains(.," at ")]
Output : 4 nodes
EDIT : If you need something stronger (assuming all messages were posted after year 2000).
//span[#class='post_date']/span[contains(#title,', 20')]/#title|//span/text()[contains(.,' at ') and contains (.,':')][ancestor::*[1][self::span][#class='post_date']]
Or :
(//span[#class='post_date']/span[#title]/#title|//span/text()[ancestor::*[1][self::span][#class='post_date']])[contains(.,', 20') or contains(.,' at ')]
Output : 4 nodes

Using Apple Foundation Calendar properly

I'm a little bit confused about the correct usage of Calendar of Apple Foundations framework.
let calendar = Calendar(identifier: .iso8601)
let dayComponent = DateComponents(year: 2019, weekday: 1, weekOfYear: 6)
let date = calendar.date(from: dayComponent)
I need to get the first day of a given week of year. When using the code above the following dates are given depending on weekday:
//weekday:
//0 -> 08 FEB
//1 -> 09 FEB
//2 -> 03 FEB
//3 -> 04 FEB
//4 -> 05 FEB
Why does weekday starts with 0 at the current week (6) while switching to week 5 when increased?
Thanks for any help.
A couple of observations:
When iterate through weekdays, you want to go from 1 through 7, because, “Day, week, weekday, month, and year numbers are generally 1-based...” Date and Time Programming Guide: Date Components and Calendar Units. You can use range(of:in:for:), maximumRange(of:), etc., to find the range of possible values.
The weekday values from 1 through 7 do not mean “first day of the week”, “second day of the week”, etc. They refer to specific days of the week, e.g. for .iso8601, “Sun” is 1, “Mon” is 2, etc.
Make sure when you use weekOfYear, you use yearForWeekOfYear:
let calendar = Calendar(identifier: .iso8601)
let firstOfWeek = DateComponents(calendar: calendar, weekOfYear: 6, yearForWeekOfYear: 2019).date!
Your code is iterating through the weekdays. Consider this code that enumerates all of the days of the week for the sixth week of 2019 (i.e. the week starting Monday, February 4th and ending Sunday, February 10th):
let weekdays = calendar.range(of: .weekday, in: .weekOfYear, for: firstOfWeek)!
let daysOfTheWeek = Dictionary(uniqueKeysWithValues: zip(weekdays, calendar.weekdaySymbols))
for weekday in weekdays {
let date = DateComponents(calendar: calendar, weekday: weekday, weekOfYear: 6, yearForWeekOfYear: 2019).date!
print("The", daysOfTheWeek[weekday]!, "in the 6th week of 2019 is", formatter.string(from: date))
}
That results in:
The Sun in the 6th week of 2019 is Sunday, February 10, 2019
The Mon in the 6th week of 2019 is Monday, February 4, 2019
The Tue in the 6th week of 2019 is Tuesday, February 5, 2019
The Wed in the 6th week of 2019 is Wednesday, February 6, 2019
The Thu in the 6th week of 2019 is Thursday, February 7, 2019
The Fri in the 6th week of 2019 is Friday, February 8, 2019
The Sat in the 6th week of 2019 is Saturday, February 9, 2019
This is effectively what your code does and why you’re not seeing what you expected.
If you want iterate through the seven days of the week in order, just get the start of the week and then offset it from there:
let calendar = Calendar(identifier: .iso8601)
let startOfWeek = DateComponents(calendar: calendar, weekOfYear: 6, yearForWeekOfYear: 2019).date!
for offset in 0 ..< 7 {
let date = calendar.date(byAdding: .day, value: offset, to: startOfWeek)!
print(offset, "->", formatter.string(from: date))
}
That results in:
0 -> Monday, February 4, 2019
1 -> Tuesday, February 5, 2019
2 -> Wednesday, February 6, 2019
3 -> Thursday, February 7, 2019
4 -> Friday, February 8, 2019
5 -> Saturday, February 9, 2019
6 -> Sunday, February 10, 2019
You asked:
I need to get the first day of a given week of year.
Probably needless to say at this point, but just omit the weekday:
let startOfWeek = DateComponents(calendar: calendar, weekOfYear: 6, yearForWeekOfYear: 2019).date!
Also see Date and Time Programming Guide: Week-Based Calendars.

Kibana Timelion: Subselect or Subquery to aggregate sum of max

Let's suppose I have the following data on ElasticSearch:
#timestamp; userId; currentPoints
August 7th 2017, 00:30:37.319; myUserName; 4
August 7th 2017, 00:43:22.121; myUserName; 10
August 7th 2017, 00:54:29.177; myUserName; 7
August 7th 2017, 01:10:29.352; myUserName; 4
August 7th 2017, 00:32:37.319; myOtherUserName; 12
August 7th 2017, 00:44:22.121; myOtherUserName; 17
August 7th 2017, 00:56:29.177; myOtherUserName; 8
August 7th 2017, 01:18:29.352; myOtherUserName; 11
I'm looking to draw a date histogram that will show me the sum of all max:currentPoints per username per hour, which whould generate the following data to plot:
August 7th 2017, 00; SumOfMaxCurrentPoints -> 27 (max from hour 00h from both users 10 + 17)
August 7th 2017, 00; SumOfMaxCurrentPoints -> 15 (max from hour 01h from both users 4 + 11)
This would usually be done with a subquery, extracting the max(currentPoints) for each hour, user and then sum the results and aggregate per hour.
Is this possible with Kibana Timelion for instance? I can't find a way to achieve this using the documentation.
Thanks
Alex
While working on another project, I've stumpled upon the answer to do this in Kibana/Elasticsearch without using Timelion.
The feature is called Sibling Pipeline Aggregation, and in this case you use the Sum Bucket. You can use it with any recent Kibana/Elastic visualization (I'm using version 5.5).
For a dataset such as:
#timestamp; userId; currentPoints
August 7th 2017, 00:30:37.319; myUserName; 4
August 7th 2017, 00:43:22.121; myUserName; 10
August 7th 2017, 00:54:29.177; myUserName; 7
August 7th 2017, 01:10:29.352; myUserName; 4
August 7th 2017, 00:32:37.319; myOtherUserName; 12
August 7th 2017, 00:44:22.121; myOtherUserName; 17
August 7th 2017, 00:56:29.177; myOtherUserName; 8
August 7th 2017, 01:18:29.352; myOtherUserName; 11
Where you want an hourly SUM of(currentPoints) all MAXs(currentPoints) per userId, resulting in:
August 7th 2017, 00; SumOfMaxCurrentPoints -> 27 (max from hour 00h from both users 10 + 17)
August 7th 2017, 00; SumOfMaxCurrentPoints -> 15 (max from hour 01h from both users 4 + 11)
You can do:
Metric
Aggregation: Sibling Pipeline Aggregation (Sum Bucket)
Bucket Aggregation Type: Terms
Bucket Field: userId
Bucket Size: Comfortable value above the # of users if you want total precision
Metric Aggregation: Max
Metric Field: currentPoints
Bucket
Buckets type: Split Rows
Bucket Aggregation: Date Histogram
Histogram Field: #timestamp
Histogram Interval: Hourly

How to find out the date of second Monday of each month of given year?

My customer has an event each second Monday of each month.
I need to mark them with red in calendar.
How do i "cleanly" find out the date of that Mondays?
Here's my version.
If the eighth of the month is a Monday, then it is the second Monday. If it is not a Monday, then how many days until the next Monday?
oct_2012 = Date.new 2012, 10, 8
oct_2012.wday # => 1, We're done!
nov_2012 = Date.new 2012, 11, 8
nov_2012.wday # => 4
nov_2012 + (8 - nov_2012.wday) # => 2012-11-12
Does that help?
Edit
Easier version: Just add and be done. This algorithm works even if the month starts on a Monday.
oct_2012 = Date.new 2012, 10, 1
oct_2012 + (8 - oct_2012.wday) # => 2012-10-08
nov_2012 = Date.new 2012, 11, 1
nov_2012 + (8 - nov_2012.wday) # => 2012-11-12
One rule and done!
You second Monday will always fall within the 8th and 14th of each month.

Scrape data of a div after one or two html tags using xpath

Here is the code:
<div id="content">
<div class="datebar">
<span style="float:right">some text1</span>
<b>some text2</b>
Thursday, September 8, 2011 - 1:17 pm EDT
</div>
</div>
I just want to extract date and time Thursday, September 8, 2011 - 1:17 pm EDT.
Any suggestions? Thanks.
div[#id = 'content']/div[#class = 'datebar']/text()
or
div[#id = 'content']/div[#class = 'datebar']/b/following-sibling::text()
Though it should be normalized after.

Resources