check if row of time ranges overlap with other rows using M in power Query - powerquery

Please I need help, as I have a below table for emp_ID which has an activity with a start/end timestamp, in some cases, these timestamps are overlapped e.g. the first 2 rows and the second 2 rows with each other, how do keep one row based on the condition (greatest duration on the same day for the same employee) for the times that overlapped.
|Employee ID|Work Type| Duration (h)|Start TimeStamp |End TimeStamp |Date |
|-----------+---------+-------------+----------------+---------------+---------|
|2531 |(OJT) | 4.97 |12/8/2022 7:02 |12/8/2022 12:00|12/8/2022|
|2531 |(OJT) | 4.95 |12/8/2022 7:03 |12/8/2022 12:00|12/8/2022|
|2531 |(Idel) | 2.50 |12/8/2022 12:30 |12/8/2022 15:00|12/8/2022|
|2531 |(Break) | 0.50 |12/8/2022 12:00 |12/8/2022 12:30|12/8/2022|
the expected result is to add a flag (Yes/No) beside the first and third rows by adding a custom column that I can use to filter.
|Employee ID|Work Type| Duration (h)|Start TimeStamp |End TimeStamp |Date |Keep Row|
|-----------+---------+-------------+----------------+---------------+---------+--------|
|2531 |(OJT) | 4.97 |12/8/2022 7:02 |12/8/2022 12:00|12/8/2022|Yes |
|2531 |(OJT) | 4.95 |12/8/2022 7:03 |12/8/2022 12:00|12/8/2022|No |
|2531 |(Idel) | 2.50 |12/8/2022 12:30 |12/8/2022 15:00|12/8/2022|Yes |
|2531 |(Break) | 0.50 |12/8/2022 12:00 |12/8/2022 12:30|12/8/2022|No |

Edited
See if this works for you. It groups by [Employee ID, Work Type, Date, Starting Hour] and marks those rows with the highest duration. Recodes start time hour for rows that are within another row's time period so they can be grouped
Employee ID
Work Type
Duration
Start Time Stamp
End Time Stamp
Date
2531
OJT
0.15
12/08/22 07:05
12/08/22 10:35
12/08/22
2531
OJT
0.04
12/08/22 08:05
12/08/22 09:00
12/08/22
2531
OJT
0.02
12/08/22 07:15
12/08/22 07:50
12/08/22
2531
OJT
0.02
12/08/22 07:05
12/08/22 07:39
12/08/22
2531
OJT
0.07
12/08/22 08:05
12/08/22 09:50
12/08/22
2531
OJT
0.11
12/08/22 08:15
12/08/22 11:00
12/08/22
2531
IDEL
0.00
12/08/22 06:05
12/08/22 06:10
12/08/22
2531
IDEL
0.02
12/08/22 07:05
12/08/22 07:39
12/08/22
2531
IDEL
0.07
12/08/22 08:05
12/08/22 09:50
12/08/22
2531
IDEL
0.03
12/08/22 08:15
12/08/22 09:00
12/08/22
2531
OJT
0.02
12/12/22 07:05
12/12/22 07:35
12/12/22
2531
OJT
0.02
12/12/22 07:05
12/12/22 07:39
12/12/22
2531
OJT
0.07
12/12/22 08:05
12/12/22 09:50
12/12/22
2531
OJT
0.03
12/12/22 08:15
12/12/22 09:00
12/12/22
2531
IDEL
0.00
12/12/22 06:05
12/12/22 06:10
12/12/22
2531
IDEL
0.02
12/12/22 07:05
12/12/22 07:39
12/12/22
2531
IDEL
0.07
12/12/22 08:05
12/12/22 09:50
12/12/22
2531
IDEL
0.03
12/12/22 08:15
12/12/22 09:00
12/12/22
2792
OJT
0.15
12/08/22 07:05
12/08/22 10:35
12/08/22
2792
OJT
0.02
12/08/22 07:05
12/08/22 07:39
12/08/22
2792
OJT
0.04
12/08/22 07:20
12/08/22 08:15
12/08/22
2792
OJT
0.08
12/08/22 08:05
12/08/22 10:00
12/08/22
2792
OJT
0.03
12/08/22 08:15
12/08/22 09:00
12/08/22
2792
IDEL
0.00
12/08/22 06:05
12/08/22 06:10
12/08/22
2792
IDEL
0.02
12/08/22 07:05
12/08/22 07:39
12/08/22
2792
IDEL
0.07
12/08/22 08:05
12/08/22 09:50
12/08/22
2792
IDEL
0.03
12/08/22 08:15
12/08/22 09:00
12/08/22
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{ {"Duration", type number}, {"Start Time Stamp", type datetime}, {"End Time Stamp", type datetime}, {"Date", type date}}),
#"Added Index" = Table.AddIndexColumn(#"Changed Type", "Index", 0, 1, Int64.Type),
#"Added Custom1" = Table.AddColumn(#"Added Index", "StartHour2", each Time.Hour([Start Time Stamp])),
// recode overlapped items start time
#"Added Custom" = Table.AddColumn(#"Added Custom1","StartHour",(x)=>List.Min(Table.SelectRows(#"Added Custom1", each [Date]=x[Date] and [Employee ID]=x[Employee ID] and [Work Type]=x[Work Type] and [End Time Stamp]>=x[End Time Stamp] and [Start Time Stamp]<=x[Start Time Stamp])[StartHour2])),
#"Grouped Rows" = Table.Group(#"Added Custom", {"Employee ID", "Work Type", "Date", "StartHour"}, {{"data", each
let a=List.Max (_[Duration]),
b = Table.AddColumn(_,"Max", each if [Duration]=a then "max" else null)
in b, type table }}),
#"Expanded data" = Table.ExpandTableColumn(#"Grouped Rows", "data", {"Duration", "Start Time Stamp", "End Time Stamp", "Index","Max"}, {"Duration", "Start Time Stamp", "End Time Stamp", "Index","Max"}),
#"Sorted Rows1" = Table.Sort(#"Expanded data",{{"Index", Order.Ascending}}),
#"Removed Columns" = Table.RemoveColumns(#"Sorted Rows1",{"StartHour", "Index"}),
#"Changed Type1" = Table.TransformColumnTypes(#"Removed Columns",{{"Start Time Stamp", type datetime}, {"End Time Stamp", type datetime}, {"Date", type date}})
in #"Changed Type1"

Related

Faceting a line plot with units

I have a data frame which houses data for a few individuals in my study. These individuals belong to one of four groups. I would like to plot each individual's curve and compare them to others in that group.
I was hoping to facet by group and then use the units argument to draw lines for each individual in a lineplot.
Here is what I have so far:
g = sns.FacetGrid(data = m, col='Sex', row = 'Group')
g.map(sns.lineplot, 'Time','residual')
The docs say that g.map accepts arguments in the order that they appear in lineplot. units is at the end of a very long list.
How can I facet a line plot and use the units argument?
Here is my data:
Subject Time predicted Concentration Group Sex residual
1 0.5 0.24 0.01 NAFLD Male -0.23
1 1.0 0.4 0.33 NAFLD Male -0.08
1 2.0 0.58 0.8 NAFLD Male 0.22
1 4.0 0.59 0.59 NAFLD Male -0.0
1 6.0 0.47 0.42 NAFLD Male -0.04
1 8.0 0.33 0.23 NAFLD Male -0.1
1 10.0 0.22 0.16 NAFLD Male -0.06
1 12.0 0.15 0.33 NAFLD Male 0.18
3 0.5 0.26 0.08 NAFLD Female -0.18
3 1.0 0.45 0.45 NAFLD Female 0.01
3 2.0 0.66 0.7 NAFLD Female 0.03
3 4.0 0.74 0.76 NAFLD Female 0.02
3 6.0 0.62 0.7 NAFLD Female 0.08
3 8.0 0.46 0.4 NAFLD Female -0.06
3 10.0 0.32 0.27 NAFLD Female -0.05
3 12.0 0.21 0.21 NAFLD Female -0.0
4 0.5 0.52 0.13 NAFLD Female -0.39
4 1.0 0.91 1.18 NAFLD Female 0.27
4 2.0 1.37 1.03 NAFLD Female -0.34
4 4.0 1.55 2.02 NAFLD Female 0.47
4 6.0 1.32 1.19 NAFLD Female -0.13
4 8.0 1.0 0.89 NAFLD Female -0.1
4 10.0 0.71 0.66 NAFLD Female -0.05
4 12.0 0.48 0.5 NAFLD Female 0.02
5 0.5 0.46 0.16 NAFLD Female -0.3
5 1.0 0.76 0.98 NAFLD Female 0.22
5 2.0 1.05 1.03 NAFLD Female -0.02
5 4.0 1.03 1.06 NAFLD Female 0.03
5 6.0 0.8 0.77 NAFLD Female -0.03
5 8.0 0.57 0.5 NAFLD Female -0.07
5 10.0 0.4 0.42 NAFLD Female 0.02
5 12.0 0.27 0.33 NAFLD Female 0.06
6 0.5 1.08 1.02 NAFLD Female -0.06
6 1.0 1.53 1.66 NAFLD Female 0.13
6 2.0 1.67 1.52 NAFLD Female -0.16
6 4.0 1.3 1.44 NAFLD Female 0.14
6 6.0 0.94 0.94 NAFLD Female -0.0
6 8.0 0.68 0.63 NAFLD Female -0.05
6 10.0 0.49 0.36 NAFLD Female -0.13
6 12.0 0.35 0.48 NAFLD Female 0.13
7 0.5 0.5 0.34 Control Female -0.16
7 1.0 0.81 0.84 Control Female 0.04
7 2.0 1.08 1.17 Control Female 0.1
7 4.0 1.0 0.99 Control Female -0.01
7 6.0 0.73 0.65 Control Female -0.08
7 8.0 0.5 0.49 Control Female -0.01
7 10.0 0.33 0.37 Control Female 0.04
7 12.0 0.22 0.25 Control Female 0.03
8 0.5 0.44 0.37 Control Male -0.06
8 1.0 0.67 0.74 Control Male 0.07
8 2.0 0.82 0.8 Control Male -0.03
8 4.0 0.72 0.72 Control Male 0.01
8 6.0 0.54 0.54 Control Male -0.0
8 8.0 0.4 0.38 Control Male -0.02
8 10.0 0.29 0.31 Control Male 0.02
8 12.0 0.21 0.21 Control Male 0.0
9 0.5 0.51 0.26 Control Female -0.25
9 1.0 0.86 0.66 Control Female -0.21
9 2.0 1.23 1.62 Control Female 0.39
9 4.0 1.3 1.26 Control Female -0.03
9 6.0 1.07 0.94 Control Female -0.13
9 8.0 0.81 0.74 Control Female -0.07
9 10.0 0.59 0.62 Control Female 0.03
9 12.0 0.43 0.54 Control Female 0.11
10 0.5 0.81 0.82 Control Female 0.01
10 1.0 1.05 1.03 Control Female -0.02
10 2.0 1.04 1.04 Control Female -0.0
10 4.0 0.77 0.81 Control Female 0.04
10 6.0 0.55 0.52 Control Female -0.03
10 8.0 0.39 0.35 Control Female -0.04
10 10.0 0.28 0.31 Control Female 0.03
10 12.0 0.2 0.21 Control Female 0.01
11 0.5 0.08 0.07 NAFLD Male -0.01
11 1.0 0.15 0.08 NAFLD Male -0.07
11 2.0 0.24 0.13 NAFLD Male -0.11
11 4.0 0.32 0.45 NAFLD Male 0.12
11 6.0 0.33 0.38 NAFLD Male 0.05
11 8.0 0.3 0.28 NAFLD Male -0.02
11 10.0 0.25 0.23 NAFLD Male -0.02
11 12.0 0.2 0.16 NAFLD Male -0.04
12 0.5 0.72 0.75 NAFLD Female 0.03
12 1.0 0.84 0.76 NAFLD Female -0.08
12 2.0 0.8 0.77 NAFLD Female -0.03
12 4.0 0.67 0.74 NAFLD Female 0.07
12 6.0 0.56 0.65 NAFLD Female 0.09
12 8.0 0.46 0.48 NAFLD Female 0.02
12 10.0 0.38 0.34 NAFLD Female -0.05
12 12.0 0.32 0.25 NAFLD Female -0.07
13 0.5 0.28 0.07 Control Female -0.21
13 1.0 0.49 0.38 Control Female -0.1
13 2.0 0.74 0.94 Control Female 0.2
13 4.0 0.88 0.84 Control Female -0.04
13 6.0 0.77 0.79 Control Female 0.02
13 8.0 0.61 0.57 Control Female -0.03
13 10.0 0.45 0.44 Control Female -0.01
13 12.0 0.32 0.32 Control Female 0.01
14 0.5 0.26 0.04 NAFLD Female -0.22
14 1.0 0.44 0.35 NAFLD Female -0.1
14 2.0 0.64 0.84 NAFLD Female 0.19
14 4.0 0.68 0.73 NAFLD Female 0.04
14 6.0 0.54 0.45 NAFLD Female -0.1
14 8.0 0.39 0.34 NAFLD Female -0.05
14 10.0 0.26 0.26 NAFLD Female 0.01
14 12.0 0.16 0.24 NAFLD Female 0.07
15 0.5 0.3 0.11 NAFLD Male -0.19
15 1.0 0.49 0.61 NAFLD Male 0.12
15 2.0 0.67 0.68 NAFLD Male 0.01
15 4.0 0.64 0.67 NAFLD Male 0.03
15 6.0 0.48 0.42 NAFLD Male -0.06
15 8.0 0.33 0.31 NAFLD Male -0.02
15 10.0 0.22 0.26 NAFLD Male 0.04
15 12.0 0.15 0.17 NAFLD Male 0.02
16 0.5 0.16 0.05 NAFLD Male -0.12
16 1.0 0.26 0.35 NAFLD Male 0.1
16 2.0 0.33 0.32 NAFLD Male -0.01
16 4.0 0.28 0.27 NAFLD Male -0.01
16 6.0 0.19 0.17 NAFLD Male -0.02
16 8.0 0.12 0.13 NAFLD Male 0.01
16 10.0 0.07 0.09 NAFLD Male 0.02
16 12.0 0.05 0.05 NAFLD Male 0.0
17 0.5 0.32 0.16 NAFLD Female -0.16
17 1.0 0.54 0.59 NAFLD Female 0.06
17 2.0 0.74 0.78 NAFLD Female 0.04
17 4.0 0.71 0.76 NAFLD Female 0.05
17 6.0 0.53 0.43 NAFLD Female -0.1
17 8.0 0.36 0.35 NAFLD Female -0.01
17 10.0 0.23 0.25 NAFLD Female 0.02
17 12.0 0.15 0.2 NAFLD Female 0.05
18 0.5 0.49 0.18 Control Female -0.31
18 1.0 0.81 0.82 Control Female 0.01
18 2.0 1.1 1.27 Control Female 0.16
18 4.0 1.03 1.06 Control Female 0.03
18 6.0 0.72 0.65 Control Female -0.07
18 8.0 0.45 0.38 Control Female -0.07
18 10.0 0.26 0.28 Control Female 0.02
18 12.0 0.14 0.19 Control Female 0.04
19 0.5 0.15 0.04 NAFLD Female -0.11
19 1.0 0.27 0.21 NAFLD Female -0.06
19 2.0 0.43 0.43 NAFLD Female -0.01
19 4.0 0.56 0.66 NAFLD Female 0.1
19 6.0 0.54 0.52 NAFLD Female -0.02
19 8.0 0.47 0.48 NAFLD Female 0.01
19 10.0 0.38 0.38 NAFLD Female 0.0
19 12.0 0.29 0.24 NAFLD Female -0.05
20 0.5 0.38 0.07 NAFLD Female -0.31
20 1.0 0.6 0.82 NAFLD Female 0.22
20 2.0 0.75 0.79 NAFLD Female 0.04
20 4.0 0.63 0.58 NAFLD Female -0.05
20 6.0 0.44 0.39 NAFLD Female -0.05
20 8.0 0.29 0.27 NAFLD Female -0.02
20 10.0 0.19 0.23 NAFLD Female 0.04
20 12.0 0.13 0.19 NAFLD Female 0.07
21 0.5 0.37 0.28 NAFLD Male -0.09
21 1.0 0.56 0.66 NAFLD Male 0.1
21 2.0 0.68 0.64 NAFLD Male -0.04
21 4.0 0.59 0.62 NAFLD Male 0.02
21 6.0 0.45 0.43 NAFLD Male -0.02
21 8.0 0.34 0.31 NAFLD Male -0.03
21 10.0 0.26 0.29 NAFLD Male 0.03
21 12.0 0.19 0.2 NAFLD Male 0.0
22 0.5 0.28 0.21 Control Male -0.07
22 1.0 0.42 0.5 Control Male 0.08
22 2.0 0.5 0.47 Control Male -0.03
22 4.0 0.42 0.42 Control Male 0.0
22 6.0 0.31 0.32 Control Male 0.01
22 8.0 0.23 0.22 Control Male -0.01
22 10.0 0.16 0.17 Control Male 0.01
22 12.0 0.12 0.11 Control Male -0.01
23 0.5 0.46 0.18 Control Female -0.28
23 1.0 0.75 0.65 Control Female -0.1
23 2.0 1.03 1.23 Control Female 0.2
23 4.0 0.96 1.05 Control Female 0.09
23 6.0 0.67 0.58 Control Female -0.1
23 8.0 0.42 0.36 Control Female -0.06
23 10.0 0.24 0.22 Control Female -0.02
23 12.0 0.14 0.14 Control Female 0.0
24 0.5 0.2 0.14 NAFLD Male -0.06
24 1.0 0.33 0.41 NAFLD Male 0.08
24 2.0 0.44 0.4 NAFLD Male -0.04
24 4.0 0.41 0.42 NAFLD Male 0.01
24 6.0 0.31 0.31 NAFLD Male 0.0
24 8.0 0.22 0.21 NAFLD Male -0.01
24 10.0 0.15 0.17 NAFLD Male 0.02
24 12.0 0.1 0.09 NAFLD Male -0.02
25 0.5 0.28 0.05 NAFLD Female -0.23
25 1.0 0.48 0.43 NAFLD Female -0.05
25 2.0 0.7 0.82 NAFLD Female 0.12
25 4.0 0.75 0.8 NAFLD Female 0.06
25 6.0 0.6 0.56 NAFLD Female -0.03
25 8.0 0.42 0.38 NAFLD Female -0.04
25 10.0 0.28 0.28 NAFLD Female -0.0
25 12.0 0.18 0.18 NAFLD Female -0.0
26 0.5 0.65 0.38 NAFLD Female -0.27
26 1.0 1.0 1.2 NAFLD Female 0.2
26 2.0 1.23 1.26 NAFLD Female 0.03
26 4.0 1.0 0.98 NAFLD Female -0.02
26 6.0 0.67 0.59 NAFLD Female -0.08
26 8.0 0.43 0.42 NAFLD Female -0.01
26 10.0 0.27 0.33 NAFLD Female 0.06
26 12.0 0.17 0.22 NAFLD Female 0.05
27 0.5 0.1 0.07 NAFLD Male -0.02
27 1.0 0.17 0.18 NAFLD Male 0.02
27 2.0 0.24 0.23 NAFLD Male -0.01
27 4.0 0.27 0.3 NAFLD Male 0.02
27 6.0 0.24 0.22 NAFLD Male -0.01
27 8.0 0.19 0.17 NAFLD Male -0.01
27 10.0 0.14 0.16 NAFLD Male 0.01
27 12.0 0.11 0.11 NAFLD Male 0.0
28 0.5 0.23 0.16 Control Female -0.08
28 1.0 0.4 0.39 Control Female -0.01
28 2.0 0.58 0.57 Control Female -0.01
28 4.0 0.62 0.69 Control Female 0.07
28 6.0 0.49 0.46 Control Female -0.04
28 8.0 0.35 0.39 Control Female 0.04
28 10.0 0.23 0.18 Control Female -0.05
28 12.0 0.15 0.12 Control Female -0.03
29 0.5 0.33 0.24 Control Female -0.09
29 1.0 0.55 0.5 Control Female -0.05
29 2.0 0.8 0.86 Control Female 0.06
29 4.0 0.84 0.91 Control Female 0.07
29 6.0 0.66 0.58 Control Female -0.08
29 8.0 0.46 0.43 Control Female -0.03
29 10.0 0.3 0.33 Control Female 0.03
29 12.0 0.19 0.2 Control Female 0.01
30 0.5 0.23 0.19 Control Female -0.04
30 1.0 0.4 0.41 Control Female 0.01
30 2.0 0.6 0.6 Control Female -0.0
30 4.0 0.68 0.71 Control Female 0.03
30 6.0 0.58 0.56 Control Female -0.03
30 8.0 0.45 0.43 Control Female -0.02
30 10.0 0.33 0.36 Control Female 0.02
30 12.0 0.24 0.24 Control Female 0.0
31 0.5 0.36 0.31 Control Female -0.05
31 1.0 0.61 0.66 Control Female 0.05
31 2.0 0.85 0.82 Control Female -0.03
31 4.0 0.86 0.9 Control Female 0.05
31 6.0 0.65 0.62 Control Female -0.03
31 8.0 0.45 0.43 Control Female -0.02
31 10.0 0.3 0.31 Control Female 0.01
31 12.0 0.19 0.21 Control Female 0.02
32 0.5 0.24 0.14 NAFLD Male -0.09
32 1.0 0.4 0.41 NAFLD Male 0.01
32 2.0 0.56 0.61 NAFLD Male 0.04
32 4.0 0.57 0.58 NAFLD Male 0.02
32 6.0 0.43 0.39 NAFLD Male -0.04
32 8.0 0.29 0.28 NAFLD Male -0.01
32 10.0 0.19 0.2 NAFLD Male 0.01
32 12.0 0.12 0.14 NAFLD Male 0.03
33 0.5 0.17 0.05 NAFLD Male -0.12
33 1.0 0.28 0.23 NAFLD Male -0.06
33 2.0 0.42 0.56 NAFLD Male 0.14
33 4.0 0.45 0.42 NAFLD Male -0.03
33 6.0 0.36 0.33 NAFLD Male -0.03
33 8.0 0.26 0.24 NAFLD Male -0.02
33 10.0 0.18 0.21 NAFLD Male 0.03
33 12.0 0.12 0.14 NAFLD Male 0.02
34 0.5 0.09 0.1 NAFLD Male 0.01
34 1.0 0.16 0.19 NAFLD Male 0.03
34 2.0 0.25 0.23 NAFLD Male -0.03
34 4.0 0.32 0.32 NAFLD Male -0.0
34 6.0 0.32 0.3 NAFLD Male -0.02
34 8.0 0.28 0.3 NAFLD Male 0.02
34 10.0 0.24 0.25 NAFLD Male 0.02
34 12.0 0.2 0.18 NAFLD Male -0.02
35 0.5 0.15 0.02 NAFLD Female -0.13
35 1.0 0.27 0.14 NAFLD Female -0.14
35 2.0 0.46 0.38 NAFLD Female -0.08
35 4.0 0.64 0.8 NAFLD Female 0.16
35 6.0 0.67 0.74 NAFLD Female 0.07
35 8.0 0.63 0.61 NAFLD Female -0.02
35 10.0 0.55 0.51 NAFLD Female -0.04
35 12.0 0.46 0.42 NAFLD Female -0.04
36 0.5 0.19 0.12 NAFLD Female -0.07
36 1.0 0.32 0.36 NAFLD Female 0.04
36 2.0 0.47 0.46 NAFLD Female -0.01
36 4.0 0.53 0.57 NAFLD Female 0.04
36 6.0 0.48 0.43 NAFLD Female -0.05
36 8.0 0.41 0.39 NAFLD Female -0.01
36 10.0 0.34 0.38 NAFLD Female 0.04
36 12.0 0.28 0.27 NAFLD Female -0.01
37 0.5 0.1 0.02 NAFLD Male -0.08
37 1.0 0.17 0.1 NAFLD Male -0.08
37 2.0 0.28 0.27 NAFLD Male -0.01
37 4.0 0.36 0.44 NAFLD Male 0.08
37 6.0 0.34 0.37 NAFLD Male 0.03
37 8.0 0.29 0.28 NAFLD Male -0.02
37 10.0 0.23 0.22 NAFLD Male -0.02
37 12.0 0.18 0.15 NAFLD Male -0.03
If you use FacetGrid.map_dataframe, you can pass the arguments almost as if you were directly calling lineplot directly:
g = sns.FacetGrid(data = m, col='Sex', row='Group')
g.map_dataframe(sns.lineplot, x='Time', y='residual', units='Subject', estimator=None)
A potential work around is to define a new function
g = sns.FacetGrid(data = m, col='Sex', row = 'Group')
def f(x,y,z,*args,**kwargs):
return sns.lineplot(x = x, y = y, units = z, estimator = None, *args, **kwargs)
g.map(f, 'Time','residual','Subject')

SAS---select observation at t+5

I attempted to calculate a formula based on price at different time (). More specifically, donates the first price observed at least 5 minutes after the price which is measured.
The following code is used to create a variable that represents .
data WANT;
set HAVE nobs=nobs;
do _i = _n_ to nobs until(other_date > date_l_);
set HAVE(
rename=( _ric=other_ric
date_l_= other_date
price = other_price
new_time = other_time)
keep=_ric date_l_ price int1min new_time)
point=_i;
if other_ric=_ric and new_time > new_time+300 and other_date = date_l_ then do;
new_price = other_price;
leave;
end;
end;
drop other_: ;
run;
However, the code did not work correctly at all time. As shown in the pic, the new_price is correct in green rectangle but is incorrect in red rectangle. Could anyone help me to solve this problem?
The following is a sample of data.
_RIC Date_L_ Time_L_ Price new_price new_time time
BAG201310900.U 20130715 9:36:19.721 0.27 0.29 9:36 9:41
BAG201310900.U 20130715 9:36:19.721 0.27 0.29 9:36 9:41
BAG201310900.U 20130715 9:36:22.751 0.27 0.29 9:36 9:41
BAG201310900.U 20130715 9:36:22.751 0.27 0.29 9:36 9:41
BAG201310900.U 20130715 9:36:24.400 0.27 0.29 9:36 9:41
BAG201310900.U 20130715 9:36:24.400 0.27 0.29 9:36 9:41
BAG201310900.U 20130715 9:36:28.150 0.27 0.29 9:36 9:41
BAG201310900.U 20130715 9:36:28.150 0.27 0.29 9:36 9:41
BAG201310900.U 20130715 9:36:45.099 0.27 0.29 9:36 9:41
BAG201310900.U 20130715 9:36:45.099 0.27 0.29 9:36 9:41
BAG201310900.U 20130715 9:36:48.929 0.28 0.29 9:36 9:41
BAG201310900.U 20130715 9:36:48.929 0.28 0.29 9:36 9:41
BAG201310900.U 20130715 9:36:49.929 0.28 0.29 9:36 9:41
BAG201310900.U 20130715 9:36:50.899 0.28 0.29 9:36 9:41
BAG201310900.U 20130715 9:37:04.839 0.27 0.29 9:37 9:42
BAG201310900.U 20130715 9:37:04.839 0.27 0.29 9:37 9:42
BAG201310900.U 20130715 9:37:04.848 0.27 0.29 9:37 9:42
BAG201310900.U 20130715 9:37:07.619 0.28 0.29 9:37 9:42
BAG201310900.U 20130715 9:37:11.619 0.28 0.29 9:37 9:42
BAG201310900.U 20130715 9:37:11.619 0.28 0.29 9:37 9:42
BAG201310900.U 20130715 9:37:11.619 0.28 0.29 9:37 9:42
BAG201310900.U 20130715 9:37:12.738 0.28 0.29 9:37 9:42
BAG201310900.U 20130715 9:37:15.528 0.28 0.29 9:37 9:42
BAG201310900.U 20130715 9:37:30.337 0.28 0.29 9:37 9:42
BAG201310900.U 20130715 9:37:32.717 0.28 0.29 9:37 9:42
BAG201310900.U 20130715 9:37:58.636 0.29 0.29 9:37 9:42
BAG201310900.U 20130715 9:38:04.016 0.28 0.29 9:38 9:43
BAG201310900.U 20130715 9:38:07.326 0.28 0.29 9:38 9:43
BAG201310900.U 20130715 9:38:07.849 0.28 0.29 9:38 9:43
BAG201310900.U 20130715 9:38:16.005 0.3 0.29 9:38 9:43
BAG201310900.U 20130715 9:38:18.055 0.3 0.29 9:38 9:43
BAG201310900.U 20130715 9:38:18.055 0.3 0.29 9:38 9:43
BAG201310900.U 20130715 9:38:18.055 0.3 0.29 9:38 9:43
BAG201310900.U 20130715 9:38:20.025 0.3 0.29 9:38 9:43
BAG201310900.U 20130715 9:38:21.235 0.3 0.29 9:38 9:43
BAG201310900.U 20130715 9:38:25.585 0.3 0.29 9:38 9:43
BAG201310900.U 20130715 9:40:01.475 0.29 0.22 9:40 9:45
BAG201310900.U 20130715 9:45:04.335 0.22 0.27 9:45 9:50
BAG201310900.U 20130715 9:45:04.335 0.22 0.27 9:45 9:50
BAG201310900.U 20130715 9:45:04.335 0.22 0.27 9:45 9:50
BAG201310900.U 20130715 9:45:35.966 0.24 0.27 9:45 9:50
BAG201310900.U 20130715 9:51:13.808 0.27 0.19 9:51 9:56
BAG201310900.U 20130715 9:52:41.409 0.27 0.19 9:52 9:57
BAG201310900.U 20130715 9:53:32.730 0.28 0.19 9:53 9:58
BAG201310900.U 20130715 9:53:33.250 0.29 0.19 9:53 9:58
BAG201310900.U 20130715 9:53:36.580 0.26 0.19 9:53 9:58
BAG201310900.U 20130715 9:53:36.580 0.26 0.19 9:53 9:58
BAG201310900.U 20130715 9:53:36.580 0.26 0.19 9:53 9:58
BAG201310900.U 20130715 9:53:36.580 0.26 0.19 9:53 9:58
BAG201310900.U 20130715 9:53:36.580 0.26 0.19 9:53 9:58
BAG201310900.U 20130715 9:53:36.580 0.26 0.19 9:53 9:58
BAG201310900.U 20130715 9:54:00.601 0.25 0.19 9:54 9:59
BAG201310900.U 20130715 9:54:24.842 0.24 0.19 9:54 9:59
BAG201310900.U 20130715 9:57:42.068 0.19 0.24 9:57 10:02
BAG201310900.U 20130715 9:57:42.068 0.19 0.24 9:57 10:02
BAG201310900.U 20130715 9:57:42.068 0.19 0.24 9:57 10:02
BAG201310900.U 20130715 10:02:36.960 0.24 0.26 10:02 10:07
BAG201310900.U 20130715 10:06:46.735 0.26 0.24 10:06 10:11
BAG201310900.U 20130715 10:08:28.588 0.23 0.24 10:08 10:13
BAG201310900.U 20130715 10:09:13.008 0.24 0.24 10:09 10:14
BAG201310900.U 20130715 10:09:13.008 0.24 0.24 10:09 10:14
BAG201310900.U 20130715 10:09:13.008 0.24 0.24 10:09 10:14
BAG201310900.U 20130715 10:09:13.008 0.24 0.24 10:09 10:14
BAG201310900.U 20130715 10:09:13.008 0.24 0.24 10:09 10:14
BAG201310900.U 20130715 10:09:13.018 0.24 0.24 10:09 10:14
BAG201310900.U 20130715 10:09:22.508 0.24 0.24 10:09 10:14
BAG201310900.U 20130715 10:09:22.508 0.24 0.24 10:09 10:14
BAG201310900.U 20130715 10:09:22.528 0.24 0.24 10:09 10:14
BAG201310900.U 20130715 10:09:34.628 0.24 0.24 10:09 10:14
BAG201310900.U 20130715 10:10:03.840 0.24 0.24 10:10 10:15
BAG201310900.U 20130715 10:10:04.939 0.25 0.24 10:10 10:15
BAG201310900.U 20130715 10:10:04.960 0.25 0.24 10:10 10:15
BAG201310900.U 20130715 10:10:04.989 0.25 0.24 10:10 10:15
BAG201310900.U 20130715 10:10:06.079 0.25 0.24 10:10 10:15
BAG201310900.U 20130715 10:10:06.090 0.25 0.24 10:10 10:15
BAG201310900.U 20130715 10:10:06.090 0.25 0.24 10:10 10:15
BAG201310900.U 20130715 10:10:08.850 0.25 0.24 10:10 10:15
BAG201310900.U 20130715 10:10:08.899 0.25 0.24 10:10 10:15
BAG201310900.U 20130715 10:10:08.920 0.25 0.24 10:10 10:15
BAG201310900.U 20130715 10:10:10.090 0.25 0.24 10:10 10:15
BAG201310900.U 20130715 10:46:08.210 0.24 0.22 10:46 10:51
BAG201310900.U 20130715 10:46:22.842 0.23 0.22 10:46 10:51
BAG201310900.U 20130715 10:46:22.842 0.23 0.22 10:46 10:51
BAG201310900.U 20130715 10:46:22.842 0.23 0.22 10:46 10:51
BAG201310900.U 20130715 10:46:22.842 0.23 0.22 10:46 10:51
BAG201310900.U 20130715 10:46:22.842 0.23 0.22 10:46 10:51
BAG201310900.U 20130715 10:46:22.842 0.23 0.22 10:46 10:51
BAG201310900.U 20130715 10:46:22.842 0.23 0.22 10:46 10:51
BAG201310900.U 20130715 10:46:22.842 0.23 0.22 10:46 10:51
BAG201310900.U 20130715 10:46:22.842 0.23 0.22 10:46 10:51
BAG201310900.U 20130715 10:46:22.842 0.23 0.22 10:46 10:51
BAG201310900.U 20130715 10:46:22.842 0.23 0.22 10:46 10:51
BAG201310900.U 20130715 10:46:25.331 0.23 0.22 10:46 10:51
BAG201310900.U 20130715 11:14:40.903 0.22 0.22 11:14 11:19
BAG201310900.U 20130715 11:26:52.196 0.22 0.25 11:26 11:31
BAG201310900.U 20130715 11:44:43.190 0.25 0.27 11:44 11:49
BAG201310900.U 20130715 11:44:43.211 0.25 0.27 11:44 11:49
BAG201310900.U 20130715 11:44:43.211 0.25 0.27 11:44 11:49
BAG201310900.U 20130715 11:44:43.211 0.25 0.27 11:44 11:49
BAG201310900.U 20130715 11:49:14.152 0.27 0.31 11:49 11:54
BAG201310900.U 20130715 12:09:12.418 0.31 0.3 12:09 12:14
BAG201310900.U 20130715 12:09:12.418 0.31 0.3 12:09 12:14
BAG201310900.U 20130715 12:09:12.418 0.31 0.3 12:09 12:14
BAG201310900.U 20130715 12:13:27.376 0.3 0.3 12:13 12:18
BAG201310900.U 20130715 12:14:48.365 0.3 0.3 12:14 12:19
BAG201310900.U 20130715 12:17:28.263 0.3 0.29 12:17 12:22
BAG201310900.U 20130715 12:17:43.893 0.3 0.29 12:17 12:22
BAG201310900.U 20130715 12:48:50.960 0.29 0.29 12:48 12:53
BAG201310900.U 20130715 12:49:59.878 0.29 0.29 12:49 12:54
BAG201310900.U 20130715 12:49:59.878 0.29 0.29 12:49 12:54
BAG201310900.U 20130715 12:49:59.898 0.29 0.29 12:49 12:54
BAG201310900.U 20130715 12:49:59.898 0.29 0.29 12:49 12:54
BAG201310900.U 20130715 12:49:59.898 0.29 0.29 12:49 12:54
BAG201310900.U 20130715 12:49:59.898 0.29 0.29 12:49 12:54
BAG201310900.U 20130715 12:49:59.898 0.29 0.29 12:49 12:54
I don't think using random access is going to be a good solution here, especially not using repeated random access. A better solution is probably going to be to load a hash table with your data for each day (as it looks like you have many rows for each day). Then use a hash iterator to find the t=300+ row. You don't provide sample data, so I can't really give you full code, but pseudocode is something like:
data want;
set have;
by _ric date_l_;
if _n_=1 then do; *declare hash table that's empty but has the structure of your have dataset; *declare a hash iterator for that table; end;
if first.date_l_ then do; *load the hash table with that date's rows; end;
*find the current row in the hash table;
*now iterate over the hash table from that row until you get to the end or you get a t+300 row;
*if you got t+300 row, then you have what you want, otherwise you're too far in the day and can stop looking - and probably should tell the data step to just skip all of the rest of the records for that day;
if last.date_l_ then do; *empty/delete the hash table; end;
run;
More specifically, P(t+5) the first price observed at least 5 minutes
after the price which is measured."
This example shows how a reflexive SQL join can acquire and use the row at an earliest future timemark. The answer requires a distinct time/value price stream, which the sample data is not. The example dedupes for demonstration purposes.
data have;
attrib
_RIC length=$20
Date_L_ informat=yymmdd10. format=yymmdd10.
Time_L_ informat=time15.3 format=time15.3
price length=8
;
infile datalines missover;
input _RIC Date_L_ Time_L_ Price;
timemark = dhms(date_l_, 0,0,0) + time_l_;
format timemark datetime21.3;
datalines;
BAG201310900.U 20130715 9:36:19.721 0.27
BAG201310900.U 20130715 9:36:19.721 0.27
BAG201310900.U 20130715 9:36:22.751 0.27
BAG201310900.U 20130715 9:36:22.751 0.27
BAG201310900.U 20130715 9:36:24.400 0.27
BAG201310900.U 20130715 9:36:24.400 0.27
BAG201310900.U 20130715 9:36:28.150 0.27
BAG201310900.U 20130715 9:36:28.150 0.27
BAG201310900.U 20130715 9:36:45.099 0.27
BAG201310900.U 20130715 9:36:45.099 0.27
BAG201310900.U 20130715 9:36:48.929 0.28
BAG201310900.U 20130715 9:36:48.929 0.28
BAG201310900.U 20130715 9:36:49.929 0.28
BAG201310900.U 20130715 9:36:50.899 0.28
BAG201310900.U 20130715 9:37:04.839 0.27
BAG201310900.U 20130715 9:37:04.839 0.27
BAG201310900.U 20130715 9:37:04.848 0.27
BAG201310900.U 20130715 9:37:07.619 0.28
BAG201310900.U 20130715 9:37:11.619 0.28
BAG201310900.U 20130715 9:37:11.619 0.28
BAG201310900.U 20130715 9:37:11.619 0.28
BAG201310900.U 20130715 9:37:12.738 0.28
BAG201310900.U 20130715 9:37:15.528 0.28
BAG201310900.U 20130715 9:37:30.337 0.28
BAG201310900.U 20130715 9:37:32.717 0.28
BAG201310900.U 20130715 9:37:58.636 0.29
BAG201310900.U 20130715 9:38:04.016 0.28
BAG201310900.U 20130715 9:38:07.326 0.28
BAG201310900.U 20130715 9:38:07.849 0.28
BAG201310900.U 20130715 9:38:16.005 0.3
BAG201310900.U 20130715 9:38:18.055 0.3
BAG201310900.U 20130715 9:38:18.055 0.3
BAG201310900.U 20130715 9:38:18.055 0.3
BAG201310900.U 20130715 9:38:20.025 0.3
run;
Dedupe
proc sort data=have nodupkey;
by _all_;
run;
Reflexive join (aka self-join)
proc sql;
create table want as
select
have._RIC
, have.timemark
, have.price
, future.timemark as timemark_at_5m_threshold
, future.price as price_at_5m_threshold
, future.timemark - have.timemark as interval_at_5m_threshold
from
have
left join
have as future
on
have._RIC = future._RIC
and future.timemark > have.timemark + 50 /* 50 seconds because sample data only covers 2 minutes */
group by
have._RIC, have.timemark
having
/* first of all future matches
* - this is why you want discrete timemarks
* when timemark has dups you would have multiple rows with same min
* and replication in result set
*/
future.timemark = min(future.timemark)
/* NOTE: an expression with a non-aggregate reference and an
* aggregate reference causes Proc SQL to automatically remerge.
* That is a good thing. Log will show
* NOTE: The query requires remerging summary statistics back with the original data.
*/
;

Pandas: Solving for the threshold of highest values for a time-series dataset

Givens: I have a set of time-series data for a day, say 96 values. I have a cumulative value, say 101 units over a given period.
Problem: I need to find the threshold, X, where all values above that threshold sum up to the given cumulative value, 101. See below chart for visual:
The X value (black line) is the threshold desired
The 101 (red area under curve) is the given cumulative value
The blue line is the time series data
Constraints: I have to perform this calculation many times (for each day of the year) so avoiding iterations would be preferred, but not necessary.
Sample Data:
DateTime Usage_KWH
1/1/2015 0:15 10.32
1/1/2015 0:30 10.56
1/1/2015 0:45 9.84
1/1/2015 1:00 9.36
1/1/2015 1:15 10.32
1/1/2015 1:30 9.6
1/1/2015 1:45 9.6
1/1/2015 2:00 10.32
1/1/2015 2:15 9.84
1/1/2015 2:30 9.6
1/1/2015 2:45 10.08
1/1/2015 3:00 9.36
1/1/2015 3:15 9.84
1/1/2015 3:30 10.32
1/1/2015 3:45 9.84
1/1/2015 4:00 9.84
1/1/2015 4:15 10.08
1/1/2015 4:30 9.6
1/1/2015 4:45 9.6
1/1/2015 5:00 10.8
1/1/2015 5:15 9.6
1/1/2015 5:30 9.84
1/1/2015 5:45 14.76
1/1/2015 6:00 14.4
1/1/2015 6:15 14.76
1/1/2015 6:30 15.12
1/1/2015 6:45 14.4
1/1/2015 7:00 14.4
1/1/2015 7:15 14.04
1/1/2015 7:30 12.96
1/1/2015 7:45 14.04
1/1/2015 8:00 12.6
1/1/2015 8:15 12.96
1/1/2015 8:30 14.04
1/1/2015 8:45 12.96
1/1/2015 9:00 17.28
1/1/2015 9:15 17.28
1/1/2015 9:30 17.76
1/1/2015 9:45 17.28
1/1/2015 10:00 17.76
1/1/2015 10:15 16.8
1/1/2015 10:30 17.28
1/1/2015 10:45 19.68
1/1/2015 11:00 17.28
1/1/2015 11:15 16.8
1/1/2015 11:30 16.8
1/1/2015 11:45 17.28
1/1/2015 12:00 16.8
1/1/2015 12:15 17.28
1/1/2015 12:30 17.28
1/1/2015 12:45 16.8
1/1/2015 13:00 17.28
1/1/2015 13:15 16.8
1/1/2015 13:30 16.8
1/1/2015 13:45 17.28
1/1/2015 14:00 25.92
1/1/2015 14:15 25.2
1/1/2015 14:30 25.2
1/1/2015 14:45 25.2
1/1/2015 15:00 25.2
1/1/2015 15:15 25.92
1/1/2015 15:30 25.2
1/1/2015 15:45 25.92
1/1/2015 16:00 25.92
1/1/2015 16:15 23.76
1/1/2015 16:30 23.76
1/1/2015 16:45 23.76
1/1/2015 17:00 24.48
1/1/2015 17:15 25.92
1/1/2015 17:30 8.88
1/1/2015 17:45 9.12
1/1/2015 18:00 8.88
1/1/2015 18:15 9.6
1/1/2015 18:30 8.88
1/1/2015 18:45 9.12
1/1/2015 19:00 9.12
1/1/2015 19:15 9.6
1/1/2015 19:30 9.12
1/1/2015 19:45 8.88
1/1/2015 20:00 9.12
1/1/2015 20:15 9.36
1/1/2015 20:30 9.12
1/1/2015 20:45 8.88
1/1/2015 21:00 6
1/1/2015 21:15 6
1/1/2015 21:30 6
1/1/2015 21:45 4
1/1/2015 22:00 5
1/1/2015 22:15 6
1/1/2015 22:30 7
1/1/2015 22:45 5
1/1/2015 23:00 7
1/1/2015 23:15 4
1/1/2015 23:30 6
1/1/2015 23:45 5
My crappy iterative code:
time_series_df = pd.DataFrame(time_series_list)
#Iterative approach taking 10 steps
for x in (time_series_df.max, time_series_df.min, -(time_series_df.max)/10):
#Getting values above an arbitrary threshold
temp = time_series_df.query('Usage_KWH > #x')
#If the difference above threshold and aggregate sum for the day are less than given cumulative value then try again
if time_series_df.sum - temp < 101:
final_threshold = temp
#print the highest value that did not exceed 101
print('final answer', final_threshold)
Extra: I have tried using variations of clip_upper, rank, cumsum, quantile, and nlargest. I am using pandas 0.18
The trick here to sort your data.. This is one way to do it. Could likely be improved for speed!
df2 = df.sort_values(['Usage_KWH'], ascending=[False]).reset_index()
df2['KWHcum'] = df2['Usage_KWH'].cumsum()/ (df2.index+1)
df2["dif"] = np.round( df2['KWHcum'] - df2['Usage_KWH'], 3)*(df2.index+1)
df2
# index DateTime Usage_KWH KWHcum dif
# 0 1/1/2015 14:00 25.92 25.920000 0.0000
# 1 1/1/2015 16:00 25.92 25.920000 0.0000
# 2 1/1/2015 15:45 25.92 25.920000 0.0000
# 3 1/1/2015 15:15 25.92 25.920000 0.0000
# 4 1/1/2015 17:15 25.92 25.920000 0.0000
# 5 1/1/2015 14:45 25.20 25.800000 3.6000
# 6 1/1/2015 14:15 25.20 25.714286 3.6001
# 7 1/1/2015 15:30 25.20 25.650000 3.6000
# 8 1/1/2015 14:30 25.20 25.600000 3.6000
# 9 1/1/2015 15:00 25.20 25.560000 3.6000
# 10 1/1/2015 17:00 24.48 25.461818 10.7998
# 11 1/1/2015 16:30 23.76 25.320000 18.7200
# 12 1/1/2015 16:45 23.76 25.200000 18.7200
# 13 1/1/2015 16:15 23.76 25.097143 18.7194
# 14 1/1/2015 10:45 19.68 24.736000 75.8400
# 15 1/1/2015 9:30 17.76 24.300000 104.6400
# 16 1/1/2015 10:00 17.76 23.915294 104.6401
# 17 1/1/2015 11:00 17.28 23.546667 112.8006
# 18 1/1/2015 9:45 17.28 23.216842 112.7992
# 19 1/1/2015 12:30 17.28 22.920000 112.8000
# 20 1/1/2015 10:30 17.28 22.651429 112.7994
# 21 1/1/2015 12:15 17.28 22.407273 112.8006
# 22 1/1/2015 13:00 17.28 22.184348 112.7989
# 23 1/1/2015 11:45 17.28 21.980000 112.8000
# 24 1/1/2015 13:45 17.28 21.792000 112.8000
# 25 1/1/2015 9:00 17.28 21.618462 112.8010
# 26 1/1/2015 9:15 17.28 21.457778 112.8006
# 27 1/1/2015 11:15 16.80 21.291429 125.7592
# 28 1/1/2015 11:30 16.80 21.136552 125.7614
# 29 1/1/2015 10:15 16.80 20.992000 125.7600
df2 = df2[df2['dif'] < 101]
print df2['Usage_KWH'].tail(1)
# 14 19.68
# Name: Usage_KWH, dtype: float64
df2 = df2[df2['dif'] < 141]
print df2['Usage_KWH'].tail(1)
#33 16.8
#Name: Usage_KWH, dtype: float64
I don't know what Pandas is, but here's a solution. Let the n numbers be in an array y[], and the area threshold (e.g. 101) be A:
Sort y[] in decreasing order. (Note that for the purpose of choosing a threshold, it doesn't matter at all what order the individual values are in.)
Set the running area total t = 0. Also set old_t = 0.
Set i = 0. For now we'll assume that we will set the threshold to y[i]; since i = 0 initially, that means we're initially setting the threshold exactly equal to the highest element. As i increases, our tentative threshold y[i] will get lower and our running area total t will increase.
While t < A and i < n:
i = i + 1
old_t = t
t = t + i * (y[i-1] - y[i])
If t < A then report that the threshold cannot be made low enough to produce an area of A above it (since the sum of all given values is still below A), and stop.
Otherwise, if t = A then report y[i] as the threshold, and stop.
Otherwise, it must be that t > A, meaning that we've gone too low -- we need to set the threshold somewhere between y[i-1] and y[i]:
We want to solve the equation A = old_t + i * (y[i-1] - x) for the desired threshold level x. That means:
Report y[i-1] - (A - old_t) / i as the threshold, and stop.
This running time for this algorithm is dominated by the time needed to sort y[] in the first step, which is O(n log n), so it will take milliseconds even for n in the millions.

Parse iostat output

I have a need to grep only certain lines in a log file generated with iostat. iostat command is iostat -x 1 -m > disk.log and it saves a file like this:
Linux 2.6.32-358.18.1.el6.x86_64 (parekosam) 11/26/2013 _x86_64_ (2 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0.04 0.01 0.14 0.28 0.00 99.53
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 72.44 6.67 4.15 0.34 0.33 0.03 162.23 0.02 3.92 1.77 0.80
dm-0 0.00 0.00 1.30 6.96 0.03 0.03 15.11 0.65 78.37 0.69 0.57
dm-1 0.00 0.00 0.07 0.00 0.00 0.00 7.99 0.00 2.57 0.67 0.00
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.00 1.01 0.00 98.99
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 5.00 0.00 3.00 0.00 0.03 18.67 0.03 10.67 10.67 3.20
dm-0 0.00 0.00 0.00 7.00 0.00 0.03 8.00 0.04 5.29 4.57 3.20
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I'd like to only show rMB/s and wMB/s columns so that I can calculate average speeds. I've tried some things with sed and awk but with little success. Ideal output should look like this:
12.27 10.23
11.27 10.22
15.26 20.23
12.24 10.25
12.26 50.23
12.23 10.26
13.23 23.23
12.22 10.23
12.23 10.23
22.23 14.27
13.21 10.23
12.23 10.23
14.22 10.23
12.23 10.21
Please notice this is for 'sda' only.
iostat -x 1 -m | awk '/sda/ { print $6, $7}'
Does this do what you want?
/^$/ {a=""}
a {print $6,$7}
/^Device/ {a=1}

Uneven CPU load distribution

I have a system with uneven CPU load in a odd pattern. It's serving up apache, elastic search, redis, and email.
Here's the mpstat output. Notice how %usr for the last 12 cores is well below the top 12.
# mpstat -P ALL
Linux 3.5.0-17-generic (<server1>) 02/16/2013 _x86_64_ (24 CPU)
10:21:46 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
10:21:46 PM all 17.15 0.00 2.20 0.33 0.00 0.09 0.00 0.00 80.23
10:21:46 PM 0 27.34 0.00 4.08 0.56 0.00 0.53 0.00 0.00 67.48
10:21:46 PM 1 24.51 0.00 3.25 0.53 0.00 0.34 0.00 0.00 71.38
10:21:46 PM 2 26.69 0.00 4.20 0.50 0.00 0.24 0.00 0.00 68.36
10:21:46 PM 3 24.38 0.00 3.04 0.70 0.00 0.23 0.00 0.00 71.65
10:21:46 PM 4 24.50 0.00 4.04 0.57 0.00 0.15 0.00 0.00 70.74
10:21:46 PM 5 21.75 0.00 2.80 0.74 0.00 0.15 0.00 0.00 74.55
10:21:46 PM 6 28.30 0.00 3.75 0.84 0.00 0.04 0.00 0.00 67.07
10:21:46 PM 7 30.20 0.00 3.94 0.16 0.00 0.03 0.00 0.00 65.67
10:21:46 PM 8 30.55 0.00 4.09 0.12 0.00 0.03 0.00 0.00 65.21
10:21:46 PM 9 32.66 0.00 3.40 0.09 0.00 0.03 0.00 0.00 63.81
10:21:46 PM 10 32.20 0.00 3.57 0.08 0.00 0.03 0.00 0.00 64.12
10:21:46 PM 11 32.08 0.00 3.92 0.08 0.00 0.03 0.00 0.00 63.88
10:21:46 PM 12 4.53 0.00 0.41 0.34 0.00 0.04 0.00 0.00 94.68
10:21:46 PM 13 9.14 0.00 1.42 0.32 0.00 0.04 0.00 0.00 89.08
10:21:46 PM 14 5.92 0.00 0.70 0.35 0.00 0.06 0.00 0.00 92.97
10:21:46 PM 15 6.14 0.00 0.66 0.35 0.00 0.04 0.00 0.00 92.81
10:21:46 PM 16 7.39 0.00 0.65 0.34 0.00 0.04 0.00 0.00 91.57
10:21:46 PM 17 6.60 0.00 0.83 0.39 0.00 0.05 0.00 0.00 92.13
10:21:46 PM 18 5.49 0.00 0.54 0.30 0.00 0.01 0.00 0.00 93.65
10:21:46 PM 19 6.78 0.00 0.88 0.21 0.00 0.01 0.00 0.00 92.12
10:21:46 PM 20 6.17 0.00 0.58 0.11 0.00 0.01 0.00 0.00 93.13
10:21:46 PM 21 5.78 0.00 0.82 0.10 0.00 0.01 0.00 0.00 93.29
10:21:46 PM 22 6.29 0.00 0.60 0.10 0.00 0.01 0.00 0.00 93.00
10:21:46 PM 23 6.18 0.00 0.61 0.10 0.00 0.01 0.00 0.00 93.10
I have another system, a database server running MySQL, which shows an even distribution.
# mpstat -P ALL
Linux 3.5.0-17-generic (<server2>) 02/16/2013 _x86_64_ (32 CPU)
10:27:57 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
10:27:57 PM all 0.77 0.00 0.07 0.68 0.00 0.00 0.00 0.00 98.47
10:27:57 PM 0 2.31 0.00 0.19 1.86 0.00 0.01 0.00 0.00 95.63
10:27:57 PM 1 1.73 0.00 0.17 1.87 0.00 0.01 0.00 0.00 96.21
10:27:57 PM 2 2.62 0.00 0.25 2.51 0.00 0.01 0.00 0.00 94.62
10:27:57 PM 3 1.60 0.00 0.17 1.99 0.00 0.01 0.00 0.00 96.23
10:27:57 PM 4 1.86 0.00 0.16 1.84 0.00 0.01 0.00 0.00 96.13
10:27:57 PM 5 2.30 0.00 0.25 2.45 0.00 0.01 0.00 0.00 94.99
10:27:57 PM 6 2.05 0.00 0.20 1.89 0.00 0.01 0.00 0.00 95.86
10:27:57 PM 7 2.13 0.00 0.20 2.31 0.00 0.01 0.00 0.00 95.36
10:27:57 PM 8 0.82 0.00 0.11 4.05 0.00 0.03 0.00 0.00 94.99
10:27:57 PM 9 0.70 0.00 0.18 0.06 0.00 0.00 0.00 0.00 99.06
10:27:57 PM 10 0.18 0.00 0.04 0.01 0.00 0.00 0.00 0.00 99.77
10:27:57 PM 11 0.20 0.00 0.01 0.01 0.00 0.00 0.00 0.00 99.78
10:27:57 PM 12 0.13 0.00 0.01 0.01 0.00 0.00 0.00 0.00 99.86
10:27:57 PM 13 0.04 0.00 0.01 0.00 0.00 0.00 0.00 0.00 99.95
10:27:57 PM 14 0.03 0.00 0.01 0.00 0.00 0.00 0.00 0.00 99.97
10:27:57 PM 15 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.97
10:27:57 PM 16 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.94
10:27:57 PM 17 0.41 0.00 0.10 0.04 0.00 0.00 0.00 0.00 99.45
10:27:57 PM 18 2.78 0.00 0.06 0.14 0.00 0.00 0.00 0.00 97.01
10:27:57 PM 19 1.19 0.00 0.08 0.19 0.00 0.00 0.00 0.00 98.53
10:27:57 PM 20 0.48 0.00 0.04 0.30 0.00 0.00 0.00 0.00 99.17
10:27:57 PM 21 0.70 0.00 0.03 0.16 0.00 0.00 0.00 0.00 99.11
10:27:57 PM 22 0.08 0.00 0.01 0.02 0.00 0.00 0.00 0.00 99.90
10:27:57 PM 23 0.30 0.00 0.02 0.06 0.00 0.00 0.00 0.00 99.62
10:27:57 PM 24 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
10:27:57 PM 25 0.04 0.00 0.03 0.00 0.00 0.00 0.00 0.00 99.94
10:27:57 PM 26 0.06 0.00 0.01 0.00 0.00 0.00 0.00 0.00 99.93
10:27:57 PM 27 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.00 99.98
10:27:57 PM 28 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.99
10:27:57 PM 29 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
10:27:57 PM 30 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
10:27:57 PM 31 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.99
Both are dedicated systems running Ubuntu 12.10 (not virtual).
I've thought and read up about setting nice, taskset, or trying to tweak the scheduler but I don't want to make any rash decisions. Also, this system isn't performing "bad" per-se, I just want to ensure all cores are being utilized properly.
Let me know if I can provide additional information. Any suggestions to even the CPU load on "server1" are greatly appreciated.
This is not a problem until some cores hit 100% and others don't (i.e. in the statistics you've shown us, there's nothing that would suggest that the uneven distribution is negatively affecting the performance). In your case, you probably have quite a few processes that distribute evenly, resulting in a base load of 6-10% on each core, and then ~12 more threads that require 10-20% of a core each. You can't split a single process/thread between cores.

Resources