tracing a linux kernel, function-by function (biggest only) with us timer

tracing a linux kernel, function-by function (biggest only) with us timer - linux-kernel

I want to know, how does the linux kernel do some stuff (receiving a tcp packet). In what order main tcp functions are called. I want to see both interrupt handler (top half), bottom half and even work done by kernel after user calls "read()".
How can I get a function trace from kernel with some linear time scale?
I want to get a trace from single packet, not the profile of kernel when receiving 1000th of packets.
Kernel is 2.6.18 or 2.6.23 (supported in my debian). I can add some patches to it.

I think that the closest tool which atleast partially can achieve what you want is kernel ftrace. Here is a sample usage:
root#ansis-xeon:/sys/kernel/debug/tracing# cat available_tracers
blk function_graph mmiotrace function sched_switch nop
root#ansis-xeon:/sys/kernel/debug/tracing# echo 1 > ./tracing_on
root#ansis-xeon:/sys/kernel/debug/tracing# echo function_graph > ./current_trace
root#ansis-xeon:/sys/kernel/debug/tracing# cat trace
3) 0.379 us | __dequeue_entity();
3) 1.552 us | }
3) 0.300 us | hrtick_start_fair();
3) 2.803 us | }
3) 0.304 us | perf_event_task_sched_out();
3) 0.287 us | __phys_addr();
3) 0.382 us | native_load_sp0();
3) 0.290 us | native_load_tls();
------------------------------------------
3) <idle>-0 => ubuntuo-2079
------------------------------------------
3) 0.509 us | __math_state_restore();
3) | finish_task_switch() {
3) 0.337 us | perf_event_task_sched_in();
3) 0.971 us | }
3) ! 100015.0 us | }
3) | hrtimer_cancel() {
3) | hrtimer_try_to_cancel() {
3) | lock_hrtimer_base() {
3) 0.327 us | _spin_lock_irqsave();
3) 0.897 us | }
3) 0.305 us | _spin_unlock_irqrestore();
3) 2.185 us | }
3) 2.749 us | }
3) ! 100022.5 us | }
3) ! 100023.2 us | }
3) 0.704 us | fget_light();
3) 0.522 us | pipe_poll();
3) 0.342 us | fput();
3) 0.476 us | fget_light();
3) 0.467 us | pipe_poll();
3) 0.292 us | fput();
3) 0.394 us | fget_light();
3) | inotify_poll() {
3) | mutex_lock() {
3) 0.285 us | _cond_resched();
3) 1.134 us | }
3) 0.289 us | fsnotify_notify_queue_is_empty();
3) | mutex_unlock() {
3) 2.987 us | }
3) 0.292 us | fput();
3) 0.517 us | fget_light();
3) 0.415 us | pipe_poll();
3) 0.292 us | fput();
3) 0.504 us | fget_light();
3) | sock_poll() {
3) 0.480 us | unix_poll();
3) 4.224 us | }
3) 0.183 us | fput();
3) 0.341 us | fget_light();
3) | sock_poll() {
3) 0.274 us | unix_poll();
3) 0.731 us | }
3) 0.182 us | fput();
3) 0.269 us | fget_light();
It is not perfect because it does not print function parameters and misses some static functions, but you can get the fell of who is calling who inside the kernel.
If this is not enough then use GDB. But as you might already know setting up GDB for kernel debugging is not as easy as it is for user space processes. I prefer to use GDB+qemu if ever needed.
Happy tracing!
Update: On later Linux distributions I suggest to use trace-cmd command line tool that is "wrapper" around /sys/kernel/debug/tracing. trace-cmd is so much more intuitive to use than the raw interface kernel provides.

You want oprofile. It can give you timings for (a selected subset of) your entire system, which means you can trace network activity from the device to the application and back again, through the kernel and all the libraries.

I can't immediately see a way to trace only a single packet at a time. In particular, it may be difficult to get such fine grained tracing since the upper-half of an interrupt handler shouldn't be doing anything blocking (this is dead-lock prone).
Maybe this is overly pedantic, but have you taken a look at the source code? I know from experience the TCP layer is very well commented, both in recording intent and referencing the RFCs.
I can highly recommend TCP/IP Illustrated, esp Volume 2, the Implementation (website). Its very good, and walks through the code line by line. From the description, "Combining 500 illustrations with 15,000 lines of real, working code...". The source code included is from the BSD kernel, but the stacks are extremely similar, and comparing the two is often instructive.

This question is old and probably not relevant to the original poster, but a nice trick I used lately which was helpful to me was to set the "mark" field of the sk_buf to some value and only "printk" if the value matches.
So for example, if you know where the top half IRQ handler (as the question suggests), then you can hard code some check (e.g. tcp port, source IP, source MAC address, well you get the point) and set the mark to some arbitrary value (e.g. skb->mark = 0x9999).
Then all the way up you only printk if the mark has the same value. As long as nobody else changes your mark (which as far as I could see is usually the case in typical settings) then you'll only see the packets you're interested in.
Since most interesting functions get the skb, it works for almost everything that could be interesting.

Related

Comparing Two Splunk Events To See Which One Is Larger

I'm using a transaction to see how long a device is RFM mode and the duration field increases with each table row. How I think it should work is that while the field is 'yes' it would calculate the duration that all events equal 'yes', but I have a lot of superfluous data that shouldn't be there IMO.
I only want to keep the largest duration event so I want to compare the current events duration to the next events duration and if its smaller than the current event, keep the current event.
index=crowdstrike sourcetype=crowdstrike:device:json
| transaction falcon_device.hostname startswith="falcon_device.reduced_functionality_mode=yes" endswith="falcon_device.reduced_functionality_mode=no"
| table _time duration
_time
duration
2022-10-28 06:07:45
888198
2022-10-28 05:33:44
892400
2022-10-28 04:57:44
896360
2022-08-22 18:25:53
3862
2022-08-22 18:01:53
7703
2022-08-22 17:35:53
11543
In the data above the duration goes from 896360 to 3862, and can happen on any date, and the duration runs in cycles like that where it increases until it starts over. So in the comparison I would keep the event at the 10-28 inflection point and so on at all other inflection points throughout the dataset.
How would I construct that multi event comparison?

By definition, the transaction command bundles together all events with the same hostname value, starting with the first "yes" and ending with the first "no". There is no option to include events by size, but there are options that govern the maximum time span of a transaction (maxspan), how many events can be in a transaction (maxevents), and how long the time between events can be (maxpause). That the duration value you want to keep (896360) is 10 days even though the previous transaction was only 36 minutes before it makes me wonder about the logic being used in this query. Consider using some of the options available to better define a "transaction".
What problem are you trying to solve with this query? It's possible there's another solution that doesn't use transaction (which is very non-performant).

Sans sample data, something like the following will probably work:
index=crowdstrike sourcetype=crowdstrike:device:json falcon_device.hostname=* falcon_device.reduced_functionality_mode=yes
| stats max(_time) as yestime by falcon_device.hostname
| append
[| search index=crowdstrike sourcetype=crowdstrike:device:json falcon_device.hostname=* falcon_device.reduced_functionality_mode=no
| stats max(_time) as notime by falcon_device.hostname ]
| stats values(*) as * by falcon_device.hostname
| eval elapsed_seconds=yestime-notime

Thanks for your answers but it wasn't working out. I ended up talking to some professional splunkers and got the below as a solution.
index=crowdstrike sourcetype=crowdstrike:device:json
| addinfo ```adds info_max_time```
| fields + _time, falcon_device.reduced_functionality_mode falcon_device.hostname info_max_time
| rename falcon_device.reduced_functionality_mode AS mode, falcon_device.hostname AS Hostname
| sort 0 + Hostname, -_time ``` events are not always returned in descending order per hostname, which would break streamstats```
| streamstats current=f last(mode) as new_mode last(_time) as time_change by Hostname ```compute potential time of state change```
| eval new_mode=coalesce(new_mode,mode."+++"), time_change=coalesce(time_change,info_max_time) ```take care of boundaries of search```
| table _time, Hostname, mode, new_mode, time_change
| where mode!=new_mode ```keep only state change events```
| streamstats current=f last(time_change) AS change_end by Hostname ```add start time of the next state as change_end time for the current state```
| fieldformat time_change=strftime(time_change, "%Y-%m-%d %T")
| fieldformat change_end=strftime(change_end, "%Y-%m-%d %T")
``` uncomment the following to sort by duration```
```| search change_end=* AND new_mode="yes"
| eval duration = round( (change_end-time_change)/(3600),1)
| table time_change, Hostname, new_mode, duration
| sort -duration```

What part does priority play in round robin scheduling?

I am trying to solve the following homework problem for an operating systems class:
The following processes are being scheduled using a preemptive, round robin scheduling algorithm. Each process is assigned a numerical priority, with a higher number indicating a higher relative priority.
In addition to the processes listed below, the system also has an idle task (which consumes no CPU resources and is identified as Pidle ). This task has priority 0 and is scheduled whenever the system has no other available processes to run.
The length of a time quantum is 10 units.
If a process is preempted by a higher-priority process, the preempted process is placed at the end of the queue.
+--+--------+----------+-------+---------+
| | Thread | Priority | Burst | Arrival |
+--+--------+----------+-------+---------+
| | P1 | 40 | 15 | 0 |
| | P2 | 30 | 25 | 25 |
| | P3 | 30 | 20 | 30 |
| | P4 | 35 | 15 | 50 |
| | P5 | 5 | 15 | 100 |
| | P6 | 10 | 10 | 105 |
+--+--------+----------+-------+---------+
a. Show the scheduling order of the processes using a Gantt chart.
b. What is the turnaround time for each process?
c. What is the waiting time for each process?
d. What is the CPU utilization rate?
My question is --- What role does priority play when we're considering that this uses the round robin algorithm? I have been thinking about it a lot what I have come up with is that it only makes sense if the priority is important at the time of its arrival in order to decide if it should preempt another process or not. The reason I have concluded this is because if it was checked every time there was a context switch then the process with the highest priority would always be run indefinitely and other processes would starve. This is against the idea of round robin making sure that no process executes longer than one time quantum and the idea that after a process executes it goes to the end of the queue.
Using this logic I have worked out the problem as such:
Could you please advise me if I'm on the right track of the role priority has in this situation and if I'm approaching it the right way?

I think you are on the wrong track. Round robin controls the run order within a priority. It is as if each priority has its own queue, and corresponding round robin scheduler. When a given priority’s queue is empty, the subsequent lower priority queues are considered. Eventually, it will hit idle.
If you didn’t process it this way, how would you prevent idle from eventually being scheduled, despite having actual work ready to go?
Most high priority processes are reactive, that is they execute for a short burst in response to an event, so for the most part on not on a run/ready queue.
In code:
void Next() {
for (int i = PRIO_HI; i >= PRIO_LO; i--) {
Proc *p;
if ((p = prioq[i].head) != NULL) {
Resume(p);
/*NOTREACHED*/
}
}
panic(“Idle not on runq!”);
}
void Stop() {
unlink(prioq + curp->prio, curp);
Next();
}
void Start(Proc *p) {
p->countdown = p->reload;
append(prioq + p->prio, p);
Next();
}
void Tick() {
if (--(curp->countdown) == 0) {
unlink(prioq + curp->prio, curp);
Start(curp);
}
}

Analyzing readdir() performance

It's bothering me that linux takes so long to list all files for huge directories, so I created a little test script that recursively lists all files of a directory:
#include <stdio.h>
#include <dirent.h>
int list(char *path) {
int i = 0;
DIR *dir = opendir(path);
struct dirent *entry;
char new_path[1024];
while(entry = readdir(dir)) {
if (entry->d_type == DT_DIR) {
if (entry->d_name[0] == '.')
continue;
strcpy(new_path, path);
strcat(new_path, "/");
strcat(new_path, entry->d_name);
i += list(new_path);
}
else
i++;
}
closedir(dir);
return i;
}
int main() {
char *path = "/home";
printf("%i\n", list(path));
return 0;
When compiling it with gcc -O3, the program runs about 15 sec (I ran the programm a few times and it's approximately constant, so the fs cache should not play a role here):
$ /usr/bin/time -f "%CC %DD %EE %FF %II %KK %MM %OO %PP %RR %SS %UU %WW %XX %ZZ %cc %ee %kk %pp %rr %ss %tt %ww %xx" ./a.out
./a.outC 0D 0:14.39E 0F 0I 0K 548M 0O 2%P 178R 0.30S 0.01U 0W 0X 4096Z 7c 14.39e 0k 0p 0r 0s 0t 1692w 0x
So it spends about S=0.3sec in kernelspace and U=0.01sec in userspace and has 7+1692 context switches.
A context switch takes about 2000nsec * (7+1692) = 3.398msec [1]
However, there are more than 10sec left and I would like to find out what the program is doing in this time.
Are there any other tools to investigate what the program is doing all the time?
gprof just tells me the time for the (userspace) call graph and gcov does not list time spent in each line but only how often a time is executed...
[1] http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html

oprofile is a decent sampling profiler which can profile both user and kernel-mode code.
According to your numbers, however, approximately 14.5 seconds of the time is spent asleep, which is not really registered well by oprofile. Perhaps what may be more useful would be ftrace combined with a reading of the kernel code. ftrace provides trace points in the kernel which can log a message and stack trace when they occur. The event that would seem most useful for determining why your process is sleeping would be the sched_switch event. I would recommend that you enable kernel-mode stacks and the sched_switch event, set a buffer large enough to capture the entire lifetime of your process, then run your process and stop tracing immediately after. By reviewing the trace, you will be able to see every time your process went to sleep, whether it was runnable or non-runnable, a high resolution time stamp, and a call stack indicating what put it to sleep.
ftrace is controlled through debugfs. On my system, this is mounted in /sys/kernel/debug, but yours may be different. Here is an example of what I would do to capture this information:
# Enable stack traces
echo "1" > /sys/kernel/debug/tracing/options/stacktrace
# Enable the sched_switch event
echo "1" > /sys/kernel/debug/tracing/events/sched/sched_switch/enable
# Make sure tracing is enabled
echo "1" > /sys/kernel/debug/tracing/tracing_on
# Run the program and disable tracing as quickly as possible
./your_program; echo "0" > /sys/kernel/debug/tracing/tracing_on
# Examine the trace
vi /sys/kernel/debug/tracing/trace
The resulting output will have lines which look like this:
# tracer: nop
#
# entries-in-buffer/entries-written: 22248/3703779 #P:1
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
<idle>-0 [000] d..3 2113.437500: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/0:0 next_pid=878 next_prio=120
<idle>-0 [000] d..3 2113.437531: <stack trace>
=> __schedule
=> schedule
=> schedule_preempt_disabled
=> cpu_startup_entry
=> rest_init
=> start_kernel
kworker/0:0-878 [000] d..3 2113.437836: sched_switch: prev_comm=kworker/0:0 prev_pid=878 prev_prio=120 prev_state=S ==> next_comm=your_program next_pid=898 next_prio=120
kworker/0:0-878 [000] d..3 2113.437866: <stack trace>
=> __schedule
=> schedule
=> worker_thread
=> kthread
=> ret_from_fork
The lines you will care about will be when your program appears as the prev_comm task, meaning the scheduler is switching away from your program to run something else. prev_state will indicate that your program was still runnable (R) or was blocked (S, U or some other letter, see the ftrace source). If blocked, you can examine the stack trace and the kernel source to figure out why.

Apache Pig: FLATTEN and parallel execution of reducers

I have implemented an Apache Pig script. When I execute the script it results in many mappers for a specific step, but has only one reducer for that step. Because of this condition (many mappers, one reducer) the Hadoop cluster is almost idle while the single reducer executes. In order to better use the resources of the cluster I would like to also have many reducers running in parallel.
Even if I set the parallelism in the Pig script using the SET DEFAULT_PARALLEL command I still result in having only 1 reducer.
The code part issuing the problem is the following:
SET DEFAULT_PARALLEL 5;
inputData = LOAD 'input_data.txt' AS (group_name:chararray, item:int);
inputDataGrouped = GROUP inputData BY (group_name);
-- The GeneratePairsUDF generates a bag containing pairs of integers, e.g. {(1, 5), (1, 8), ..., (8, 5)}
pairs = FOREACH inputDataGrouped GENERATE GeneratePairsUDF(inputData.item) AS pairs_bag;
pairsFlat = FOREACH pairs GENERATE FLATTEN(pairs_bag) AS (item1:int, item2:int);
The 'inputData' and 'inputDataGrouped' aliases are computed in the mapper.
The 'pairs' and 'pairsFlat' in the reducer.
If I change the script by removing the line with the FLATTEN command (pairsFlat = FOREACH pairs GENERATE FLATTEN(pairs_bag) AS (item1:int, item2:int);) then the execution results in 5 reducers (and thus in a parallel execution).
It seems that the FLATTEN command is the problem and avoids that many reducers are created.
How could I reach the same result of FLATTEN but having the script being executed in parallel (with many reducers)?
Edit:
EXPLAIN plan when having two FOREACH (as above):
Map Plan
inputDataGrouped: Local Rearrange[tuple]{chararray}(false) - scope-32
| |
| Project[chararray][0] - scope-33
|
|---inputData: New For Each(false,false)[bag] - scope-29
| |
| Cast[chararray] - scope-24
| |
| |---Project[bytearray][0] - scope-23
| |
| Cast[int] - scope-27
| |
| |---Project[bytearray][1] - scope-26
|
|---inputData: Load(file:///input_data.txt:org.apache.pig.builtin.PigStorage) - scope-22--------
Reduce Plan
pairsFlat: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-42
|
|---pairsFlat: New For Each(true)[bag] - scope-41
| |
| Project[bag][0] - scope-39
|
|---pairs: New For Each(false)[bag] - scope-38
| |
| POUserFunc(GeneratePairsUDF)[bag] - scope-36
| |
| |---Project[bag][1] - scope-35
| |
| |---Project[bag][1] - scope-34
|
|---inputDataGrouped: Package[tuple]{chararray} - scope-31--------
Global sort: false
EXPLAIN plan when having only one FOREACH with FLATTEN wrapping the UDF:
Map Plan
inputDataGrouped: Local Rearrange[tuple]{chararray}(false) - scope-29
| |
| Project[chararray][0] - scope-30
|
|---inputData: New For Each(false,false)[bag] - scope-26
| |
| Cast[chararray] - scope-21
| |
| |---Project[bytearray][0] - scope-20
| |
| Cast[int] - scope-24
| |
| |---Project[bytearray][1] - scope-23
|
|---inputData: Load(file:///input_data.txt:org.apache.pig.builtin.PigStorage) - scope-19--------
Reduce Plan
pairs: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-36
|
|---pairs: New For Each(true)[bag] - scope-35
| |
| POUserFunc(GeneratePairsUDF)[bag] - scope-33
| |
| |---Project[bag][1] - scope-32
| |
| |---Project[bag][1] - scope-31
|
|---inputDataGrouped: Package[tuple]{chararray} - scope-28--------
Global sort: false

There is no surety if pig uses the configuration DEFAULT_PARALLEL value for every steps in the pig script. Try PARALLEL along with your specific join/group step which you feel taking time (In your case GROUP step).
inputDataGrouped = GROUP inputData BY (group_name) PARALLEL 67;
If still it is not working then you might have to see your data for skewness issue.

I think there is a skewness in the data. Only a small number of mappers are producing exponentially large output. Look at the distribution of keys in your data. Like data contains few Groups with large number of records.

I tried "set default parallel" and "PARALLEL 100" but no luck. Pig still uses 1 reducer.
It turned out I have to generate a random number from 1 to 100 for each record and group these records by that random number.
We are wasting time on grouping, but it is much faster for me because now I can use more reducers.
Here is the code (SUBMITTER is my own UDF):
tmpRecord = FOREACH record GENERATE (int)(RANDOM()*100.0) as rnd, data;
groupTmpRecord = GROUP tmpRecord BY rnd;
result = FOREACH groupTmpRecord GENERATE FLATTEN(SUBMITTER(tmpRecord));

To answer your question we must first know how many reducers pig enforces to accomplish the - Global Rearrange process. Because as per my understanding the Generate / Projection should not require a single reducer. I cannot say the same thing about Flatten. However we know from common-sense that during flatten the aim is to de-nestify the tuples from bags and vice versa. And to do that all the tuples belonging to a bag should definitely be available in the same reducer. I might be wrong. But can anyone add something here to get this user an answer please ?

What is the contents of the cache after loop?

A computer uses a small direct-mapped cache between the main memory and the
processor. The cache has four 16-bit words, and each word has an associated 13-bit tag,
as shown in Figure (a). When a miss occurs during a read operation, the requested
word is read from the main memory and sent to the processor. At the same time, it is
copied into the cache, and its block number is stored in the associated tag. Consider the
following loop in a program where all instructions and operands are 16 bits long:
LOOP: Add (R1)+,R0
Decrement R2
BNE LOOP
<-13 bits-> <--16bit->
0|TAG |DATA |
2| | |
4| | |
6|_______ | ______ |
(a)Cache
.
.
| A03C |<---ADDRESS 054E
| 05D9 |
| 10D7 |
.
.
(b)Main Memory
Assume that, before this loop is entered, registers R0, R1, and R2 contain 0, 054E,
and 3, respectively. Also assume that the main memory contains the data shown in
Figure (b), where all entries are given in hexadecimal notation. The loop starts at
location LOOP = 02EC.
(a) Show the contents of the cache at the end of each pass through the loop.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

tracing a linux kernel, function-by function (biggest only) with us timer - linux-kernel

You want oprofile. It can give you timings for (a selected subset of) your entire system, which means you can trace network activity from the device to the application and back again, through the kernel and all the libraries.

Related

Comparing Two Splunk Events To See Which One Is Larger

What part does priority play in round robin scheduling?

Analyzing readdir() performance

Apache Pig: FLATTEN and parallel execution of reducers

What is the contents of the cache after loop?

Categories

Resources