Boot hangs on serial8250: too much work for irq36 - linux-kernel

I am currently working on Poky on an embedded PPC system.
I am experiencing a "serial8250: too much work for irq36" problem around 50-75% times I try to boot.
What I was thinking to do is
Reduce printk lines
Check if there is any fix about this error on Linux source files update
The error I am observing is this one::
serial8250_interrupt: 8544 callbacks suppressed
serial8250: too much work for irq36
serial8250: too much work for irq36
serial8250: too much work for irq36
serial8250: too much work for irq36
serial8250: too much work for irq36
serial8250: too much work for irq36
serial8250: too much work for irq36
serial8250: too much work for irq36
serial8250: too much work for irq36
serial8250: too much work for irq36
(several lines same of the ones before)
INFO: rcu_sched self-detected stall on CPU
3: (1 GPs behind) idle=f9f/140000000000001/0 softirq=0/0 fqs=1699
INFO: rcu_sched detected stalls on CPUs/tasks:
0: (1 GPs behind) idle=965/140000000000001/0 softirq=0/0 fqs=1699
3: (1 GPs behind) idle=f9f/140000000000001/0 softirq=0/0 fqs=1699
(detected by 1, t=52517 jiffies, g=91, c=90, q=161)
Task dump for CPU 0:
(agetty) R running 0 1915 1 0x00000084
Call Trace:
[ed0079a0] [c005b408] update_cfs_rq_blocked_load+0xb8/0x1c0 (unreliable)
[ed007a60] [00000211] 0x211
Task dump for CPU 3:
systemd-journal R running 0 957 1 0x00000004
Call Trace:
rcu_sched kthread starved for 50818 jiffies!
(t=52529 jiffies g=91 c=90 q=161)
rcu_sched kthread starved for 50831 jiffies!
Task dump for CPU 0:
(agetty) R running 0 1915 1 0x00000084
Call Trace:
[ed0079a0] [c005b408] update_cfs_rq_blocked_load+0xb8/0x1c0 (unreliable)
[ed007a60] [00000211] 0x211
Task dump for CPU 3:
systemd-journal R running 0 957 1 0x00000004
Call Trace:
[e9ee9b80] [c007f948] rcu_dump_cpu_stacks+0xa8/0x100 (unreliable)
[e9ee9ba0] [c008386c] rcu_check_callbacks+0x4fc/0x7b0
[e9ee9c10] [c008725c] update_process_times+0x3c/0x70
[e9ee9c20] [c009afb8] tick_sched_timer+0x68/0xe0
[e9ee9c50] [c00880b4] __run_hrtimer.isra.34+0x54/0xf0
[e9ee9c70] [c00889b8] hrtimer_interrupt+0x118/0x330
[e9ee9ce0] [c0009658] __timer_interrupt+0xa8/0x1a0
[e9ee9d00] [c0009980] timer_interrupt+0xb0/0xe0
[e9ee9d20] [c000f7e0] ret_from_except+0x0/0x18
--- interrupt: 901 at smp_call_function_many+0x2a0/0x300
LR = smp_call_function_many+0x270/0x300
[e9ee9de0] [c00a033c] smp_call_function_many+0x24c/0x300 (unreliable)
[e9ee9e20] [c00177cc] flush_tlb_mm+0x9c/0xa0
[e9ee9e40] [c00f8dd8] tlb_flush_mmu_tlbonly.part.86+0x18/0x90
[e9ee9e50] [c00f8f54] tlb_flush_mmu+0x24/0x40
[e9ee9e60] [c00f8f88] tlb_finish_mmu+0x18/0x70
[e9ee9e70] [c00fee00] unmap_region+0xc0/0x140
[e9ee9ef0] [c01013ec] do_munmap+0x26c/0x430
[e9ee9f20] [c01015e8] vm_munmap+0x38/0x60
[e9ee9f40] [c000f130] ret_from_syscall+0x0/0x3c
--- interrupt: c01 at 0x201842fc
LR = 0x20117090
serial8250_interrupt: 8544 callbacks suppressed
serial8250: too much work for irq36
serial8250: too much work for irq36
serial8250: too much work for irq36
(several lines same of the ones before)

Related

What can be done to lower UE4Editor startup time?

Status: the problem lowered, but compared to other users reports it persists.
I have moved to UE4.27.0 and the startup time lowered from 11 (v4.26.2) to 6 minutes! (the RAM usage lowered too!) But doesnt compare to the speed other ppl report "almost instantly"...
It is not compiling anything, not even shaders, it is like the 6th time I run it for one project.
Should I try to disable plugins? but Im new with UE and dont want to difficult my usage. Tho, for ex., I have nothing VR related to test so it could really be initially disabled.
HD READ SPEED? NO
I have tested moving UE4Editor whole engine path (100GB) to a 3xSSD(Stripes), but the UE4Editor startup time remained the same. My HD were it is too, is fast but not so fast as the 3xSSD.
CPU USAGE? MAY BE if it could use 4 cores could solve it?
UE4Editor startup uses A SINGLE CORE ONLY, i can confirm with htop and system monitor, it is possible to see only a single core being used 100% and it changes between the 4 cores, so only one is used at 100% per time.
I tested this command line parameter -USEALLAVAILABLECORES after the project URL for UE4Editor, but nothing changed. I read that option is ignored in some machines, so may be if I patch it's usage it could work on mine?
GPU? no?
a report about an integrated graphics card (weak one) says it doesnt interfere with the startup time.
LOG for UE4Editor v4.27.0 with the new biggest intervals ("..." means ommited log lines to make it easier to read; "!(interval in seconds)" is just to easy reading it (no ommitted lines here)):
[2021.09.15-23.38.20:677][ 0]LogHAL: Linux SourceCodeAccessSettings: NullSourceCodeAccessor
!22s
[2021.09.15-23.38.42:780][ 0]LogTcpMessaging: Initializing TcpMessaging bridge
[2021.09.15-23.38.42:782][ 0]LogUdpMessaging: Initializing bridge on interface 0.0.0.0:0 to multicast group 230.0.0.1:6666.
!16s
[2021.09.15-23.38.58:158][ 0]LogPython: Using Python 3.7.7
...
[2021.09.15-23.39.01:817][ 0]LogImageWrapper: Warning: PNG Warning: Duplicate iCCP chunk
!75s
[2021.09.15-23.40.16:951][ 0]SourceControl: Source control is disabled
...
[2021.09.15-23.40.26:867][ 0]LogAndroidPermission: UAndroidPermissionCallbackProxy::GetInstance
!16s
[2021.09.15-23.40.42:325][ 0]LogAudioCaptureCore: Display: No Audio Capture implementations found. Audio input will be silent.
...
[2021.09.15-23.41.08:207][ 0]LogInit: Transaction tracking system initialized
!9s
[2021.09.15-23.41.17:513][ 0]BlueprintLog: New page: Editor Load
!23s
[2021.09.15-23.41.40:396][ 0]LocalizationService: Localization service is disabled
...
[2021.09.15-23.41.45:457][ 0]MemoryProfiler: OnSessionChanged
!13s
[2021.09.15-23.41.58:497][ 0]LogCook: Display: CookSettings for Memory: MemoryMaxUsedVirtual 0MiB, MemoryMaxUsedPhysical 16384MiB, MemoryMinFreeVirtual 0MiB, MemoryMinFreePhysical 1024MiB
SPECS:
I'm using ubuntu 20.04.
My CPU is 4 cores 3.6GHz.
GeForce GT 710 1GB.
Related question but for older UE4: https://answers.unrealengine.com/questions/987852/view.html
Unreal Engine needs a high-end pc with a lot of RAM, fast SSD's, a good CPU and a medium graphic card. First of all there are always some shaders that needs to be compiled from the engine, and a lot of assets to be loaded in the startup time. As I can see you're on Linux you are probably using a self-compiled Unreal Engine version.... not the best thing to do for a newbie, because this may cause several problems on load time, startup, compiling and a lot of other stuff. If it's the first times you're using Unreal, try using it on Windows, it's all easier.

Julia package load extremely slow in first run

I'm using Julia 1.5.2 under Linux 5.4.0 and waited around 15 minutes for Pkg.add("DifferentialEquations"). Then I started the Kernel in Jupyter Notebook and ran the following code. It took terribly 1 minute to execute (the actual first time that I did this it took 225s).
t = time()
using Printf
using BenchmarkTools
using OrdinaryDiffEq
using Plots
tt = time() - t
#sprintf("It took %f seconds to import Printf, BenchmarkTools, OrdinaryDiffEq and Plots.", tt)
# It took 58.545894 seconds to import Printf, BenchmarkTools, OrdinaryDiffEq and Plots.
Finally, I done the same as above, but for each package. This is the summary:
Printf: 0.004755973815917969
BenchmarkTools: 0.06729602813720703
Plots: 19.99405598640442
OrdinaryDiffEq: 19.001102209091187
I know from here that Pkg was slow in the past, but I think that 15 minutes isn't a normal installing time at all. However, this is not my big problem.
I know that Julia needs to compile everything everytime the Kernel is started or some package is loaded. But it obviously is not a compilation time, it's a compilation eternity.
Can anyone figure out why this is so terribly slow? And, if it's normal, wouldn't it be better to provide precompiled packages to Pkg such as numpy and friends are in Python? Or at least compile forever in the first using?
Thank you!
My complete Platform Info:
Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i3-6100U CPU # 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
This problem is generally called latency or time-to-first-plot (TTFP) when referring to julia-lang. There are some discussions you can find when using these keywords.
A nice recent analysis of this problem is assessed in the article "Analyzing sources of compiler latency in Julia: method invalidations"
At the time of writing (end 2020, stable release v1.5.3), no general solution is available but strategies of massive precompilation of packages instead of JIT is discussed, with marginal success.

why do my various user programs terminate abruptly without an error message?

I do a variety of different kinds of data analysis and numerical simulation on my custom-built Ubuntu machine using custom-written programs that sometimes must run for days or even weeks. Some of those programs have been in Fortran, some in Python, some in C; there is literally zero commonality between these programs except that they run a long time and do a lot of disk i/o. Most are single-thread.
The typical execution command line looks like
./myprog &> myprog.log &
If an ordinary runtime error occurs, any buffered program output and the error message both faithfully appear in myprog.log and the logfile is cleanly closed. But what's been happening instead in many cases is that the program simply quits in mid-stream -- usually after half a day to a day or so, without any further output to the log file. It's like the program had been randomly hit with a 'kill -9'.
I don't know why this is happening, and it seems to be specific to this particular machine (I have been doing similar work for 30 years and never experienced this before). The operating system itself seems rock-stable; it has been rebooted only rarely over the past couple years for specific reasons like updates. It's only my longer-running user processes that seem to die abruptly like this with no accompanying diagnostic.
Not being a system-level expert, I'm at a loss for how to diagnose what's going on. Right now, my only option is to regularly check whether my program is still running and restart it if necessary.
System details:
Ubuntu 18.04.4 LTS
Linux kernel: 4.15.0-39-generic
CPU: AMD Ryzen Threadripper 1950x
UPDATE: Since dmesg was mentioned, here are some representive messages, which I have no idea how to interpret. The UFW BLOCK messages are by far the most numerous, but there are also a fair number of the ata6 messages, which seem to have something to do with the SATA hard drive. Could this be relevant?
[5301325.692596] audit: type=1400 audit(1594876149.572:218): apparmor="DENIED" operation="open" profile="/usr/sbin/cups-browsed" name="/usr/share/locale/" pid=19663 comm="cups-browsed" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[5352288.689739] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[5352288.689753] ata6.00: cmd a0/00:00:00:08:00/00:00:00:00:00/a0 tag 14 pio 16392 in
Get event status notification 4a 01 00 00 10 00 00 00 08 00res 40/00:03:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
[5352288.689756] ata6.00: status: { DRDY }
[5352288.689760] ata6: hard resetting link
[5352289.161877] ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[5352289.166076] ata6.00: configured for PIO0
[5352289.166635] ata6: EH complete
[5353558.066052] [UFW BLOCK] IN=enp5s0 OUT= MAC=10:7b:44:93:2f:58:b4:0c:25:e0:40:12:08:00 SRC=172.105.89.161 DST=144.92.130.162 LEN=40 TOS=0x00 PREC=0x00 TTL=243 ID=50780 PROTO=TCP SPT=58944 DPT=68 WINDOW=1024 RES=0x00 SYN URGP=0

Tensorflow, Keras and GPUS: logs show Resource Exhausted Error before simply loading up model weights

I am new to Ubuntu and I am setting up a new machine for deep learning using Keras and Tensorflow. I am fine tuning VGG16 on a set of pretty complex medical images. My machine specifications are:-
i7-6900K CPU # 3.20GHz × 16
GeForce GTX 1080 Ti x 4
62.8 GiB of RAM
My previous machine was an iMac with no GPU but an i7 quad core processor and 32GB of RAM. The iMac ran the following model although it took 32 hours to complete it.
Here is the code:-
img_width, img_height = 512, 512
top_model_weights_path = '50435_train_uip_possible_inconsistent.h5'
train_dir = '../../MasterHRCT/50435/Three-Classes/train'
validation_dir = '../../MasterHRCT/50435/Three-Classes/validation'
nb_train_samples = 50435
nb_validation_samples = 12600
epochs = 200
batch_size = 16
datagen = ImageDataGenerator(rescale=1. / 255)
model = applications.VGG16(include_top=False, weights='imagenet')
Then:-
generator_train = datagen.flow_from_directory(
train_dir,
target_size=(img_width, img_height),
shuffle=False,
class_mode=None,
batch_size=batch_size
)
bottleneck_features_train = model.predict_generator(
generator=generator_train,
steps=nb_train_samples // batch_size,
verbose=1
)
np.save(file="50435_train_uip_possible_inconsistent.npy", arr=bottleneck_features_train)
print("Completed train data")
generator_validation = datagen.flow_from_directory(
validation_dir,
target_size=(img_width, img_height),
shuffle=False,
class_mode=None,
batch_size=batch_size
)
bottleneck_features_validation = model.predict_generator(
generator=generator_validation,
steps=nb_validation_samples // batch_size,
verbose=1
)
np.save(file="12600_validate_uip_possible_inconsistent.npy", arr=bottleneck_features_validation)
print("Completed validation data")
Yesterday, I ran this code and it was super fast (nvidia-smi suggested that only one GPU was being used which I believe is expected for TF). The CPU hit 56% of maximum. Then it crashed - with a CUDA_OUT_OF_MEMORY error. So I lowered the batch size to 4. Again, its started really fast but then the CPU jumped to 100% and my system froze. I had to hard reboot.
I have tried again today, and the first time I get this error when simply trying to load the ImageNet Weights...
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[3,3,512,512]
[[Node: block4_conv2_2/random_uniform/RandomUniform = RandomUniform[T=DT_INT32, dtype=DT_FLOAT, seed=87654321, seed2=5932420, _device="/job:localhost/replica:0/task:0/gpu:0"](block4_conv2_2/random_uniform/shape)]]
On the command line it says:-
2017-08-08 06:13:57.937723: I tensorflow/core/common_runtime /bfc_allocator.cc:700] Sum Total of in-use chunks: 71.99MiB
2017-08-08 06:13:57.937739: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 80150528
InUse: 75491072
MaxInUse: 80069120
NumAllocs: 177
MaxAllocSize: 11985920
Now clearly this is a memory issue - but why would it fail to even load the weights. My Mac can run this entire code albeit intractably slowly. I should note that this morning, I did get this code running once, but this time, it was ridiculously slow - slower than my Mac. My ignorant view is that something is chewing memory but I can't debug this...I am uncertain where to begin being new to Ubuntu. Having the precedence of seeing the code run super fast (and then toward the end crash) yesterday I wonder has the system 'reset' something or disabled something.
Help!
EDIT:
I cleared all the variables in jupyter notebook, dropped the batch size to 1 and reloaded and I managed to load the weights, but on running the first generator I get:
ResourceExhaustedError: OOM when allocating tensor with shape[1,512,512,64]
[[Node: block1_conv1/convolution = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](_arg_input_1_0_0/_105, block1_conv1/kernel/read)]]
I am not clear why I can succesfully run this on my Mac but not a machine with greater RAM, CPU and 4 GPUs...

LITMUS^RT: How to handle linux kernel deadlock

I'm working on the linux kernel 3.10 patched with LITMUS^RT, a real-time extension with a focus on multiprocessor real-time scheduling and synchronization.
My aim is to write a scheduler that allows a task to migrate from a cpu to another when preempted and only when particular conditions are met. My current implementation is affected by deadlock beetwen cpus as shown by the following error:
Setting up rt task parameters for process 1622.
[ INFO: inconsistent lock state ]
3.10.5-litmus2013.1 #105 Not tainted
---------------------------------
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
rtspin/1620 [HC0[0]:SC0[0]:HE1:SE1] takes:
(&rq->lock){?.-.-.}, at: [<ffffffff8155f0d5>] __schedule+0x175/0xa70
{IN-HARDIRQ-W} state was registered at:
[<ffffffff8107832a>] __lock_acquire+0x86a/0x1e90
[<ffffffff81079f65>] lock_acquire+0x95/0x140
[<ffffffff81560fc6>] _raw_spin_lock+0x36/0x50
[<ffffffff8105e231>] scheduler_tick+0x61/0x210
[<ffffffff8103f112>] update_process_times+0x62/0x80
[<ffffffff81071677>] tick_periodic+0x27/0x70
[<ffffffff8107174b>] tick_handle_periodic+0x1b/0x70
[<ffffffff810042d0>] timer_interrupt+0x10/0x20
[<ffffffff810849fd>] handle_irq_event_percpu+0x6d/0x260
[<ffffffff81084c33>] handle_irq_event+0x43/0x70
[<ffffffff8108778c>] handle_level_irq+0x6c/0xc0
[<ffffffff81003a89>] handle_irq+0x19/0x30
[<ffffffff81003925>] do_IRQ+0x55/0xd0
[<ffffffff81561cef>] ret_from_intr+0x0/0x13
[<ffffffff8108615a>] __setup_irq+0x20a/0x4e0
[<ffffffff81086473>] setup_irq+0x43/0x90
[<ffffffff8184fb5f>] setup_default_timer_irq+0x12/0x14
[<ffffffff8184fb78>] hpet_time_init+0x17/0x19
[<ffffffff8184fb46>] x86_late_time_init+0xa/0x11
[<ffffffff8184ecd1>] start_kernel+0x270/0x2e0
[<ffffffff8184e5a3>] x86_64_start_reservations+0x2a/0x2c
[<ffffffff8184e66c>] x86_64_start_kernel+0xc7/0xca
irq event stamp: 8886
hardirqs last enabled at (8885): [<ffffffff8108dd6b>] rcu_note_context_switch+0x8b/0x2d0
hardirqs last disabled at (8886): [<ffffffff81561052>] _raw_spin_lock_irq+0x12/0x50
softirqs last enabled at (8880): [<ffffffff81037125>] __do_softirq+0x195/0x2b0
softirqs last disabled at (8857): [<ffffffff8103738d>] irq_exit+0x7d/0x90
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&rq->lock);
<Interrupt>
lock(&rq->lock);
*** DEADLOCK ***
1 lock held by rtspin/1620:
#0: (&rq->lock){?.-.-.}, at: [<ffffffff8155f0d5>] __schedule+0x175/0xa70
stack backtrace:
CPU: 1 PID: 1620 Comm: rtspin Not tainted 3.10.5-litmus2013.1 #105
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
ffffffff81bc4cc0 ffff88001cdf3aa8 ffffffff8155ae1e ffff88001cdf3af8
ffffffff81557f39 0000000000000000 ffff880000000001 ffff880000000001
0000000000000000 ffff88001c5ec280 ffffffff810750d0 0000000000000002
Call Trace:
[<ffffffff8155ae1e>] dump_stack+0x19/0x1b
[<ffffffff81557f39>] print_usage_bug+0x1f7/0x208
[<ffffffff810750d0>] ? print_shortest_lock_dependencies+0x1c0/0x1c0
[<ffffffff81075ead>] mark_lock+0x2ad/0x320
[<ffffffff81075fd0>] mark_held_locks+0xb0/0x120
[<ffffffff8129bf71>] ? pfp_schedule+0x691/0xba0
[<ffffffff810760f2>] trace_hardirqs_on_caller+0xb2/0x210
[<ffffffff8107625d>] trace_hardirqs_on+0xd/0x10
[<ffffffff8129bf71>] pfp_schedule+0x691/0xba0
[<ffffffff81069e70>] pick_next_task_litmus+0x40/0x500
[<ffffffff8155f17a>] __schedule+0x21a/0xa70
[<ffffffff8155f9f4>] schedule+0x24/0x70
[<ffffffff8155d1bc>] schedule_timeout+0x14c/0x200
[<ffffffff8105e3ed>] ? get_parent_ip+0xd/0x50
[<ffffffff8105e589>] ? sub_preempt_count+0x69/0xf0
[<ffffffff8155ffab>] wait_for_completion_interruptible+0xcb/0x140
[<ffffffff81060e60>] ? try_to_wake_up+0x470/0x470
[<ffffffff8129266f>] do_wait_for_ts_release+0xef/0x190
[<ffffffff81292782>] sys_wait_for_ts_release+0x22/0x30
[<ffffffff81562552>] system_call_fastpath+0x16/0x1b
At this point I suppose I see two possible approaches to follow to solve this problem:
Release the lock to the current cpu before migrating to the target cpu using the kernel's functions. LITMUS^RT provides a callback in which I can decide which task will execute:
static struct task_struct* pfp_schedule(struct task_struct * prev) {
[...]
if(is_preempted(prev) {
// release lock on current cpu
migrate_to_another_cpu();
}
[...]
}
What I think I must do is to release the current lock before the call to migrate_to_another_cpu function, but I still haven't found any way of doing that.
The scheduler that I want to implement allows only one task migration at a time, therefore it's theoretically impossible to have a deadlock. For some reasons though the execution of my task set fails and I get the error listed above which I think is identified during some kind of preliminary analisys performed by the linux kernel. This, however is a potential deadlock, it's kind of a warning, and I would like to let my task set to continue its normal execution ignoring it.
Long story short, does anyone know if one of these solutions is possible and, if yes, how to implement it?
Thanks in advance!
P.S.: tips or better ideas are very welcome! ;-)

Resources