Determining why kernel hangs on boot - linux-kernel

hi :
i was building kernel for my gentoo linux . when i start the kernel , i
got this message , and can't going on.
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
non-volatile memory driver v1.3
i don't know how to solve this problem . and i need help . thanks .

Why don't you try to disable pci hotplug support in kernel (if I recall correctly is in main config menu / PCI support)? You probably don't need this.

I'm going to have to disagree with those voting to close, because I think there really is a question here, and the question is "How to debug this?"
I'm going to propose two approaches:
1) Studious approach: Learn about mechanisms intended for handling boot problems. See if you can increase the kernel debug message level. Disable un-needed drivers as Quizzo suggested.
2) Cowboy approach: grep the kernel sources for strings seen in the final messages, and start shotgunning all possibly relevant bits of code with your own "still alive at" printk messages. Once you know where it's hanging, figure out why and either remove that mechanism or fix it.
At an extreme there's also a tool for debugging the kernel - kgdb - which you could set up if you have a second machine available.
If you already have linux running on this box, see if there's a config.gz in /proc or in a boot folder which you can extract and compare to the configuration you are trying to compile. It might not be a bad idea to first recompile and test exactly the same version and configuration as you have running, and then make desired changes one by one.
Also you might see if there's odd hardware in your system you could temporarily remove. For example, an older PC I have has a bios that hangs during drive enumeration if I have a large USB external drive plugged in during boot.

i have solved the problem by enable all pci hotplug flag in kernel config file.
thinks all.

Related

Crash when reading /proc/net/ip_conntrack with netfilter accounting enabled

Hoping someone might have an idea -
I've been exploring my home router which is running Linux 3.4.11-rt19. I enabled conntrack accounting using echo 1 > /proc/sys/net/netfilter/nf_conntrack_acct and have been looking at /proc/net/ip_conntrack as I browse to various websites.
My problem is that as I am loading a relatively heavy site with flash animations, graphics, ads etc., the cat /proc/net/ip_conntrack command will occasionally crash the router. This happens pretty frequently - I can easily crash the router within a minute or two.
The crash does not reproduce if I do not enable conntrack accounting.
I was looking at the netfilter code trying to spot potential race conditions or missing locks but came up empty. I also tried to examine the differences in netfilter code against later Linux versions without success.
Are there any ways for me to debug this? Or is this a known issue? By any chance is there a workaround that does not require me to re-implement this proc file in kernel mode?
One of the possible reason of crashing is may be the low memory of the router as while processing your request while netfilter accounting enabled , as it is not able to process the website containing flash etc.
As , a solution of this u can use custom made router based on full Linux support which u can install on a low end , scrap old PC or single board computers available in the market.
There are many Linux distros are available , some famous one are like
pfsense
strong text
Smoothwall
etc.
At first I'd try to disable reboot on panic:
echo 0 >/proc/sys/kernel/panic
Then I'd try to disable panic on oops and some other situations:
echo 0 >/proc/sys/kernel/panic_on_oops
echo 0 >/proc/sys/kernel/panic_on_stackoverflow
echo 0 >/proc/sys/kernel/panic_on_unrecovered_nmi
echo 0 >/proc/sys/kernel/panic_on_rcu_stall
echo 0 >/proc/sys/kernel/panic_on_warn
This could give you a chance to keep control your router over issue.
Also if your kernel has built-in netconsole driver or standalone netconsole module, you could setup logging kernel messages to another machine. Look at Kernel Netconsole documentation to find out how to configure it dynamically.
Anyway, the important step to solve/understand an issue is to get kernel log on crash.

how to debug a pci device and linux driver

I am programming a pci device with verilog and also writing its driver,
I have probably inserted some bug in the hardware design and when i load the driver with insmod the kernel just gets stuck and doesnt respond. Now Im trying to figure out what's the last driver code line that makes my computer stuck. I have inserted printk in all relevant functions like probe and init but non of them get printed.
What other code is running when i use insmod before it gets to my init function? (I guess the kernel gets stuck over there)
printks are often not useful debugging such a problem. They are buffered sufficiently that you won't see them in time if the system hangs shortly after printk is called.
It is far more productive to selectively comment out sections of your driver and by process of elimination determine which line is the (first) problem.
Begin by commenting out the entire module's init section leaving only return 0;. Build it and load it. Does it hang? Reboot system, reenable the next few lines (class_create()?) and repeat.
From what you are telling, it is looks like that Linux scheduler is deadlocking by your driver. That's mean that interrupts from the system timer doesn't arrive or have a chance to be handled by kernel. There are two possible reasons:
You hang somewhere in your driver interrupt handler (handler starts its work but never finish it).
Your device creates interrupts storm (Device generates interrupts too frequently as a result your system do the only job -- handling of your device interrupts).
You explicitly disable all interrupts in your driver but doesn't reenable them.
In all other cases system will either crash, either oops or panic with all appropriate outputs or tolerate potential misbehavior of your device.
I guess that printk won't work for such extreme scenario as hang in kernel mode. It is quite heavy weight and due to this unreliable diagnostic tool for scenarios like your.
This trick works only in simpler environments like bootloaders or more simple kernels where system runs in default low-end video mode and there is no need to sync access to the video memory. In such systems tracing via debugging output to the display via direct writing to the video memory can be great and in many times the only tool that can be used for debugging purposes. Linux is not the case.
What techniques can be recommended from the software debugging point of view:
Try to review you driver code devoting special attention to interrupt handler and places where you disable/enable interrupts for synchronization.
Commenting out of all driver logic with gradual uncommenting can help a lot with localization of the issue.
You can try to use remote kernel debugging of your driver. I advice to try to use virtual machine for that purposes, but I'm not aware about do they allow to pass the PCI device in the virtual machine.
You can try the trick with in-memory tracing. The idea is to preallocate the memory chunk with well known virtual and physical addresses and zeroes it. Then modify your driver to write the trace data in this chunk using its virtual address. (For example, assign an unique integer value to each event that you want to trace and write '1' into the appropriate index of bytes array in the preallocated memory cell). Then when your system will hang you can simply force full memory dump generation and then analyze the memory layout packed in the dump using physical address of the memory chunk with traces. I had used this technique with VmWare Workstation VM on Windows. When the system had hanged I just pause a VM instance and looked to the appropriate .vmem file that contains raw memory latout of the physical memory of the VM instance. Not sure that this trick will work easy or even will work at all on Linux, but I would try it.
Finally, you can try to trace the messages on the PCI bus, but I'm not an expert in this field and not sure do it can help in your case or not.
In general kernel debugging is a quite tricky task, where a lot of tricks in use and all they works only for a specific set of cases. :(
I would put a logic analyzer on the bus lines (on FPGA you could use chipscope or similar). You'll then be able to tell which access is in cause (and fix the hardware). It will be useful anyway in order to debug or analyze future issues.
Another way would be to use the kernel crash dump utility which saved me some headaches in the past. But depending your Linux distribution requires installing (available by default in RH). See http://people.redhat.com/anderson/crash_whitepaper/
There isn't really anything that is run before your init. Bus enumeration is done at boot, if that goes by without a hitch the earliest cause for freezing should be something in your driver init AFAIK.
You should be able to see printks as they are printed, they aren't buffered and should not get lost. That's applicable only in situations where you can directly see kernel output, such as on the text console or over a serial line. If there is some other application in the way, like displaying the kernel logs in a terminal in X11 or over ssh, it may not have a chance to read and display the logs before the computer freezes.
If for some other reasons the printks still do not work for you, you can instead have your init function return early. Just test and move the return to later in the init until you find the point where it crashes.
It's hard to say what is causing your freezes, but interrupts is one of those things I would look at first. Make sure the device really doesn't signal interrupts until the driver enables them (that includes clearing interrupt enables on system reset) and enable them in the driver only after all handlers are registered (also, clear interrupt status before enabling interrupts).
Second thing to look at would be bus master transfers, same thing applies: Make sure the device doesn't do anything until it's asked to and let the driver make sure that no busmaster transfers are active before enabling busmastering at the device level.
The fact that the kernel gets stuck as soon as you install your driver module makes me wonder if any other driver (built in to kernel?) is already driving the device. I made this mistake once which is why i am asking. I'd look for the string "kernel driver in use" in the output of 'lspci' before installing the module. In any case, your printk's should be visible in dmesg output.
in addition to Claudio's suggestion, couple more debug ideas:
1. try kgdb (https://www.kernel.org/doc/htmldocs/kgdb/EnableKGDB.html)
2. use JTAG interfaces to connect to debug tools (these i think vary between devices, vendors so you'll have to figure out which debug tools you need to the particular hardware)

How to know that the kernel has panicked?

I want to be able to monitor kernel panics - know if and when they have happened.
Is there a way to know, after the machine has booted, that it went down due to a kernel panic (and not, for example, an ordered reboot or a power failure)?
The machine may be configured with KDUMP and/or KDB, but I prefer not to assume that either is or is not installed.
Patching the kernel is an option, though I prefer to avoid it. But even if I do it, I'm not sure what can the patch do.
I'm using kernel 2.6.18 (ancient, I know). Solutions for newer kernels may be interesting too.
Thanks.
The kernel module 'netconsole' may help you to log kernel printk messages over UDP.
You can view the log message in remote syslog server, event if the machine is rebooted.
Introduction:
=============
This module logs kernel printk messages over UDP allowing debugging of
problem where disk logging fails and serial consoles are impractical.
It can be used either built-in or as a module. As a built-in,
netconsole initializes immediately after NIC cards and will bring up
the specified interface as soon as possible. While this doesn't allow
capture of early kernel panics, it does capture most of the boot
process.
Check kernel document for more information: https://www.kernel.org/doc/Documentation/networking/netconsole.txt

Tool to Debug Guest OS in Virtual Box

I'm just cross posting the same question I did on virtualbox.org. http://forums.virtualbox.org/viewtopic.php?f=9&t=26702&p=119139#p119139
If not breaking any rule, I'd appreciate to kwon more about it since stackoverflow promisses to be more dynamic!
"Hi,
I did some search and could not find any tool to debug a guest system from the early boot in virtual box. Although, I came across JCP, a x86 emulator in java that is not so powerful and beautyful but has a debug mode where one can view the Physical Memory, the CPU registers along other things. It also makes it possible to execute CPU instructions step by step and set break points, watchpoints and conditional ones. There's such thing in Virtual Box?
I think would be amazing to have it and be able to inspect the system while its running. For learn about PC architecture or as a tool to develop a kernel as well.
In the case you think its good idea (I think it is) how can it be achieved? I'm interested in develop such sort o things and would like to know if it is feasible if not already implemented somewhere."
EDT: Are modern x86 able to interrupt its execution just after a cpu cycle and pass execution addres to another code to just do this? Yes, the trap flag can be set to put the processor in step by step execution mode. x86 will execute one instruction and call INT 3.
Contrary to what is stated above, VirtualBox now contains a (limited) debugger. Add --dbg to the command line when starting the VM. For more information consult:
12.1.3. The built-in VM debugger
The OSDev wiki has some useful information on debugging a guest operating system, though according to this page VirtualBox doesn't have a debugger at present. I've been using QEmu with the GDB stub and it works quite nicely, so you might like to give that a go instead.

Gnu Debugger & Linux Kernel

I have compiled my own Kernel module and now I would like to be able to load it
into the GNU Debugger GDB. I did this once, a year ago or so to have a look
at the memory layout. It worked fine then, but of course I was too silly to
write down the single steps I took to accomplish this... Can anyone enlighten
me or point me to a good tutorial?
Thank you so much
For kernels > 2.6.26 (i.e. after May 2008), the preferred way is probably to use "kgdb light" (not to be confused with its ancestor kgdb, available as a set of kernel patches).
"kgdb light" is now part of the kernel (in by default in current Ubuntu kernels, for instance), and it's capabilities are improving fast (Jason Wessel is working on it - possible google key).
Drawback: You need two machines, the one you're debugging and the development machine (host) where gdb runs. Currently, those two machines can only be linked through a serial link.
kgdb runs in the target machine where it handles the breakpoints, stepping, etc. and the remote debugging protocol use to talk with the development machine.
gdb runs in the development machine where it handles the user interface.
An USB-to-serial adapter works OK on the development machine, but currently, you need a real UART on the target machine - and that's not so frequent anymore on recent hardware.
The (terse) kgdb documentation is in the kernel sources, in
Documentation/DocBook
I suggest you google around for "kgdb light" for the complete story.
Again, don't confuse kgdb and kgdb light, they come together in google searches but are mostly different animals. In particular, info from linsyssoft.com relate to the "ancestor" kgdb, so try queries like:
kgdb module debugging -"linsyssoft.com" -site:linsyssoft.com
and discard articles prior to May 2008 / 2.6.26 kernel.
Finally, for module debugging, you need to manually load the module symbols in the dev machine for all the code and sections you are interested in. That's a bit too long to address here, but some clues there, there and there.
Bottom line is, kgdb is a very welcome improvement but don't expect this trip to be as easy as running gdb in user mode. Yet. :)
It has been a while since I was actively developing drivers for Linux, so maybe my answer is a bit out of date. I would say you cannot use GDB. If at all, only to debug post mortem on dump files. To debug you should rather use a kernel debugger. Build the kernel with a kernel debugger enabled (there is one out-of-the box debugger for 2.6, which was lacking at the time I was active). I used the kernel patches for KDB from Sun ftp://oss.sgi.com/www/projects/kdb/download/, which I was quite happy with. A user space tool won't be of much use unless new gdb communicate somehow with the internal kernel debugger (which anyway you would have to activate)
I hope this gives you at least some hints, while not being a detailled answer. Better than no answer at all. Regards.
I suspect what you did was
gdb /boot/vmlinux /proc/kcore
Of course you can't actually do any debugging, but it's certainly good enough to have a poke around the kernel.

Resources