Map device driver code to Logic Analyzer waveform

Map device driver code to Logic Analyzer waveform - linux-kernel

As per SDIO specification, the sequence of operations (for write transaction) take place as:
Command53 -- CommandLatency -- Command53Response -- ResponseLatency -- startbit -- write-number-of-bytes -- CRC -- endbit -- WriteLatency -- startbit -- CRC -- endbit -- busybit.
During benchmarking of SDIO UART driver, the time values which I got were more than expected. A lot of latency was found especially during write transaction.
Reasons for latency could be scheduler allocating processor time to other processes, delay in work queues, etc.
I would like to analyze and understand the latency. May be understanding the mapping between the device driver code and the Logic Analyzer waveform can lead to some cue.
Can somebody shed some light on this?
Thank you.
EDIT 1:
Sorry! I assumed a few things.
In sdio_uart_transmit_chars() there is a call to sdio_out() which in turn calls sdio_writeb() and this call writes byte wise (one byte at a time) to a SDIO UART device. I modified the driver to use sdio_writesb() i.e. multi-byte mode. This reduced the time taken to write X bytes relatively. Interestingly, with increase in size of write data, there was exponential increase in WriteLatency (as mentioned above).
This latency could be because of many reasons. I would like to understand these reasons.
Setup: I am using Linux (v 2.6.32) laptop and a loadable kernel module (which is modified sdio_uart.c)
EDIT 2:
May be adding 'SDIO' in this question is misleading..(not sure at the moment). The reasons for delay could be generic to any device driver while interacting with the hardware and it may be independent of SDIO write process.
If somebody can point me to related online resource, I would be happy to explore and update the result here.
Hope I added more clarity this time. Please comment if I the question is still not clear.
Thank you for your time.
EDIT 3:
Yes, I am looking at the signals on Logic Analyzer (LA) and there are longer delays during and between writes than I expected.
To give an idea about time values:
For 512 bytes transfer: At the hardware level theoretically the write should take 50 micro seconds (us), however in reality I got 200 us.
This gap of 150 us is what I want to understand.
Note:
1) I am rounding off the time values to simplify the case.
2) All the time values are calculated at Kernel level and no user space issue is involved here.

One thing worth looking at is if your sd interface functions by DMA, such that the driver can program the state machine and then it just runs by itself, or if getting the message out requires repetitive service by the driver, which might be delayed by other kernel obligations.
You could also see if there may be an I/O bottleneck, for example is the SD interface or whatever bus it hangs off of also used for something else?
Finally you could search for ways to increase the priority. At an extreme, you could switch to a real time SD driver rather than a normal one.

Related

External memory Data Copy through SPI -- Speed

Any experience still seems to be insufficient to answer those strange issues that pop up in serial communication buses. We are trying to implement a data copy from an external flash in to the SRAM. Below are the details how we have configured our system.
Controller : RH850 (D1M1), PLL speed at 60MHz
External Flash (IS25LP128)
SPI speed: 5MHz (clocks observed using oscilloscope)
Data size: 4 MB
Now, in theory, if my SPI is operating at 5MHZ it should copy 5MBits/Sec. We are trying to copy 4MB so essentially it will be 32 Mega Bits. So in theory, our transfer should take about 7 seconds. Ok we have some implication overheads. My driver code can accept only up to 64Kb per read call so we chose to copy 40Kb for about 100 times to achieve this and we run this in a for loop.. Ok let me add a whooping 5 seconds of overhead (Sorry RH850!) so in total 12 seconds; well, lets add some more buffer and make it a comfort zone of 15 sec (Max expected!). But then when we run the code, its taking a whole 40seconds to finish the copy. We have checked the clock and it is 5MHz as expected and at least they are continuous.
Has anyone here faced this? Where can we look in to? Well I know I have some flash-driver provided by my vendor to dig in to but before I do that, I wanted to be sure! Any help will be really appreciated.

At a first glance, I can think about minimum 10 things which may be responsible for this. One thing I'm sure, this problem is complex. There is no simple "one line solution". The main suspect is what is not yours: the flash driver. So, isolate "pieces" one by one and verify them, starting from the bottom.
Is there operating system? DMA in use? Issue with memory or resource arbitration/sharing? Interrupts are in use or polling? Any higher priority jobs are running? Data read from registers or memory mapped? Generic SPI peripheral or special serial flash is used by the driver (I don't know RH850, some uC has it)?
Your post is not precise enough, so maybe these questions will help you. What I would do? My own driver!

Is it possible to lose data on sudden power loss for emulated EEPROM of PIC?

I want to use emulated EEPROM feature on PIC24FJ128GB106 chip since it do not have internal EEPROM.
However, although it is not clearly mentioned on its datasheet (AN1095 document), I think data is temporarily stored on holding latch before pack operation. If then, I think data could be lost on sudden power loss before pack operation.
Is it right?

Yes it is possible!
You must ensure that controller have enough electrical power to finish FLASH block write. Single FLASH block write takes about 3 ms. So you must use low voltage detect circuit which interrupt the main program and put the controller in low power consumption to finish FLASH write 100% (use big enough capacitor on Vdd).

Bus protocol for a microcontroller in VHDL

I am designing a microcontroller in VHDL. I am at the point where I understand the role of each component (ALU/Memory...), and some ideas on how to realise them. I basically want to implement a Von Neumann architecture.
But here is what I don't get : how do the components communicate ? I don't know how to design my bus (buses?). I am therefore looking for a simple bus implementation and protocol.
My unresolved questions :
Is it simpler to have one bus for everything or to separate the different kind of data ?
How does each component knows when to "listen" and when to "write" ?
The emphasis is on the simplicity of the design (and thus of the implementation). I do not care about speed. I want to do everything from scratch (ie. no pre-made softcore).
I don't know if this is of importance at this stage, but it will not need to run "real" compiled code, is have any kind of compatibility with anything existing. Also, at which point do I begin to think about my 'assembly' instructions ? I thinks that I will load them directly in the memory.
Thank you for your help.
EDIT :
I ended up drawing (a lot of) inspiration from the Picoblaze, because it is :
simple to understand
under a BSD Licence
Specifically, I started by adding a few instructions to it.

Since your main concern seems to be learning about microcontroller design, a good approach could be taking a look into some of the earlier microprocessor models. Take for instance the Z80:
Source: http://landley.net/history/mirror/cpm/z80.html
Another good Z80 HW description: http://www.msxarchive.nl/pub/msx/mirrors/msx2.com/zaks/z80prg02.htm
To answer your first question (single vs. multiple buses), this chip uses a single bus for everything, and it has a very simple design. You could probably use something similar. To make the terminology clear, a single system bus may be composed of sub-buses (and they are also called buses). The figure shows a system bus composed of a bidirection data bus (8-bit wide) and an address bus (16-bit wide).
To answer your second question (how do components know when they are active),
in the image above you see two distinct signals, memory request and I/O request. Only one will be active at a time, and when I/O request is active, that's when a peripheral could potentially be accessed.
If you don't have many peripherals, you don't need to use all 16 address lines (some Z80's have an 8-bit I/O space). Each peripheral would be accessed through some addresses in this space. For instance, in a very simple system:
a timer peripheral could use addresses from 00h to 03h
a uart could addresses from 08h to 0Fh
In this simple example, you need to provide two circuits: one would detect when the address is within the range 00-03h, and another would do the same for 08-0Fh. If you do a logic "and" between the output of each detector and the I/O request signal, then you would have two signals indicating when each of the peripherals is being accessed. Your peripheral hardware should primarily listen to this signal.
Finally, regarding your question about instructions, the dataflow inside your microprocessor would have several stages. This is usually called a processor's datapath. It is common to divide the stages into:
FETCH: read an instruction from program memory
DECODE: check specific bits within the instructions, and decide what type of instruction it is
EXECUTE: take the actions required by the instruction (e.g., ALU operations)
MEMORY: for some instructions, you need to do a data read or write
WRITE BACK: update your CPU registers with new values affected by the instruction
Source: https://www.cs.umd.edu/class/fall2001/cmsc411/projects/DLX/proj.html
Most of your job of dealing with individual instructions would be done in the DECODE and EXECUTE stages. As for the datapath control, you will need a state machine that controls the sequence of operations through the 5 stages. This functional block is usually called a Control Unit. Here you have a few choices:
Your state machine could go throgh all stages sequentially, one at a time. An instruction would take several clock cycles to execute.
Similar as the choice above, but combining two or more stages in a single cycle if you want to make things simpler and faster.
Pipeline the execution of instructions. This can give a great speed boost, but maybe it's better left for later because things can get quite complex.
As for the implementation, I recommend keeping the functional blocks as separate entities, and make sure you write a testbench for each block. Your job will go faster if you write those testbenches.
As for the blocks, the Register File is pretty easy to code. The Instruction Decoder is also easy if you have a clear idea of your instruction layout and opcodes. And the ALU is also easy if you know the operations it needs to perform.
I would start by writing testbenches for the Instruction Decoder and the Register File. Then I would write a script that runs all the testbenches and checks their results automatically. Only then I would focus on the implementation of the functional blocks themselves.

Basically on-chip busses will use parallel busses for address and data input and output. Usually there will be some kind of arbiter which decides which component is allowed to write to the bus. So a common approach is:
The component that wants to write will set a data line connected to the arbiter to high or low to signal that it wants to access the bus.
The arbiter decides who gets access to the bus
The arbiter sets the chip select of the component that should be allowed next to access the bus.
Usually your on chip bus will use a master/slave concept, so only masters have acting access to the bus. The slaves only wait for requests from the master.
I for one like the AMBA AHB/APB design but this might be a little over the top for your application. You can have a look at this book looking for ideas on how to implement your bus

How to measure ISR execution time?

I am on linux kernel 2.6.32.
I am facing an issue in which one of the two ISR (serial and ethernet) are taking more time (hundreds of microseconds) on several occasion/under some scenarios which I don't know. I would like to get the time difference every time the ISR executes.
What would be the best way (least expensive in terms of overhead involved). I don't see ARM architecture has some TSC register (read_tsc api) which would give me direct access to time as it offers on some other architecture.
So Idea is
1) The moment ISR is invoked measure time
2) the moment ISR is complete measure the time.
3) get the difference of 1 and 2 store it in some variable.
4) Keep doing the steps 1 to 2 and when the value received in the step 3 is greater than the past value overwrite it (keep/preserve value with maximum latency).
When the issue happens (some abrupt condition print the value) or array of last 10 value).
I need to do in kernel driver so let me know what would be the least expensive way.

OMAP3 has Cortex-A8 core. That does have Performance Monitor Unit (PMU). Cycle Count (CCNT) would correspond to x86 TSC, except probably you have to enable it counting before you read. Good info in BeagleBoard post.
In 2.6.32.55 I see arch/arm/oprofile/op_model_v7.c gives full access and control. My need was bare-metal, I used ARM example code that was simple and worked for me.
It would also be possible to use an OMAP3 GPT, but that would be more work, e.g. to get its clock input set up from PRCM.

Cycle accurate emulation

I'm currently learning C for my next emulation project, a cycle accurate 68000 core (my last project being a non-cycle accurate Sega Master System emulator written in Java which is now on its third release). My query regards cycle level accuracy as taking things to this level is a new thing for me.
To break things down to a granularity of 1 CPU cycle, presumably I need to know how long memory accesses take and so on, but my question is that for instructions that take multiple cycles in their memory fetch/write stages, what is the CPU doing each cycle - e.g. are x amount of bits copied per cycle.
With my SMS emulator I didn't have to worry too much about M1 stages etc, as it just used a cycle count for each instruction - in other words it is only accurate to an instruction level, not a cycle level. I'm not looking for architecture specific details, merely an idea of what sort of things I should look out for when going to this level of granularity.
68k details are welcome however. Basically I'm wondering what is supposed to happen if a video chip reads from an area of memory whilst a CPU is still writing the data to it mid way through that phase of an instruction, and other similar situations. I hope I've made it clear enough, thank you.

For a really cycle accurate emulation you have first to decide on a master clock you want to use as reference. That should be the fastest clock at which's granularity the software running can detect differences in order of occurance. This could by the CPU clock, but in most cases the bus cycle time decides at which granularity events can be discerned (and that is often only a fraction of the CPU clock).
Then you need to find out the precendence order the different devices (IC's) connected to that bus have (if there is more than one bus master). An example would be if (and how) video DMA can delay the CPU.
There exist generally no at the same time events. Either the CPU writes before the DMA reads, or the other way around (that is still true in case of dual ported devices, you just need to consider the device's inherent predence mechanism).
Once you have a solid understanding which clock is the effectively controlling the granularity of discernible events you can think about how to structure the emulator to reproduce that behaviour exactly.
This way you can create a 100% cycle exact emulation, given you have enough information about all the devices behavior.
Sorry I can't give you more detailed info, I know nothing about the specifics of the Sega's hardware.

My guess is that you don't have to get in to excruciating detail to get good enough results for the timing for this sort of thing. Which you can't do anyway f you don't want to get into the specifics of the architecture.
Your main question seemed to be "what is supposed to happen if a video chip reads from an area of memory whilst a CPU is still writing the data to it". Generally on these older chips, the bus protocols are pretty simple (they're not packetized) and there is usually a pin that indicates that the bus is busy. So if the CPU is writing to memory, the video chip will simply have to wait until the CPU is done. Because of these sorts of limitations, dual ported ram was popular for a while so that the frame buffer could be simultaneously written by the CPU and read by the RAMDAC.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio