Any experience still seems to be insufficient to answer those strange issues that pop up in serial communication buses. We are trying to implement a data copy from an external flash in to the SRAM. Below are the details how we have configured our system.
Controller : RH850 (D1M1), PLL speed at 60MHz
External Flash (IS25LP128)
SPI speed: 5MHz (clocks observed using oscilloscope)
Data size: 4 MB
Now, in theory, if my SPI is operating at 5MHZ it should copy 5MBits/Sec. We are trying to copy 4MB so essentially it will be 32 Mega Bits. So in theory, our transfer should take about 7 seconds. Ok we have some implication overheads. My driver code can accept only up to 64Kb per read call so we chose to copy 40Kb for about 100 times to achieve this and we run this in a for loop.. Ok let me add a whooping 5 seconds of overhead (Sorry RH850!) so in total 12 seconds; well, lets add some more buffer and make it a comfort zone of 15 sec (Max expected!). But then when we run the code, its taking a whole 40seconds to finish the copy. We have checked the clock and it is 5MHz as expected and at least they are continuous.
Has anyone here faced this? Where can we look in to? Well I know I have some flash-driver provided by my vendor to dig in to but before I do that, I wanted to be sure! Any help will be really appreciated.
At a first glance, I can think about minimum 10 things which may be responsible for this. One thing I'm sure, this problem is complex. There is no simple "one line solution". The main suspect is what is not yours: the flash driver. So, isolate "pieces" one by one and verify them, starting from the bottom.
Is there operating system? DMA in use? Issue with memory or resource arbitration/sharing? Interrupts are in use or polling? Any higher priority jobs are running? Data read from registers or memory mapped? Generic SPI peripheral or special serial flash is used by the driver (I don't know RH850, some uC has it)?
Your post is not precise enough, so maybe these questions will help you. What I would do? My own driver!
Related
I am using ESP32 module for BLE & WiFi functionality, I am writing data on EEPROM of ESP32 module after every 2 seconds.
How many read/write cycles are allowed as per standard features of ESP32 module? based on which I need to calculate EEPROM life time and number of readings (with frequency) I can store.
The ESP32 doesn’t have an actual EEPROM; instead it uses some of its flash storage to mimic an EEPROM. The specs will depend on the specific SPI flash chip, but they’re likely to be closer to 10,000 cycles than 100,000. Writing to it every couple of seconds will likely wear it out pretty quickly - it’s not a good design choice, especially if you keep rewriting the same location.
I'm very late here, but an SD card seems like the ideal option for you. If you want to save just a few bytes, you can use FeRAM (also called FRAM). It's a combination between RAM and ROM, it's vast, and the data stays on it after power off. It is pretty expensive, so you might want to go with the SD card or web server option. I just wanted to tell you that this existed, I also know this for like a few months.
At that write rate even automotive grade EEPROM like the 24LC001 which supports at least 1,000,000 writes will only last about 2 months!
I think microchip has EERAM which supports infinite writes and will not loose contents on power loss.
Check the microchips 47L series.
I want to use emulated EEPROM feature on PIC24FJ128GB106 chip since it do not have internal EEPROM.
However, although it is not clearly mentioned on its datasheet (AN1095 document), I think data is temporarily stored on holding latch before pack operation. If then, I think data could be lost on sudden power loss before pack operation.
Is it right?
Yes it is possible!
You must ensure that controller have enough electrical power to finish FLASH block write. Single FLASH block write takes about 3 ms. So you must use low voltage detect circuit which interrupt the main program and put the controller in low power consumption to finish FLASH write 100% (use big enough capacitor on Vdd).
I am just new with this APIs of the Nvidia and some expressions are not so clear for me. I was wondering if somebody can help me to understand when and how to use these CUDA commands in a simply way. To be more precise:
Studing how is possible to speed up some applications with parallel execution of a kernel (with CUDA for example), at some point I was facing the problem of speeding up the interaction Host-Device.
I have some informations, taken surfing on the web, but I am little bit confused.
It clear that you can go faster when it is possible to use cudaHostRegister() and/or cudaHostAlloc(). Here it is explained that
"you can use the cudaHostRegister() command to take some data (already allocated) and pin it avoiding extra copy to take into the GPU".
What is the meaning of "pin the memory"? Why is it so fast? How can I do this previously in this field? After, in the same video in the link, they continue explaining that
"if you are transferring PINNED memory, you can use the asynchronous memory transfer, cudaMemcpyAsync(), which let's the CPU keep working during the memory transfer".
Are the PCIe transaction managed entirely from the CPU? Is there a manager of a bus that takes care of this?
Also partial answers are really appreciated to re-compose the puzzle at the end.
It is also appreciate to have some link about the equivalent APIs in OpenCL.
What is the meaning of "pin the memory"?
It means make the memory page locked. That is telling the operating system virtual memory manager that the memory pages must stay in physical ram so that they can be directly accessed by the GPU across the PCI-express bus.
Why is it so fast?
In one word, DMA. When the memory is page locked, the GPU DMA engine can directly run the transfer without requiring the host CPU, which reduces overall latency and decreases net transfer times.
Are the PCIe transaction managed entirely from the CPU?
No. See above.
Is there a manager of a bus that takes care of this?
No. The GPU manages the transfers. In this context there is no such thing as a bus master
EDIT: Seems like CUDA treats pinned and page-locked as the same as per the Pinned Host Memory section in this blog written by Mark Harris. This means by answer is moot and the best answer should be taken as is.
I bumped into this question while looking for something else. For all future users, I think #talonmies answers the question perfectly, but I'd like to bring to notice a slight difference between locking and pinning pages - the former ensures that the memory is not pageable but the kernel is free to move it around and the latter ensures that it stays in memory (i.e. non-pageable) but also is mapped to the same address.
Here's a reference to the same.
I'm currently learning C for my next emulation project, a cycle accurate 68000 core (my last project being a non-cycle accurate Sega Master System emulator written in Java which is now on its third release). My query regards cycle level accuracy as taking things to this level is a new thing for me.
To break things down to a granularity of 1 CPU cycle, presumably I need to know how long memory accesses take and so on, but my question is that for instructions that take multiple cycles in their memory fetch/write stages, what is the CPU doing each cycle - e.g. are x amount of bits copied per cycle.
With my SMS emulator I didn't have to worry too much about M1 stages etc, as it just used a cycle count for each instruction - in other words it is only accurate to an instruction level, not a cycle level. I'm not looking for architecture specific details, merely an idea of what sort of things I should look out for when going to this level of granularity.
68k details are welcome however. Basically I'm wondering what is supposed to happen if a video chip reads from an area of memory whilst a CPU is still writing the data to it mid way through that phase of an instruction, and other similar situations. I hope I've made it clear enough, thank you.
For a really cycle accurate emulation you have first to decide on a master clock you want to use as reference. That should be the fastest clock at which's granularity the software running can detect differences in order of occurance. This could by the CPU clock, but in most cases the bus cycle time decides at which granularity events can be discerned (and that is often only a fraction of the CPU clock).
Then you need to find out the precendence order the different devices (IC's) connected to that bus have (if there is more than one bus master). An example would be if (and how) video DMA can delay the CPU.
There exist generally no at the same time events. Either the CPU writes before the DMA reads, or the other way around (that is still true in case of dual ported devices, you just need to consider the device's inherent predence mechanism).
Once you have a solid understanding which clock is the effectively controlling the granularity of discernible events you can think about how to structure the emulator to reproduce that behaviour exactly.
This way you can create a 100% cycle exact emulation, given you have enough information about all the devices behavior.
Sorry I can't give you more detailed info, I know nothing about the specifics of the Sega's hardware.
My guess is that you don't have to get in to excruciating detail to get good enough results for the timing for this sort of thing. Which you can't do anyway f you don't want to get into the specifics of the architecture.
Your main question seemed to be "what is supposed to happen if a video chip reads from an area of memory whilst a CPU is still writing the data to it". Generally on these older chips, the bus protocols are pretty simple (they're not packetized) and there is usually a pin that indicates that the bus is busy. So if the CPU is writing to memory, the video chip will simply have to wait until the CPU is done. Because of these sorts of limitations, dual ported ram was popular for a while so that the frame buffer could be simultaneously written by the CPU and read by the RAMDAC.
As per SDIO specification, the sequence of operations (for write transaction) take place as:
Command53 -- CommandLatency -- Command53Response -- ResponseLatency -- startbit -- write-number-of-bytes -- CRC -- endbit -- WriteLatency -- startbit -- CRC -- endbit -- busybit.
During benchmarking of SDIO UART driver, the time values which I got were more than expected. A lot of latency was found especially during write transaction.
Reasons for latency could be scheduler allocating processor time to other processes, delay in work queues, etc.
I would like to analyze and understand the latency. May be understanding the mapping between the device driver code and the Logic Analyzer waveform can lead to some cue.
Can somebody shed some light on this?
Thank you.
EDIT 1:
Sorry! I assumed a few things.
In sdio_uart_transmit_chars() there is a call to sdio_out() which in turn calls sdio_writeb() and this call writes byte wise (one byte at a time) to a SDIO UART device. I modified the driver to use sdio_writesb() i.e. multi-byte mode. This reduced the time taken to write X bytes relatively. Interestingly, with increase in size of write data, there was exponential increase in WriteLatency (as mentioned above).
This latency could be because of many reasons. I would like to understand these reasons.
Setup: I am using Linux (v 2.6.32) laptop and a loadable kernel module (which is modified sdio_uart.c)
EDIT 2:
May be adding 'SDIO' in this question is misleading..(not sure at the moment). The reasons for delay could be generic to any device driver while interacting with the hardware and it may be independent of SDIO write process.
If somebody can point me to related online resource, I would be happy to explore and update the result here.
Hope I added more clarity this time. Please comment if I the question is still not clear.
Thank you for your time.
EDIT 3:
Yes, I am looking at the signals on Logic Analyzer (LA) and there are longer delays during and between writes than I expected.
To give an idea about time values:
For 512 bytes transfer: At the hardware level theoretically the write should take 50 micro seconds (us), however in reality I got 200 us.
This gap of 150 us is what I want to understand.
Note:
1) I am rounding off the time values to simplify the case.
2) All the time values are calculated at Kernel level and no user space issue is involved here.
One thing worth looking at is if your sd interface functions by DMA, such that the driver can program the state machine and then it just runs by itself, or if getting the message out requires repetitive service by the driver, which might be delayed by other kernel obligations.
You could also see if there may be an I/O bottleneck, for example is the SD interface or whatever bus it hangs off of also used for something else?
Finally you could search for ways to increase the priority. At an extreme, you could switch to a real time SD driver rather than a normal one.