How can you control Kristen FPGA from implementing excess registers? - fpga

I am using Kristen to generate a Verilog FPGA host interface for a neuromorphic processor. I have implemented the basic host as follows,
<module name= "nmp" duplicate="1000">
<register name="start0" type="rdconst" mask="0xFFFFFFFF" default="0x00000000" description="Lower 64 bit start pointer of persitant NMP storage."></register>
<register name="start1" type="rdconst" mask="0xFFFFFFFF" default="0x00000020" description="Upper 64 bit start pointer of persitant NMP storage."></register>
<register name="size" type="rdconst" mask="0xFFFFFFFF" default="0x10000000" description="Size of NMP persitant storage in Mbytes."></register>
<register name="c_start0" type="rdconst" mask="0xFFFFFFFF" default="0x10000000" description="Lower 64 bit start pointer of cached shared storage."></register>
<register name="c_start1" type="rdconst" mask="0xFFFFFFFF" default="0x00000020" description="Upper 64 bit start pointer of cached shared storage."></register>
<register name="c_size" type="rdconst" mask="0xFFFFFFFF" default="0x10000000" description="Size of cached shared storage in Mbytes."></register>
<register name="row" type="rdwr" mask="0xFFFFFFFF" default="0x00000000" description="Configurable row location for this NMP."></register>
<register name="col" type="rdwr" mask="0xFFFFFFFF" default="0x00000000" description="Configurable col location for this NMP."></register>
<register name="threshold" type="rdwr" mask="0xFFFFFFFF" default="0x00000000" description="Configurable synaptic sum threshold for this instance."></register>
<memory name="learn" memsize="0x00004000" type="mem_ack" description="Learning interface - Map input synapsys to node intensity">
<field name="input_id" size="22b" description="Input ID this map data is intended for."></field>
<field name="scale" size="16b" description="The intensity scale for this input ID."></field>
</memory>
</module>
The end result is that I am seeing a ton of registers being generated and I have to scale my NMP size down to fit within the constraints of my FPGA. Is there a way to control the number of registers being generated here? Obviously I need to store settings for these different fields. Am I missing something here?
I should add that I am trying to get to a 2048 scale on my NMP but the best I can do is just over a 1000, and not quite 1024. If I implement without PCIe or host control, I can get to 2048 without issue.

If I understand correctly, each NMP instance has a been coded with a internal register to store data and the configuration you have shown will result in kristen creating Verilog with registers as well. Effectivley there is a double buffered storage occuring.
Because of this, the number of registers are effectively doubled beyond what they need to be. One way of dealing with this situation described is to use another RAM interface of 32 bits wide. I do note that your config calls for 9 x 32 bits words which is a odd size for memory. There will be some wasted adddress space. Kristen will create a RAM's on binary boundaries so, you can get a 16x32bit memory region that you can overlay on that interface. And then a second RAM just like you have already for the learn memory.
<module>
<memory name="regs" memsize="0x10" type="mem_ack" description="Register mapping per NMP instance">
<field name "start0" size="32b" description="Start0"></field>
<field name "start1" size="32b" description="Start1"></field>
....
<field name "threshold" size="32b" description="Threshold"></field>
</memory>
<memory name="learn" memsize="0x00004000" type="mem_ack" description="Learning interface - Map input synapsys to node intensity">
<field name="input_id" size="22b" description="Input ID this map data is intended for."></field>
<field name="scale" size="16b" description="The intensity scale for this input ID."></field>
</memory>
</module>
Generate this and take a look at the new interface. That should reduce the number of registers generated in your Verilog code and subsequent synthesis.

Related

What determines bv_len inside BIO structure (for I/O request)?

I built a ram based virtual block device driver with blk-mq API that uses none for I/O scheduler. I am running fio to perform random read/write on the device and noticed that the bv_len in each bio request is always 1024 bytes. I am not aware any place in code that sets this value explicitly. The file system is ext4.
Is this a default config or something I could change in code?
I am not aware any place in code that sets this [bv_len] value explicitly.
In a 5.7 kernel isn't it set explicitly in __bio_add_pc_page__bio_add_pc_page() and __bio_add_page() (which are within block/bio.c)? You'll have to trace back through callers to see how the passed len was set though.
(I found this by searching for the bv_len identifier in LXR and then going through results)
However, #stark's comment about tune2fs is the key to any answer. You never told us the filesystem block size and if your block device is "small" it's likely your filesystem is also small and by default the choice of block size is dependent on that. If you read the mke2fs man page you will see it says the following:
-b block-size
Specify the size of blocks in bytes. Valid block-size values are 1024, 2048 and 4096 bytes per block. If omitted, block-size is heuristically determined by the filesystem size and the expected usage of the filesystem (see the -T option).
[...]
-T usage-type[,...]
[...]
If this option is is not specified, mke2fs will pick a single default usage type based on the size of the filesystem to be created. If the filesystem size is less than or equal to 3 megabytes, mke2fs will use the filesystem type floppy. [...]
And if you look in the default mke2fs.conf the blocksize for a floppy is 1024.

Registers usage during compilation

I found information that general purpose registers r1-r23 and r26-r28 are used by the compiler to store local variables, but do they have any other purpose? Also which memory are this registers part of(cache/RAM)?
Finally what does global pointer gp in register r26 points to?
Also which memory are this registers part of(cache/RAM)?
Register are on-processors storage allowing a fast data transfer (2 reads/1 write per cycle). They store variables that can represent memory addresses, but, besides that, are completely unrelated to memory or cache.
I found information that general purpose registers r1-r23 and r26-r28 are used by the compiler to store local variables, but do they have any other purpose?
Registers are use with respect to hardware or software conventions. Hardware conventions are related to the instruction set architecture. For instance, the call instruction transfers control to a subroutine and stores return address in register r31 (ra). Very nasty things are likely to happen if you overwrite r31 register by any mean without precautions. Software conventions are supposed to insure a proper behavior if used consistently within software. They indicate which register have special use, which need to be saved when context switching, etc. These conventions can be changed without hardware modifications, but doing so will probably require changes in several software tools (compiler, linker, loader, OS, ...).
general purpose registers r1-r23 and r26-r28 are used by the compiler to store local variables
Actually, some registers are reserved.
r1 is used by asm for macro expansion. (sw)
r2-r7 are used by the compiler to pass arguments to functions or get return values. (sw)
r24-r25 can only be used by exception handlers. (sw)
r26-r28 hold different pointers (global, stack, frame) that are set either by the runtime or the compiler and cannot be modified by the programmer.(sw)
r29-r31 are hw coded returns addresses for subprograms or interrupts/exceptions. (hw)
So only r8-r23 can used by the compiler.
but do they have any other purpose?
No, and that's why they can be freely used by the compiler or programmer.
Finally what does global pointer in register r26 points to?
Accessing memory with load or stores have a based memory addressing. Effective address for ldx or stx (where 'x' is is b, bu, h, etc depending on data characteristics) is computed by adding a register and a 16 bits immediate. This only allows to go an an address within +/-32k of the content of register.
If the processor has the address of a var in a register (for instance the value returned by a malloc) the immediate allows to do a displacement to access fields in a struct, next array value, etc.
If the address is local or global, it must be computed by the program. Pointers registers are used to that purpose. Local vars addresses are computed by adding an immediate to the stack pointer (r27or sp).
Addresses of global or static vars are computed by adding an integer to the global pointer (r26 or gp). Content of gp corresponds to the start of the memory data segment and is initialized by the loader just before program execution and must not be modified. The immediate displacement with respect to the start of data segment is computed by the linker when it defines memory layout.
Note that this only allows to access 64k memory because of the 16 bits immediate width. If the size of global/static variables exceeds this value and a var is not within this range, a couple of instructions are required to enter the 32 bits of the address of the var before the data transfer. With gp this is not required and it is a way to provide a faster access to global variables.

Speeding up "Mapcache_seed" with Mapserver

I'm using mapcache_seed in the mapcache package to create a large image cache from calling my mapserver WMS, with vectors.
Currently, this is the command I'm using:
sudo -u www-data mapcache_seed -c mapcache.xml -g WGS84 -n 8 -t Test -e\ [Foo,Bar,Baz,Fwee] -M 8,8 -z 12,13 --thread-delay 0 --rate-limit 10000
Where www-data is my Nginx system-user, mapcache.xml is my config, WGS84 is my SRS, -n 8 is my logical thread count (on a i7-6700HQ at 3200 MHz), -z 12,13 is one zoom level that needs to be seeded, thread-delay is off, and tile rate creation is set to 10000.
However, I only (max) get 50% total CPU utilization and most times only a single core goes above 50%. And an average of 500 tiles per second -- independent of how many threads or processes I specify. I've been trying to get all zoom-levels (4 to 27) seeded for the last couple of days, but I've only managed to get through 4-12, before being severely bottle-necked at a mere 3GB of a couple million tiles.
Memory utilization is at a stable 2.4% for 8GB PC4-2133 for mapcache_seed (0.5 for the WMS). Write speeds are at 100 MB/s, no-buffer write is also 100 MB/s, while buffered+cache is at 6.7-8.7 GB/s on a SATA III 1TB HDD. I have another SSD drive on my machine that gets 6 GB/s write and 8 GB/s read, but it's too small for storage and I'm afraid of drive failure from too many writes.
The cached tiles are around 4KB each and that means I get around 2MB worth of tiles every second. The majority of them aren't even tiles, but symlinks to a catch-all blank tile for empty tiles.
How would I go about speeding this process up? Messing with threads and limits, through mapcache_seed, does not make any discernible difference. This is also on a Debian Wheezy machine.
This is also being run through fast-cgi, using 256x256 px images, and disk cache with a restricted extent to a single country (otherwise mapcache starts generating nothing but symlinks to blank tiles, because more than 90% of the world is blank!)
Mapserver mapfile (redacted):
MAP
NAME "MAP"
SIZE 1200 800
EXTENT Foo Bar Baz Fwee
UNITS DD
SHAPEPATH "."
IMAGECOLOR 255 255 255
IMAGETYPE PNG
WEB
IMAGEPATH "/path/to/image"
IMAGEURL "/path/to/imageurl"
METADATA
"wms_title" "MAP"
"wms_onlineresource" "http://localhost/cgi-bin/mapserv?MAP=/path/to/map.map"
"wms_srs" "EPSG:4326"
"wms_feature_info_mime_type" "text/plain"
"wms_abstract" "Lorem ipsum"
"ows_enable_request" "*"
"wms_enable_request" "*"
END
END
PROJECTION
"init=epsg:4326"
END
LAYER
NAME base
TYPE POLYGON
STATUS OFF
DATA polygon.shp
CLASS
NAME "Polygon"
STYLE
COLOR 0 0 0
OUTLINECOLOR 255 255 255
END
END
END
LAYER
NAME outline
TYPE LINE
STATUS OFF
DATA line.shp
CLASS
NAME "Line"
STYLE
OUTLINECOLOR 255 255 255
END
END
END
END
mapcache.xml (redacted):
<?xml version="1.0" encoding="UTF-8"?>
<mapcache>
<source name="ms4wserver" type="wms">
<getmap>
<params>
<LAYERS>base</LAYERS>
<MAP>/path/to/map.map</MAP>
</params>
</getmap>
<http>
<url>http://localhost/wms/</url>
</http>
</source>
<cache name="disk" type="disk">
<base>/path/to/cache/</base>
<symlink_blank/>
</cache>
<tileset name="test">
<source>ms4wserver</source>
<cache>disk</cache>
<format>PNG</format>
<grid>WGS84</grid>
<metatile>5 5</metatile>
<metabuffer>10</metabuffer>
<expires>3600</expires>
</tileset>
<default_format>JPEG</default_format>
<service type="wms" enabled="true">
<full_wms>assemble</full_wms>
<resample_mode>bilinear</resample_mode>
<format>JPEG</format>
<maxsize>4096</maxsize>
</service>
<service type="wmts" enabled="false"/>
<service type="tms" enabled="false"/>
<service type="kml" enabled="false"/>
<service type="gmaps" enabled="false"/>
<service type="ve" enabled="false"/>
<service type="mapguide" enabled="false"/>
<service type="demo" enabled="false"/>
<errors>report</errors>
<locker type="disk">
<directory>/path/</directory>
<timeout>300</timeout>
</locker>
</mapcache>
So for anyone coming across this ten years later, when what's left of the little documentation for these tools has rotted away, I messed with my mapcache and mapfile settings to get something better. It wasn't that my generation was too slow, it was that I was generating TOO MANY GODDAMN SYMLINKS to blank files. First, the mapfile extent was incorrect. Second, I was using the "WGS84" grid, which by default seeds ALL extents. That means 90% of all my tiles were just symlinks to a blank.png, and it ate up ALL of my inodes. I recommend using mkdir blank; rsync -a --delete blank/ /path/to/cache for a quick clearing of all that mess.
I fixed the above by taking the WGS84 specifications and changing the extent to the one I specified in my mapfile. Now, only my mapfile gets seeded. Lastly, I appended the grid XML element like so:
<grid restricted_extent="MAP FILE EXTENT HERE">GRIDNAME</grid>
With restricted_extent now it's certain that only my map gets seeded. I had over 100 million tiles, but they were all goddamn symlinks! Otherwise, I got a "Ran out of space" or something of some such. Even if df shows that the partition isn't full, it's misleading. Symlinks take up inode space, not logical space! To see inode space, run df -hi. I was at 100% inode space, but only 1% logical space on a 1TB drive -- filled with goddamn symlinks!

x86 segmentation, DOS, MZ file format, and disassembling

I'm disassembling "Test Drive III". It's a 1990 DOS game. The *.EXE has MZ format.
I've never dealt with segmentation or DOS, so I would be grateful if you answered some of my questions.
1) The game's system requirements mention 286 CPU, which has protected mode. As far as I know, DOS was 90% real mode software, yet some applications could enter protected mode. Can I be sure that the app uses the CPU in real mode only? IOW, is it guaranteed that the segment registers contain actual offset of the segment instead of an index to segment descriptor?
2) Said system requirements mention 1 MB of RAM. How is this amount of RAM even meant to be accessed if the uppermost 384 KB of the address space are reserved for stuff like MMIO and ROM? I've heard about UMBs (using holes in UMA to access RAM) and about HMA, but it still doesn't allow to access the whole 1 MB of physical RAM. So, was precious RAM just wasted because its physical address happened to be reserved for UMA? Or maybe the game uses some crutches like LIM EMS or XMS?
3) Is CS incremented automatically when the code crosses segment boundaries? Say, the IP reaches 0xFFFF, and what then? Does CS switch to the next segment before next instruction is executed? Same goes for SS. What happens when SP goes all the way down to 0x0000?
4) The MZ header of the executable looks like this:
signature 23117 "0x5a4d"
bytes_in_last_block 117
blocks_in_file 270
num_relocs 0
header_paragraphs 32
min_extra_paragraphs 3349
max_extra_paragraphs 65535
ss 11422
sp 128
checksum 0
ip 16
cs 8385
reloc_table_offset 30
overlay_number 0
Why does it have no relocation information? How is it even meant to run without address fixups? Or is it built as completely position-independent code consisting from program-counter-relative instructions? The game comes with a cheat utility which is also an MZ executable. Despite being much smaller (8448 bytes - so small that it fits into a single segment), it still has relocation information:
offset 1
segment 0
offset 222
segment 0
offset 272
segment 0
This allows IDA to properly disassemble the cheat's code. But the game EXE has nothing, even though it clearly has lots of far pointers.
5) Is there even such thing as 'sections' in DOS? I mean, data section, code (text) section etc? The MZ header points to the stack section, but it has no information about data section. Is data and code completely mixed in DOS programs?
6) Why even having a stack section in EXE file at all? It has nothing but zeroes. Why wasting disk space instead of just saying, "start stack from here"? Like it is done with BSS section?
7) MZ header contains information about initial values of SS and CS. What about DS? What's its initial value?
8) What does an MZ executable have after the exe data? The cheat utility has whole 3507 bytes in the end of the executable file which look like
__exitclean.__exit.__restorezero._abort.DGROUP#.__MMODEL._main._access.
_atexit._close._exit._fclose._fflush._flushall._fopen._freopen._fdopen
._fseek._ftell._printf.__fputc._fputc._fputchar.__FPUTN.__setupio._setvbuf
._tell.__MKNAME._tmpnam._write.__xfclose.__xfflush.___brk.___sbrk._brk._sbrk
.__chmod.__close._ioctl.__IOERROR._isatty._lseek.__LONGTOA._itoa._ultoa.
_ltoa._memcpy._open.__open._strcat._unlink.__VPRINTER.__write._free._malloc
._realloc.__REALCVT.DATASEG#.__Int0Vector.__Int4Vector.__Int5Vector.
__Int6Vector.__C0argc.__C0argv.__C0environ.__envLng.__envseg.__envSize
Is this some kind of debugging symbol information?
Thank you in advance for your help.
Re. 1. No, you can't be sure until you prove otherwise to yourself. One giveaway would be the presence of MOV CR0, ... in the code.
Re. 2. While marketing materials aren't to be confused with an engineering specification, there's a technical reason for this. A 286 CPU could address more than 1M of physical address space. The RAM was only "wasted" in real mode, and only if an EMM (or EMS) driver wasn't used. On 286 systems, the RAM past 640kb was usually "pushed up" to start at the 1088kb mark. The ISA and on-board peripherals' memory address space was mapped 1:1 into the 640-1024kb window. To use the RAM from the real mode needed an EMM or EMS driver. From protected mode, it was simply "there" as soon as you set up the segment descriptor correctly.
If the game actually needed the extra 384kb of RAM over the 640kb available in the real mode, it's a strong indication that it either switched to protected mode or required the services or an EMM or EMS driver.
Re. 3. I wish I remembered that. On reflection, I wish not :) Someone else please edit or answer separately. Hah, I did know it at some point in time :)
Re. 4. You say "[the code] has lots of instructions like call far ptr 18DCh:78Ch". This implies one of three things:
Protected mode is used and the segment part of the address is a selector into the segment descriptor table.
There is code there that relocates those instructions without DOS having to do it.
There is code there that forcibly relocates the game to a constant position in the address space. If the game doesn't use DOS to access on-disk files, it can remove DOS completely and take over, gaining lots of memory in the process. I don't recall whether you could exit from the game back to the command prompt. Some games where "play until you reboot".
Re. 5. The .EXE header does not "point" to any stack, there is no stack section you imply, the concept of sections doesn't exist as far as the .EXE file is concerned. The SS register value is obtained by adding the segment the executable was loaded at with the SS value from the header.
It's true that the linker can arrange sections contiguously in the .EXE file, but such sections' properties are not included in the .EXE header. They often can be reverse-engineered by inspecting the executable.
Re. 6. The SS and SP values in the .EXE header are not file pointers. The EXE file might have a part that maps to the stack, but that's entirely optional.
Re. 7. This is already asked and answered here.
Re. 8. This looks like a debug symbol list. The cheat utility was linked with the debugging information left in. You can have completely arbitrary data there - often it'd various resources (graphics, music, etc.).

How to split bio into multiple bios?

I want create a block device that get a bio with request for n sector and split it into n bio with 1 sector. I used bio_split but it doesn't work and reaches BUG_ON.
Is there any function to do such thing?
If there's not can anyone help me to write a function to do that?
It's also fine to have a function that split a bio into 4k bios.
The split_bio() function only works for bios with a single page (when bi_vcnt field is exactly 1).
To deal with bios with multiple pages - and I suspect you deal with these most of the time - you have to create new bios and set them up so that they contain only a single sector.
Tip: If the sector size is the same as the page size (currently 4K), and your block driver tells the kernel to supply no less than this size, than you only have to put each page from the incoming bio to the new bio. If the sector size is less then the page size, than the logic will be a bit more complicated.
Use bio_kmalloc to allocate the new bios and copy the data onto the memory pages in them manually.

Resources