How to enlarge the memory in Microblaze for software applications?

How to enlarge the memory in Microblaze for software applications? - fpga

I wrote a C program, which has a big size . However, it is known that the Microblaze by default uses only 64KB. So I change the amount of BRAM in the EDK to 512K but when I generate the bitsream I got this errors:
- C:\Users\slim\Desktop\hs\system.mhs line 74
IPNAME: plb_v46, INSTANCE: mb_plb - 2 master(s) : 1 slave(s)
IPNAME: lmb_v10, INSTANCE: ilmb - 1 master(s) : 1 slave(s)
IPNAME: lmb_v10, INSTANCE: dlmb - 1 master(s) : 1 slave(s)
ERROR:EDK:440 - platgen failed with errors!
Done!

On a Spartan-6, EDK's blockRAM memory block are limited to 64kB if they are 32 bits large, 128kB if they are 64 bits large and 256kB if they are 128 bits large. Besides, the spartan-6 LX45 has only 238kB of BlockRAM memory available, so this is impossible altogether on your platform.
The memory the microblaze use to store it's program is limited to 32 bits, as far as I know. To use the larger memories you would need to connect it to an AXI port. Otherwise, Xilinx has an answer record on how to use 2 differents 64kB memory to present 128kB to the microblaze.
Do you really need that much memory? If the answer is "yes", you should use a microblaze with external memory instead. You will have much more memory available (your board has 128MB DDR ready to use) with minimal performance impact if you have sufficient caches.

Related

Relation between computer architecture and cache block size

Suppose memory is byte addressable and cache block size is 4 bytes . So in one cache access 1 block is accessed. Does it means computer architecture is of 32 bit. My question is what derivation you can make about computer architecture if you are given about cache block size

No, usually cache block size is larger than the register width, to take advantage of spatial locality between nearby full-register-width loads / stores which is typical. Making cache as fine-grained a 4-byte chunks costs a large amount of overhead (tags and so on) compared to the amount of storage needed for the actual data. e.g. 20 tag bits, plus "dirty" and other MESI state per 32-bit cache line, might mean that a 32 kiB (usable space) cache needs more like 56 kiB of raw SRAM storage, and that's without considering ECC or parity.
If a CPU has a floating-point unit, it can often do 64-bit loads/stores, even if the integer register width is only 32-bit. (Or even wider with SIMD, or load-pair / store-pair instructions.)
Typical real-world cache sizes are 64 bytes on modern systems, and formerly 32 bytes on earlier CPUs like Pentium III. 64 bytes is the DDR SDRAM burst size, so it's a good choice for the size of off-chip memory accesses. (Recent Intel systems with AVX-512 SIMD can load/store a whole 64-byte (512-bit) cache line with a single instruction, though. SIMD vector width has caught up to cache line size. But integer accesses are still at most 8 bytes wide.)
There's no relationship between cache block size and architecture bitness. You definitely want the block size to be at least as wide as a normal load / store, but it would be possible to build a 64-bit machine with 32-bit cache blocks. That would mean 64-bit loads take two cache accesses to do it, so it would be a really bad idea unless your usual workload consisted of using 64-bit addresses in registers to access scattered 32-bit values, and you wanted to optimize for that without caring about efficiency of anything else.
Most 64-bit ISAs can work with 32 or 64-bit data equally efficiently. Some, notably x86-64, don't even have what you'd call a "word size". There's no one native access size that's most efficient on x86-64, and instructions are an unaligned byte stream, not like ISAs with aligned 32-bit instruction words like RISC-V or AArch64.
So if you knew that the cache block size was 32-bit, it would be a good guess that the register width was at most 32-bit, but could be 8 or 16-bit. (Or 4-bit or possibly even 6-bit or something? With sizes smaller than 32-bit, for historical CPUs it often becomes a question of what one means by bitness: ALU, register, bus, fixed-width instruction? Notice that in earlier parts of the answer, I just talked about register width, not "32-bit CPU".)
If this was a real commercial design instead of a computer science example, an 8-bit machine would be the most likely; a normal 32-bit machine would use larger cache blocks but you could plausibly imagine finer granularity on a machine that could only load 1 byte at a time. (Of course, being an 8-bit machine doesn't imply that restriction; you could have a load-pair instruction, or FP registers that allow 32-bit or 64-bit loads/stores.)

How is byte addressing implemented in modern computers?

I have trouble understanding how in say a 32-bit computer byte addressing is achieved:
Is the ram itself byte addressable meaning the first byte has address 0 and the second 1 etc? In this case, wouldn't is take 4 read cycles to read a 32-bit word and waste the width of the data bus?
Or does the ram consist of 32-bit words meaning address 0 points to the first 4 bytes and address 2 points to bytes 5 to 8? In this case I would expect the ram interface to make byte addressing possible (from the cpu's point of view)

Think of RAM as 8 bit wide structure with N entries. N is often the size quoted when referring to memory (256 MB - 256M entries, 2GB - 2G entries etc, B is for bytes). When you access this memory, the smallest unit you can address is one of these entries which is 8 bits (1 byte). Since you can only access it at byte level, we call it byte addressable memory.
Now coming to your question about accessing this memory, we do not just access a byte. Most of the time, memory accesses are sent through caches which are there to reduce memory access latency. Caches store data at a higher granularity than a byte or word, normally it is multiple of words. In doing so, caches explore a property called "locality". Locality means, there is a high chance that we either access this data item or a near by data item very soon. So fetching not just the byte, but all the adjacent bytes is not a waste. Think of it as an investment for future, saves you multiple data fetches that you would have done otherwise.

Memory addresses in RAM start with 0th address and they are accessed using the registers with capacity of 8 bit register or 32 bit registers. Based on these registers the value from specific address is accessed by the CPU. If you really need to understand how it works, you will need to run couple of programs using Assembly language to navigate in the physical memory by reading the values directly using registers and register move commands.

AXI4 (Lite) Narrow Burst vs. Unaligned Burst Clarification/Compatibility

I'm currently writing an AXI4 master that is supposed to support AXI4 Lite (AXI4L) as well.
My AXI4 master is receiving data from a 16-bit interface. This is on a Xilinx Spartan 6 FPGA and I plan on using the EDK AXI4 Interconnect IP, which has a minimum WDATA width of 32 bits.
At first I wanted to use narrow burst, i.e. AWSIZE = x"01" (2 bytes in transfer). However, I found that Xilinx' AXI Reference Guide UG761 states "narrow bursts [are] supported but [...] not recommended." Unaligned transactions are supposed to be supported.
This had me thinking. Say I start an unaligned burst:
AWLEN = x"01" (2 beats)
AWSIZE = x"02" (4 bytes in transfer")
And do the following:
AX (32-bit word #0: send hi16)
XB (32-bit word #1: send lo16)
Where A, B are my 16 bit words that start off at an unaligned (2-byte aligned) address. X means WSTRB is deasserted for the indicated 16 bit.
Is this supported or does this fall under the category "narrow burst" even through AWSIZE = x"02" (4 bytes in transfer) as opposed to AWSIZE = x"01" (2 bytes in transfer)?
Now, if this was just for AXI4, I would probably not care as much about this use case, because AXI4 peripherals are required to use the WSTRB signals. However, the AXI Reference Guide UG761 states "[AXI4L] Slaves interface can elect to ignore WSTRB (assume all bytes valid)."
I read here that many (but not all; and there is not list?) Xilinx AXI4L peripherals do elect to ignore WSTRB.
Does this mean that I'm essentially barred from doing narrow burst ("not recommended") as well as unaligned bursts ("WSTRB can be ignored") or is there an easy way to unload some of the implementation work from my master into the interconnect, guaranteeing proper system behavior when accessing AXI4L peripherals?

Your example is not a narrow burst, and should work.
The reason narrow burst is not recommended is that it gives sub-optimal performances. Both narrow-burst and data realignement cost in area and are not recommended IMHO. However, DRE has minimal bandwidth cost, while narrow burst does. If your AXI port is 100MHz 32 bits, you have 3.2GBits maximum throughput, if you use narrow burst of 16 bits 50% of the time, than your maximum throughput is reduced to 2.4GBits (32bits X 50MHz + 16bits X 50Mhz). Also, I'm not sure AXI-Lite support narrow burst or data realignement.
Your example has 2 major flaws. First, it requires 3 data-beats to transfer 32 bits, which is worst than narrow-burst (I don't think AXI is smart enough to cancel the last burst with WSTRB to 0). Second, you can't burst more than 2 16-bits at a time, which will hang your AXI infrastructure's performances if you have a lot of data to transfer.
The best way to deal with this is concatenate the 16 bits together to form a 32 bits in your block. Then you buffer these 32 bits and burst them when you have enough. This is the AXI high performance way to do this.
However, if you receive data as 16-bits, it seems you would be better using AXI-Stream, which support 16-bits but doesn't have the notion of addresses. You can map an AXI-Stream to AXI-4 using Xilinx's IP cores. Either AXI-Datamover or AXI-DMA can do that. Both do the same (in fact, AXI-DMA includes a datamover), but AXI-DMA is controlled trough an AXI-Lite interface while Datamover is controlled through additionals AXI-Streams.
As a final note, the Xilinx cores never requires narrow-burst or DRE. If you need DRE in AXI-DMA, it's done by the AXI-DMA core and not the AXI Interconnect. Also, these cores are clear-source, so you can checkout how they operate easily.

Why does the 8086 use an extra register to address 1MB of memory?

I heard that the 8086 has 16-bit registers which allow it to only address 64K of memory. Yet it is still able to address 1MB of memory which would require 20-bit registers. It does this by using another register to to hold another 16 bits, and then adds the value in the 16-bit registers to the value in this other register to be able to generate numbers which can address up to 1MB of memory. Is that right?
Why is it done this way? It seems that there are 32-bit registers, which is more than sufficient to address 1MB of memory.

Actually this has nothing to do with number of registers. Its the size of the register which matters. A 16 bit register can hold up to 2^16 values so it can address 64K bytes of memory.
To address 1M, you need 20 bits (2^20 = 1M), so you need to use another register for the the additional 4 bits.

The segment registers in an 8086 are also sixteen bits wide. However, the segment number is shifted left by four bits before being added to the base address. This gives you the 20 bits.

the 8088 (and by extension, 8086) is instruction compatible with its ancestor, the 8008, including the way it uses its registers and handles memory addressing. the 8008 was a purely 16 bit architecture, which really couldn't address more than 64K of ram. At the time the 8008 was created, that was adequate for most of its intended uses, but by the time the 8088 was being designed, it was clear that more was needed.
Instead of making a new way for addressing more ram, intel chose to keep the 8088 as similar as possible to the 8008, and that included using 16 bit addressing. To allow newer programs to take advantage of more ram, intel devised a scheme using some additional registers that were not present on the 8008 that would be combined with the normal registers. these "segment" registers would not affect programs that were targeted at the 8008; they just wouldn't use those extra registers, and would only 'see' 16 addres bits, the 64k of ram. Applications targeting the newer 8088 could instead 'see' 20 address bits, which gave them access to 1MB of ram

I heard that the 8086 has 16 registers which allow it to only address 64K of memory. Yet it is still able to address 1MB of memory which would require 20 registers.
You're misunderstanding the number of registers and the registers' width. 8086 has eight 16-bit "general purpose" registers (that can be used for addressing) along with four segment registers. 16-bit addressing means that it can only support 216 B = 64 KB of memory. By getting 4 more bits from the segment registers we'll have 20 bits that can be used to address a total of 24*64KB = 1MB of memory
Why is it done this way? It seems that there are 32 registers, which is more than sufficient to address 1MB of memory.
As said, the 8086 doesn't have 32 registers. Even x86-64 nowadays don't have 32 general purpose registers. And the number of registers isn't relevant to how much memory a machine can address. Only the address bus width determines the amount of addressable memory
At the time of 8086, memory is extremely expensive and 640 KB is an enormous amount that people didn't think that would be reached in the near future. Even with a lot of money one may not be able to get that large amount of RAM. So there's no need to use the full 32-bit address
Besides, it's not easy to produce a 32-bit CPU with the contemporary technology. Even 64-bit CPUs today aren't designed to use all 64-bit address lines
Why can't OS use entire 64-bits for addressing? Why only the 48-bits?
Why do x86-64 systems have only a 48 bit virtual address space?
It'll takes more wires, registers, silicons... and much more human effort to design, debug... a CPU with wider address space. With the limited transistor size of the technology in the 70s-80s that may not even come into reality.

8086 doesn't have any 32-bit integer registers; that came years later in 386 which had a much higher transistor budget.
8086's segmentation design made sense for a 16-bit-only CPU that wanted to be able to use 20-bit linear addresses.
Segment registers could have been only 8-bit or something with a larger shift, but apparently there are some advantages to fine-grained segmentation where a segment start address can be any 16-byte aligned linear address. (A linear address is computed from (seg << 4) + off.)

How do we determine if a processor is 8-bit; 16-bit or 32-bit

Is it determined by size of the address buss; if yes then was 8086 a 20-bit processor? If no what is criteria for assigning a bit number like 8-bit, 16-bit, 32-bit to processor?

It's not well defined. Broadly, as xtofl points out, it's the size of the atomic unit of computation (in early computers, this wasn't always synonymous with "register"). So the PDP-10 was a 36 bit machine, a 8080 was 8 bit, and a IBM 360 or Intel 80386 is "32 bits".
But there are exceptions. The Motorola 68000 and 68010 CPUs implemented a 32 bit register set, but did it via microcode on top of a mostly 16 bit internal architecture. They were usually marketed as "16 bit" CPUs at the time.
The size of the address bus is almost never the defining factor. All successful "8 bit" CPUs implemented 16 bit addressing, for example (often via odd hacks to make up for the lack of a single address register, c.f. 6502's indirect addressing modes or the Z80's H/L registers). And the 8086, as you mention, used its segment register addressing to get 20 address lines to work (the 80286 extended this trick to 24 bits of physical address). And in the other direction, many "32 bit" CPUs had smaller address buses to save logic that wouldn't be used on a machine that would never have more than a few megabytes of memory: the 68000 limited addressing to 24 bits, even though the pointers themselves were 32. Likewise modern 64 bit CPUs universally implement less than 64 bits of physical address.

As far as i know the bit width of the processor is determined by how many bits the internal data processing circuits accept at once. Like if the adders, multipliers etc in the ALU accept 16 bit operands then the CPU is 16 bit, and if it accepts 32 bits then it is 32 bit. It does not matter what is the width of the data bus or the address bus. In general the bit length of the Accumulator would determine the bit length of the processor.

I guess normally you label it by the size of it's accumulators/registers.

With respect to a CPU, I'd say that it's the width of a register. You can do an operation on only 8 bits, 16-bits, 32-bits, etc. at a time.

The bit size (8-bit, 16-bit, 32-bit) of a microprocecessor is determined by the hardware, specifically the width of the data bus. The Intel 8086 is a 16-bit processor because it can move 16 bits at a time over the data bus. The Intel 8088 is an 8-bit processor even though it has an identical instruction set. This is similar to the Motorola 68000 and 68008 processors. The bit size is not determined by the programmer's view (the register width and the address range).

I think the first number of Integrated chip refers to the type of the processor.
If it is IC 8085 means it is a 8 bit processor.

any processor can be designated by its' two attributes
instruction set architecture &
no. of bits it can handle in single clock cycle.
take for example Intel's IA-32 architecture, also called x86-32 , here x86 indicates the architecture and 32 indicates 32-bit processor
X-Architecture
there are a number of architectures
Pre-x86 x86
-Intel's IA-32 architecture, also called x86-32 -x86-64
- -with AMD's AMD64 and Intel's Intel 64 version of it
- Motorola's 6800 and 68000 a
rchitectures ARM7
Y-bit processor
: simply- its the data handling capability of cpu/processor in a single clock cycle.
suppose it is an 8 bit processor then in a single clock cycle, the ALU can perform operation on 8 bit data only.(note that this operation may be an internal operation like add/sub as well as transferring data to other IO device)
classification Based on Registers-
Processor in addition to ALU and CU contains some memory locations as well, called as registers. depending on the processor, a register may typically store 8, 16, 32 or 64 bits. The register size of a particular processor allows us to classify the processor. Processors with a register size of n-bits are called n-bit processors, so that processors with 8-bit registers are called 8-bit processors.
classification Based on databus width-
since the alu can handle only 8 bit data in a single clock cycle it won't make sense to have data bus width more than that & 8 bit processor will have 8 bit wide databus, hence databus width can also be an alternate way to find out the bit processing capability of processor.for a processor with n bit databus means that the CPU can transfer n-bits to another device in a single operation.
for the question:
"suppose we have a 32 bit ALU i.e. it can take 32 bits at a time and
our data bus size is 16 bit i.e. it can hold 16 bit of data at a time
thn wht will be the ans. In this case...?"
the example of such processor would be intel 8088 & Moto 68000
Using bus width classification, the Intel 8088 microprocessor is an 8-bit processor since it uses an 8-bit data bus, although its CPU registers are in fact 16-bit registers.
Similarly the Motorola 68000 is classified as a 16-bit processor, even though its CPU registers are 32-bit registers.
Sometimes a combination of the two classifications is used where the 8088 might be described as 8/16-bit processor and the Motorola 68000 as a 16/32-bit processor.

The word size(8-bits, 16-bits or 32-bits) of a microprocessor is the size of the data path in the execution unit. Typically, this is the size of the accumulator.
This is the execution unit size. An example where this matters is the 8088, which is a 16 bit computer running on an 8 bit bus. The 8085 is 8-bits. The 8086/8088 is 16-bits. The 80386 is 32-bits. Mordern Intel Processors are 64-bits.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio