How to implement a cache without clock signal in Verilog? - caching

I'm doing this homework problem of Computer Organization and Design. I need to use Verilog to implement a cache, which connects a CPU and a main memory. The cache should be direct mapped. It should use write back scheme for writing, and LRU for replacement.
The interface of the cache is as follows:
module cache(
input wire rw, //from CPU, 1 for write, 0 for read
input wire[31:0] addr, //from CPU, memory address
input wire[31:0] wdata, //from CPU, data to write
input wire[31:0] mem_rdata, //from memory, data read
input wire mem_done, //from memory, r/w done
output reg[31:0] rdata, //to CPU, data read
output reg hit_miss, //to CPU, 1 for r/w hit, 0 for r/w miss
output reg mem_rw, //to memory, 1 for write, 0 for read
output reg[31:0] mem_addr, //to memory, memory address
output reg[31:0] mem_wdata //to memory, data to write
);
endmodule
Other interface signals can be added if necessary, but it is specially stated that clock signal and FSM are not needed in the cache implementation.
I find it difficult not to use a clock signal. The cache has internal states, so it is more than just combinational logic. But how can a sequential logic be implemented without clock signal? Thanks very much for any idea.

Related

Given the data path of the LC-3 (provided below), give a complete description of the STORE DIRECT instruction (ST SR, label), as follows

I've tried reading the book but I'm not sure exactly how to go about this. Anyone have an idea?
So in doing these problems its best to have the datapath in front of you and have the Register Transfer Language written down, you do these problems step by step, it is a little daunting, but following all of the Digital Logic you have learned it is all a matter of you being a pirate and the datapath being your treasure map.
To do this you just follow the wires in the diagram. I'm using this one which I'm sure is in the Patt/Patel textbook https://i.ytimg.com/vi/PeWbyffnkZ4/maxresdefault.jpg
Mem[PC + SEXT(IR[8:0])] = SR
Clock Cycle 1
So the first thing you need to do is SEXT(IR[8:0]) So where in the datapath is a sign extender and where is the IR. If you look at the ADDR2MUX you see it has 4 inputs each being bits from the IR and one with 0. So ADDR2MUX=IR[8:0]
Next we need to add the PC to it. So from the output of the ADDR2MUX will be the SEXT(IR[8:0]) So next we need to add the PC to that output. Well we see the output of the ADDR2MUX feeds into an adder. So ok we need to set the other adder up with the PC. The ADDR1MUX has an input from the register file and the PC. So ADDR1MUX=PC
Both of these inputs go into the adder and now the output of that adder has PC + SEXT(IR[8:0])
Next we need to store to memory, the address we want to store to is PC + SEXT(IR[8:0]), and what we want to store is SR. So how do we do that? To interface with memory we need to put the address in the MAR (Memory Address Register) and the data we want to store in the MDR. So lets do the MAR step first. So we need to put the result of the ADDER into the MAR. The only path we can take is the MARMUX. So MARMUX=ADDER. We need to Gate the MARMUX to put it out on the bus as well. So GateMARMUX.
The value of the MARMUX is now out onto the bus so we want to latch that into the MAR so LDMAR.
We need a new clock cycle now because we need to wait for the value to latch into the register which happens at the beginning of a new clock cycle.
Clock Cycle 1 Signals - ADDR2MUX=IR[8:0], ADDR1MUX=PC, MARMUX=ADDER, GateMARMUX, LDMAR
Clock Cycle 2
Now lets the source register into the MDR. Looking at the diagram we need a path from the register file to the BUS to get it into the MDR. There's actually two ways of doing this one going through ADDR1MUX and one going through the ALU.
I will take the ALU path as its slightly more correct.
First we need to make SR1 be the source register from the instruction so SR1MUX=[11:9]
The Source register from the instruction now comes out the register file from the SR1 output, this feeds into the ALU. So we must choose the operation the ALU does so, ALUK=PASSA. PASSA simply makes the output of the ALU the 'A' input.
We then need to put the ALU output on the bus so GateALU
Now the ALU output is on the bus and we want to store this in the MDR, however there is a MUX blocking that input. MIO.EN=0 to select the bus output to go into the MDR. Then we need to latch that value into the register so LDMDR
We just tried to load a value into a register so it will not be available in the output until the start of the next clock cycle so..
Clock Cycle 2 Signals - SR1MUX=[11:9], ALUK=PASSA, GateALU, MIO.EN=0, LDMDR
Clock Cycle 3
All we need to do is give memory a good ol kick to store the value in the MDR at the address in the MAR so... MIO.EN=1, R/W=W
Clock Cycle 3 signals - MIO.EN=1, R/W=W
So this concludes the ST instruction it takes 3 clock cycles and the signals to assert each clock cycle are indicated at the end of each clock cycle. All of those signals per clock cycle can be turned on at the same time.

What does a multiplexer do in CPU?

I had designed a simple ALU, and I generated "operation codes" using a decoder. Now, I'm studying about Multiplexers, but I couldn't understand what they do in a CPU or ALU?
A really simple example: If you want to fetch a data bit from memory, a multiplexer allows you to specify an address (the input code), and the memory bit will be connected to another "pin".
So say you have 256 bits of memory, and you want to connect this to an output pin, the multiplexer has 8 bits for input codes. You proved a code say N, and and bit N is connected through the logic gates to the output of the multiplexer. This multiplexer would have a total of 256 + 8 input lines.
I'm not sure how this would be implemented in more modern CPUs but you can probably see how several bit multiplexers could be stacked together and be used to fetch a byte from memory in parallel as well, and connected to say an arithmetic register to perform computations.
Fun right?!

Xilinx FPGA output to output timing constraints

I have a Spartan-6/ISE design where I'm generating 8-bit data # 70MHz to feed the FIFO of a Cypress FX3 USB3 controller. I also generate a 70MHz o/p clock and /WR strobe that clock data into the USB controller. The 70MHz is derived from halving the 140MHz system clock, divided by 2 in a process rather than using a DPLL, though the 140MHz system clock is produced using a DPLL.
I want to ensure the 8-bit data meets the set-up & hold time requirements of the USB controller and, although the data, o/p clock and /WR are derived from the 140MHz, I don't really care about their relationship to it. What I'm really concerned about is ensuring the set-up & hold times for data & /WR w.r.t the 70MHz o/p clock are within the USB controller limits.
My question is: how do I go about specifying timing constraints between FPGA outputs rather than w.r.t. to the internal system clock ?
Thanks
Dave

How to get a rgb picture into FPGA most efficiently, using verilog

I am trying to write a verilog code for FPGA programming where I will implement a VGA application. I use Quartus II and Altera DE2.
At the moment, my aim is to get a 640x480 rgb image during compilation (method doesn't matter as long as it works and is efficient). The best solution I came up with is to convert the picture into rgb hex files using matlab and to use $readmemh to get them into a register.
But as discussed here: verilog $readmemh takes too much time for 50x50 pixel rgb image
it takes too much time and apparently there is no way around it with this method. It would be fine if it was only the time but there is also the size problem, 640x480 pretty much costs most of the free space.
What I am hoping is some system function or variable type of verilog that will take and store the picture in a different way so that size won't be a problem anymore. I have checked solutions for verilog and quartus webpage but I believe there should be a faster way to do this general task, rather than writing something from scratch.
compilation report for 200x200 readmemh attempt:
Based on your compilation report, I'd recommend you using a block ROM (or RAM) memory, instead of registers to store your image.
At this moment you're using distributed RAM, i.e. the memory that is available inside a each small logic blocks of FPGA. This makes distributed RAM, ideal for small sized memories. But when comes to large memories, this may cause an extra wiring delays and increase synthesis time (the synthesiser need to wire all of this blocks).
On the other hand, a block RAM is a dedicated two port memory containing several kilobits (depending on your device and manufacture) of RAM. That's why you should use block RAM for large sized memories, while distributed RAM for FIFO's or small sized memories. Cyclone IV EP4CE115F29 (available in DE2-115) has 432 M9K memory blocks (3981312 memory bits).
One important thing, the READ operation is asynchronous for distributed RAM (data is read from memory as soon as the address is given, doesn't wait for the clock edge), but synchronous for block RAM.
The example of single port ROM memory (Quartus II Verilog Template):
module single_port_rom
#(parameter DATA_WIDTH=8, parameter ADDR_WIDTH=8)
(
input [(ADDR_WIDTH-1):0] addr,
input clk,
output reg [(DATA_WIDTH-1):0] q
);
// Declare the ROM variable
reg [DATA_WIDTH-1:0] rom[2**ADDR_WIDTH-1:0];
initial
begin
$readmemh("single_port_rom_init.txt", rom);
end
always # (posedge clk)
begin
q <= rom[addr];
end
endmodule

Linux block driver merge bio's

I have a block device driver which is working, after a fashion. It is for a PCIe device, and I am handling the bios directly with a make_request_fn rather than use a request queue, as the device has no seek time. However, it still has transaction overhead.
When I read consecutively from the device, I get bios with many segments (generally my maximum of 32), each consisting of 2 hardware sectors (so 2 * 2k) and this is then handled as one scatter-gather transaction to the device, saving a lot of signaling overhead. However on a write, the bios each have just one segment of 2 sectors and therefore the operations take a lot longer in total. What I would like to happen is to somehow cause the incoming bios to consist of many segments, or to merge bios sensibly together myself. What is the right approach here?
The current content of the make_request_fn is something along the lines of:
Determine read/write of the bio
For each segment in the bio, make an entry in a scatterlist* with sg_set_page
Map this scatterlist to PCI with pci_map_sg
For every segment in the scatterlist, add to a device-specific structure defining a multiple-segment DMA scatter-gather operation
Map that structure to DMA
Carry out transaction
Unmap structure and SG DMA
Call bio_endio with -EIO if failed and 0 if succeeded.
The request queue is set up like:
#define MYDEV_BLOCK_MAX_SEGS 32
#define MYDEV_SECTOR_SIZE 2048
blk_queue_make_request(mydev->queue, mydev_make_req);
set_bit(QUEUE_FLAG_NONROT, &mydev->queue->queue_flags);
blk_queue_max_segments(mydev->queue, MYDEV_BLOCK_MAX_SEGS);
blk_queue_physical_block_size(mydev->queue, MYDEV_SECTOR_SIZE);
blk_queue_logical_block_size(mydev->queue, MYDEV_SECTOR_SIZE);
blk_queue_flush(mydev->queue, 0);
blk_queue_segment_boundary(mydev->queue, -1UL);
blk_queue_max_segments(mydev->queue, MYDEV_BLOCK_MAX_SEGS);
blk_queue_dma_alignment(mydev->queue, 0x7);

Resources