Of course, I can test on my boards, just prepare the bitstream.
Bit streams here:
@emard @kost: I've added support for the
f32c packet upload of a binary file to the RVSoc system. It now works at low (115K2) and high (1M) speeds, but not at very high (3M) speeds. Did 3Mb/s work in other contexts?
Also, it only works if the terminal speed and the upload speed are set to the same value. I think that switching the speed on the ftdi uart may introduce noise characters on the line, and the upload protocol has no way to re-sync after such noise that I could see. Maybe I am missing something?
However, it seems that
f32c also does not support split speeds (see https://github.com/f32c/f32c/blob/master/src/boot/sio/binboot.c). Maybe that never worked?
On the other hand running both the terminal and the upload at 1M is a minor nuisance, just a lot of options on the command line....
fujprog, see here: https://github.com/kost/fujprog/blob/master/fujprog.c#L3427-L3434
@emard Thanks for the feedback! The verilog loader to match fujprog is here:
I'm currently using that with this command line:
fujprog -t -e tst.bin -x 1000000 -b 1000000 sys.bit1
and that is a practical workflow for me: the bit file gets loaded to board, immediately uploads the test program and drops into a terminal for testing. At the moment it only seems to work when the upload and terminal speed are the same, but maybe that is because there is an error in my verilog for the upload receiver. It would be nice if I could drop the -x and -b options and just use the defaults.
In its current form it reliably uploads even large files (9MB) without error at an acceptable speed (at 1Mb/s that takes 90 s).
If 115K2 + 3M it works for the f32c, it must be possible to get it to work for RVSoc as well.
@emard @lawrie I've completed my baseline work on the RVSoc system. Code is here:
It successfully runs xv6. I've also tried running Linux again using the binaries provided by the original project from the Uni of Tokyo, but it seems to get stuck after it switches from the Berkeley Boot Loader into Linux itself. Maybe the issue is the different memory layout, I'm not interested enough in Linux to dive into this and compile from source.
In a quick Dhrystone measurement I get to about 2 DMips performance, which is less than the 3 DMips that a was expecting based on the claims in the paper (~7 x 40Mhz/104MHz = ~ 3). Not sure how that compares with VexRiscv (I assume that one will be well above 3 DMips on the ULX3S).
All in all I am quite impressed with this project. Who would have thought that one could run Linux on top of just 4000 lines of plain Verilog, consuming less than half of a 12F (ie. 25F) chip?
I meant 2 DMips at 40MHz. This makes sense: an instruction takes 12 clocks (+ cache misses), so 40MHz / 12 = 3.3 million instructions per second. Taking into account cache misses and normalised (CISC) instructions, the reported 2 to 3 million instructions per second seems to be the right magnitude.
The f32c appears to be 30...35 times faster, doing 1.5 to 1.8 instructions per clock. Maybe I am totally misunderstanding Dhrystones and DMips?
I'm not feeling bad, also because it is not my design :^) I just ported it. I read up on Dhrystones and there are some interesting insights (for me, maybe everybody here already knows).
The f32c is a scalar design, so should do 1 DMips/Mhz if its instructions are about equally efficient as those of the original VAX. In practical terms, I think that is true for both MIPS and Riscv. Compilers got better since then and that accounts for some 30-40% of DMips 'inflation'. Also running from zero-wait ram instead of slow ram plus a small cache accounts for some 30% of improvement. I now understand the 1.5-1.8 result for the f32c. No put-down intended, I can see that the f32c is fast.
I'm running my Dhrystone tests with the kencc compiler (which is more
-O3) and a simple library. This skews my measurement versus what is normally used. Compensating for that and using the 1.3 number that the f32c page reports for usage with a cache and SDRAM, I get back to the 1 DMips/Mhz number on a like-for-like basis. This is 20 times faster.
Now, comparing a scalar CPU with a CPU using 12 clocks I would have expected the difference to be 12x not 20x. This difference is the same order of magnitude as the 2 DMips versus expected 3 DMips that I mentioned earlier. This is worth further investigation.
Maybe the choice to use a 256 byte cache line was not clever. Maybe I should reduce the cache line to 64 bytes, double the number of sets from 16 to 32 and split into I and D caches. This can be fitted into the cache design with relative ease.
As a final note, I'm interested in this SoC as a target for early 80's Unix and 2 DMips is twice as fast as the original VAX, twice as fast as the 3B2 (what SysV was developed on) per core, and equally fast as a 68020 or 68030 running at 15Mhz. The design is already at the mid-80's speed that I am looking for.