the [...] MULU/MULS instructions are very slow, taking about 70 cycles. The exact number of cycles is data-dependent because the microcode uses an iterative algorithm.)
So, (assuming that pndc is correct and the 68000 is microcoded). What "iterative algorithm" is this? I can imagine something naive like
(let (mulu_impl (lambda (ct x y) (if (eq? ct 0) y (mulu_impl (decrement ct) x (add x y)))))) (mulu (lambda (x y) (mulu_impl x x y)) (* untested; could be full of brainfarts *)
which basically counts x downward, adding the old x to y each time. But this would be naive and slow for large values of x. They probably had a better trick up their sleeves.
The chip is all hardwired - there is no microcode in it.
I find this quote rather hard to believe, since I think that if you implement something so complex without microcode you end up decreasing the clock speed to allow for signal propagation delays; I could be wrong though. But if it is the case, then my guess is that
MULS now need to be implemented differently from what I did in Lisp up there.
How was the multiplier(s) implemented in the 68000 and the 68060?
Although none of my databooks directly state how multiplication is done, we can infer several things from what they do say.
My copy of M68000 8-/16-/32-bit Microprocessor User's Manual shows the execution times of the MC68000 multiply instructions on page 8-4:
MULS 70(1/0)+* MULU 70(1/0)+*
where 70 is the number of processor cycles, 1 means one read cycle, 0 means no write cycles,
+ means additional cycles to fetch the effective address should be added, and
* means that the number 70 is the maximum number of cycles. Below this is exactly the text that @DroidW quotes in his answer.
On page 10-5 of the same databook, the execution times for the MC68010 are listed:
MULS 42(1/0)+* MULU 40(1/0)*
Note how fewer clock cycles are needed for this processor. I'm not sure why this
MULU instruction is missing the
+ for effective address calculation (perhaps a typo). Also, this table is missing the text that @DroidW quotes in his answer.
The MC68000 indeed uses microcode (and in some cases nanocode). The processor simply doesn't have enough combinational logic to perform a complete multiplication calculation in one cycle. Instead, it repeats some basic operations, with a loop controlled by the microcode. There are no barrel shifters, just single-bit shifts.
Let's call the multiplicand D, the multiplier R, and the product P.
To multiply unsigned numbers (MULU), we perform a long multiplication algorithm:
(1) Clear the product P = 0.
(2) Shift P one bit left.
(3) Shift D one bit left, and check the bit shifted out (the MSB). If it is zero, skip to step 5.
(4) Add R to the product P.
(5) Repeat steps 2 to 4 for each bit of D.
Because the branch in step 3 jumps ahead to step 5, it makes sense that the number of cycles will be based on 2x(number of ones in the multiplicand). This is confirmed in @DroidW's quote.
In later chips such as the 68010, we can add more logic to accomplish steps 2 to 4 in parallel. Also, instead of the branch in step 3, we can mask R with the bit from step 3, and then always add that result to P. These measures reduce the number of machine cycles.
In later processors, you can unroll the loop and use a large number of adders. It uses a lot more hardware and is harder to design, but is much faster. For example, to multiply binary numbers
1101 value of R 1101 value of R 0000 R is masked 0000 R is masked ======= 0100111
To multiply signed numbers (MULS), we use Booth's algorithm.
(1) Set P to D. An additional bit is added to the right (a new LSB), which is initialized to zero.
(2) Examine the two least significant bits of P. If 01, continue to step 3. If 10, skip to step 4. Otherwise, skip to step 5.
(3) Add R to the upper word of P. Jump to step 5.
(4) Subtract R from the upper word of P.
(5) Arithmetically shift P one bit right.
(6) Repeat steps 2 to 5 for all bits of D.
As you can see, there is more work to be done when bit patterns 01 and 10 are seen. This is confirmed by @DroidW's quote.
In advanced processors, you can also unroll this algorithm using hardware.
DIVS,DIVU - The divide algorithm used by the MC68000 provides less than 10% difference between the best and the worst case timings. MULS,MULU - The multiply algorithm requires 38+2n clocks where n is defined as: MULU: n = the number of ones in the <ea> MULS: n = concatanate the <ea> with a zero as the LSB; n is the resultant number of 10 or 01 patterns in the 17-bit source; i.e., worst case happens when the source is $5555