\ MM.FTH - Forth Floating point matrix multiply benchmark 0 [IF] ====================================================== This file is maintained by: Stephen Pelc MicroProcessor Engineering 133 Hill Lane Southampton SO15 5AF England tel: +44 (0)23 8063 1441 fax: +44 (0)23 8033 9691 email: stephen@mpeltd.demon.co.uk The Forth MM benchmark was contributed by Marcel Hendrix, derived from a C benchmark by Mark Smotherman. The code may be freely redistributed. If you modify this code, and want the changes to be incorporated in future releases, please send the changes to Stephen Pelc. The code is used to test MPE's VFX code generator and optimiser. Except for legacy reasons, no attempt is made to maintain this code for non-optimising and non-ANS systems. The code is maintained to test ANS Forth systems. It does not take advantage of system specific extensions, except for one (see below). However, it uses the utility file from the Forth Scientific Library, which is intended to be hacked for specific systems as much as you can. Consequently, when you contribute results, please also contribute the FSL utility file as well. ************************************* On 2.8GHz P4, 512Mb DDR266 RAM, XPpro ************************************* SwiftForth 2.00.3 19may2000 standard FPMATH.F and FSL-UTIL.F all-tests 500x500 mm - normal algorithm 3.282 secs. 500x500 mm - temporary variable in loop 9.781 secs. 500x500 mm - unrolled inner loop, factor of 4 9.969 secs. 500x500 mm - unrolled inner loop, factor of 8 9.922 secs. 500x500 mm - unrolled inner loop, factor of 16 9.906 secs. 500x500 mm - pointers used to access matrices 4.047 secs. 500x500 mm - pointers used, unrolled by 4 4.141 secs. 500x500 mm - transposed B matrix 8.891 secs. 500x500 mm - interchanged inner loops 8.797 secs. 500x500 mm - blocking, step size of 20 9.234 secs. 500x500 mm - Robert's algorithm 2.812 secs. 500x500 mm - T. Maeno's algorithm, subarray 20x20 3.766 secs. 500x500 mm - Generic Maeno, subarray 20x20 3.125 secs. 500x500 mm - D. Warner's algorithm, subarray 20x20 9.344 secs. ========================================================= Total 97.017 secs. ok iForth version 1.12.8722, generated 23:39:13, June 8, 2002. optimised FSL_UTIL.FRT 500x500 mm - normal algorithm 1.735 secs. 500x500 mm - temporary variable in loop 3.296 secs. 500x500 mm - unrolled inner loop, factor of 4 3.172 secs. 500x500 mm - unrolled inner loop, factor of 8 3.125 secs. 500x500 mm - unrolled inner loop, factor of 16 3.094 secs. 500x500 mm - pointers used to access matrices 1.734 secs. 500x500 mm - pointers used, unrolled by 4 1.578 secs. 500x500 mm - transposed B matrix 1.687 secs. 500x500 mm - interchanged inner loops 2.938 secs. 500x500 mm - blocking, step size of 20 3.125 secs. 500x500 mm - Robert's algorithm 0.593 secs. 500x500 mm - T. Maeno's algorithm, subarray 20x20 0.875 secs. 500x500 mm - Generic Maeno, subarray 20x20 0.922 secs. 500x500 mm - D. Warner's algorithm, subarray 20x20 1.765 secs. ========================================================= Total 29.639 secs. ok VFX Forth Version: 3.70 [build 1659] Build date: 5 July 2004 with optimising NDP387 and VFUTILS.FTH all-tests 500x500 mm - normal algorithm 1.641 secs. 500x500 mm - temporary variable in loop 2.766 secs. 500x500 mm - unrolled inner loop, factor of 4 2.563 secs. 500x500 mm - unrolled inner loop, factor of 8 2.562 secs. 500x500 mm - unrolled inner loop, factor of 16 2.563 secs. 500x500 mm - pointers used to access matrices 2.984 secs. 500x500 mm - pointers used, unrolled by 4 1.578 secs. 500x500 mm - transposed B matrix 1.187 secs. 500x500 mm - interchanged inner loops 2.172 secs. 500x500 mm - blocking, step size of 20 2.328 secs. 500x500 mm - Robert's algorithm 0.578 secs. 500x500 mm - T. Maeno's algorithm, subarray 20x20 0.578 secs. 500x500 mm - Generic Maeno, subarray 20x20 0.656 secs. 500x500 mm - D. Warner's algorithm, subarray 20x20 1.282 secs. ========================================================= Total 25.438 secs. ok =============================================================== [THEN] \ ************************************************ \ Select system to be tested, set FORTHSYSTEM \ to value of selected target. \ Set SPECIFICS false to avoid system dependencies. \ Set SPECIFICS true to show off implementation tricks. \ Set HACKING false to use the base source code. \ Set HACKING true to optimise the source code. \ ************************************************ 1 constant VfxForth3 \ MPE VFX Forth v3.x 2 constant Pfw22 \ MPE ProForth 2.2 3 constant SwiftForth20 \ FI SwiftForth 2.0 4 constant SwiftForth15 \ FI SwiftForth 1.5 5 constant Win32Forth \ Win32Forth 4.2 6 constant BigForth \ BigForth 11 July 1999 7 constant BigForth-Linux \ BigForth 11 July 1999 8 constant iForth \ iForth 1.12 5 Aug 2001 9 constant iForth20 \ iForth 2.0 8 June 2002 10 constant SwiftForth22 \ FI SwiftForth 2.2.2.9 VfxForth3 constant ForthSystem \ select system to test \ iForth20 constant ForthSystem \ SwiftForth22 constant ForthSystem \ Win32Forth constant ForthSystem false constant specifics \ true to use system dependent code false constant hacking \ true to use "guru" level code that \ makes assumptions of an optimising compiler. true constant ANSSystem \ Some Forth 83 systems cannot compile \ all the test examples without carnal \ knowledge, especially if the compiler \ checks control structures. : .specifics \ -- ; display trick state ." using" specifics 0= if ." no" then ." extensions" ; : .hacking \ -- ; display hack state ." using" hacking 0= if ." no" then ." hackery" ; : .testcond \ -- ; display test conditions .specifics ." and" .hacking ; \ ***************************** \ VFX Forth for Windows harness \ ***************************** VfxForth3 ForthSystem = [IF] [defined] +idata [if] +idata \ enable P4 data options variable zzz \ preallocate first IDATA buffer [then] true constant ndp? \ -- flag ; true if NDP stack version c" C:\Products\PfwVfx.dev\WindowsBox\Sources\Lib" setmacro LibDir c" C:\Products\PfwVfx.dev\WindowsBox\Sources\Lib\FSL\library" setmacro FslDir c" C:\Products\PfwVfx.dev\WindowsBox\Sources\Lib" setmacro NdpDir ndp? [if] S" %NdpDir%\Ndp387" INCLUDED [else] S" %NdpDir%\Hfp387" INCLUDED [then] char . dp-char ! \ select ANS number conversion char . fp-char ! -short-branches \ disable short forward branches S" %FslDir%\VfxUtil" INCLUDED \ FSL harness for VFX Forth 3.x : HTAB out @ - spaces ; \ n -- ; step to position n : DEC. ( n -- ) BASE @ >R DECIMAL . R> BASE ! ; extern: DWORD PASCAL GetTickCount( void ); : COUNTER \ -- ms GetTickCount ; : DFVARIABLE #16 buffer: ; [THEN] \ ****************************** \ iForth 2.0 for Windows harness \ ****************************** iForth20 ForthSystem = [IF] NEEDS -miscutil NEEDS -dynlink 0 VALUE 'counter S" kernel32.dll" LIBRARY-OPEN THROW ( dll) S" GetTickCount" ROT LIBRARY-FIND THROW TO 'counter : counter ( -- ms ) 0 'counter FOREIGN ; : defined \ caddr -- flag find nip ; \ include c:\dfwforth\examples\fsl\fsl_util.frt include c:\dfwforth\include\fsl_util.frt : & ; IMMEDIATE [THEN] \ ********************** \ SwiftForth 2.0 harness \ ********************** SwiftForth20 ForthSystem = [IF] include C:\MyApps\SwiftForth20\Lib\Options\fpmath.f : f<> F= 0= ; include C:\MyApps\SwiftForth20\Lib\FSLib\Library\fsl-util.f : HTAB get-xy drop - spaces ; \ n -- ; step to position n : DEC. ( n -- ) BASE @ >R DECIMAL . R> BASE ! ; [THEN] \ ********************** \ SwiftForth 2.2 harness \ ********************** SwiftForth22 ForthSystem = [IF] \ FPCONFIG.F should be in the BENCHMRK folder include C:\MyApps\SwiftForth2229\Lib\Options\fpmath.f : f<> F= 0= ; include C:\MyApps\SwiftForth2229\Unsupported\FSLib\Library\fsl-util.f : HTAB get-xy drop - spaces ; \ n -- ; step to position n : DEC. ( n -- ) BASE @ >R DECIMAL . R> BASE ! ; [THEN] \ ****************** \ Win32Forth harness \ ****************** Win32Forth ForthSystem = [IF] : COUNTER \ -- ms Call GetTickCount ; : >pos \ n -- ; step to position n getxy drop - spaces ; : HTAB #TAB ; \ n -- ; step to position n : M/ \ d n1 -- quot fm/mod nip ; : buffer: \ n -- ; -- addr create here over allot swap erase ; : 2- \ n -- n-2 2 - ; : [o/n] \ -- ; stop optimiser treating * DROP etc as no code ; immediate : SendMessage \ h m w l -- res swap 2swap swap \ Win32Forth uses reverse order Call SendMessage ; : GetTickCount \ -- ms Call GetTickCount ; [THEN] \ ******************** \ Start of common code \ ******************** 500 CONSTANT N \ -- n ; number of iterations to use \ ****************** \ Timing and display \ ****************** 50 CONSTANT TCOL 0 VALUE _time_ VARIABLE TotalTime : TIMER-RESET ( -- ) COUNTER TO _time_ ; : #? ( d -- d ) 2DUP OR 0= IF BL HOLD ELSE # THEN ; : .secs \ ms -- 0 <# BL HOLD # # # [char] . HOLD # #? #? #> TYPE ." secs." ; : .ELAPSED ( -- ) TCOL HTAB COUNTER _time_ - dup TotalTime +! .secs ; : .algo \ caddr u -- \ Display size and algorithm from string CR N 0 .R ." x" N 0 .R SPACE TYPE ; \ ***** \ TOOLS \ ***** C" DF@+" DEFINED 0= [IF] : DF@+ ( addr -- addr' ) ( F: -- r ) DUP DF@ DFLOAT+ ; [THEN] C" DF+!" DEFINED 0= [IF] : DF+! ( addr -- ) ( F: r -- ) DUP DF@ F+ DF! ; [THEN] C" DF!+" DEFINED 0= [IF] : DF!+ ( addr -- addr' ) ( F: r -- ) DUP DF! DFLOAT+ ; [THEN] C" DF+!+" DEFINED 0= [IF] : DF+!+ ( addr -- addr' ) ( F: r -- ) DUP DF@ F+ DF!+ ; [THEN] C" *DSUM" DEFINED 0= [IF] : *DSUM ( addr1 addr2 count -- addr1' addr2' ) ( F: -- n ) 0e 0 ?DO SWAP DF@+ SWAP DF@+ F* F+ LOOP ; : *DSUML ( addr1 addr2 count stride2 -- addr1' addr2' ) ( F: -- r ) LOCALS| stride2 | 0e 0 ?DO SWAP DF@+ SWAP DUP DF@ stride2 + F* F+ LOOP ; [THEN] CHAR x CONSTANT 'x' CHAR n CONSTANT 'n' CHAR v CONSTANT 'v' CHAR u CONSTANT 'u' CHAR p CONSTANT 'p' CHAR t CONSTANT 't' CHAR i CONSTANT 'i' CHAR b CONSTANT 'b' CHAR m CONSTANT 'm' CHAR r CONSTANT 'r' CHAR w CONSTANT 'w' CHAR z CONSTANT 'z' \ ===================================================================================== DOUBLE DMATRIX a{{ DOUBLE DMATRIX b{{ DOUBLE DMATRIX c{{ DOUBLE DMATRIX d{{ DOUBLE DMATRIX bt{{ 0 [IF] ================================================ \ Useful test bits : .a{{ \ -- ; display A{{ matrix cr ." A{{" N 0 ?DO cr N 0 ?DO a{{ J I }} DF@ F. LOOP LOOP ; : .b{{ \ -- ; display B{{ matrix cr ." B{{" N 0 ?DO cr N 0 ?DO b{{ J I }} DF@ F. LOOP LOOP ; : .c{{ \ -- ; display B{{ matrix cr ." C{{" N 0 ?DO cr N 0 ?DO c{{ J I }} DF@ F. LOOP LOOP ; : .bt{{ \ -- ; display B{{ matrix cr ." BT{{" N 0 ?DO cr N 0 ?DO bt{{ J I }} DF@ F. LOOP LOOP ; [THEN] : SET-COEFFICIENTS ( -- ) \ Set coefficients so that result matrix should have row entries \ equal to (1/2)*n*(n-1)*i in row i N 0 ?DO N 0 ?DO J S>F FDUP b{{ J I }} DF! a{{ J I }} DF! LOOP LOOP ; : FLUSH-CACHE ( -- ) N 0 ?DO N 0 ?DO 0e d{{ J I }} DF! LOOP LOOP ; FVARIABLE row_sum FVARIABLE sum : CHECK-RESULT ( -- ) 0e row_sum F! N N 1- * 2/ S>F sum F! N 0 ?DO I S>F sum F@ F* row_sum F! N 0 ?DO a{{ J I }} DF@ J S>F F<> IF CR ." error in result entry a{{ " J DEC. I DEC. ." }}: " a{{ J I }} DF@ F. ." <> " J S>F F. UNLOOP UNLOOP EXIT THEN b{{ J I }} DF@ J S>F F<> IF CR ." error in result entry b{{ " J DEC. I DEC. ." }}: " b{{ J I }} DF@ F. ." <> " J S>F F. UNLOOP UNLOOP EXIT THEN c{{ J I }} DF@ row_sum F@ F<> IF CR ." error in result entry c{{ " J DEC. I DEC. ." }}: " c{{ J I }} DF@ F. ." <> " row_sum F@ F. UNLOOP UNLOOP EXIT THEN LOOP LOOP ; : NORMAL() ( -- ) s" mm - normal algorithm" .algo TIMER-RESET N 0 ?DO N 0 ?DO a{{ J 0 }} b{{ 0 I }} N N DFLOATS *DSUML 2DROP c{{ J I }} DF! LOOP LOOP .ELAPSED ; : TNSQ() ( -- ) 0 LOCALS| K | s" mm - temporary variable in loop" .algo TIMER-RESET N 0 ?DO I TO K N 0 ?DO a{{ J 0 }} DF@ b{{ 0 I }} DF@ F* N 1 ?DO a{{ K I }} DF@ b{{ I J }} DF@ F* F+ LOOP c{{ J I }} DF! LOOP LOOP .ELAPSED ; : UNROLL4() ( -- ) 0 0 LOCALS| K S | s" mm - unrolled inner loop, factor of 4" .algo TIMER-RESET N 0 ?DO I TO K N 0 ?DO 0e 0 TO S N 3 - 0 ?DO I TO S a{{ K I }} DF@ b{{ I J }} DF@ F* F+ a{{ K I 1+ }} DF@ b{{ I 1+ J }} DF@ F* F+ a{{ K I 2+ }} DF@ b{{ I 2+ J }} DF@ F* F+ a{{ K I 3 + }} DF@ b{{ I 3 + J }} DF@ F* F+ 4 +LOOP N S 4 + ?DO a{{ K I }} DF@ b{{ I J }} DF@ F* F+ LOOP c{{ J I }} DF! LOOP LOOP .ELAPSED ; : UNROLL8() ( -- ) 0 0 LOCALS| K S | s" mm - unrolled inner loop, factor of 8" .algo TIMER-RESET N 0 ?DO I TO K N 0 ?DO 0e 0 TO S N 7 - 0 ?DO I TO S a{{ K I }} DF@ b{{ I J }} DF@ F* F+ a{{ K I 1+ }} DF@ b{{ I 1+ J }} DF@ F* F+ a{{ K I 2+ }} DF@ b{{ I 2+ J }} DF@ F* F+ a{{ K I 3 + }} DF@ b{{ I 3 + J }} DF@ F* F+ a{{ K I 4 + }} DF@ b{{ I 4 + J }} DF@ F* F+ a{{ K I 5 + }} DF@ b{{ I 5 + J }} DF@ F* F+ a{{ K I 6 + }} DF@ b{{ I 6 + J }} DF@ F* F+ a{{ K I 7 + }} DF@ b{{ I 7 + J }} DF@ F* F+ 8 +LOOP N S 8 + ?DO a{{ K I }} DF@ b{{ I J }} DF@ F* F+ LOOP c{{ J I }} DF! LOOP LOOP .ELAPSED ; : UNROLL16() ( -- ) 0 0 LOCALS| K S | s" mm - unrolled inner loop, factor of 16" .algo TIMER-RESET N 0 ?DO I TO K N 0 ?DO 0e 0 TO S N 15 - 0 ?DO I TO S a{{ K I }} DF@ b{{ I J }} DF@ F* F+ a{{ K I 1+ }} DF@ b{{ I 1+ J }} DF@ F* F+ a{{ K I 2+ }} DF@ b{{ I 2+ J }} DF@ F* F+ a{{ K I 3 + }} DF@ b{{ I 3 + J }} DF@ F* F+ a{{ K I 4 + }} DF@ b{{ I 4 + J }} DF@ F* F+ a{{ K I 5 + }} DF@ b{{ I 5 + J }} DF@ F* F+ a{{ K I 6 + }} DF@ b{{ I 6 + J }} DF@ F* F+ a{{ K I 7 + }} DF@ b{{ I 7 + J }} DF@ F* F+ a{{ K I 8 + }} DF@ b{{ I 8 + J }} DF@ F* F+ a{{ K I 9 + }} DF@ b{{ I 9 + J }} DF@ F* F+ a{{ K I 10 + }} DF@ b{{ I 10 + J }} DF@ F* F+ a{{ K I 11 + }} DF@ b{{ I 11 + J }} DF@ F* F+ a{{ K I 12 + }} DF@ b{{ I 12 + J }} DF@ F* F+ a{{ K I 13 + }} DF@ b{{ I 13 + J }} DF@ F* F+ a{{ K I 14 + }} DF@ b{{ I 14 + J }} DF@ F* F+ a{{ K I 15 + }} DF@ b{{ I 15 + J }} DF@ F* F+ 16 +LOOP N S 16 + ?DO a{{ K I }} DF@ b{{ I J }} DF@ F* F+ LOOP c{{ J I }} DF! LOOP LOOP .ELAPSED ; : UNROLL ( n -- ) CASE 4 OF UNROLL4() ENDOF 8 OF UNROLL8() ENDOF 16 OF UNROLL16() ENDOF CR ." mm - unrolled inner loop, factor of " DUP DEC. ." not implemented" ENDCASE ; specifics [if] VfxForth3 ForthSystem = [if] -fasti [then] [then] : PNSQ4() ( -- ) 0 LOCALS| S | s" mm - pointers used, unrolled by 4" .algo TIMER-RESET N 0 ?DO N 0 ?DO 0e a{{ J 0 }} b{{ 0 I }} 0 TO S N 3 - 0 ?DO I TO S SWAP DF@+ SWAP DUP DF@ F* F+ N DFLOATS + SWAP DF@+ SWAP DUP DF@ F* F+ N DFLOATS + SWAP DF@+ SWAP DUP DF@ F* F+ N DFLOATS + SWAP DF@+ SWAP DUP DF@ F* F+ N DFLOATS + 4 +LOOP N S 4 + ?DO SWAP DF@+ SWAP DUP DF@ F* F+ N DFLOATS + LOOP c{{ J I }} DF! 2DROP LOOP LOOP .ELAPSED ; : PNSQ() ( n -- ) DUP 4 = IF DROP PNSQ4() EXIT THEN s" mm - pointers used to access matrices" .algo ?DUP IF ." , unroll factor of " DEC. ." not allowed" EXIT THEN TIMER-RESET N 0 ?DO N 0 ?DO 0e a{{ J 0 }} b{{ 0 I }} N 0 ?DO SWAP DF@+ SWAP DUP DF@ N DFLOATS + F* F+ \ DUP DF@ \ F* F+ \ N DFLOATS + LOOP c{{ J I }} DF! 2DROP LOOP LOOP .ELAPSED ; specifics [if] VfxForth3 ForthSystem = [if] +fasti [then] [then] : TRANSPOSE() ( -- ) 0 LOCALS| K | s" mm - transposed B matrix" .algo TIMER-RESET N 0 ?DO N 0 ?DO b{{ J I }} DF@ bt{{ I J }} DF! LOOP LOOP N 0 ?DO I TO K N 0 ?DO a{{ J 0 }} DF@ bt{{ I 0 }} DF@ F* N 1 ?DO a{{ K I }} DF@ bt{{ J I }} DF@ F* F+ LOOP c{{ J I }} DF! LOOP LOOP .ELAPSED ; \ from Monica Lam ASPLOS-IV paper : REG_LOOPS() ( -- ) 0 LOCALS| K | s" mm - interchanged inner loops" .algo TIMER-RESET N 0 ?DO N 0 ?DO 0e c{{ J I }} DF! LOOP LOOP N 0 ?DO I TO K N 0 ?DO a{{ J I }} DF@ N 0 ?DO FDUP b{{ J I }} DF@ F* c{{ K I }} DF+! LOOP FDROP LOOP LOOP .ELAPSED ; : TILING() ( step -- ) s" mm - blocking, step size of " .algo DUP DEC. DUP 4 N 1+ WITHIN 0= IF drop ." is unreasonable" EXIT THEN TIMER-RESET 0 0 0 LOCALS| K kk jj step | N 0 ?DO N 0 ?DO 0e c{{ J I }} DF! LOOP LOOP N 0 ?DO I TO kk N 0 ?DO I TO jj N 0 ?DO I TO K kk step + N MIN kk ?DO a{{ J I }} DF@ jj step + N MIN jj ?DO FDUP b{{ J I }} DF@ F* c{{ K I }} DF+! LOOP FDROP LOOP LOOP step +LOOP step +LOOP .ELAPSED ; \ ********************************************/ \ * Contributed by Robert Debath 26 Nov 1995 */ \ * rdebath@cix.compulink.co.uk */ \ ********************************************/ : ROBERT() ( -- ) s" mm - Robert's algorithm" .algo TIMER-RESET N 0 ?DO N 0 ?DO b{{ J I }} DF@ bt{{ I J }} DF! LOOP LOOP N 0 ?DO N 0 ?DO a{{ J 0 }} bt{{ I 0 }} N *DSUM 2DROP c{{ J I }} DF! LOOP LOOP .ELAPSED ; 0 [IF] =========================================================================== * Matrix Multiply by Dan Warner, Dept. of Mathematics, Clemson University * * mmbu2.f multiplies matrices a and b * a and b are n by n matrices * nb is the blocking parameter. * the tuning guide indicates nb = 50 is reasonable for the * ibm model 530 hence 25 should be reasonable for the 320 * since the 320 has 32k rather than 64k of cache. * Inner loops unrolled to depth of 2 * The loop functions without clean up code at the end only * if the unrolling occurs to a depth k which divides into n * in this case n must be divisible by 2. * The blocking parameter nb must divide into n if the * multiply is to succeed without clean up code at the end. * * converted to c by Mark Smotherman * note that nb must also be divisible by 2 => cannot use 25, so use 20 =========================================================================== [THEN] DFVARIABLE s10 DFVARIABLE s00 DFVARIABLE s01 DFVARIABLE s11 : WARNER() ( nb -- ) 0 0 0 0 LOCALS| K ii jj kk nb | s" mm - D. Warner's algorithm, subarray " .algo nb 0 .R 'x' EMIT nb 0 .R N nb MOD N 2 MOD OR IF cr ." the matrix size " N DEC. ." must be divisible both by the block size " nb DEC. ." and 2." EXIT THEN nb 2 MOD IF ." block size must be evenly divisible by 2" EXIT THEN TIMER-RESET N 0 ?DO I TO ii N 0 ?DO I TO jj nb ii + ii ?DO nb jj + jj ?DO 0e c{{ J I }} DF! LOOP LOOP N 0 ?DO I TO kk nb ii + ii ?DO I TO K nb jj + jj ?DO c{{ J I }} DF@ s00 DF! c{{ J I 1+ }} DF@ s01 DF! c{{ J 1+ I }} DF@ s10 DF! c{{ J 1+ I 1+ }} DF@ s11 DF! nb kk + kk ?DO a{{ K I }} DF@ b{{ I J }} DF@ F* s00 DF+! a{{ K I }} DF@ b{{ I J 1+ }} DF@ F* s01 DF+! a{{ K 1+ I }} DF@ b{{ I J }} DF@ F* s10 DF+! a{{ K 1+ I }} DF@ b{{ I J 1+ }} DF@ F* s11 DF+! LOOP s00 DF@ c{{ J I }} DF! s01 DF@ c{{ J I 1+ }} DF! s10 DF@ c{{ J 1+ I }} DF! s11 DF@ c{{ J 1+ I 1+ }} DF! 2 +LOOP 2 +LOOP nb +LOOP nb +LOOP nb +LOOP .ELAPSED ; 0 [IF] =========================================================================== Matrix Multiply tuned for SS-10/30; * Maeno Toshinori * Tokyo Institute of Technology * * Using gcc-2.4.1 (-O2), this program ends in 12 seconds on SS-10/30. * * in original algorithm - sub-area for cache tiling * #define L 20 * #define L2 20 * three 20x20 matrices reside in cache; two may be enough =========================================================================== [THEN] DFVARIABLE t0 DFVARIABLE t1 DFVARIABLE t2 DFVARIABLE t3 DFVARIABLE t4 DFVARIABLE t5 DFVARIABLE t6 DFVARIABLE t7 : MAENO() ( nb -- ) 0 0 0 0 0 LOCALS| K it kt i2 kk lparm | s" mm - T. Maeno's algorithm, subarray " .algo lparm 0 .R 'x' EMIT lparm 0 .R N lparm MOD N 4 MOD OR IF cr ." the matrix size " N DEC. ." must be divisible both by the block size " lparm DEC. ." and 4." EXIT THEN lparm 4 MOD IF cr ." block size must be evenly divisible by 4" EXIT THEN TIMER-RESET N 0 ?DO N 0 ?DO 0e c{{ J I }} DF! LOOP LOOP N 0 ?DO I TO i2 N 0 ?DO I TO kk i2 lparm + TO it kk lparm + TO kt N 0 ?DO I TO K it i2 ?DO 0e t0 DF! 0e t1 DF! 0e t2 DF! 0e t3 DF! 0e t4 DF! 0e t5 DF! 0e t6 DF! 0e t7 DF! kt kk ?DO a{{ J I }} DF@ FDUP b{{ I K }} DUP DF@+ F* t0 DF+! FDUP DF@+ F* t1 DF+! FDUP DF@+ F* t2 DF+! DF@ F* t3 DF+! a{{ J 1+ I }} DF@ FDUP DF@+ F* t4 DF+! FDUP DF@+ F* t5 DF+! FDUP DF@+ F* t6 DF+! DF@ F* t7 DF+! LOOP t0 DF@ c{{ I J }} DF+!+ t1 DF@ DF+!+ t2 DF@ DF+!+ t3 DF@ DF+! t4 DF@ c{{ I 1+ J }} DF+!+ t5 DF@ DF+!+ t6 DF@ DF+!+ t7 DF@ DF+! 2 +LOOP 4 +LOOP lparm +LOOP lparm +LOOP .ELAPSED ; : MM_MAENO \ pip1 pip2 pop3 nb -- \ Takes pointers to two FSL input arrays, an FSL output array and a \ block size nb. 0 0 0 0 0 LOCALS| K it kt i2 kk lparm pop3{{ pip2{{ pip1{{ | N 0 ?DO N 0 ?DO 0e pop3{{ J I }} DF! LOOP LOOP N 0 ?DO I TO i2 N 0 ?DO I TO kk i2 lparm + TO it kk lparm + TO kt N 0 ?DO I TO K it i2 ?DO 0e t0 DF! 0e t1 DF! 0e t2 DF! 0e t3 DF! 0e t4 DF! 0e t5 DF! 0e t6 DF! 0e t7 DF! kt kk ?DO pip1{{ J I }} DF@ FDUP pip2{{ I K }} DUP DF@+ F* t0 DF+! FDUP DF@+ F* t1 DF+! FDUP DF@+ F* t2 DF+! DF@ F* t3 DF+! pip1{{ J 1+ I }} DF@ FDUP DF@+ F* t4 DF+! FDUP DF@+ F* t5 DF+! FDUP DF@+ F* t6 DF+! DF@ F* t7 DF+! LOOP t0 DF@ pop3{{ I J }} DF+!+ t1 DF@ DF+!+ t2 DF@ DF+!+ t3 DF@ DF+! t4 DF@ pop3{{ I 1+ J }} DF+!+ t5 DF@ DF+!+ t6 DF@ DF+!+ t7 DF@ DF+! 2 +LOOP 4 +LOOP lparm +LOOP lparm +LOOP ; : MAENO2() \ nb -- s" mm - Generic Maeno, subarray " .algo dup 0 .R 'x' EMIT dup 0 .R N over MOD N 4 MOD OR IF cr ." the matrix size " N DEC. ." must be divisible both by the block size " DEC. ." and 4." EXIT THEN dup 4 MOD IF drop cr ." block size must be evenly divisible by 4" EXIT THEN TIMER-RESET >r a{{ b{{ c{{ r> MM_MAENO .ELAPSED ; : MM ( char n -- ) DEPTH 0= ABORT" no algorithm chosen" DEPTH 2 < IF 0 THEN LOCALS| ur | & a{{ N N }}malloc malloc-fail? & b{{ N N }}malloc malloc-fail? OR & bt{{ N N }}malloc malloc-fail? OR & c{{ N N }}malloc malloc-fail? OR & d{{ N N }}malloc malloc-fail? OR ABORT" MM :: out of core" SET-COEFFICIENTS FLUSH-CACHE CASE 'n' OF NORMAL() ENDOF 'v' OF TNSQ() ENDOF 'u' OF ur UNROLL ENDOF 'p' OF ur PNSQ() ENDOF 't' OF TRANSPOSE() ENDOF 'i' OF REG_LOOPS() ENDOF 'b' OF ur TILING() ENDOF 'm' OF ur MAENO() ENDOF 'z' OF ur MAENO2() ENDOF 'r' OF ROBERT() ENDOF 'w' OF ur WARNER() ENDOF CR ." `" DUP EMIT ." ' is an invalid algorithm" ENDCASE CHECK-RESULT & d{{ }}free & c{{ }}free & bt{{ }}free & b{{ }}free & a{{ }}free key? drop \ Permits o/p update on some systems ; : ALL-TESTS ( -- ) page 0 TotalTime ! 'n' mm 'v' mm 'u' 4 mm 'u' 8 mm 'u' 16 mm 'p' mm 'p' 4 mm 't' mm 'i' mm 'b' 20 mm 'r' mm 'm' 20 mm 'z' 20 mm 'w' 20 mm cr TCOL 7 + 0 ?DO ." =" LOOP cr ." Total" .testcond TCOL HTAB TotalTime @ .secs ; : .ABOUT CR ." Try: 'n' mm -- normal" CR ." 'v' mm -- with a temporary variable in the inner loop" CR ." 'u' n mm -- with unrolled (by n) inner loop, n = {4,8,16}" CR ." 'p' mm -- using pointers instead of array notation" CR ." 'p' 4 mm -- using pointers instead of array notation, unrolled by 4 [new]" CR ." 't' mm -- with transposed b matrix" CR ." 'i' mm -- with switched inner loops" CR ." 'b' n mm -- using blocking by n, 4 < n < " N DEC. CR ." 'r' mm -- using Robert's algorithm" CR ." 'r' 8 mm -- using Robert's algorithm unrolled by 8" CR ." 'm' n mm -- using Maeno's algorithm with blocking factor n" CR ." 'z' n mm -- using Maeno's algorithm with blocking factor n - generic form" CR ." 'w' n mm -- using Warner's algorithm with blocking factor n" CR CR ." ALL-TESTS -- test all algorithms" ; .ABOUT ( * End of Source * )