ACL 7.0 is smokin!
Tuesday, November 9, 2004
Several months ago, I ran Eric Marsden's cl-bench benchmarks against the 3 most popular Win32 CL implemementations. Since ACL 7.0 is now available, I decided to see how it compares. Since I wanted to use the previous release as a a basis for comparison against the new release, I have kept the ACL 6.2 figures in the table. As before, all tests were run on a Toshiba Tecra 9000 laptop with a 1200MHz Intel Pentium III processor and 1 MB of RAM under MS Windows 2000 SP4. The color coding is as follows:
- Dark green indicates the best result.
- Light green indicates second place.
- Orange indicates third place.
- Red indicates the worst result.
- "n/a" indicates that the test did not run successfully.
| Benchmark | CLISP-2.33 | LW-4.3.7 | ACL-6.2 | ACL-7.0 |
| COMPILER | 2.28 | 3.28 | 2.48 | 1.88 |
| LOAD-FASL | 1.20 | 4.35 | 0.65 | 0.51 |
| SUM-PERMUTATIONS | 7.41 | 15.45 | 5.02 | 1.84 |
| WALK-LIST/SEQ | 0.09 | 0.21 | 0.25 | 0.25 |
| WALK-LIST/MESS | 0.03 | n/a | n/a | n/a |
| BOYER | 91.50 | 88.49 | 23.00 | 41.14 |
| BROWSE | 2.60 | 7.63 | 0.79 | 0.86 |
| DDERIV | 3.31 | 1.75 | 0.91 | 0.93 |
| DERIV | 3.46 | 1.98 | 0.88 | 0.98 |
| DESTRUCTIVE | 4.02 | 1.18 | 0.84 | 0.86 |
| DIV2-TEST-1 | 4.73 | 2.70 | 1.09 | 1.17 |
| DIV2-TEST-2 | 7.04 | 3.07 | 1.43 | 1.43 |
| FFT | 7.70 | 7.59 | 6.70 | 7.37 |
| FRPOLY/FIXNUM | 9.70 | 2.00 | 1.61 | 1.51 |
| FRPOLY/BIGNUM | 3.24 | 2.07 | 1.86 | 1.42 |
| FRPOLY/FLOAT | 8.98 | 2.50 | 2.63 | 2.58 |
| PUZZLE | 21.41 | 5.36 | 10.03 | 9.03 |
| TAK | 9.63 | 0.77 | 0.73 | 0.71 |
| CTAK | 7.55 | 0.78 | 1.90 | 1.94 |
| TRTAK | 9.27 | 0.77 | 0.72 | 0.69 |
| TAKL | 12.35 | 1.90 | 2.08 | 1.99 |
| STAK | 7.86 | 1.07 | 5.25 | 2.94 |
| FPRINT/UGLY | 5.07 | 5.99 | 9.82 | 7.46 |
| FPRINT/PRETTY | 4.31 | 23.73 | 11.06 | 8.25 |
| TRAVERSE | 22.74 | 6.58 | 3.51 | 3.74 |
| TRIANGLE | 36.20 | 3.92 | 8.61 | 7.86 |
| RICHARDS | 24.58 | 5.09 | 9.21 | 7.98 |
| FACTORIAL | 1.30 | 2.48 | 2.22 | 0.99 |
| FIB | 4.73 | 0.39 | 0.29 | 0.26 |
| FIB-RATIO | 0.07 | 0.18 | 9.21 | 0.08 |
| MANDELBROT/COMPLEX | 33.54 | 25.32 | 31.09 | 23.28 |
| MANDELBROT/DFLOAT | 30.92 | 13.11 | 15.09 | 15.23 |
| MRG32K3A | 42.78 | 13.84 | 7.79 | 4.64 |
| CRC40 | 161.34 | 90.64 | 86.25 | 88.54 |
| BIGNUM/ELEM-100-1000 | 0.09 | 0.74 | 1.62 | 0.55 |
| BIGNUM/ELEM-1000-100 | 0.27 | 2.87 | 8.02 | 2.09 |
| BIGNUM/ELEM-10000-1 | 0.25 | 3.73 | 22.42 | 3.52 |
| BIGNUM/PARI-100-10 | 0.03 | 0.08 | 0.12 | 0.09 |
| BIGNUM/PARI-200-5 | 0.10 | 0.38 | 0.57 | 0.44 |
| PI-DECIMAL/SMALL | 2.56 | 37.61 | 757.83 | 3.55 |
| PI-DECIMAL/BIG | 1.88 | n/a | 1727.69 | 3.45 |
| PI-ATAN | 2.10 | n/a | 11.97 | 2.78 |
| PI-RATIOS | 1.19 | 11.15 | 59.17 | 4.99 |
| SLURP-LINES | 0.02 | 0.01 | 0.02 | 0.02 |
| HASH-STRINGS | 3.05 | 2.05 | 14.35 | 1.65 |
| HASH-INTEGERS | 3.20 | 5.04 | 6.73 | 1.60 |
| BOEHM-GC | 25.83 | 11.94 | 9.29 | 8.26 |
| DEFLATE-FILE | 7.83 | n/a | 3.57 | 2.61 |
| 1D-ARRAYS | 1.49 | 0.69 | 0.51 | 0.52 |
| 2D-ARRAYS | 39.52 | 24.04 | 23.33 | 20.85 |
| 3D-ARRAYS | 87.82 | n/a | 48.18 | 46.42 |
| BITVECTORS | 16.92 | 2.83 | 3.20 | 3.23 |
| BENCH-STRINGS | 3.55 | 25.44 | 27.99 | 2.89 |
| fill-strings/adjustable | 64.58 | 66.62 | 97.89 | 49.50 |
| STRING-CONCAT | 564.81 | n/a | n/a | 74.65 |
| SEARCH-SEQUENCE | 18.52 | 7.48 | 7.47 | 6.89 |
| CLOS/defclass | 0.89 | 6.38 | 0.31 | 4.81 |
| CLOS/defmethod | 0.20 | 1.40 | 0.35 | 0.49 |
| CLOS/instantiate | 6.53 | 5.92 | 5.11 | 4.78 |
| CLOS/simple-instantiate | 6.08 | 1.26 | 0.88 | 0.86 |
| CLOS/methodcalls | 9.76 | 8.13 | 7.34 | 5.77 |
| CLOS/method+after | 8.07 | 6.67 | 3.11 | 2.45 |
| CLOS/complex-methods | n/a | 0.86 | 0.76 | 0.67 |
| EQL-SPECIALIZED-FIB | 3.12 | 1.06 | 0.96 | 0.86 |
I was impressed by the amount performance has improved in ACL 7.0 compared to ACL 6.2. ACL 7.0 now scores 1st or 2nd in most of the tests and has no "Worst" scores. Admittedly, this is not a scientific test (all the caveats that I expressed in my earlier posting still apply and results can vary from test to test); however, this still represents quite a significant performance boost for ACL 7.0.
I exchanged a number of emails with Duane Rettig of Franz to learn more about what had been done. His emails provide insight into some of the issues, so I decided to include them in this posting. Here are some of his comments (extracted from multiple emails and reproduced here with permission):
"I mostly try to stay away from commenting on benchmarking except for making generalizations, because it is such a YMMV thing, and also because it is so transient; one platform emphasizes one thing and another emphasizes another; when one vendor finds a great algorithm the others scramble to try to leapfrog the first. So it is constantly changing.In addition to his technical comments, Duane summed up the influence of benchmarking tests very nicely in one of his emails. I'll close this posting with his summary:
To my knowledge, there is only one major functional reason why any of our lisp functionality should be slower than any others, and that is our 'wide-binding' symbol-access style - I've actually sped that up from 6.2 to 7.0, and am still looking at better ways to access symbol-value faster - you'll see speedups on any benchmarks that access global-special variables, and although you won't see any benchmarks to prove it, wide-binding allows lisp thread switches to become much faster, with both our 'virtual threads' and our 'os-threads' designs.
We have improved string/symbol hashing in two steps; in 6.2, we issued the hash.004 patch which improved the distribution of hash codes dramatically, and which reduced the collisions in these benchmarks to nearly zero. In 7.0, we also bashed things out in the symbol representation and thus increased the width of the hash-code generation field from 16 bits to 24 bits, which removed the rest of the collisions and spread the hash codes even more evenly. Your perception of snappiness is probably due to the two decimal orders of magnitude increase in hashing efficiency.
One final observation: The boyer benchmark seems much worse on 7.0, but I suspect that if you do a (room) on each of your 6.2 and 7.0 lisps, you'd find a much smaller newspace on whichever 7.0 image you are using to do the benchmarks. The boyer benchmark is cons-intensive, and depending on how many iterations you do the gc will figure into the benchmark speed. I don't know what the current thinking is on the inclusion of gc times, but back in the gabriel/stanford-benchmark times in the late '80s and easly '90s, I think that we had agreed that gc was undeterministic and thus should be factored out of benchmarks. This is borne out by the fact that even this set of benchmarks does a preliminary call to bench-gc, for each implementation to do with what they wish. I have not taken advantage of that, but in my own copy of sysdep/setup-acl.lisp, just before the (in-package :cl-bench), I placed this form:
(sys::resize-areas :new 15000000 :old 15000000)
which seems to help boyer out."
"Two things (perhaps it sounds too much like marketing hype to you, but it is truly how I, as a developer, do feel):
- Results like this should serve as an answer to those who have guessed that we 'aren't interested' in speeding up particular kinds of benchmarks - obviously, as programmers explore newer or less-used areas of Lisp functionality, or as they come up with new benchmarks (especially those which showcase their favorite implementations' strong suits) there will always be some possibilities for temporary lags in performance as we weigh the requirements of our customers (especially paying ones). And, quite frankly, those areas might not always get immediate attention, since our highest priorities are in fact to satisfy requirements of our customers (or, more accurately, to do what we can to make them more successful). But putting these optimizations at lower priority doesn't necessarily mean that they won't get done; it just means that it won't get done as soon.
- This isn't the end of it, by any means. I'm sure that other CL developers will continue (as they have) to improve the speed of their implementations, as will we. And the more meaningful benchmarks (both quantity and quality together) we see, the better CL implementations are likely to get."

