The Power of Optimization

•January 4, 2010 • Leave a Comment

People are pretty skeptical when I talk about getting big speedups from simple changes, so when optimizing/debugging the program I made recently to generate tough problem instances of Maximum Independent Set (MIS), I timed how long the generator took on one CPU core of my 1.66GHz Core 2 Duo laptop.

The program starts with a random graph that very likely has an independent set of size m, keeps the first independent set of size m that it finds, and adds edges as it searches so that the first one is the only independent set of size m.  There are many papers out there on hiding independent sets (or cliques) that are much larger than most in graphs, but this is the first I’ve heard of hiding an independent set that’s only one node larger than most of the maximal independent sets. I set out with the objective of getting 128-node graphs with a single maximum independent set of size 24.

The first version that worked correctly printed out every independent set it found, since I’d had some issues with correctly traversing the sets, so printing took a vast majority of the time.  Specifically:

nodes initial edges MIS size build printed parameters passed time (s)
32 64 12 32-bit debug state, graph, and added edges separately ~1200
32 64 12 32-bit debug graph and added edges separately 2.63
32 64 12 32-bit debug added edges separately 0.119
32 64 12 32-bit debug none separately 0.084
32 64 12 32-bit release none separately 0.037

This isn’t even doing any “real” optimizations; it was spending 99.997% of the time just printing out data. Speedup from not printing out anything and using release build: 32345x. In fact, the final result is so much shorter, that much of its time could be overhead, so the speedup on a larger instance would be higher.  I consider this like a design optimization, whereas the optimizations below are algorithmic optimizations.  I find that design optimizations usually have the largest impact and result in simpler code, though in this case the code wasn’t really much simpler since it wasn’t really a full-out design change.  Design changes are almost always of the type “I don’t need to do all this stuff… why am I doing it anyway?”  Moral of the story: printing text is ridiculously slow; avoid unnecessary logging.

Now that it was so much faster, I needed a larger test case, so I started ramping up the graph size:

nodes initial edges MIS size build printed parameters time (s)
64 400 16 32-bit release none separately 4.33
96 640 20 32-bit release none separately ~2316
96 640 20 32-bit release none together ~2598

This was the first “real” optimization I tried, and it didn’t work. The search is highly recursive and has 10 parameters, most of which don’t change, so I was thinking that passing them together in a structure would be better, but it wasn’t. The overhead of accessing them indirectly outweighted the savings from only passing 4 parameters. Moral of the story: even for functions with a ton of parameters, function calling is fast; it’s the content of the function that matters… as long as it isn’t a tiny function.

At this point, I went for a trick I’d done before for traversing independent sets, but I wasn’t sure how it’d work in this case, since it has to add edges too. The trick is whenever the number of nodes left that could be added to the current set is 64 or less, I bit-pack the adjacency matrix for the subgraph induced by these nodes and operate on entire rows of the matrix at once, since each row fits into a qword (unsigned long long). With a 64-bit build, this can be done quite efficiently. It took a lot of debugging to get this to work properly, but it was worth it:

nodes initial edges MIS size build printed parameters other time (s)
96 640 20 64-bit debug added edges separately 64-node trick,
96 640 20 64-bit release none separately 64-node trick 18.4
128 1200 24 64-bit release none separately 64-node trick ~666

That did the trick. It was now possible to accomplish the objective of getting a 128-node graph with a unique MIS of size 24 in just over 11 minutes. Speedup from bit-packing, to do many operations in parallel and reduce overhead: 126x. Moral of the story: simple bit operations take a clock cycle or two; fiddling around with data structures usually takes much longer. This still wasn’t quite where I wanted it, since now that it seems feasible, I would like to ramp it up a bit more. Luckily, I could easily improve the early cut-off check I had (i.e. when there aren’t enough nodes left to create an independent set of size m, you can avoid looking there.)

nodes initial edges MIS size build printed parameters other time (s)
128 1200 24 64-bit release none separately 64-node trick
& early cut-off
128 900 32 64-bit release none separately 64-node trick
& early cut-off

That helped too. Speedup from early cut-off: 5x. Moral of the story: don’t compute things that you don’t need to, as long as it’s easy to check whether you need to compute them or not.

Final speedup: 625x from just algorithmic optimizations. 20 million x total. To generate 50 graphs of the last type, it’ll only take 11.3 CPU-hours now, but it’d have taken 293 CPU-days without any optimization other than removing printing and using a release build.  I haven’t done any detail-level optimization, what most people call “inner-loop optimization”.  I could do a bunch more, i.e. this is still single-threaded and in C; I could do much better in assembly language and with 8 threads on the Intel Core i7 CPU at work. I’ll probably generate 8 graphs at once by adding a simple OpenMP parallel for loop. However, there’s not much use in generating such graphs with more than 128 nodes at the moment, so I’m back to working on a secret side project I hope you’ll like. 🙂

Proof for a Post a While Back

•December 14, 2009 • 2 Comments

In a post a while back, I mentioned that I’d found that:

1-\frac{1}{\sqrt[k]{2}}\approx\frac{\ln 2}{k}

for sufficiently large k, which relates to finding the median of the minimum of a set of k independent, identically-distributed random variables.  I didn’t have a proof at the time, but I’ve thought of a very simple “proof”:

(1-\frac{1}{x})^x \approx \frac{1}{e} for large x.

1-\frac{1}{x} \approx \frac{1}{\sqrt[x]{e}}

1-\frac{1}{\sqrt[x]{e}} \approx \frac{1}{x}

1-\frac{1}{\sqrt[k]{e^{\ln 2}}} \approx \frac{\ln 2}{k}; x = \frac{k}{\ln 2}

1-\frac{1}{\sqrt[k]{2}} \approx \frac{\ln 2}{k}

It’s by no means rigorous, as things that are “approximately equal” put through certain transformations can result in values that  are not even close, but it seems to check out in this case.

More Tutorial Videos!

•September 7, 2009 • Leave a Comment

Now that my honours project is over, I’ve been busy catching up on other things, including the Assembly Language Video Tutorial and a new Performance Optimization Tutorial.  The assembly tutorial now has two more videos, one showing how to do a 3×3 Gaussian blur in a very simple way, and one showing how to identify red and orange in the image.

Input Image: Map of the Carleton University Tunnels

Input Image: Map of the Carleton University Tunnels

Output of Episode 4: Map Photo Blurred to Reduce Noise

Output of Episode 4: Map Photo Blurred to Reduce Noise

Output of Episode 4: Red and Orange Thresholds to Identify Tunnels and "You Are Here" Sign

Output of Episode 5: Red and Orange Thresholds to Identify Tunnels and "You Are Here" Sign

In the next few videos, we’ll be finding which red blob is the “You are here” sign, finding where it’s pointing, finding the farthest emergency telephone symbol from it, and finding the shortest path there.  The premise is that you’re running from zombies and want to warn campus security from as safe a position as possible, but you’re out of minutes on your cell phone. 😀

The Performance Optimization Tutorial is something that I’ve been meaning to get underway for some time, and my co-workers encouraged me to start it after getting an unexpected 10x-20x speedup on the AQUA@Home simulation code.  It’s going to be a much larger tutorial, so it’ll take a while to get into stuff like vectorization, but it’s very important to get a solid grounding in the fundamentals, so after the first two videos, there’ll be 4-6 videos on how common operations on different data types work at the bit level.  There’s a ton of stuff that can be exploited about the nature of these operations, some of which can make the difference between something being vectorizable and not vectorizable.  The hope is that almost any dedicated programmer should be able to get big speedups where it counts. 🙂

Edit: There’s a second video of the optimization tutorial out now.  I’m hoping to make this a weekly thing, or at least releasing a video in one of the tutorials each week.

Roots of Two and the Median of the Minimum

•August 31, 2009 • Leave a Comment

Yet another weird math observation.  As k becomes large (like k=1000 has less than 0.034% relative error):

1-\frac{1}{\sqrt[k]{2}}\approx\frac{\ln 2}{k}

I haven’t actually looked into why this is, but it means that:

\sqrt[k]{2}\approx\frac{k}{k-\ln 2}

So why am I looking at this?  Well, I’m really looking at the finding the median of the minimum of k independent, identically-distributed random varaibles.  With k such variables:

P(\min(X_1,...,X_k)<a) = 1-P(X_1\ge a \wedge X_2\ge a \wedge ... \wedge X_k\ge a)
= 1-P(X_1\ge a)^k
= 1-(1-P(X_1 < a))^k

The question is, what value a is the median of the minimum?  Let’s define p=P(X_1 < a) and solve for it:

p = 1-\frac{1}{\sqrt[k]{2}}

If it’s the minimum of many independent variables, it’s then \approx\frac{\ln 2}{k}.  So, as long as we can find the value a at percentile p of the original distribution, we’ve found the median of this distribution of the minimum.  Weird stuff.

Inventor IDE Alpha 5b Released, but…

•August 24, 2009 • Leave a Comment

So technically Inventor IDE Alpha 5b is now released on; the problem is that nobody can access right now, even though the bill got paid in time thanks to my father.  Well, I’ll just hope that it gets resolved soon and say a bit about this release.

This release finally lets users create 64-bit applications and operating system images by selecting them in the new project options dialog:

Project Options Dialog Box

Project Options Dialog Box

There are also many user interface improvements that probably nobody else will notice, but that will come in handy in the next videos of the tutorial I’ll be continuing again soon.  One that people may notice is one that I’d wanted to avoid, but there is now a progress bar for loading.  Personally, I wanted the loading performance to be improved to the point there that wasn’t needed, but that’ll have to wait for the Beta version.

There are some encoding bugs fixed, mostly for the support for 64-bit code, inline functions, and not-yet-released performance test functions.  As an example it turns out that general registers r12 and r13 need some undocumented encodings that I hadn’t supported until today.

The next things potentially on the agenda for Alpha 6 releases are:

  • Find/replace bar, a twist on the classic find bar
  • Find all references
  • Unit/performance testing system
  • Exports & relocations, to really support making libraries
  • Listing files (a request from Prof. Reichstein for COMP 2003, and something that’d help debugging operating system code)
  • Navigation side strip, for quick navigation within a file, especially for errors
  • Output formats: OBJ, LIB, ELF, and Mach-O, to support Windows, Linux, and Mac with the same source code
  • Input OBJs or LIBs

Things higher up are currently more likely to come sooner, but if people have any requests, please let me know.  The most likely is that only a few will get implemented for Alpha 6, and the rest will be put off for when Inventor IDE has been ported to C/C++.

Possible Downtime and More Random Math

•August 22, 2009 • 1 Comment and may or may not be down on Sunday or Monday.  If they are, I blame Scotiabank for:

  • needing 6-8 business days to make a simple transaction that can be made in less than half a second because of two little things I like to call: the internet and the 21st century
  • sending two versions of a credit card within a week or two where activating the wrong one makes them both not work
  • needing over a week to make and send a new credit card instead of just creating one on the spot in about 30 seconds as they easily could (just need a punch for the numbers and a magnetic strip writer)
  • charging me an outrageous $54 annual fee for the privilege of carrying a piece of plastic around that didn’t work anyway
  • charing a record high 30% interest rate on the credit card even though I’ve never missed a payment
  • paying a record low 0.25% interest rate on my savings account, and even then only on amounts over $5,000; less than 1/3 of what RBC pays, and less than 1/4 of what HSBC pays
  • working only 9:30am-4:00pm to avoid dealing with customers who have jobs

Anyway, hopefully the sites don’t go down, and here’s some more random math-ish stuff.  Suppose you have a function in terms of its Taylor series about zero (it’s Maclauren series):

f(x) = \displaystyle\sum_{i=0}^{\infty} a_i x^i

How can you find the square of the function?  In hind sight, it should have been obvious, but it’s:

(f(x))^2 = \displaystyle\sum_{i=0}^{\infty} \left(\displaystyle\sum_{j=0}^{i} a_j a_{i-j}\right) x^i

In fact, more generally, if we also have g(x) = \sum_{i=0}^{\infty} b_i x^i :

f(x)g(x) = \displaystyle\sum_{i=0}^{\infty} \left(\displaystyle\sum_{j=0}^{i} a_j b_{i-j}\right) x^i

You can also calculate \displaystyle\frac{f(x)}{g(x)} fairly simply:

\displaystyle\frac{f(x)}{g(x)} = \frac{\displaystyle\sum_{i=0}^{\infty} a_i x^i}{\displaystyle\sum_{i=0}^{\infty} b_i x^i} = \displaystyle\sum_{i=0}^{\infty} c_i x^i
\displaystyle\sum_{i=0}^{\infty} a_i x^i = \left(\displaystyle\sum_{i=0}^{\infty} b_i x^i\right)\left(\displaystyle\sum_{i=0}^{\infty} c_i x^i\right)

a_0 = b_0 c_0
a_1 = b_1 c_0 + b_0 c_1
a_2 = b_2 c_0 + b_1 c_1 + b_0 c_2
a_3 = b_3 c_0 + b_2 c_1 + b_1 c_2 + b_0 c_3


\begin{bmatrix} b_0 & 0 & 0 & 0 & \cdots \\ b_1 & b_0 & 0 & 0 & \cdots \\ b_2 & b_1 & b_0 & 0 & \cdots \\ b_3 & b_2 & b_1 & b_0 & \cdots \\ \vdots & \vdots & \vdots & \vdots & \ddots \end{bmatrix} \begin{bmatrix} c_0 \\ c_1 \\ c_2 \\ c_3 \\ \vdots \end{bmatrix} = \begin{bmatrix} a_0 \\ a_1 \\ a_2 \\ a_3 \\ \vdots \end{bmatrix}

Simply use forward substitution to solve for the values of c_i.

It’s much harder to work out what non-integer powers of the Taylor series are, though.  The pattern for \sqrt{f(x)} seems too complex to figure out from 5 terms.  What’d be really cool is to get the series for f(g(x)), since then most anything else would be pretty easy to figure out.

The Harmonic Series and Beyond

•August 19, 2009 • Leave a Comment

I was going to write a post about AQUA@Home, the cool quantum simulation system you can help with that I’ve been working on at D-Wave, but I’ll leave that for a couple of days and talk about one of the miscellaneous things that I’ve thought about at random.

I’ve known for quite some time that \sum_{i=1}^n \frac{1}{i} (the harmonic series) divergesas n approaches infinity, and more recently, that it scales as \Theta(\log n).  However, \sum_{i=1}^n \frac{1}{i^2} converges, along with any other positive function O(\frac{1}{i^2}) being summed.  For a while I had assumed that any function o(\frac{1}{i}) would converge, but I learned in my second calculus course at Carleton that \sum_{i=2}^n \frac{1}{i \log i} also diverges.

Several weeks ago, I was curious where the true bound between converging and diverging is, and I seem to have found it.

I started by finding a simple proof of the scaling of the harmonic series, and it goes something like the following.  Let’s assume that n is a power of two:

= 1 + \sum_{j=1}^{\log n} \frac{1}{2}
= 1 + \frac{1}{2}\log n


=\frac{1}{n} + \sum_{j=0}^{\log n - 1} 1
= \log n + \frac{1}{n}

Therefore, \sum_{i=1}^n \frac{1}{i} \in \Theta(\log n).  Using a similar approach, one can show that \sum_{i=2}^n \frac{1}{i \log i} \in \Theta(\log \log n).  Doing this again can show that \sum_{i=4}^n \frac{1}{i \log i \log \log i} \in \Theta(\log \log \log n), and a pattern emerges allowing each one to be expressed in terms of the previous.

The final conclusion:

\sum_{i=2^k}^n \frac{1}{i \log i \log \log i \log \log \log i ... \log...\log i} \in \Theta(\log...\log n)

where the number of logs in the last factor and in the scaling is k.  Note that the 2^k is just to prevent any of the logarithms from going to zero, so assume n >> 2^k.  If any one of the logarithmic factors in the sum has a power greater than one, the series converges, but with them all equal to or less than one, the series still diverges.  In effect, the scaling can get arbitrarily close to \Theta(1), so this appears to be the very brink of divergence.  Some may argue that having a finite k makes it insufficient, since there are theoretically slower-growing functions, but that’s really splitting hairs.

Biking Directions

•August 1, 2009 • 1 Comment

As much as my land lady is quite kind, she’s unintentionally given me humourously bad biking directions two out of two times now.  As an example, last night, she suggested I take a shorter route to work today.  The route I normally take to work is:

  1. Bike west along Kingsway to Willingdon Avenue
  2. Bike north down the huge hill on Willingdon Avenue to Still Creek Drive

It’s about 6.8km total, and it takes around 30 minutes total (at least in this heat wave).  Her suggested route:

  1. Bike west along Imperial Street to Gilley Avenue
  2. Bike north down Gilley Avenue to Deer Lake Parkway
  3. Bike west along Deer Lake Parkway to Willingdon Avenue
  4. Bike north down Willingdon Avenue to Still Creek Drive

Unfortunately, the valid set of directions closest to that is:

  1. Bike west along Imperial Street to Gilley Avenue
  2. Bike north down Gilley Avenue to Deer Lake
  3. Swim across Deer Lake to Deer Lake Parkway (actually illegal, but that’s another matter)
  4. Bike west along Deer Lake Parkway to Willingdon Avenue
  5. Bike north down Willingdon Avenue to Still Creek Drive

What I actually had to do was:

  1. Bike west along Imperial Street to Gilley Avenue
  2. Bike north down the big hill on Gilley Avenue to Oakland Street, where Gilley Avenue ends.
  3. Bike back up the big hill on Oakland until Royal Oak Avenue
  4. Bike down the huge hill on Royal Oak Avenue to Deer Lake Parkway
  5. Bike west along Deer Lake Parkway to Willingdon Avenue
  6. Bike north down Willingdon Avenue to Still Creek Drive

Even if I cut out going back up the hill by going to Royal Oak first, it’s the same distance as my original route, and contains more uphill parts on Deer Lake Parkway because the hill on Royal Oak is even larger than the hill on Willingdon.  (I wish I could find topographic maps for the area to give an idea of the scale of these hills, but all the programs online I’ve tried failed or didn’t have data in sufficient detail.)  I’ll probably just stick to Kingsway and Willingdon, even though they’re very busy roads.

Inventor Performance Viewer

•July 9, 2009 • 3 Comments

I recently put off the release of Inventor IDE Alpha 5b (probably coming in August, since my honours project is highest priority now) to get something ready for DemoCamp Vancouver that I’ve been wanting for a very long time.  It’s a feature that probably won’t be released until Alpha 6, but you can be certain that I’ll try to make good use of it at D-Wave as soon as I can.  I consider it the first truly awesome feature of Inventor IDE.  It’s easiest to introduce with a picture:

Rock-solid sorting performance data literally in the blink of an eye

Rock-solid performance data literally in the blink of an eye

(In case you’re wondering, the times are in clock cycles on a 1.66GHz laptop, so 100,000 clocks = 60 microseconds.)

A big problem in doing performance analysis is that getting reliable results and useful scaling info usually means tests that take minutes, so that the impact of the OS or other applications levels off to a constant factor.  If you try to do short tests, you get issues like these:

Bad Performance Data for 2 Mersenne Twister Implementations

Bad Performance Data for 2 Mersenne Twister Implementations

Bad Performance Data for Sorting Algorithms

Bad Performance Data for 3 Sorting Implementations

These aren’t exactly reliable results; running a short test multiple times produces wildly different results because of just a few bad data.  On Tuesday around 4am (Pacific Daylight  Time), I finally managed to find a decent approach to eliminate almost all bad data without eliminating much good data.  It’s certainly not perfect, but here are the results of running the two Mersenne Twister tests thrice each while AQUA@Home is trying to max out the CPU and the harddrive is thrashing like mad (for unknown reasons):

Much Better Mersenne Twister Performance Data

Much Better Mersenne Twister Performance Data

It’s worth pointing out that these times are in fact noticeably worse than the results when the CPU and harddrive aren’t under heavy load elsewhere (about 1.5 times as long).  The important thing is that the results are still remarkably consistent, and Mersenne128 is still 3.8 times faster than Mersenne32.  You can even still roughly see that Mersenne32 regenerates its data array every 624 elements and Mersenne128 regenerates every 2496 elements.  In case you’re wondering why CPU times would be so consistently longer instead of scattered, it has to do with process switches flushing the cache and TLBs, only causing cache misses in the good data, whereas the bad data still contain any process switches and other interrupts (like the 1kHz tick and page faults).

I’m still shocked that I can effectively triple-click the test run button on the performance test functions and have the results show up as fast as Windows can show the windows.  It’s even more responsive than the regular run button in the ImageProgram sample app, and that had shocked me.  That said, I found the ironic-but-not-so-important problem that it names the output files with the time only down to the second, so one that ran in the same second as another wrote over the first’s data file.

This kind of capability finally allows developers to rapidly iterate through possible performance improvements, not only making it faster to optimize, but much easier, because there’s time to try out many more optimizations in the same amount of time.  For example the faster mergesort in the top graph is from adding a special case for n=2.  In that way, it can also be used as a learning tool for developers to determine what does and doesn’t improve performance.  Best of all, because of the performance viewer’s language-independence, its platform-independence, and the many useful analyses it could do on the data, it’s applicable to many different situations; probably many I won’t think of myself, so I leave that up to you.  I’ve just shown you the first glimpse into what I’ve got planned for it.  🙂

Bugs that never see the light of day

•June 21, 2009 • Leave a Comment

Today, I fixed a ton of bugs in Inventor IDE and added some simple-but-useful functionality I’d been putting off for a long time.  It occurred to me earlier today that for almost all of these bugs, I’m probably the only one who’s ever encountered them, and some of them, even I never encountered them; they’re fixed without ever being experienced.  I thought I’d share some of these bugs with you to give an idea of what I’ve been up to and what sort of bugs I encounter in Inventor IDE.

Bug #1: Huge JComboBox

This is one of those bugs that I can’t actually fix, but I managed to find a workaround after a whole lot of banging my head against the wall.  There’s at least one bug in Java’s implementation of JComboBox that uses the Windows look-and-feel.  It may have been fixed in Java 1.6 or later, but I have to build with Java 1.5 or nothing will run when somebody tries to run it with Java 1.5, even though I’m not using any Java 1.6 features.

The bug: suddenly and irregularly, the size of a JComboBox will become larger than 16384 x 16384.  When Java tries to create a buffer into which to paint the JComboBox, it runs out of memory, because 16384 x 16384 x 4 bytes per pixel = 1GB.

It took a few hours to figure this out, because the stack trace for the OutOfMemoryError doesn’t contain any of my code (since I hadn’t overridden JComboBox.paintComponent()) and the bug rarely occurs.  The way I eventually figured it out was that I put a breakpoint at one of the low levels in the stack trace, ran the code until I got a debugger trace that looked similar, and looked back at the object up the chain at the last pair of paint()/paintChildren() calls, and it turned out to be an ObservedStringComboBox (one of my classes).  It wasn’t for a few minutes that I realized the size was so huge, and I discovered that getPreferredSize() and getMinimumSize() return similarly huge sizes too, so I traced into them.  After an hour of crawling through the mess that is Java’s Swing, I found that it was probably that some object was giving a dummy value to indicate that it didn’t really know how large something was supposed to be.

The solution: I’ve overridden JComponent.paintComponent() in both classes I have that extend JComboBox, and they check whether the width or height is over 16384, in which case, they do a rough estimation of the size based on the string widths of their options.  Then, they call setSize(), setMinimumSize(), setPreferredSize(), and setMaximumSize(), just to make sure, before calling the original paintComponent().  It successfully picked up the error when it occurred, and set the JComboBox to a reasonable size that didn’t look out of place.

Bug #2: Where’s the caret?

This bug has been around for quite some time, and although I was stumped on a good way to solve it (partly due to the next bug), it’s solution was a one-character change.

The bug:

The caret is currently at the beginning of the first "SegmentDescriptor", but it's impossible to see.

The caret is currently at the beginning of the first “SegmentDescriptor”, but it’s impossible to see.

The border of the LineFields blocks out the one pixel-width of the caret that would be visible if it had no border.

The solution:

The exact same thing, but everything in a LineField is drawn 2 pixels to the right

The exact same thing, but everything in a LineField is drawn 2 pixels to the right

This was literally solved by changing the insets from (0,0,0,0) to (0,2,0,0).  Now the full caret is visible.  (It didn’t look as good with just one pixel-width showing, so I have both pixels showing.)

Bug #3: The layout war

I hate layout issues; Java seems to have a neverending supply of them, which is why I removed LayoutManagers from a vast majority of the components that need good layout.  There were a few major layout bugs that I still had though, and the most ridiculous-looking was to do with the layout of my ObservedStringTables (shown above correctly, but I’ll use a new example).

The bug: the ObservedStringTable should control the layout of the LineFields it contains, since they should be kept looking tabular, but many LineFields elsewhere are supposed to resize automatically to fit the current size of their text, so that behaviour was built into LineField.

An enumeration isn't supposed to look this bad (btw, here, it auto-increments by "sizeof SegmentDescriptor" if no value specified)

An enumeration isn’t supposed to look this bad (btw, here, it auto-increments by “sizeof SegmentDescriptor” if no value specified)

The solution: it’s not a great solution, but by the time it’s lack of greatness would matter, it’ll have changed or need changing for other reasons already.  I just have LineField check whether it’s in an ObservedStringTable, and if it is, it doesn’t change its own size.


Finally, it looks like a table

In case you’re wondering, yes, the caret is now not visible if it’s at the end of the Increment field.  I’ll fix it now, ’cause I’ll forget something small like that… there we go; insets are now (0,2,0,2).  🙂

Bug #4: When the impossible is possible

Moving away from UI-related bugs, this bug was technically fixed today, though only because today started at midnight, not when I got up (11:30 AM).  A few days ago, I added support for compiling inline functions that have only text-replacement parameters (i.e. macros minus support for elaborate macro stuff).  That way, I can use it for having debug output functions in Own OS that don’t get called at all if “TEXT_TESTING_MODE” is false.  However, fairly often when compiling, though unpredictably, there’d be an ArrayIndexOutOfBoundsException when adding a reference to a line or when adding an error to a line.

It’s fairly common to add errors to lines during compilation, since the error detection doesn’t check for everything before compile-time, but the only code that would be adding references to lines would be the code for inline function compiling.  When I encounter a call to an inline function, I copy the function, replace the parameter references in the code with their given values, and remove the parameters.  Then I compile it as a regular function, just in the middle of another function.  The weird thing about this error is that upon adding a check to make sure that the array was large enough (which was even redundant, because the line before had been a call to ensureRefCapacity(), which expands the reference array if needed), it still had the same array index exception.  The only way I could think of that this could happen is if multiple threads were messing around with the same line at the same time, but the odds of that were pretty low, considering I’d explicitly designed the compilation not to modify anything concurrently.

The bug: when a parameter is removed from a function that is in global scope, things observing the global scope get notified about it.  A temporary copy of an inline function shouldn’t really be notifying the global scope about its parameters being removed, and updating code based on a notification is probably not something that should occur while compiling.  Sure enough, functions didn’t like getting told that a temporary function they can’t actually access has lost a parameter.

The solution: this is one bug where I haven’t actually determined exactly how/why it was occurring, but the stack traces and stepping through debugging clearly showed that the error was only occurring when a temporary function was broadcasting to everybody that it had lost a parameter, which it shouldn’t be doing.  I just stopped temporary functions from notifying the global scope of anything.  It’s usually not advisable to fix something without fully understanding it, but often, such a restriction would mean that few bugs would get fixed.  For now, I’m okay with this lack of understanding, because I’ll need to re-investigate and adjust this type of thing when I add support for function overloading when I add support for C, and until then, it’s unlikely that any oversight here will rear its head again.

Yesterday’s Bug: The complete round-off problem

So this wasn’t from today, but it’s a cool quirk of floating-point calculation with which programmers should familiarize themselves.  Suppose that you have an array of floats (4-byte real numbers) that’s 2 billion elements long, and every single element contains the value 1.  If you simply go through the array and sum up the values of the elements, what is the final sum?

You might guess that it’s 2 billion plus or minus some relatively small amount of error, but in fact, the sum is exactly 8,388,608.  This is the case whether you have 10 million elements, 2 billion elements, or 1 trillion elements.  The answer is still 8,388,608.  Why?

floats have 23 bits of mantissa, and once they get to the value 8,388,608 (i.e. 1.0000…0×2^23), adding the value 1 would make it 8,388,609, but this can’t be represented exactly in floating point, so it gets rounded down (following the round even rule), back to 8,388,608.  This happens for every single 1 that is added after that too.  In general, when adding many values that are all positive (or all negative) of similar magnitude, the final sum will be around 8,388,608 times the average value.

This is exactly what I found in a piece of code I had changed to use floats instead of doubles at work (at D-Wave).  It was trying to add 14,400,000 elements with roughly the value 1, and it gave 8,388,608 every time.  Instead of reading the sum, I was actually reading the average, 0.582542 instead of about 1.0, so it wasn’t immediately apparent why it was wrong.  I realized that this was the error when I ran it for 144,000,000 elements and got 0.058254, which is 10 times less for 10 times as many elements, meaning that the sum hadn’t changed.  That brings up the question, how do you add up the elements in an array to avoid this?

The solution: Well, the way I fixed it at work was to just change it back to using a double (8-byte real number) to accumulate the values.  That doesn’t encounter the complete round-off problem until 2^52 additions; it can still result in bad round-off error before then, but it shouldn’t be an issue for the piece of code we were doing this in.  Is there a way that works for any number of additions?

One way that works in O(logn) extra space (or inline if you can destroy the data), while still being in O(n) time, is that you break it up by adding every pair, then adding every pair of those, then adding every pair of those, etc.  The way that can be done in O(logn) space is that you keep a value for every level, all starting at zero, and when you’re done a level, you add the value to the next level up.  If that was the end of the next level up, it goes up more, else you go back to the next at the bottom.  This is equivalent to splitting it up recursively (using O(logn) stack space), only without the disadvantages of recursion.  It also has the advantage that it could be done with data coming in over time.  Note that there’s still the issue of round-off from adding positive and negative numbers that sum to near zero, but there’s not much that can be done about that.

Edit: Here’s code for this approach (untested):

float sum(const float*const array,const size_t size) {
    // Assuming there is a function to compute the ceiling of log_2 of size
    // The -1 is because the lowest level (pairs) gets added immediately instead of waiting
    const size_t logSize = ceilLog2(size)-1;
    float accumulators[logSize] = {0};
    bool isSecondHalf[logSize] = {0};
    float sum;
    for (size_t i=0; i<size-1; i+=2) {
        // Add together a pair of elements
        sum = array[i];
        sum += array[i+1];
        size_t level = 0;
        // Go up the tree until reaching the end
        // or a place where it's the first half
        while (isSecondHalf[level]) {
            isSecondHalf[level] = 0;
            accumulators[level] = 0;
            sum += accumulators[level+1]
            // Handle case of size being an exact power of two
            if (level==logSize) {
                return sum;
        // Save the sum at the highest level reached
        accumulators[level] = sum;
        // Now on the second half of that level
        isSecondHalf[level] = 1;
    // Make sure to get the last element if there were an odd number of elements
    sum = (size&1) ? array[size-1] : 0;
    // Add up all values left in accumulators (order matters)
    for (size_t level=0; level<logSize-1; ++level) {
        sum += accumulators[level];
    return sum;