Tuesday, May 27, 2008

Debian OpenSSL Bug: Some Thoughts

Last week's big security news was the discovery that Debian has been shipping a busted version of OpenSSL for over a year. There's been tons written about it (e.g. coverage on Slashdot and Schneier's site). Everyone agrees it's a major screwup, and we'll probably never know how much data (if any) was compromised because of it. What can we learn from it, from a software engineering standpoint?

1) Modifying code just to appease a source analysis tool is a terrible idea. It's got to be one of the worst reasons to modify code. Tools like this will highlight potential problems, but they will generate false positives that must be ignored.

2) Private source branches are a bad idea. Debian maintains private versions of several packages they include. I don't know anything about the benefits of this practice, but the downside is obvious: changes don't make it back to the maintainers, who are the ones most qualified to vet any changes. If these patches had been submitted to the OpenSSL project, they would have been rejected and the problem averted.
I've never maintained my own Linux distribution, but I have had to make custom software builds for customers. It's sometimes tempting to keep a change local and not commit it -- Not a good idea.

3) Some problem domains are inherently difficult to write automated tests for. In this case, the PRNG would still have passed all statistical tests - it's just that it could generate only 32767 different number streams. That's an awfully hard failure to write a test for, and a test for it would be useless in catching a bug that's even slightly different.

4) "Never ascribe to malice, that which can be explained by incompetence." [I don't want to call anyone incompetent, but that's how the saying goes.] In all I've read about this fiasco, I don't remember anyone suggesting that this was anything more than an accident - even though it's a result Mallory could only dream of! It must be because of the relative transparency of the Debian process, but can you imagine the response if a similar problem turned up in Windows? The reaction to the discovery of the NSAKEY debugging symbol was pretty heated, and as far as I recall nothing was ever proved - talk of "back doors" never got beyond speculation.

Monday, April 21, 2008

This Week in Security

I came across three interesting articles this past week:

1) The DailyWTF breaks the story about the State of Oklahoma's Department of Corrections sex offenders list being an open proxy for their Oracle database. SELECT * FROM FAIL. As Les says in a comment on Bruce Schneier's site, this is a screwup that can't be blamed on a single person. Sure one guy wrote bad code and then applied a half-assed fix, but there's no way the original passed through a QA process or was reviewed by anyone (if your QA team isn't looking out for SQL injection/XSS opportunities, they should be).

What I think is more revealing is the organization's response to the initial report. Inferring from the story, the word of this data leak would have to come down the chain of command from the top, and I would hope they would be taking this seriously. All of these people -- and their bosses -- aware of the problem, and the patch they push out is still so lame? Did they ever discuss the fundamental underlying problem? Or do you suppose somebody just checked that the original URL no longer worked?

2) Matasano Chargen has an accessible account of an epic hack against a bug in Adobe's Flash plugin. If I follow it right, Mark Dowd figured out how to use an unchecked malloc failure to convince the byte-code verifier to ignore machine code embedded in a flash file at one point, but to run it nonetheless. In his writeup, Thomas Ptacek weaves in Terminator references; the actual exploit (subverting the verifier to ignore semantically significant content) puts me more in mind of A Fire Upon the Deep.

3) The Six Dumbest Ideas in Computer Security: The ideas here are simple, but well worth reading. A recurring theme is how easy it is to approach a problem backwards, and how that sets you up to be on the losing side (no matter how hard you work). Fun game: which of the six ideas would have prevented the above failures?

Friday, April 18, 2008

When Interface Dictates Implementation

This whole topic is probably kind of obvious, but I recently ran into a great example of how a seemingly minor difference in an interface can have a large impact on possible implementations. I was evaluating two libraries for inclusion in a project at work. Both provided a C API that we would use, and the two interfaces were only slightly different. This difference would prove to be significant.

There are only a handful of ways to create and manipulate "objects" with a C library. The first library, "Toolkit S", had an interface like this:

int CreateObject(...); /* returns a handle or a reserved value to indicate failure */

int Foo(int handle, ...);


int Bar(int handle, ...);


int DestroyObject(int handle);



That is, all required allocations are done internally, and clients interact via a numeric id. This is similar to how _open works. The second library, "Toolkit L", had an interface like:

typedef struct tag_Object{ ... } Object;

int CreateObject(Object* obj); /* returns an error code, initializes struct */


int Foo(Object* obj, ...);


int Bar(Object* obj, ...);


int DestroyObject(Object* obj);



In this case, the memory for the object is allocated by the caller and that struct is passed in to every call (like the CriticalSection functions). In this particular case the Object struct contains pointers to memory that could be allocated by the library, but this isn't generally the case.

There are pros and cons to each design; I'm not going to declare one superior to the other in all cases. However, what I realized (this is kind of obvious if you've thought about it) is that the first library must have a similar struct defined internally, and has an internal data structure that maps int handles to it. Also, this data structure has to have some kind of thread-safe synchronization around it which causes an unavoidable performance penalty in a multi-threaded environment that the other library avoids completely. No matter how access is controlled, there are use cases where your design requires a performance hit (above and beyond the extra level of indirection). The benchmarks showed what I knew they would: Library L made much better use of my test machine's second core.

They've been talking about the "multicore revolution" for years, and we're finally getting to the point where odds are that your code is running on a multi-core machine. You can't always take advantage of it - especially as a library author - but you have to do what you can to not get in the way of those who can take advantage.

Monday, April 7, 2008

Final Four Probabilities

For the first time in history, the teams in the Final Four this year are all number 1 seeds. I've been hearing about this for a while, and began to wonder... Should my reaction be "it's about time" or "wow, that's remarkable"? So I went after the numbers to try to figure it out.

The first thing you realize is that the NCAA likes to change formats pretty regularly. The current format (dating back to 2002) starts with 65 teams; the previous 16 seasons featured a 64-team tournament with the same overall structure. These (1985 - 2007) are the tournaments whose results I used. Here's what I found:

Seed Region Wins Percent
1 38 0.41
2 21 0.23
3 12 0.13
4 9 0.1
5 4 0.04
6 3 0.03
8 3 0.03
11 2 0.02

So, if we take history as a guide, we would put the probability of all four #1 seeds winning their regional brackets at just under 3% [.41 * .41 * .41 * .41], so it should happen once every 34 years, on average. This is the 24th year of the current format, so it happened a little sooner than we would have expected.

An interesting question is, "what is the most likely distribution of seeds in a Final Four?" Obviously, a 1 seed is the favorite to win its regional group, but other outcomes are more likely than all of them winning. Three 1 seeds and a 2 seed would be twice as likely (about 6.4%, once every 15.5 years); it happened in 1993. Close behind that, we would expect two 1 seeds, a 2 seed and a 3 seed about every 16.4 years; this was the case in 1991 and 2001.

Two years ago, no number one seeds made the Final Four (this was not quite a first - it happened in 1980 during a 48-team tournament). This is actually a relatively likely situation - you would expect it once every 12 or so years.

To close out, let's look at outcomes that we would expect more frequently, and see how they compare with history:

Final Four Seeds Probability Expected Time (Years) Occurrences Observed Rate
3 #1's 0.1654 6.04 3 0.13
2 #1's 0.3527 2.84 10 0.43
1 #1 0.3341 2.99 9 0.39

Note that the above outcomes are mutually exclusive (but not exhaustive - I mentioned the other cases above). In particular, I'm surprised by the frequency with which only one 1 seed makes it to the Final Four. I wonder what kind of odds I can get on that next year...

Thursday, March 27, 2008

Divisibility Testing and Pattern Matching, Part 2

Last time, I left myself with the challenge of coming up with a regular expression to test for divisibility by 3, and left you with 3 relatively unsatisfying regular expressions to use for determining divisibility (by 2, 5 and 10). (The RE for 4 I'll count as being partly satisfying). First, why were those tests easy to code as a regular expression? If we are testing for divisibility by some number d = 2^a * 5^b, then we only have to consider the last max(a, b) digits. If our test only depends on the last few digits, then at worst we'll be able to enumerate the possibilities. The regular expression (00|04|08|12|...|92|96)$ will also correctly test for divisibility by 4.

For determining divisibility by 3, such an enumeration isn't available to us (we have to consider every digit since there is no power of 10 evenly divisible by 3). How do we know that it's possible to write a regular expression that matches on multiples of 3? Because we can write a finite state automaton that tests for divisibility by 3.


enum State
{
Accept,
Rem1,
Rem2
}

bool IsDivisibleBy3(char* num)
{
State state = State.Accept;
for (; *num; ++num)
{
switch (*num)
{
case '0': case '3': case '6': case '9':
break;
case '1': case '4': case '7':
if (state == State.Accept)
state = State.Rem1;
else if (state == State.Rem1)
state = State.Rem2;
else
state = State.Accept;
break;
case '2': case '5': case '8':
if (state == State.Accept)
state = State.Rem2;
else if (state == State.Rem1)
state = State.Accept;
else
state = State.Rem1;
break;
}
}
return s == State.Accept;
}

I wrote it in this style to make explicit that it represents a finite state automaton. It's essentially an obfuscated version of what you might actually write to test for divisibility by 3:


bool IsDivisibleBy3(char* num)
{
int remainder = 0;
for (; *num; ++num)
{
remainder = ((10 * remainder + *num - '0') % 3);
}
return remainder == 0;
}

This itself is just a slight modification of the code to read in a number and see if 3 divides it, with the important change that this one won't overflow on large input. Notice how close this code comes to implementing the rule that 3 divides a number if 3 divides the sum of its digits!

Since there exists an FSA that checks for divisibility by 3, there must be a regular expression that codes this FSA. To simplify the definition, we'll introduce some character classes:

Z = [0369], O = [147], T = [258]

These partition the digits into equivalence classes modulo 3 (and correspond to the different cases in the code - coincidence?). Then, a regular expression to test for divisibility by 3 is:

^(Z|(OZ*OZ*(TZ*O)*Z*O)|(TZ*TZ*(OZ*T)*Z*T)|(O(OZ*T)*Z*T)|(TZ*(TZ*O)Z*O))+$

If you're not into the whole brevity thing, that expands to:

^
([0369] | ([147][0369]*[147][0369]*([258][0369]*[147])*[0369]*[147]) | ([258][0369]*[258][0369]*([147][0369]*[258])*[0369]*[258]) | ([147]([147][0369]*[258])*[0369]*[258]) | ([258][0369]*([258][0369]*[147])[0369]*[147]))+$

So now we have regular expressions for testing for divisibility by 2, 3, 4, 5, and 10. We could make one for 8 (it would be similar to the one for 4). Testing for divisibility by 6 is equivalent to testing for both 2 and 3, both of which we can do (but which it would not be easy to write a standard RE for, though I believe there are extensions that implement an And operator). Alternately, we could modify the one above to make sure that the last digit is even (ick - that one is left for the reader). What would a regular expression that tests for divisibility by 7 look like?

Sunday, March 16, 2008

Divisibility Testing and Pattern Matching

How do you tell if a number is divisible by 10? Everybody knows you just look to see if the last digit is 0 or not. There are similar rules for checking divisibility by 5 (ends in 0 or 5) and 2 (ends in an even number). What I think is interesting is that these rules are not mathematical in the usual sense. Rather, they are based on simple textual pattern matching. That is, they don't rely on manipulating values, they just require inspecting the textual representation of the value in question.

If you are going to do pattern matching on text, the natural tool is regular expressions. Using the usual regular expression syntax, the above rules can be expressed as 0$, [05]$, and [02468]$, respectively. Those are the easy ones. Can we code more complicated divisibility tests as regular expressions? How about checking for divisibility by 4?

The mathematical shortcut for checking if 4 divides a number is to see if it divides just the last 2 digits. This works because if n = 100 a + b, 4 | n iff 4 | 100 a + b iff 4 | b. If we choose a and b so that b < 100, b will be exactly the last 2 digits of n. So we know we can get away with looking at just the last 2 digits; is there a regular expression we can use? Yes: ([02468][048])|([13579][26])$ works! Unsurprisingly, our regular expression considers only the last 2 digits.

So far, I've been considering the easy cases - testing for divisibility by combinations of 2 and 5 (which are easy since we're using the base 10 representation). You may know the shortcut for testing for divisibility by 3: add up the digits of the number, and if this sum is itself divisible by 3 then so was the original number. Is it possible to code this test as a regular expression?

To be continued...

[Erratum: the regular expression above for divisibility by 4 requires 2 digits, so it won't match 0, 4 and 8]

Saturday, March 15, 2008

C# Extension Methods - Syntactic Sugar That Won't Rot Your Teeth

C# 3.5 introduced extension methods to the language. These are non-member methods that are callable with member method syntax. For example, if we have a class:

class Foo { /**/ }


we can declare an extension method for it like this:
  static class Bar
{
public static void Baz(this Foo f) { /**/ }
}
(The this keyword in the declaration is what marks it as an extension method.) The syntactic sugar is that you are allowed to call Baz on an instance of Foo like this:

new Foo().Baz();

The compiler rewrites the above call to the more traditional:

Bar.Baz(new Foo());

Now that the mechanics are out of the way, let's look at what they are good for. The main benefit of extension methods, in my mind, is that they remove the syntactic penalty of non-member functions. Take Scott Myers's advice [ddj] for C++ class design: "Prefer non-member non-friend functions to member functions." There is little argument that it's a good idea in principle, but there are certainly mixed feelings about the syntactic trade-off [Google seems to bork that thread - the discussion tree it renders doesn't make much sense]. As an example of where you might do this, consider how String.StartsWith() might have been implemented:

public class MyString
{
private Char[] m_chars;
public Char this[int i]
{
get { return m_chars[i]; }
}

public MyString(String s)
{
m_chars = s.ToCharArray();
}

public int Length { get { return m_chars.Length; } }
}

public static class MyStringExt
{
public static bool StartsWith(this MyString str, MyString find)
{
if (str.Length < find.Length) return false;

for (int i = 0; i < find.Length; ++i)
{

if (str[i] != find[i])
return false;
}
return true;
}
}

The same functionality, but with improved encapsulation! There is a downside, of course. Object methods are much more "discoverable" than static functions that happen to accept particular parameters. Member functions have to be declared along with the rest of the class, so there is just one place to look for them [partial classes complicate this a bit]. Microsoft's answer to this is improved Intellisense; extension methods are included alongside regular methods. As long as you're using their IDE, I think it makes up for it.

Sunday, March 2, 2008

The Universe aligns to tell Joel what he wants to hear

Joel Spolsky doesn't think comments belong in blogs. "You don't have a right to post your thoughts at the bottom of someone else's thoughts. That's not freedom of expression, that's an infringement on their freedom of expression." To back this up, he cites some offensive comments that were written in response to a post on a real estate blog. There are so many things wrong with Joel's position that I'm not sure where to begin.

First of all, if the blog allows comments, then yes, you do have the right to post your thoughts at the bottom. If the writer didn't want feedback, they would turn that feature off. There is an implied social contract that commenters will be courteous and civil, which was obviously violated in Joel's real estate example. No big deal, that's what the delete key is for. Joel obviously feels that forum moderation is acceptable in other situations, and goes so far to liken it to picking up trash in the park. This solution may not be perfect, but that Joel doesn't even mention is it interesting.

So why doesn't he mention it? My guess is because of how he thinks about blogging. He doesn't see it as a dialog in the traditional sense, he sees it more like pamphleteering. You write your essay, distribute it, and anyone who cares to respond can write their own pamphlet. Indexing services (bloglines, technorati, etc) make it so that you can find others essays that claim to respond to yours. Perfect.

Perfect, that is, if that's the model you wish to use. A more effective model (in my opinion) is the "presentation with question and answer" model that most blogs use. Short questions that ask about an unclear point and the author's response make the post better when they appear with the original post. Discussion is more focused when it takes place in a single thread (another thing Joel knows); the best place for that is alongside the original.

In the end, I think it's partly a matter of scale. People like Joel can get away with the pamphleteering style of blogging. There are n+1 people responding to his posts, and he certainly doesn't have time to even read them all, so why should he host them. For most of the long tail, there are much fewer readers, and just a handful of replies. For them, accepting (civil!) comments and posting them at the bottom is the way to go.