Unfuddle

For a recent project, we were looking for an issue tracker. Our biggest criteria was something that was free — we weren’t planning on doing anything complicated, and so available issue trackers were essentially equal in our eyes. We ended up going with Unfuddle, but it wasn’t until after signing up that we discovered its biggest strength: an API.

There are other issue trackers with an API, but this one seemed reasonable enough so we used it. The big get for us in building a web-based app, is that it lowers the barrier for reporting bugs. We set up our development server to include a bug-reporting button that collected a little report that then sent it to our server, which uses the Unfuddle API to add a ticket. It also records and submits a little information about the user’s browser, current session, etc. that might help in debugging their problem.

Not the most complicated or profound system in the world, but it does put user-reported (and tester-reported) bugs in the same place we check our to-dos and other issues. No checking a support email account we remember once a month. Just puts bugs right were we’re looking anyway. I hope to share a little bit of the backend code soon!

Tagged with:
 

Until recently, I bore the burden of a dirty little secret: I didn’t use unit testing. I don’t know if this has been other students’ experience, but my software engineering course could have been a lot better. Perhaps it’s a topic that requires students to determine their own level of involvement, but particularly, I am irked that these issues were not as developed as they could have been:

  1. Version control – svn, git, something! Anything, really! Just use something!
  2. Unit testing – sure we talked about it, but its utility isn’t clear until you use it in earnest
  3. Documentation – always felt more like a chore than a tool
  4. Issue tracking – Joel on Software said it best: “I was born with only two bug-storing-slots in my brain.”
  5. More best practices – Effective C++ (it’s worth it — get it and read it) alone was worth more than the course was.

I concede that these are things whose utility might be hard to convey to college students who may or may not be working on projects they really believe in. Still, very important.

One of the things I like about documentation is that you can refer to the documentation when uncertain about side-effects, input, etc. of functions and you don’t really feel like reading through code and trying to keep it in your brain-RAM. Not only that, but I usually refer to my own documentation to get an idea of the guarantee that a function was offering, to try to make sure that it’s making those same guarantees after I make changes. Not only this, but articulating functionality crystalizes it’s implications in your mind. When describing how something works to a friend, it often reminds you of issues you hadn’t considered. Writing documentation is the same in this respect.

When working with other people, while it’s nice to be able to talk face-to-face, the issues we think of that need changing go in one ear and out the other for me. Not that it wasn’t a good idea, but my brain-RAM is not infinite. Bugs, features, ideas, issues all get paged out of my head. Write them down instead, which is where issue tracking comes in. Design questions and decisions come out of issues all the time, and the trail of comments and discussion make a nice reference for documentation.

I find that when I’m fixing a bug, the functionality of that chunk of code gets reduced to that particular feature in my mind. So, if when I try using it, if that bug is fixed, then my job is done. But we’ve all experienced unintended side effects, where in fixing one thing, we break another. Having a list of features to check is a step up from trying to remember the things that need to work, but it’s much less than the power of unit testing. I’ve recently been using UnitTest++ for C++ applications, and QUnit from the jQuery people, and have been pleased to thrilled. It allows you to check functionality exhaustively and quickly on a whim. Even the task of writing the tests forces you to define the functional requirements of your code, and I find it helps to clarify a lot.

Version control lets you try crazy new approaches and track changes, isn’t that great? I was helping a friend last semester with some of his code, and he was trying to make code changes with copy, paste, and the undo buffer. In the end, he ended up introducing more bugs into the code than he was fixing. It’s intractable to keep these kinds of code changes in your brain-RAM. Use something, use anything. I have a qualm with svn because it was something I used so rarely, I could never remember how to set it up. Even a barrier as small as that was enough to prevent me from using version control at all. I love git because it’s so easy to set up, I use it for almost anything: from a report (if using LaTeX), to a weekend project, to my thesis. There are tons of tools out there, but just use something.

Python Lesson Learned

I have finally learned a lesson after running into this fact repeatedly: no matter what you are trying to do, python has a module for it. I’m reminded of two xkcd comics in particular.

Tagged with:
 

numerical_limits

I’m working on a C++ class that encapsulates changes to a “steerable” variable. It’s part of a system that allows trusted commands from a socket library to change the value of registered variables. For example, a client might:

int sInt(0);
float sFloat(0.3);
registerParam("someSteerableInteger", &sInt);
registerParam("someSteerableFloat", &sFloat);
...
// Library makes these calls after receiving remote commands:
setParam("someSteerableInteger", 5);
setParam("someSteerableFloat", 3.4);

There’s some intelligence in the class to ensure integrity of the data. For example, it will cast the value past into setParam(…) correctly, so that a call to setParam(“someSteerableInteger”, 5.3) results in the 5.3 cast as an integer and then used to set the integer’s value.

Some of the other intelligence needed is that the user might specify the minimum and maximum valid values the variable might take on. Or he or she might not. If not, it would be nice to have a convenient way of specifying a reasonable default minimum and maximum.

Enter numerical_limits. It’s a static templated class that can give you some insight into the built-in types. For example:

#include <iostream>
#include <limits>
using namespace std;
...
cout << "Min unsigned int: " << numerical_limits<unsigned int>::min() << endl;
cout << "Max float: " << numerical_limits<float>::max() << endl;

Tagged with:
 

Sum of Dice

I recently had an interview in which I was asked, “how many ways can 1000 dice make the sum 3000?” I was caught a little off-guard, and the phrasing made me wonder about a closed-form solution which, after some reading, I don’t believe exists. Still, a workable solution isn’t as easy as that.

Let F(n, k) denote the number of ways n dice can make the sum k. Note that the first die can be 1, 2, 3, 4, 5 or 6, so F(1000, 3000) = F(999, 2999) + F(999, 2998) + … + F(999, 2994). While this recurrence relation leads to an easy implementation, it suffers from two big problems: performance and maximum recursion depth. Without any memoization or optimization, this would lead to 61000 function calls, which is about 10780. That’s a big number. If you were able to evaluate one such function call every clock cycle on every computer in the world (about 7 x 1018 cycles/second), you would need about 10750 years, or 10650 lifetimes of the universe (about 10100 years) to solve it. Get ready to wait.

One optimization is to check if the arguments n and k make sense at each level, and prune accordingly. For example, n dice can only sum to between n and 6n, so if k is outside of that range, you don’t need to branch down any farther. Even using this and instantaneous branch evaluation, your code still wouldn’t complete in the expected lifetime of the universe.

Obviously, this is still a solvable problem (there are about 10757 ways 1000 dice can sum to 3000). A python implementation figured that out in 3.5 seconds. One way to help crack this egg is to use memoization. Consider the calls to F(999, 2999) and F(999, 2998). Expand these out to get:

  • F(999, 2999) = sum[F(998, k = 2998, 2997, 2996, 2995, 2994, 2993)]
  • F(999, 2998) = sum[F(998, k = 2997, 2996, 2995, 2994, 2993, 2992)]

You’ll notice that both of these expansions call F(998, k = 2997, 2996, 2995, 2994, 2993), each of which is a lot of work to evaluate. So, if once you determined each of these values once, you could use them in other places and save huge amounts of time.

One thing you’d need in order to implement memoization is a map (or dictionary in python) that stores key-value pairs. So once you figure out F(998, 2997), you could store it like myMap[(998, 2997)] = …; but this can add some complication. Every time to call F, you’d have to check if you’ve already evaluated the function for those arguments. And when you’ve figured out and stored a lot of these values, each search will take some time. It will still work and will get you the answer you need, but probably not as fast as you’d like, and not as fast as it could be. Using memoization for F(700, 2100) took 8.2 seconds in a python implementation, but a better approach took 1.5 seconds. The maximum recursion depth prevented it from using bigger numbers.

Algorithmically, the best you’ll be able to do (as far as I can tell) is O(n2), which is not bad in the scheme of things, and certainly not the combinatoric time we saw first. Enter dynamic programming. Particularly, we’ll use a bottom-up approach. We’ll build two arrays, each associated with a number of dice. The i‘th index in that array is the number of ways n dice can sum up to (n + i). So for 1 die, myArray[0] is the number of ways you can roll a 1, and for 4 dice, myArray[10] would be the number of ways you can roll 4 dice to get 14.

As I said, we’ll keep track of two arrays, say previous and current. First, initialize previous to be the number of ways 1 die can be rolled: {1, 1, 1, 1, 1, 1}, because 1 die can be rolled each number in only one way. Then, we’ll use that to fill the array “current” with the number of ways 2 dice can be rolled, based on the array previous. For example, current[0] = previous[0] = 1, because the only way to get a sum of n with n dice is to roll all 1′s. Then, current[1] = previous[0] + previous[1], because if we roll a 2 for the second dice, there are previous[0] ways to sum up to 1 with the remaining die, or if we roll a 1 for the second dice, there are previous[1] ways to sum up to 2 with the remaining die.

Each entry current[i] should be the sum of previous[i-5, i-4, i-3, i-2, i-1, i], taking the bounds of the previous array into consideration (this can actually be generalized to any m-sided dice by using previous[i-m+1, ..., i]). Once you’ve filled the array current, then you assign current to previous, and move on to the next number of dice, 3. Continue this until you have filled the array for the desired number of dice. If performance is a big issue, this can be implemented as a moving frame. For example, since current[i] uses previous[i-5, ..., i] and current[i+1] uses previous[i-4, ..., i+1], you can alternatively write current[i+1] = current[i] – previous[i-5] + previous[i+1], and use only two add/subtract operations instead of 5.

A bonus to this bottom-approach is that once finished, you actually have an array with the number of ways you can make any sum with the n dice. This makes it easy if you wish to find the number of ways n dice can form a sum up to or greater than a number.

Tagged with:
 

Wrapping printf(1)

Working on an application that had become a little… verbose, I decided it was finally time to wrap my prints in a function that could easily be switched on or off depending on whether or not I wanted it to be verbose. One approach I had seen before (from my OS professor) that has a certain amount of merit is to wrap statements with a macro:

#ifdef DEBUG
printf(....);
#endif

The nice thing about this approach is that debugging can be turned on or off easily at compile time. However, my experience has been that it leads to a lot of typing, and seeing too many macros in the middle of code makes my brain explode in a fiery rage. So, I figured that if I wrapped my prints in another function (I called mine ‘log’ and ‘error’), I could avoid this whole mess and keep my sanity. I’ve done this with a number of other projects in other languages, but I had to learn some magic to do it in C.

Lesson Learned #1 : Variable arguments. It turns out you can define functions that take a variable number of arguments with va_list (from ). You define such a function:

void log(const char* fmt, ...) {
va_list args;
va_start(args, fmt);
...
va_end(args);
}

Lesson Learned #2 : However, from what I can gather, you can’t just inject printf directly into this. However, having anticipated this, there is a set of functions designed for cases like this: vfprintf, vprintf, vsnprintf, vsprintf. The ‘v’ stands for va_list (variable-argument list), and you can use them just like you’d expect:

void log(const char* fmt, ...) {
va_list args;
va_start(args, fmt);
fprintf(log_fd, "NOTE : ");
vfprintf(log_fd, fmt, args);
va_end(args);
}

The thing I like about this approach is that you have control over how log messages get printed in one place. So, for example, if I provided another function, setLogFD, then I could easily just set the file descriptor where all log messages get printed. So easy! Something I’ve used this for in other instances (especially servers) is to also inject additional information like a timestamp on every message. So, when I call:

log("Some event '%s' just happened.\n", event_name);

Then I automatically get “NOTE : ” and maybe a timestamp prefixed on that message. Which make code look clean, and adds a great deal of functionality. I actually added another function error(…) that prints to a different file descriptor in case I want to suppress debug messages, but no error messages. For additional layers of debugging, you might do something like this:

void log(int level, const char* fmt, ...) {
va_list args;
va_start(args, fmt);
FILE* fd = my_log_files[level];
fprintf(fd, "NOTE : ");
vfprintf(fd, fmt, args);
va_end(args);
}

This way, at startup, you could easily set some of the file descriptors in my_log_files to stderr and some to point to /dev/null or otherwise dissolve into the ether.

Tagged with:
 

MPI_Datatype

It has been a while since I’ve had to work with MPI, but recently I had to learn a new trick with it. MPI provides ways to convey data between processes in a number of ways, from broadcasts to scatters to all-gathers. But obviously you have to provide a certain amount of information about the structure of the data, not the least of which is the datatype.

MPI defines enumerated constants for the basic data types in C: MPI_CHAR, MPI_INT, etc., and for the most part these will suffice. But what if you want to scatter your own struct? This can be done through a number of utility functions, but the most versatile seems to be MPI_Type_struct().

You let MPI know how many different blocks, or chunks of memory there are, their lengths, their offsets and types, and then what variable to store the resulting integer handle in. So if we had a struct:


typedef struct {
int var;
char string[STRING_LENGTH];
double foo;
} bar;

We would first indicate that there are three blocks, of lengths 1, STRING_LENGTH and 1:

int count = 3;
int lengths[3] = {1, STRING_LENGTH, 1};

The offsets indicate the byte offsets from the base address of each of the types in the structure. For this example, the “var” variable is the first, and thus has offset 0. On the other hand, “string” will have an offset that is sizeof(int), and “foo” will appear after the int and the string of length STRING_LENGTH:

MPI_Aint offsets[3] = {0, sizeof(int), sizeof(int) + STRING_LENGTH};
MPI_Dataype types[3] = {MPI_INT, MPI_CHAR, MPI_DOUBLE};

To finish up, we ask MPI to fill an integer we declare with the handle that will hereafter refer to this struct for the purposes of MPI. Then, we commit it, and it’s ready for use!

int barDatatype;
MPI_Type_struct(count, lengths, offsets, types, &barDatatype);
MPI_Type_commit(&barDatatype);

Now if you have an array of bar structs, you can use scatterv with it and your new datatype, or bcast for that matter. It’s business as usual, as if it were any of the include base datatypes in MPI.

Tagged with:
 

Promises to my OS Prof

For me, operating systems was one of the most worth-while courses in my undergraduate career. Not only gaining insight into the black-box that can be the operating system, but learning a bunch of skills that I have found invaluable since then.

Even in graduate school, I encounter computer scientists that have never used the command line. While its all well and good to use your IDE, it is absolutely crippling to not be able to do all the same magics from the command line. Not only that, but there is a wealth of tools accessible from your favorite shell and bash scripting is a useful piece to keep in one’s toolbox. The command line was one thing with which I was better acquainted through operating systems; this was especially true in one project where we had to roll our own shell. This was particularly good because the shell gives access to a lot of system tools that I’ve since found I want to use in other programs: forking, dup’ing, interacting with environment variables, etc.

Other good experiences were working with threading libraries and even booting up EC2 instances and testing out code there. I am extremely grateful to then-professor Mike Colagrosso for making it such a worthwhile experience. Lately I’ve been working with code where I remember quality code promises he made us make:

ALWAYS save the return value of system calls. I try to avoid programming in C as much as possible (after all, C++ is usually an alternative, if not Python), but whenever I do I am constantly reminded of this one. The most common design for functions that I encounter in that kind of code is to accept the variable that is to be updated, and return instead, a code indicating the success/failure/warnings/etc. of the code. In some cases, the return value is the only way you’ll have access to the resource just requested (for example in the case of fork(2) ), but either way it’s the basic mechanism for getting feedback about how code has executed.

NEVER use strcpy(3). I’ve never used a buffer overflow in a clever way (though this is a someday project), or I should say on purpose. But I’ve dealt with enough instances of me being distracted and writing bad code that I’ve gotten to know gdb (the GNU DeBugger) better than I would like. The strcpy(3) function relies on null character termination to stop copying memory from one string into another. If, however, a string isn’t null terminated or isn’t terminated before the length of the string buffer being copied into, then the operation will overrun the buffer. Instead, ALWAYS use strncpy(3). It allows you to specify the maximum length that should be copied (like, for instance the length of the buffer being copied into) so as to avoid this embarrassing problem.

Despite the taking the class just over three years ago (it’s unsettling to realize its been so long), these few promises have stuck with me. So Mike, if you’re reading this, I’m doing my part!

Tagged with:
 

A Bad First Instinct

It’s an itch. A compulsion. Sometimes when working with a new library I get a very strong impulse to do a very bad thing — reinvent the wheel. Or, in this case, re-write the wheel.

It’s easy to object to the way code is organized, and can cause a certain amount of discomfort. Whether it’s the desire to go through and reformat code (braces belong on the same line as the if statement!) or the pain of hacking together relatively incompatible library designs, it can be unpleasant.

Something that I’ve been pushing myself to do, and slowly learning to do is to be comfortable with that discomfort. To resist the urge to rewrite code, as it’s never just a matter of rewriting it, but also testing it, an so forth. I’m slowly beginning to accept the painful truth: no code is perfect.

That said, there are times when I do rewrite libraries. Sometimes you’re handed code from ten years ago that’s no longer applicable, or antiquated, or just plain ugly. In these scenarios it’s perfectly justified to make large sweeping structural changes (with the protection of your favorite version control, of course). Just, hold off. Wait, and try to use the library, and the first-pass issues will either grow into systemic problems or wither and recede into the cracks.

Tagged with:
 

This last week saw SIGGRAPH 2010 in Los Angeles, sunny California. It was there that I gave my first talk at a real conference. One request that I had not anticipated was for the slides from the talk, and so I post them here now.

Pending some red tape resolution, I hope to post the live-working demos soon. Until then, I hope that this video from the original conference submission will whet your appetite!

For those particularly curious, feel free to contact me or refer to the abstract.

Tagged with: