The coolest part about my job is that I get to work on tasks that are cool, but typically are forever laid to rest in the would-be-cool-if-we-could-but-we-don’t-have-time-or-resources category. However as project code sizes increases, the usefulness/coolness ratio of static analysis grows and moves from would-be-cool, to nice-to-have to this-is-the-only-way. Now, I’m not sure where Mozilla lies on this scale, but I do know for sure that the giant codebase is an analysis treasure trove.
Dead Code Motivation
I have been talking about dead code detection in Mozilla for ages. As software changes, some pieces of code unintentially get left behind, but often there is no way to tell. It results in unnessary maintanance burden and increased footprint. In roc’s case randomly spotting dead methods may also involve a little IRC griping at me about not having any tools for it, even rudimentary ones. Between that and blizzard’s cheering over reduced code size, I had no choice, but to give it a try once I had enough outparamdelling being reviewed.
Dead Code Results
First of, the approach described below seems to work well: see bug with initial results. Now I get a few thousand reported methods to inspect and refine the results.
Dead Code Approach
I’ve been pondering dead code detection since I’ve started working on this stuff. There are lots of ways to do it, but I finally settled on the dead-simplest one: method-level granularity. Ingredients are:
- Dehydra/Treehydra extraction JavaScript: Dehydra makes it easy to enumerate methods and class hierarchies. Treehydra makes it possible to extract every mention of a method from the code(except for pointers to virtual functions that were casted…in that case we are left with vtable index and a type that the pointer was cast to). Additionally, Treehydra counts the number of AST nodes visited in every function body.
- Shell script to aggregate the result of processing Mozilla source. Gotta admit, perl’s hashtables give GNU sort -u a run for its money.
- Ocaml program to do the super-dumb algorithm. Classes with the same names, are assumed to be the same class. Method overloads are assumed to be a single method. All methods are assumed to be virtual. Every ClassType::FUNCTION_CALL call walks down to all children & up through all the parents and marks the derived methods as called. Then all derivatives of scriptable XPIDL are filtered out and the uncalled methods are printed out. In order to find most exciting functions first, methods in the results are sorted by their AST count =D. Language Nerd Trivia: Why OCaml - because ADTs are no fun in other languages. Why is my OCaml so bad - because I lost the hang out if due to not writing anything it for the past 2 years. I’m sure most people reading this will go: “Wait a minute, this is unsound if you don’t see the virtual function pointers?”, to which I reply: “Once I nuke the method and mozilla compiles successfully, it’s sound”. In reality someone could make a puny GCC patch to preserve more data in virtual function pointer assignments.
Since this is all in early stages there a bunch of things I don’t deal with: method overloads, constructors, destructors and overloaded operators….and templates. All function bodies are scanned, but some function names are a pain to deal with or they aren’t straightforward function calls in GCC so they don’t participate in the dead/alive contest.
Where To Go From Here
Well, once we run out of dead code detected by the primitive caveman approach above, we’ll have to investigate less conservative approaches possibly involving abstract interpretation and callgraphs.
How To Get Involved in Screwing With Software Cost Models By Contributing Negative Line Counts
This project has a lot of places to help out:
- work through the dead method list, filing bugs accordingly and deleting any related code
- Extend the machinery to work on all non-static function
- Try this on your favourite large codebase.
- Write the virtual function pointer annotation patch :)