I was curious to see how many WTFs are in programmers' code and compare it across languages, so I wrote
script to figure it out using github as the source
I couldn't figure out a way to do it using
github's API, so
I had to screen scrape
the search instead. Therefore, as the markup
on that page changes, it will break the script. But, you can fork it and fix it later if you'd like.
Another caveat is that I used the search string 'a' to determine the total number of repositories for a language. If you have a better way to get the actual
number, or maybe just a more common letter we could search for, feel free to share your ideas!
Below, you'll find a graph of the WTF's per repository by the most popular languages on github. I used only
the most popular languages because it was a PITA to try and size the graph using Google Docs to include them
However, you can get the
raw data for all languages if you want to play with it yourself.
Any thoughts on how we can improve this?
What's your analysis on how we can interpret the data?
In other words...does your language encourage you to fall into the pit of success or the other way around?
What a fantastic idea!! Thanks for this. Did you look through ColdFusion code too?
@Jeremy - I was thinking something along those lines! =)
@Tim - Yes, CF is on the full data list, but didn't make it into the graph, since I used github's definition of the most popular languages on github for that list. It would fall somewhere between CSS and XML, with an average of 0.000863293825651 WTFs/repository.
The full data list is available at:
I was thinking a more experience-oriented interpretation. Almost every developer learns C, so percentage-wise, the number of expert C developers is pretty low, and the number of "easily confused" C developers is pretty high.
Objective C is "the new hotness" and people are frantically trying to learn it to write iPhone apps, so again, the percentage of easily confused newbies is high. Lua is kind of an outlier, and I'm not sure how to interpret that one.
The rest of these languages are more special purpose, so the people who start writing in them quickly become experts, because they do project after project, living their entire life in that language. Thus, they're less likely to say "What the hell? I've never heard of that before".
If we could count the number of developers that exist for each language, then we could measure a ratio of WTF/programmers.
I doubt that most people using github are new to programming, so the high level of WTFs could mean greater interest in the language by programmers together with potential poor documentation for the language.
Emacs Lisp has more WTFs than VimL. Emacs wins again!
@Adam Ness: Do a lot of developers learn C? I think most who start a degree in CS probably do, but I'd guess (key word!) that most devs don't bother with C after that, if they have a formal education in programming at all.
For Objective C, I have some light to shed on that situation. If you look at the search results for WTF and Objective C (through github's search), you'll see there are a lot of references to Webkit, which has the Web Template Framework, which gets abbreviated to WTF namespace in the code. ( I learned that from
http://stackoverflow.com/questions/834179/wtf-is-w... )
I like your reasoning on the others. In fact I like the reasoning on Objective C, but judging by the search results on the front page, I think it has a lot of noise from Webkit.
@Hugo - That would be pretty sweet. Maybe you can bug the github guys to give us more access to the data like that. It would be fun to play with.
@anonymous: Absolutely hilarious!
For those who didn't look into the full data, VimL didn't make the list of "popular languages" so it didn't make the chart, but the full data list shows them as very close, with Emacs Lisp slightly edging out VimL for WTFs per repository.
Interesting. I always thought that Ruby/Python would both be pretty high because most hobby projects on github seem to be written in one of them, and people are more likely to include WTF in hobby projects than large scale projects. And this sort of bears true for Python in the graph, but not at all for Ruby.
CSS is also surprisingly low. I would have expected it to dwarf any actual programming language, because IE6 exists, but it's actually lower than most.
I also find it interesting that functional languages (Haskell, Erlang) are pretty low. These have been thrust into a more mainstream light lately, so I'd imagine they'd be doing worse from programmers who are less experienced in them using them. But they're not like that at all. If anything, Haskell has done pretty good.
As for the others:
C - It's low level. This frustrates programmers more used to high level languages. No surprises here, except maybe that it does worse than the more complex C++.
Obj-C - It seems to be a language you either love or hate. No surprises.
Lua - I really don't know how to explain this one. My one guess is that it may be related to the frequency of Lua usage in game mods.
I wonder if this comparison is fair. Shouldn't you also take into account the size of the repositories? I.e. either measure WTFs/code line/language, or make a graph of avg. code lines/repository/language, and see if there is a positive correlation with the diagram you already posted.
Ha. At first I thought you were doing some kind of code analysis to find bad code and I was wondering how you were doing that automatically across languages in a way that was fair and made sense. Then I took a peak at your code and realized you are looking for the string 'wtf'.
Nice work!
I wrote something similar that scrapes the search results page for whatever term you'd like. It lets you compare results, as well.
I can explain the Lua WTF's
World of Warcraft mods go into a ./WTF/ directory.
http://www.wowwiki.com/WTFSo, virtually every WoW mod on github will make multiple references to this. I spose.
What does this measure? It measures how often someone with write-access to a repository wrote "WTF" into it. Most likely, it is the developer(s) himself/themselves. Discounting the instances where "WTF" has a different meaning, what does this mean? Critique of the language? Critique of some libraries? Self-critique? Colloquial naming of conditions (signal 'WTF)?
I think that the main information gain from this is about the people commenting on it.
These metrics are nearly useless. You're searching for the string "wtf" in the source, but it doesn't actually correspond with what most programmers consider to be a "wtf" in code. A slightly better (though still noisy and fairly useless) metric might be occurrences of the string "wtf" in commit messages. That at least would signify something the developer happened to have a "wtf" reaction to, although it could just as easily be a reaction to some library behavior or user requirement as it could the actual code being committed. For evidence of why simply searching for "wtf" in the code is fairly useless look no further than the search through Objective-C where the vast majority of hits are for imports of the webkit framework.
What would be really useful, but much much harder to do, would be to get the various "lint" applications for different languages and feed the source of each project through that. Although still not a very accurate count of true "wtf"s which require actual analysis of the code, it could give some idea as to which languages tend to encourage lazy practices. Often a high number of issues spotted by the lint applications correspond with high prevalence of "wtf"s.
you need to normalize the distribution by the number of lines of code. if language a has 50 million lines of code in github and 1 million WTFs, and language B has 1 million lines of code in github and 500,000 WTFs...it says a lot more about language B than language A (i.e. half the lines of code contain WTFs)
@Ed: Lua's WTF hits aren't due to WoW Text Files. The SavedVariables which are stored in the WTF folder uses the lua extension.
Take a look at the search results[1] and you'll see that most of the hits are from the same piece of code, but different forks/repositories.
@fair? - I would love it if github provides size of repository (in lines of code, or even just kb) but I couldn't find that information if they have it.
@Jess, @Kyle: I did consider how cool it would be to run some static analysis to find code WTFs, but I'd need a good way to choose repositories at random, because I don't think it's feasible to run through every repo that way.
Maybe we could build some tool to do it distributed among a ton of computers!
@Kyle: I'd like to run it through commits as you mentioned, but I don't know of a good way to get bulk data on that through github. The only way I could think of would be to iterate through all the commits, but I don't have the kind of processing power to do that either. Again, maybe we could do a distributed tool.
Aside from the technological infeasibility of going through each repo one at a time, github also imposes a rate limit, so even if I had that kind of time/power, I'd not be able to do it on my own.
Ideally, github could offer a way to get bulk data that would let us perform that kind of analysis with a few queries, instead of a few million.
@haste and @Ed: thanks for the insight on Lua's WTFs through WOW!
@Kris: Thanks for the link to your repository. I did make it easy enough to switch out the string you're looking for, and while I considered sticking up a website, I decided against it because of the rate limit.
What did you use to make the charts?
@fair? - When I say I couldn't find it, I meant I couldn't find it in such a way that wouldn't require me iterating over every repository, not that it's not at all available. I don't recall the specific availability, but I do recall there was no way for me to get it and still do what I wanted without going through a ton of repos one at a time.
The guys at GitHub are pretty cool, and Zach Holman who works at GitHub has been running a series of blog posts talking about how GitHub does things here:
http://zachholman.com/posts/how-github-works-async...Might be worth it to try to contact him and see if GitHub might be willing to add something to the APIs that could facilitate some more interesting metrics. At the very least it seems like a way to search if not all of GitHub, then at least an individual projects commit messages would be a useful API to expose.
How close is your language towards a computer "program" or just a data presentation script?
This chart seems to sum it up pretty good.
@Kyle - I know it's been a while, but I wanted to say thanks for the advice. I think I'll do that. I hadn't considered trying to contact someone personally, even though they've always appeared approachable. Anyway, thanks again!
Great idea -- thanks for sharing, I love it!
This is a great measure of how well programmers in various languages comment their code :-).
Instead of scrapping the data you could also use Google BigQuery which also contains all the information you needed
check it out
Oh yes, the number of WTFs for objective-C - i can really really feel & confirm it.
The number of PHP-WTFs is also nice...
Thanks Manuel, I had no idea about it. Looks awesome!
Lelala, I've felt it for Objective-C as well, but I think it's probably just the fact that it is a common abbreviation for WebTemplateFramework used all over the place. =)
I would like to see WTFs per line of code, line of comment or even words of code / comment. At the moment it looks like these metrics reflect the size of the repositories and the length of the code / comments.
@Stuart: As would I. At the time I did this there wasn't much of a search available at github, and there was no archive of data to sort through (that I was aware of). Now some more of those metrics might be possible.
This has some serious flaws and wasn't intended to be real research, though I think the idea has some merit in identifying programmer states of mind on average.
I just don't have the computation power to go through the proper channels to normalize against lines of code, for example. But even that we may need to consider further external variables to control for. For example: boilerplate, which we might suspect to cause WTFs, may show up as a negative correlation due simply to the increased number of lines.
