I was curious to see how many WTFs are in programmers' code and compare it across languages, so I wrote
a
script to figure it out using github as the source
data.
I couldn't figure out a way to do it using
github's API, so
I had to screen scrape
the search instead. Therefore, as the markup
on that page changes, it will break the script. But, you can fork it and fix it later if you'd like.
Another caveat is that I used the search string 'a' to determine the total number of repositories for a language. If you have a better way to get the actual
number, or maybe just a more common letter we could search for, feel free to share your ideas!
Below, you'll find a graph of the WTF's per repository by the most popular languages on github. I used only
the most popular languages because it was a PITA to try and size the graph using Google Docs to include them
all.
However, you can get the
raw data for all languages if you want to play with it yourself.
Any thoughts on how we can improve this?
What's your analysis on how we can interpret the data?
Hey! Why don't you make your life easier and subscribe to the full post
or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate
wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!
Leave a comment
In other words...does your language encourage you to fall into the pit of success or the other way around?
Posted by Jeremy
on Aug 11, 2011 at 10:22 AM UTC - 6 hrs
What a fantastic idea!! Thanks for this. Did you look through ColdFusion code too?
Posted by Tim Cunningham
on Aug 11, 2011 at 10:24 AM UTC - 6 hrs
@Jeremy - I was thinking something along those lines! =)
@Tim - Yes, CF is on the full data list, but didn't make it into the graph, since I used github's definition of the most popular languages on github for that list. It would fall somewhere between CSS and XML, with an average of 0.000863293825651 WTFs/repository.
The full data list is available at:
http://bit.ly/r7XBvL
Posted by
Sammy Larbi
on Aug 11, 2011 at 11:04 AM UTC - 6 hrs
I was thinking a more experience-oriented interpretation. Almost every developer learns C, so percentage-wise, the number of expert C developers is pretty low, and the number of "easily confused" C developers is pretty high.
Objective C is "the new hotness" and people are frantically trying to learn it to write iPhone apps, so again, the percentage of easily confused newbies is high. Lua is kind of an outlier, and I'm not sure how to interpret that one.
The rest of these languages are more special purpose, so the people who start writing in them quickly become experts, because they do project after project, living their entire life in that language. Thus, they're less likely to say "What the hell? I've never heard of that before".
Posted by
Adam Ness
on Aug 12, 2011 at 10:18 AM UTC - 6 hrs
If we could count the number of developers that exist for each language, then we could measure a ratio of WTF/programmers.
I doubt that most people using github are new to programming, so the high level of WTFs could mean greater interest in the language by programmers together with potential poor documentation for the language.
Posted by Hugo Estrada
on Aug 14, 2011 at 07:22 AM UTC - 6 hrs
Emacs Lisp has more WTFs than VimL. Emacs wins again!
Posted by anonymous
on Aug 14, 2011 at 07:53 AM UTC - 6 hrs
@Adam Ness: Do a lot of developers learn C? I think most who start a degree in CS probably do, but I'd guess (key word!) that most devs don't bother with C after that, if they have a formal education in programming at all.
For Objective C, I have some light to shed on that situation. If you look at the search results for WTF and Objective C (through github's search), you'll see there are a lot of references to Webkit, which has the Web Template Framework, which gets abbreviated to WTF namespace in the code. ( I learned that from
http://stackoverflow.com/questions/834179/wtf-is-w... )
I like your reasoning on the others. In fact I like the reasoning on Objective C, but judging by the search results on the front page, I think it has a lot of noise from Webkit.
@Hugo - That would be pretty sweet. Maybe you can bug the github guys to give us more access to the data like that. It would be fun to play with.
@anonymous: Absolutely hilarious!
For those who didn't look into the full data, VimL didn't make the list of "popular languages" so it didn't make the chart, but the full data list shows them as very close, with Emacs Lisp slightly edging out VimL for WTFs per repository.
Posted by
Sammy Larbi
on Aug 14, 2011 at 06:47 PM UTC - 6 hrs
Interesting. I always thought that Ruby/Python would both be pretty high because most hobby projects on github seem to be written in one of them, and people are more likely to include WTF in hobby projects than large scale projects. And this sort of bears true for Python in the graph, but not at all for Ruby.
CSS is also surprisingly low. I would have expected it to dwarf any actual programming language, because IE6 exists, but it's actually lower than most.
I also find it interesting that functional languages (Haskell, Erlang) are pretty low. These have been thrust into a more mainstream light lately, so I'd imagine they'd be doing worse from programmers who are less experienced in them using them. But they're not like that at all. If anything, Haskell has done pretty good.
As for the others:
C - It's low level. This frustrates programmers more used to high level languages. No surprises here, except maybe that it does worse than the more complex C++.
Obj-C - It seems to be a language you either love or hate. No surprises.
Lua - I really don't know how to explain this one. My one guess is that it may be related to the frequency of Lua usage in game mods.
Posted by
Macha
on Aug 18, 2011 at 09:48 AM UTC - 6 hrs
I wonder if this comparison is fair. Shouldn't you also take into account the size of the repositories? I.e. either measure WTFs/code line/language, or make a graph of avg. code lines/repository/language, and see if there is a positive correlation with the diagram you already posted.
Posted by fair?
on Aug 18, 2011 at 10:36 AM UTC - 6 hrs
Ha. At first I thought you were doing some kind of code analysis to find bad code and I was wondering how you were doing that automatically across languages in a way that was fair and made sense. Then I took a peak at your code and realized you are looking for the string 'wtf'.
Posted by
Jess Johnson
on Aug 18, 2011 at 11:25 AM UTC - 6 hrs
Nice work!
I wrote something similar that scrapes the search results page for whatever term you'd like. It lets you compare results, as well.
https://github.com/krismolendyke/githubris
Posted by Kris
on Aug 18, 2011 at 11:26 AM UTC - 6 hrs
I can explain the Lua WTF's
World of Warcraft mods go into a ./WTF/ directory.
http://www.wowwiki.com/WTFSo, virtually every WoW mod on github will make multiple references to this. I spose.
Posted by Ed
on Aug 18, 2011 at 11:28 AM UTC - 6 hrs
What does this measure? It measures how often someone with write-access to a repository wrote "WTF" into it. Most likely, it is the developer(s) himself/themselves. Discounting the instances where "WTF" has a different meaning, what does this mean? Critique of the language? Critique of some libraries? Self-critique? Colloquial naming of conditions (signal 'WTF)?
I think that the main information gain from this is about the people commenting on it.
Posted by Harleqin
on Aug 18, 2011 at 12:18 PM UTC - 6 hrs
These metrics are nearly useless. You're searching for the string "wtf" in the source, but it doesn't actually correspond with what most programmers consider to be a "wtf" in code. A slightly better (though still noisy and fairly useless) metric might be occurrences of the string "wtf" in commit messages. That at least would signify something the developer happened to have a "wtf" reaction to, although it could just as easily be a reaction to some library behavior or user requirement as it could the actual code being committed. For evidence of why simply searching for "wtf" in the code is fairly useless look no further than the search through Objective-C where the vast majority of hits are for imports of the webkit framework.
What would be really useful, but much much harder to do, would be to get the various "lint" applications for different languages and feed the source of each project through that. Although still not a very accurate count of true "wtf"s which require actual analysis of the code, it could give some idea as to which languages tend to encourage lazy practices. Often a high number of issues spotted by the lint applications correspond with high prevalence of "wtf"s.
Posted by Kyle
on Aug 18, 2011 at 02:13 PM UTC - 6 hrs
you need to normalize the distribution by the number of lines of code. if language a has 50 million lines of code in github and 1 million WTFs, and language B has 1 million lines of code in github and 500,000 WTFs...it says a lot more about language B than language A (i.e. half the lines of code contain WTFs)
Posted by davis
on Aug 18, 2011 at 03:25 PM UTC - 6 hrs
@Ed: Lua's WTF hits aren't due to WoW Text Files. The SavedVariables which are stored in the WTF folder uses the lua extension.
Take a look at the search results[1] and you'll see that most of the hits are from the same piece of code, but different forks/repositories.
[1]
https://github.com/search?type=Code&language=Lua&q...
Posted by haste
on Aug 18, 2011 at 04:18 PM UTC - 6 hrs
@fair? - I would love it if github provides size of repository (in lines of code, or even just kb) but I couldn't find that information if they have it.
@Jess, @Kyle: I did consider how cool it would be to run some static analysis to find code WTFs, but I'd need a good way to choose repositories at random, because I don't think it's feasible to run through every repo that way.
Maybe we could build some tool to do it distributed among a ton of computers!
@Kyle: I'd like to run it through commits as you mentioned, but I don't know of a good way to get bulk data on that through github. The only way I could think of would be to iterate through all the commits, but I don't have the kind of processing power to do that either. Again, maybe we could do a distributed tool.
Aside from the technological infeasibility of going through each repo one at a time, github also imposes a rate limit, so even if I had that kind of time/power, I'd not be able to do it on my own.
Ideally, github could offer a way to get bulk data that would let us perform that kind of analysis with a few queries, instead of a few million.
@haste and @Ed: thanks for the insight on Lua's WTFs through WOW!
@Kris: Thanks for the link to your repository. I did make it easy enough to switch out the string you're looking for, and while I considered sticking up a website, I decided against it because of the rate limit.
What did you use to make the charts?
Posted by
Sammy Larbi
on Aug 19, 2011 at 08:34 AM UTC - 6 hrs
@fair? - When I say I couldn't find it, I meant I couldn't find it in such a way that wouldn't require me iterating over every repository, not that it's not at all available. I don't recall the specific availability, but I do recall there was no way for me to get it and still do what I wanted without going through a ton of repos one at a time.
Posted by
Sammy Larbi
on Aug 19, 2011 at 08:36 AM UTC - 6 hrs
@Sammy
The guys at GitHub are pretty cool, and Zach Holman who works at GitHub has been running a series of blog posts talking about how GitHub does things here:
http://zachholman.com/posts/how-github-works-async...Might be worth it to try to contact him and see if GitHub might be willing to add something to the APIs that could facilitate some more interesting metrics. At the very least it seems like a way to search if not all of GitHub, then at least an individual projects commit messages would be a useful API to expose.
Posted by Kyle
on Aug 19, 2011 at 10:27 AM UTC - 6 hrs
How close is your language towards a computer "program" or just a data presentation script?
This chart seems to sum it up pretty good.
Posted by Nobody
on Aug 22, 2011 at 09:35 AM UTC - 6 hrs
@Kyle - I know it's been a while, but I wanted to say thanks for the advice. I think I'll do that. I hadn't considered trying to contact someone personally, even though they've always appeared approachable. Anyway, thanks again!
Posted by
Sammy Larbi
on Sep 20, 2011 at 06:22 AM UTC - 6 hrs
Great idea -- thanks for sharing, I love it!
Posted by
Sammy Larbi
on Jan 23, 2012 at 01:54 PM UTC - 6 hrs
This is a great measure of how well programmers in various languages comment their code :-).
Posted by
PO8
on Mar 10, 2013 at 07:58 PM UTC - 6 hrs
Instead of scrapping the data you could also use Google BigQuery which also contains all the information you needed
check it out
http://www.githubarchive.org/
Posted by
Manuel
on Sep 19, 2013 at 01:36 AM UTC - 6 hrs
Oh yes, the number of WTFs for objective-C - i can really really feel & confirm it.
The number of PHP-WTFs is also nice...
Posted by
Lelala
on Sep 19, 2013 at 04:45 AM UTC - 6 hrs
Thanks Manuel, I had no idea about it. Looks awesome!
Lelala, I've felt it for Objective-C as well, but I think it's probably just the fact that it is a common abbreviation for WebTemplateFramework used all over the place. =)
Posted by
Sammy Larbi
on Sep 19, 2013 at 05:34 AM UTC - 6 hrs
I would like to see WTFs per line of code, line of comment or even words of code / comment. At the moment it looks like these metrics reflect the size of the repositories and the length of the code / comments.
Posted by Stuart
on Sep 26, 2013 at 04:04 AM UTC - 6 hrs
@Stuart: As would I. At the time I did this there wasn't much of a search available at github, and there was no archive of data to sort through (that I was aware of). Now some more of those metrics might be possible.
This has some serious flaws and wasn't intended to be real research, though I think the idea has some merit in identifying programmer states of mind on average.
I just don't have the computation power to go through the proper channels to normalize against lines of code, for example. But even that we may need to consider further external variables to control for. For example: boilerplate, which we might suspect to cause WTFs, may show up as a negative correlation due simply to the increased number of lines.
Posted by
Sammy Larbi
on Sep 26, 2013 at 07:56 AM UTC - 6 hrs
Leave a comment