The Tiemann-Ghosh mail exchange on the FLOSS survey: is the FLOSS survey data representative?
We've become many mails and comments talking about the FLOSS developer survey and the results obtained by nationality or technical preferences (like distribution/Operating System, editor or desktop).
Michael Tiemann, Red Hat's CTO, and Rishab Aiyer Ghosh, FLOSS lead author, have exchanged following mails on the topic. We recommend their reading, because they may clarify the goals, motivations and way of work of the FLOSS survey.
We would like to thank Michael Tiemann for his time, interest and suggestions (and, of course, for giving permission to make this public).
I read with interest the FLOSS report. Several people have remarked that while the interpretation of the data is thoughtful, the data itself does not appear to be quite representative, due possibly to two reasons. First, the criteria for selecting the sample may have induced a bias, and second, the overall scope of the sample appears to be biased.
Regarding the first point, if there are 1,000 packages shipped by Red Hat, 2,000 shipped by Mandrake, 3,000 shipped by SuSE, and 2,000+ not shipped in any commercial distribution, if you ask all package maintainers "what is your prefered distribution" you're going to get an answer that biases towards the fringe rather than the mainstream. On solution to this problem is to see how the responses change if you multiply the maintainer's vote by one plus the number of distributions that include their package. Alternatively, you could multiply the maintainer's answers by the estimates of usage provided by your user questions. Thus, maintainers who are more aligned with their communities of use are given a higher weight than those that don't seem to have as much usage.
Regarding the second, let me first confess that I'm an American, and thus have US bias. This survey was done in Europe, and shows a european bias in terms of open source hacking. That may be valid, but it can only be tested by partnering with a US-based research group and seeing if they can duplicate the data, or if their survey shows a US bias. If, as I suspect, the Internet is not free of national bias, then a study should be done from the US, and that data added to yours to compute a more representative picture. As CTO of Red Hat, I can say that I see *lots* of US-based open source development that I don't see reflected in your survey.
Thanks for posting the raw data on your website. I may have a chance to put it into postgres and come up with some additional thoughts or comments. If you'd like to work with a US-based group, I could certainly promote that to folks I know.
From: Rishab Aiyer Ghosh <rishab (AT) dxm.org>
we met at the NSF open source research agenda workshop in washington dc this january.
thanks for your comments on the FLOSS study, of which i was lead author. it's not clear to me whether you read the survey report itself or only the raw data, since you focus on the "preferred distribution" question. this was also the focus of the slashdot discussion due to the rather unfortunate wording of the posting there which implied that the study was designed to bring out developer preferences of platforms and software.
on the contrary, those were "incentive" questions meant to provide developers with a reason to fill out the questionnaire, the results presumed to be of interest to developers but not really considered important from the point of the study, as you can see from the report itself. most of the questions related to time spent on various development activities (e.g. proprietary development and f/os development hours/week), personal life and motivation/business related questions. the full report is at www.infonomics.nl/FLOSS/ (see final report, part IV) and a nice summary is on ZDNET UK at http://news.zdnet.co.uk/story/0,,t269-s2121232,00.html
my response to your comments on the survey is below.
At 04:58 PM 22/08/2002 +0200, you wrote:
this would make sense if we wanted to get an idea of the distribution preferred by _users_ rather than _developers_. otoh we don't really care about the preferred distribution/software/editor/interface issues, and they were only presented as results because they were questions that were answered, but were not the subject of our analysis. i can imagine debian using this to promote themselves, of course, but a survey to find preferred distributions would have to have a quite different methodology!
< Regarding the second, let me first confess that I'm an American, and thus have
again, the survey wasn't designed to provide an accurate geographical breakdown of developers (indeed i believe this can't be provided by any sampled survey but requires a census-type approach e.g. by analysing author addresses). however, we list the results because we believe they are indicative as the questionnaire was very widely distributed, and the results are similar to most other surveys as we point out in the report.
nevertheless, we limit the analysis of these data to immigration patterns and, for EU countries, correlations with national policy towards f/oss.
the one data point that may indicate geographic bias - or at least a difference from other surveys - was quite surprising, the high proportion of entries from france. unlike other surveys, this one was announced (in translation) in many languages, though the questionnaire itself was only in english. although this didn't greatly increase the proportion of german or spanish developers compared to other surveys, it did the french, perhaps because many french developers pay more attention to announcements on french-language sites.
< Thanks for posting the raw data on your website. I may have a chance to put it
the data in SPSS format will be publicly available in the near future.
karim lakhani's BCG survey came out just as we were about to launch the questionnaire for ours, so we had to opportunity to study and revise our methodology. we chose specifically to avoid pre-selection as done in the BCG survey as their led to a bias on several attributes (experienced developers leading large active projects on an american developer platform) other than that of nationality. instead, we used a random sample through wide publicity of the questionnaire and various verification methods to ensure it was a sample of genuine developers (e.g. e-mail address fragments entered in the questionnaire were matched to addresses found in source code for over 20% of the respondents, and the verifiable responses were statistically compared with the non-verifiable responses, the results of which will be available shortly as we publish the report's appendices)
although still a self-selecting sample, we could ensure that any bias due to self-selection was largely towards more vocal developers, which doesn't really affect most of the data points.
for any details regarding it, please use this contact e-mail address.