Project outline

Final report

Press coverage

related research




The Economic Organization and Efficiency of OS/FS Software Production:
An Agenda for Integrated Research
J.-M. Dalle [*], P. A. David [**], and W. E. Steinmueller [***]

[*] Université de Paris-VI – Jussieu
[**] Stanford University & Oxford Internet-Institute
[***] University of Sussex-SPRU


This version: 18 October 2002

Conceptual Approach
The initial contributions to the social science literature on open source and free software (OS/FS hereinafter) movements have been directed primarily to identifying the motivations that account for the sustained and many instances intensive engagement of (rational) agents in this non-contractual and unremunerated mode of activity. That focus reflects the view that widespread voluntary participation in the creation of economically valuable goods which will be made freely available for public use is an anomaly (at least from the viewpoint of mainstream microeconomic analysis). Anomalies are intrinsically intriguing, but it is also believed that this phenomenon may turn out to be of considerable economic and social importance, and warrants systematic study on the later grounds alone. A second theme in the early literature has been the search to uncover the basis in the OS/FS mode of production of the apparent ability of its outputs to compete in the market against proprietary software on the basis not only of its lower cost, but its reputedly superior quality. This theme resembles the first, in reflecting a state of surprise and puzzlement about the apparently greater efficiency that these non-profit, distributed production organizations have been able to attain in relation to firms engaged in “closed” and centrally directed production of the same type of commodity.

We share the view that the OS/FS movements may carry broader economic and social significance, and so deserve to be the subject of systematic empirical and theoretical study. But, although much about this particular phenomenon remains far from understood, the same might well be said about other aspects of the workings of modern economies, which are no less equally likely to turn out to be important for human well-being. One therefore might be forgiven for remarking that if the research attention that OS/FS software production is attracting from economists is, primarily, to be rationalized on the grounds of novelty and mysteriousness, the extent of this attention is not well founded. The emergence of OS/FS activities at their present scale is hardly so puzzling or aberrant a development as to constitute a rationale for devoting substantial resources to studying it. The cooperative production of knowledge by members of distributed epistemic communities who did not expect to receive direct remuneration for those efforts is not a new departure; there are numerous historical precursors and precedents for OS/FS, notably in the “invisible colleges” that appeared among the practitioners of the new experimental and mathematically approaches to scientific inquiry in western Europe in the course of the 17th century. The “professionalization” of scientific research, as is well known, was a comparatively late development. Moreover, there a substantial body of analysis by philosophers of science and epistemologists, as well as work on the economics of knowledge, that points to the superiority of cooperate knowledge sharing as a mode of generating additions to the stock of reliable empirical propositions. It is the scale on which OS/FS activities are being conducted, the rate at which they may interact, and the geographical dispersion of the participants, rather than their mere existence that can properly be deemed to be historically unprecedented. The modularity and immateriality of software products, and the enabling effects of the advances in computer mediated telecommunications during the past several decades, however take us a long way towards accounting for those aspects of the story.

In our view OS/FS warrants systematic investigation in view of a particular historical conjuncture, indeed portentous constellation of trends in the modern economy. The first is that information-goods are moving increasingly to the center of the stage as drivers of economic growth. Secondly, the enabling of peer-to-peer organizations for information distribution and utilization is an increasingly obtrusive consequence of the direction in which digital technologies are advancing, and the “open” (and cooperative) mode of organizing the generation of new knowledge has long been recognized to have efficiency properties that are much superior to institutional solutions to the public goods problem which entail the restriction of access to information through secrecy or property rights enforcement. Thirdly, and of practical significance for those who seek to study it systematically, the OS/FS mode of production itself is generating a wealth of quantitative information about this instantiation of “open epistemic communities.” This last development makes OS/FS activities a valuable window through which to study the more generic and fundamental processes that are responsible for its power, as well as the factors that are likely to limit its domain of viability in competition with other modes of organizing economic activities.

Consequently, proceeding from this framing of the phenomenon, a rather different conceptual approach from the one that has hitherto dominated the recent economics literature concerning OS/FS, and a correspondingly distinctive research strategy is being pursued by the members of the project on The Economic Organization and Viability of Open Source Software at Stanford University and its research partners at academic institutions France, the Netherlands and Britain [1]. The three associated groups in Europe are led, respectively, by Jean-Michel Dalle (University of Paris-VI), Rishab Ghosh (University of Maastricht-MERIT/Infonomics Institute), and Edward Steinmueller (SPRU-University of Sussex).

Most of the researchers associated with this project come to this particular subject matter from the perspective formed by their previous and on-going work in “the new economics of science,” which has focused attention upon the organization of collaborative inquiry in the “open science” mode, the behavioral norms and reinforcing reward systems that structured the allocation of resources, the relationships of these self-organizing and relatively autonomous epistemic communities with their patrons and sponsors in the public and private sectors. As economists looking at OS/FS communities, the interrelated and central pair of questions that are both simple and predictably familiar. Firstly, how do OS/FS project mobilize the resources, allocate expertise, and retain the commitment of their members? Secondly, how fully do the products of these essentially self-directed efforts meet the long-term needs of software users in the larger society, and not simply provide satisfactions of various kinds for the developers? In other words, the tasks we set ourselves in regard to OS/FS address the classic economic questions of whether and how it is possible for a decentralized decision resource allocation process to achieve coherent and socially efficient outcomes. What makes the problem especially interesting in this case is the possibility that the institutions evolved by the OS/FS movements enable them to accomplish that outcome without help either from the “invisible hand” of the market mechanism driven by price signals, or the “visible hands” of centralized managerial hierarchies. To respond to this challenge requires that the analysis be directed towards providing a basis for evaluating the social optimality properties of the way “open science”, “open source” and kindred cooperative communities organize the production and regulate the quality of the “information-tools“ and the information-goods” that will be used not only for their own, internal purposes, but by others with quite different purposes in the society at large.

The parallels with the phenomenon of “open science,” to which allusion has been made here, suggests the adoption of a particular framework for inquiry into the logic of non-market allocation mechanisms that achieve substantial coordination of activity among distributed agents. In particular, the “open science” framework suggests the importance of recognition in motivating scientist discovery efforts, the institutions governing claims of ‘priority’ (being first) in discovery, and the practices that have evolved to assist in recognizing and rewarding effort. A similar framework is useful for integrating and assessing the significance of empirical findings from studies of the micro-level incentives and social norms that structure the allocation of developers’ efforts within particular projects, and that govern the “release” and promotion of software code. The systems analysis approach familiar in general equilbrium economics tells us that we should likewise be asking how the performance of those functions is related to the mechanisms that allocate the resources of the larger community among different concurrent projects, that direct the attention of individual and groups to successive projects; and that suppor the development of particular capabilities and sub-specialties by members of those communities. Obviously, those capabilities provide “spill-overs” to other areas of endeavor – including the production of software goods and services by commercial suppliers. It follows that to fully understand the dynamics of the OS/FS mode and its interactions with the rest of the information-technology sector one cannot treat the expertise of the software development community as a given and exogenously determined resource.

Implementing the Approach
Four lines of complementary investigation are being undertaken, three of which are directed to expanding the empirical base for analysis of distinct aspects of the micro- and meso-level workings of OS/FS communities. The fourth is integrative and more speculative, as it is organized around the development and exercising of a stochastic simulation structure designed to shown the connections between micro- and macro-level performance properties of the larger system of software production. The three areas of empirical study, along with findings from other such inquiries, are expected to provide distributions of observations supporting the development of a properly specified and parameterized simulation model that would provide insights into the processes that may be responsible for generating patterns of the kind that we can observe, as well as allow investigation of counterfactual conditions that policy decisions may create. Thus, although these lines of inquiry can advance in parallel, their interactions are iterative: the empirical studies will guide the specification of the simulation structure that is to be used to investigate their broader, systemic implications.

The initial thrusts of these four complementary research “salients” are briefly described, taking them in turn:

• Distribution of developer efforts within software projects:
This micro-level resource allocation process can be studied quantitatively by tracking the authorship distributions found in specific projects over time. A start is being made by examining an atypical yet very important and emblematic OS/FS product: the Linux Kernel, the successive release of which constitute a very large database containing over 3 billion lines of code. The data production work – referred to by the acronym LICKS (Linux: Chronology of the Kernel Sources) – has been organized as a sub-contracted project carried out under the direction of Rishab Ghosh at at MERIT/Infonomics. It will significantly advance the state of basic data: first, by identifying the principal components within the Linux kernel, and identifying the packages relating to each of those components, and secondly, by extracting the code for successive versions, and linking the dated code (contributed by each identified author, along with the components to which it relates), so that time-series analyses become possible. The resulting dataset should be useable both for subsequent studies of the dynamics of the participation of the population of all contributors to the Linux kernel, and their clusters, as well as the chronology of the development of the major components of the code for the kernel. It will, in addition identify the degree of dependence between packages of code for each of the major component parts of the software.

We anticipate soon being able to answer questions of the following sort, which give the flavor of the micro-level allocation issues that this class of data will enable us to address: Is the left-skew in the distribution of contributions to the Linux Kernel as a whole also a feature of the distributions found for its components, i.e., is the pattern of concentration self-identified “authorship” fractal? Is the concentration significantly greater for some components than for others? Are these distributions stationary throughout the life of the project, or does concentration grow (or diminish) over time – the former having been found to be the case for the distribution of scientific authorship in specific fields over the lives of cohorts of researchers publishing in that field. In addition, we expect to be able to identify clusters of authors who work together within, and across different components of the Kernel project; to learn whether these grow by accretion of individuals, or coalesce through mergers; and whether, if they do not grow, they contract or remain constant. Further, by correlating the clusters of authors with the dependency data, it may be possible to obtain characterizations of the nature of “knowledge flows” between identified groups. [2]

It will be an important methodological issue for subsequent work to ascertain whether or not there are significant biases in the ability of the extraction algorithm to identify the distribution of authorship within this, the project for which it was designed. Inasmuch as one cannot treat the Linux Kernel as a “representative “ OS/FS project, other projects, which may differ in their conventions with regard to self-identification of contributions in the code itself, eventually will need to be studied.

• Allocation of developer communitys’ efforts among projects:
The SourceForge site [3] contains data on a vast number of ongoing projects, including both succeeding and failing ones [4] that are being analysed by Edward Steinmueller and Juan Mateos Garcia at SPRU – Science and Technology Policy Research at University of Sussex. Taking the project as the unit of observation, this data provides an evidentiary basis for establishing statistically the set of project characteristics that are particularly influential in determining whether or not it achieves one or more of the criteria that may be taken to define “success.” Measures of success include the timing of delivery of versions of the software at various stages of completion, continued participation of actors, and citation of the project’s outputs in various OS/FS forums. Considering the hypothesis that software projects are likely to attain higher levels of several of these measures of success when they are able to align “critical mass” of adherents and develop a self-reinforcing momentum of growth, the empirical challenge is to identify the combinations of early characteristics that statistically predict attainment of “critical mass.” A supply-driven approach to the question would interpret the “community alignment” problem as one of recruiting individuals who share a belief in the efficacy of the OS/FS mode of developing software, their diverse specific interests and motives for joining the project, notwithstanding; and who collectively possess the mix of differentiated skills that are needed for the work of rapidly designing, programming, debugging and upgrading the early releases of the code.

Both large and small-scale analysis seem feasible as a way of pinpointing the characteristics that enable (or fail to enable) the creation of ‘burgeoning’ communities that propel the growth of open source projects towards critical mass and and engagement with a self-catalyzing process by which user-producers (those who both use the software and contribute to its further development) sustain a high level of attention and incremental innovation in further developing projects. SourceForge itself provides sufficient information about the initial features of project to permit analysing the influence of factors such as technical nature, intended user/audiences, internal project organization and release policies, and legal aspects (e.g., projected form of licensing).

Timing and path-dependencies may be hypothesized to affect the success or failure of OS/FS projects, and it may be important to recognize that success and failure is not determined in isolation from the characteristics of other projects that may be competing for developer’s adherence. A population ecology perspective (emphasising the processes of birth, growth in the context of rivalry, and demise) is potentially pertinent in this connection, and interactions between the characteristics of the project and the features of “the niche” into which it is launched should be empirically investigated. Given that "developer mind-share" is limited, we may suppose that older projects are entrenched through technological lock-in processes that make more difficult the engagement of a critical mass of developers in competing similar projects. Developers will tend to increase their co-operative activities in older projects as they gain in experience and knowledge about them (these individuals are moving up project-specific learning curves, as well as acquiring generic and more transferable skills). Their attention and willingness to co-operate in other projects/new ones is therefore likely to decrease [5]. A key critical question about the governance of this process is whether the dominance of incumbent older projects serves to suppress innovative variety by creating a ‘standardised’ or ‘dominant design’ model of the attributes and features of the software. This kind of externality effect, through which accidents of timing and initial momentum may serve to “lock in” some projects, while locking-out others that are technically or socially more promising if considered on their intrinsic merits, has been identified in studies of the (competitive) commercial development and diffusion of other technological artefacts. It would therefore be of considerable interest to establish whether or not dynamics of this kind can be observed in the non-market allocation of developmental resources among software systems products. The fact that SourceForge makes it possible to filter projects according to the tools (such as programming languages and techniques) used in their development, and that the differences between these tools may be an important factor in lock-in, makes the analysis of this kind of processes easier. The possibility of tracking down the history of individuals' co-operative records may also make it feasible to study their patterns of involvement, entry and exit of different projects. Quantitative and qualitative methods will be used to identify the presence or absence of path dependency in the attainment of successive “states” of project growth.

• Sponsorship support and individual developer relations with commercial sponsors:
This module of the research program is concerned with understanding the formation of the variety of complementary software products and services that commercial ventures that have developed around the software system code created by the larger “open source” projects. These activities are a source of direct support for some OS/FS projects, an a beacon that may affect the launching of new programs as well as inducing individual entry into the community and participation in other projects that has no collateral commercial orientation in view. The degree to which such related profit-oriented activities develop symbiotic connections with an open source project, rather than being essentially parasitic, can be investigated by seeking evidence of individuals’ engagements in overlapping participation in OS/FS projects and commercial ventures based upon either proprietary or OS/FS projects of both kinds; and by examining such formal commitments that are undertaken with existing projects.

A two-pronged approach to studying the issues this raises in being pursued at Stanford (SIEPR) by the Dr. Seema Arora (project research associate) and Andrew Waterman (graduate research assistant). A web-based survey protocol has been developed to elicit more detailed information from developers about their contacts with, and possible involvements in complementary/collateral commercial enterprises. This survey replicates a number of questions that were answered on the FLOSS survey, in order to establish the relationship between the two respondent populations, as well as to increase the sample density on particular questions [6] . By asking respondents to identify the first OS/FS project on which they worked, and the two projects they deemed to be their most significant/important project participations (for identified reasons), the survey design seeks to link responses with the project characteristics information available from the SourceForge archives. A second line of inquiry also makes contact with the work of on determinants of project “success,” previously described: data available at SourceForge will be used to investigate whether there are significant statistical relationships between the specifics of the licensing arrangements adopted at the time that project’s are launched, and the subsequent development of derivative commercial enterprises around those projects that eventually do release code.

• Using Agent Based Simulation Modelling as an Integrative Device:
The fourth strand in our project -- the development of an agent-based simulation model of the OS/FS production system -- was a substantive focal point of Dalle and David’s joint presentation at the Bruxelles (14th October 2002) Workshop on “Advancing theResearch Agenda”. This model-building activity aims to eventually provide more specific insights into the workings of OS/FS communities, and also into their interaction with organizations engaged in proprietary and “closed mode” software production. It seeks to articulate the interdependences among distinct sub-components of the resource allocation system, and to absorb and integrate empirical findings about micro-level mobilization and allocation of individual developer efforts both among projects, and within projects. Stochastic process representations of these interactions offer a powerful tool for identifying critical structural relationships and parameters that affect the emergent properties of the macro system. Among the latter properties, the global performance of the OS/FS mode in matching the functional distribution and characteristics of the software systems produced to the evolving needs of users in the economy at large, obviously is an issue of importance for our analysis to tackle.

In our original attempt to model open-source software development, developers/agents essentially choose to dedicate their efforts – typically contributing new functionalities, correcting bugs, etc. – among alternative projects, each project corresponding to a different software module. The available alternatives at each moment include the launching of new projects. Agents’ actions are probabilistic and conditioned on comparisons of the expected non-pecuniary or other rewards associated with each project, given specifications about the distribution of their potential effort endowments [7]. The shape of the distribution of endowments, strictly speaking, cannot be inferred immediately from the (skewed) empirical distribution of the identified contributions measured in lines of code, but one can surmise that the former distribution also is left-skewed – on the basis of the relative sizes of the “high-activity” and “low-activity” segments of the developer population found by the FLOSS survey. To characterize the structure of the relative rewards associated with participation in various roles and in projects of different types, we begin with rather coarse implementation of a set of “ethnographic” observations describing the norms of OS/FS hacker/developer communities – following Eric S. Raymond’s insights in the well-known essay “Homesteading the Noosphere.” [8] The core of this is a variant of the collegiate reputational reward system: the more significant attached to the project, the agent’s role, and the extent or criticality of the contribution, the greater the anticipated “reward”. Caricaturing Raymond’s more nuanced discussion, we stipulate that launching a new project is as a rule more rewarding than contributing to an existing one, especially when several contributions have already been made; early releases typically are more rewarding than later versions of project code; there are some rewarding “project-modules” that are systematically accorded more “importance” than others, and these are ordered in a way that reflects meso-level technical dependences. One way to express this last rule is to say that there is a hierarchy of within a family of projects, such that contributing to the Linux Kernel is deemed a (potentially) more rewarding activity than providing Linux implementation of an existing and widely used applications program, and the latter dominates writing an obscure driver for a newly-marketed printer. In other words, we postulate that there is lexicographic ordering of rewards based upon a discrete, technically-based “ladder” of project-types. Lastly, for present purposes, new projects are created in relation to existing ones: we consider that it is always possible to add a new module in relation to an existing one, to which it adds a new functionality, and we assume that this new module will be located one level down the ladder.

As a consequence, all the modules, taken together, are organized as in a tree which grows as new contributions are added, and which can grow in various ways depending on which part of it (upstream or downstream modules, notably) developer will select. We further conjecture that this tree will be in some extent correlated both with the directory tree, and also with the technical interdependencies between the modules, although this correlation will probably be especially imperfect in the first case.
A typical example of a single a tree is shown in Figure 1, where numbers associated with each module account for versions, considering further that versioning accounts for performance:

Figure 1: An OS/FS Agent-Generated Software System

But there are many trees that can draw “nourishment” from the resources of the developer community, take root and grow. Our model represents these simply as being arranged in two dimensions: one is a ring-configuration in which projects that are differentiated in ranges of functionality as preceived by final users will be positioned farther apart; the other is a orthogonal projection from the point(s) on the ring. The latter allows for the emergence of substitute structures that arise from code-forking in the “root” projects, or at projects situated at any of the “derivative” (downstream) levels. Among the questions of interest in evaluating the overall performance of the OS/FS mode are those concerning the social welfare implications of code-forking at the various levels that the model distinguishes, and the mechanisms present in the organization of projects that control the frequency and circumstances of “code-forking.”

Although our goal is eventually to account for the effects of the relationships that may develop with commercial sponsors of projects, and the direct and indirect market interactions between profit-seeking software firms and OS/FS development projects, our model-building effort thus far has been occupied with representations of the workings of OS/FS communities in isolation. Thus we focus mostly on social utility measurements according to the basic ideas that (1) upstream modules are more valuable than downstream ones simply because of the range of applications that eventually can be built upon them, and (2) a greater diversity of functionalities (breadth of the tree at the downstream layers) is more immediately valuable because it provides software solutions to fit a wider array of user needs.

In this regard, preliminary results tend to stress the social efficiency of developer community “norms” that accord significantly greater reputational rewards for adding, and contributing to the releases of upstream modules. Further, these preliminary explorations of the model suggest that policies of releasing code “early” tend to generate tree-shapes that have higher social efficiency scores. The intuitively plausible interpretation of this last finding is that early releases are especially important (adding larger increments social utility) in the case of upstream modules, because they create bases for further applications development, and the reputational reward structure posited in the model encourages this “roundabout” (generic infrastructure) process of development by inducing individual efforts to share the recognition for contributing to upstream code. This is based upon a static ex post evaluation of the resulting tree form, and it is evident that the results may be altered by considering the dynamics and applying social time discount rates to applications that only become available for end users at considerably later dates. In other words, the social efficiency of the reward structure that allocates developers’ efforts will depend upon the temporal distribution, as well as relative extent to which OS/FS-generated code meets the needs of final users rather than the needs/goals of the agents who choose to work on these projects.

[1] This project is supported by an NSF Grant (IIS-0112962) to the Stanford Institute for Economic Policy Research’s “Knowledge Networks and Institutions for Innovation Program, led by Paul David.

[2] A different use of this methodology would analyse the dependency of all the contributions – the signed and the unsigned contributions to the Linux Kernel, and investigate whether there are significant differences in the likelihood that contributions that have high dependency measures will be signed.

[3] The SourceForge.net website contains data on over 33,000 Open Source Projects, including their technical nature, intended audiences and stage of development. Records of their past history and the engagement of the OS/FM community in their improvement and evolution are also, in principle, available.

[4] The success/failure of a project can be characterised by its developing rate and speed, and the engagement of the community of developers with its improvement/growth

[5] Contraposed to this tendency could be that of individuals abandoning old projects as these reach their end or get bored with them. If individuals derive utility from the excitement associated to "new hacks", the kind of attachment to projects previously described will not take place necessarily. If the exit/entry rates of developers in the OS community is fast enough, this problem may also be attenuated. Finally, individuals searching to increase their status inside the community may have incentives to abandon their roles as collaborators of existing projects in order to start new ones (this possibility will be developed later).

[6] The instrument presently is undergoing the final stage of field-testing at a website to which volunteers willing to identify themselves are directed: http://www.stanford.edu/group/floss-us/survey.fft.

[7] In the simplest formulations of the model, agents’ endowments (measured in thousand lines of code per period (KLOCs) are treated as “fixed-effects” and are obtained as random draws from a stationary distribution. More complex schemes envisage endogenously determined and serially correlated code-ing capacities, with allowance for experience based learning effects at the agent level.

[8]See The Cathedral and the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary, Sebastopol CA:O’Reilly, 2001: 65-112.