Development Blog

Apr 15, 2016

Guest Post: Design Principles for Linking

This post is guest written by Brandon Kuczenski, a researcher at UCSB, and chair of a SETAC NA working group of Inventory Model Description and Revision. Brandon has engaged the Ocelot team in the comments section of the development blog, and we are grateful that he has accepted our offer to explain his thoughts in a better forum.

Inventory Linking Software

Chris asked me to write a guest post (and possibly a series of posts) since I am very opinionated about LCA software, and I happily agreed. As it happens, I've been working some years in my own windowless room on an LCA computation suite that takes a radically different approach from what is out there now. For a number of reasons (mainly the choice of language and development environment), the products of that project are not easily extendable, and for reasons having to do with my own technical limitations, they would not scale up well to a project the size of Ecoinvent. So I am working on a re-implementation (in python). But in part because of my radically different approach, I have a different notion for what a "linker" is good for and how it would work. In this post I will explain my motivations, and outline a few design principles that stem from them. It may turn out that the software I have in mind fills a different niche from what the Ocelot team has in mind for Ocelot.

Where did this number come from?

First off, I am more interested in LCA studies than LCI databases. I am primarily concerned with the review and validation of LCA results. When I see a publication that reports the global warming potential (GWP) of a product, I want to know how the authors arrived at the reported value. I don't care so much about being able to perform uncertainty analysis or monte carlo simulation -- I want to understand the model. A few days ago, in a fairly defensive line of comment on the Thinkstep LinkedIn community, Christoph Koffler commented that "the Critical Review of LCA studies can include a model review, but it doesn't have to" (link restricted to group members). To me, that statement is patently absurd. If you are not reviewing the model, you are essentially reviewing a marketing document. Maybe what he meant as a GaBi person was, you don't have to see the GaBi model itself. With that I agree -- but you need to look at some representation of the model.

The problem is that LCA practitioners don't really have a precise way of describing models. We write reports, and these reports can run to hundreds of pages, but often the "model" is just shown as a box diagram -- literally, a picture -- and every author's picture is drawn differently and includes and excludes different things. The diagram is then explained with written text that may or may not be intelligible to a reader.

Pondering this led me to two realizations:

  1. A model's structure can be described without including any data. The model itself is a set of boxes and arrows, also known as a directed graph. We know how to describe directed graphs precisely.
  2. The model's structure makes reference to a lot of pre-existing data. Almost every model uses background data that comes from an externally-maintained database, be it Ecoinvent, GaBi, or whatever. The set of background data sets used is finite and typically very small. In describing a model, it is sufficient to list those datasets in an unambiguous way. A reviewer can then look up the data sets in your list, retrieve them, and check them and even replicate the background of your model if desired.

At this point, it is clear that there are different concepts of "linking" involved. One, the one that Ocelot is ostensibly concerned with, is used in the preparation of background databases, like ecoinvent. The other, that my research is concerned with, is used in connecting background datasets with foreground models. The first kind is done by the database maintainer. The second kind is done by the study author. I don't believe they should be confused (and I confess I may have confused them in my critical comments on last week's post).

Design Principles for Linking

Taking now the perspective of "linking" as two distinct tasks, or more accurately as an iterative task over successively larger scopes, I can offer some principles that I feel should inform the design of a linking tool. Some of them are deliberately contrary to Ecoinvent's (and also Thinkstep's) way of doing things, since I like to be provocative. They are all grounded in the needs of the study author, and thus become mandates to the data designer.

  1. Flows are Things. A process is nothing without its flows. But flows themselves exist without processes! Flows are observable, measurable, extensive- which processes are not. Balancing flows is the basis for LCA accounting. The linker data model should recognize flows as first-class entities. In fact, a flows-first data model may be more intuitive for the user of LCA software than the current, dominant process-first view.
  2. Multiple Databases. A study author is not going to want to be constrained to using one database, and the reason is that no single database can cover the Earth: there is too much knowledge required. It is hubris to think otherwise. Proof of this is in the fact that the 2014 GaBi professional database has 2,600-odd processes, and over 600 of them have "Steam (MJ)" as the reference output (another 230 output "Thermal Energy"). Granted, there are a lot of ways to make steam -- a lot of technologies, a lot of places -- but this is exactly why no single database provider should pretend it can cover everything. The linker should take this into account and should be prepared to use data from multiple providers.
  3. The accounts don't have to balance. This goes all the way back to Heijungs, who in his epic thesis "Economic Drama and the Environmental Stage" formulated LCA as analogous to a system of national accounts, thereby fixing allocation as the fundamental problem in LCI database construction. But this is a myth: it was a simplification for Leontief and it's flat-out false for LCA, because unused products routinely enter export markets, stockpiles, or find alternative uses, and those are typically outside the system boundary. Steam gets vented. Wouldn't it be more edifying if, rather than pretending a refinery's outputs are uniform and allocatable, one output of an LCA computation is the quantity of residual oil that has to be sloughed off to other markets in the process of making your product? Then it could be left to the user how to allocate the burdens. Such a system is easily achievable.
  4. [Partially] Aggregate By Reference. The design of certain tools (OpenLCA in particular, but possibly others) requires a user to load the entire ecoinvent database, in unit processes, in order to perform computations. But most of the time this isn't necessary. First, the core database has presumably been reviewed by experts, and since processes are all interdependent, the internal ones should not be changed by the user AT ALL. Second, the amount of redundant computation is tremendous. Ecoinvent has already done all the aggregations for all the processes in all its system models. Those numbers shouldn't change, so why re-compute them? Finally, this approach precludes using multiple system models at once, or at least doing so causes OpenLCA to place a tremendous demand on my poor laptop.

All of the above means there is essentially no reason to force users to load the whole database in bulk. Instead, users should build models by reference to upstream and downstream components that are computed at the database level. If desired, they can be free to adapt unit processes into their own foregrounds, in the spirit of Bourgault et al 2012, and these they can modify freely. The unit processes already include their own references to fixed database content. Moreover, the reference data sets don't have to be fully aggregated -- they could be partially aggregated in accordance with principles #2 and #3. Obviously the nature of these components is yet to be determined.

What kind of tool could possibly satisfy these principles? Well, one thing is for sure: the tool could not be run until after an LCA study model has been constructed. The database linker would necessarily finish with incomplete pieces, and these pieces would later get connected when the linking tool is run by the data end-user. The system implied by these principles is necessarily distributed across multiple data providers. Thus it's wholly speculative, and also somewhat incompatible with currently available data resources. But this post is all about radical re-imagining.