Today we released version 0.1 of the Ocelot software (two days early!). 0.1 provides the foundation for Ocelot, and doesn't yet have any system model-specific code. You can see the features that were resolved in the 0.1 milestone here.
The foundation in code
The building blocks for Ocelot are explained in detail in the documentation. The basic workflow is:
- Extract data from ecospold2 files to the internal dataset format.
- Read user input to define a system model as a list of Python functions.
- Start a logger, and create an output directory.
- Apply the system model transformation functions to the extracted data, saving the database at each step, and logging changes.
- Save the final database.
- Parse the logfile and generate an HTML report.
All these building blocks are included in 0.1 - system_model, generic_extractor, Logger, HTMLReport, etc. You can even see an example Ocelot run report. Bear in mind that we mostly have example transformation functions so far, so the example report is not so impressive. It is also not beautiful - if you have JS/CSS skills and want to contribute, please let us know!
Testing, and the documentation of test cases and expectations, are a core component of Ocelot. Our tests currently cover foundational code, such as the extraction of ecospold2 data and running transformation functions. We test locally during development, but also have continuous integrations services testing on Windows and Linux every time a commit is published to Github. We even have automated test coverage data generated as well.
Our development for the next few months is defined in the milestones. Each milestone is defined by a number of tasks (issues), and the milestone is reach when all relevant issues are closed. In this way, we are hoping to make the development, priorities, and future plans for Ocelot as transparent as possible. For all development going forward, we will try to follow the Github suggested workflow, with named branches and pull requests. In this way, each addition will come self-contained with documentation and tests, as well as a discussion of its flaws and merits. We have also created wiki pages for system model variations and ideas for Ocelot desired capabilities - please add your input here!
I have removed comments from the front page, because for silly technical reasons that got attached to whatever the latest blog post, making for a disjointed conversation. However, I didn't want to lose what had already been said, so here is are the comments so far between Brandon Kuczenski and the Ocelot project team.
It seems to me that if you are already starting with ecospold v2 and ending with ecospold v2, you have already foreclosed a great many design decisions / discussions.
One some level, you are completely correct - ecospold 2 builds in a certain set of assumptions, both explicit and implicit, and these assumptions impose limits. I don't think any of the actual code aside from IO will include anything ecospold 2-specific, but there will be an internal data format which will mirror ecospold 2 in many ways.
One of the explicit principles in the Ocelot grant was to "not let the best be the enemy of the good", and we think it is better to get at least some system model principles in place as tested, open source computer code, than to wait for the perfect data format. There are also some practical reasons for choosing ecospold 2 - you can add arbitrary properties, so it is quite flexible, and it supports parameterization. Most importantly, ecoinvent is the only LCI database that I know of that provides "raw" master data where no system models have been applied at all, and ecoinvent is provided in ecospold 2.
Certainly our work will only be the first step on a longer conversation with the community, and I would be quite curious to hear about specific examples of what you think would be ignored or overseen.
In my opinion the lack of a "flow" as a standalone entity is a serious shortcoming in a linking algorithm, since flows are the things that link. But my bigger concern is the philosophical one that is implied in squeezing everything into a data format that is so very expensive to work with- and in doing so a priori, before the design principles of the task are even established (or at least known to others).
I consider "arbitrary extensibility" to be a weakness of ecospold, not a strength. I've long thought it a folly for ecoinvent to try to fold all their complex data (transformations! LCIA methods! markets! inheritance! uncertainty! parameters!) into a single data object that nobody but ecoinvent knows how to author or read. ILCD, verbose as it is, at least has some concept of separation of concerns. (ILCD also does not do many of the things ecoinvent requires-) I have a genuine fear, especially reading your response, that ocelot is just a trip further down a rabbit hole that is already (from the outside) too deep to enter.
It's also not true that ecoinvent provides "raw" master data to its users! The file list only has linked system models. They're also not on nexus.openlca.org. It's true that I can view the unlinked data on the website, one data set at a time, via search, but that's not the same thing.
n.b. US LCI, on the other hand, is made up of "raw" unallocated multioutput processes.
Anyway, forgetting the input format- why would the output format be ecospold 2?
There are very few people who would consider a set of 11,000 XML files to be a useful thing they would want to generate. If the object of the algorithm is to create a square technology matrix, why make the output so very, very far from that? Why not deliver, say, a technology matrix? It would be 0.1% the size, it would be easy to interpret, it would be software independent.
A bit later
All this is just to say, it sounds like ocelot is mainly going to be useful for internal ecoinvent personnel. Which is not a bad thing! I think it's great that ecoinvent will be able to do new and exciting things with system models. It's just- a smaller community than I thought you meant at first... and to say it in a nasty way, because I have a chip on my shoulder.
Brandon makes some interesting points, and I have invited him to make a guest blog post about what he imagines a project like Ocelot could become. So I postpone discussion of broader theory questions for now. However, I do have to disagree with a few technical points.
First, ecospold 2 definitely has "flows" distinct from activities. For example, the complete list of flows in ecoinvent 3.2 is included in the files Content/MasterData/IntermediateExchanges.xml and Content/MasterData/ElementaryExchanges.xml, both available here.
Next, while it is true that unallocated data is not available on nexus.openlca.org (an unaffiliated website), you can get all the master data by just asking the ecoinvent team. They don't bite.
It is also a bit of an exaggeration to state that "nobody but ecoinvent knows how to author or read" ecospold 2. For example, you can find open source importers and converters for Java and Python. In my personal opinion, Ecospold 2 is sometimes a pain, but it is not such a dramatic pain. For what it's worth, I have given multiple presentations about how I really want the JSON linked data schema to take over the world.
It is worth briefly mentioning that the US LCI is not raw data - it is linked data, resolved in time and space. System models are much more than just allocation.
Finally, here is a kitten:
Chris has already said most of what I wanted to say, I'll just add a few words.
The schema for the master data files is available for these files, upon request. I'm happy to answer all questions about these files. The schema for the dataset ecospold2 format is also available.
About the output format: all the existing software who deal with the ecospold2 format have the capacity to take those files and transform them into a matrix. We also have the internal tools to transform ecospold2 files to a matrix representation, both in Python scipy.sparse matrices and in machine-readable txt format that can be reconstituted into a matrix easily.
We think it is important to use ecospold2 format as an output because it carries with it all the relevant meta information necessary to understand the datasets. We plan to add comments, where necessary, along the calculation chain, when it will be judged necessary (for example, allocation, loss, market group replacement, geography linking, etc.)
That being said, more than one output is for from impossible, and since we already have some internal tools, we could easily include them as some "end-of-pipe" scripts for user comfort. We also care about the accessibility of results. We are happy that you have suggestions about this aspect!
If I can provide the perspective of somebody answering support questions from the entire user base of ecoinvent. Brandon is right to point out that the ocelot project is "directly useful" to a very small proportion of users. The vast majority of our users do not care much about what we are discussing here. And if I can speak candidly, I am sometimes appalled by questions showing lack of understanding of vary basic LCA concepts from our users. At the other side of the spectrum, there is only a handful of people in the world who have the resources, interest and skills to actively contribute to the project. However, the entirety of the users will benefit indirectly from the fruits of this project, for example by the reassurance that system model assumptions have been thoroughly tested.
The purpose of the development blog is to allow open discussion about the kind of concerns Brandon has. The project is young enough to be steer in many directions. As somebody who usually focuses on computational aspects that most member of the community don't even want to think about, I'm quite happy to see Brandon's interest in the project. We obviously have to find a balance between the ideals we hold and what we can reasonably achieve with the resources we have for this project. As Chris pointed, not letting the best be the enemy of the good is something to keep in mind. The release of version 3.0 has been delayed because this principle was forgotten, and we are dealing with many legacy problems caused by this lack of foresight. Ocelot is the best initiative we have had so far to fix these legacy problems.
Well, this is a bit embarrassing. It turns out that I wrote a blog post almost exactly two years ago proposing essentially the Ocelot project, and then forgot about it. Brandon Kuczenski has been having a discussion with myself and others over the limitations that the ecospold 2 mental model brings in the comments of this development blog, and I was looking for historical examples of when I had complained about ecospold XML to show that I felt his pain, when I stumbled across this blog post, called "Some ideas on an open source version of the ecoinvent software."
I include below what I wrote in April 2014. It is exciting (and a bit scary) to be a part of the team trying to make "a brighter future"!
Ecoinvent version 3: A difficult elevator pitch
There are still a lot of people confused or doubtful about ecoinvent version 3. One of the big questions that people have, as seen repeatedly on the LCA mailing list, is exactly how ecoinvent 3 works. True, there is a document called the "data quality guidelines", which explains the concepts behind ecoinvent 3 in some detail. But even in the "Advanced LCA" PhD seminar that I led last fall, the data quality guidelines raised as many questions as it answered, and few have the days needed to read through that document thoroughly. The large number of changes from version two to version three, plus the fact that some LCIA results have changed a lot, leads many to doubt whether they should make the transition.
From my perspective, as someone who has played around with LCA software, the fact that the ecoinvent software is not publicly accessible is also a source of concern - especially ironic given the motto of the ecoinvent centre, to "trust in transparency." There were a lot of clarifications and modifications needed to turn the ideas of the data quality guidelines into working computer code, and these adaptations are just as important as the general data quality guidelines framework. It is my understanding that the ecoinvent centre is in the process of creating a new document that will more precisely give the rules for the application of the various system models, but this document is not yet finished.
One big step towards addressing these problems would be to have an open source version of the software that takes unit process master data sets and applies the different system models. If done properly, this software could provide a practical implementation of the abstract idea in the data quality guidelines, giving precise details on each step needed to get to the technosphere and biosphere matrices. In the rest of this post, I will give my thoughts on what such a software could look like.
The first guiding principle of any such software must be practicality. Even a simple piece of software is a lot of work, and ecoinvent 3 would not be a simple piece of software. Therefore, the software should have limited scope - not all of the ecoinvent software functionality is needed - and should not reinvent the wheel, but build on existing libraries as much as possible. One should start with the easiest problems, and build up a set of simplified components that can handle most unit processes. Practicality also means using a well-established, boring technology stack.
The second guiding principle should be accessibility. The inspiration behind literate programming should apply here as well - the point of such a software is not just to redo work already done, but to make the rules, algorithms, and special cases understandable to people from different backgrounds. The documentation for backbone is a beautiful example of annotated source code, but probably using something like Sphinx, e.g. brightway2 documentation is a better example, as numerous diagrams and even animations may be necessary. Development should be open, and source code should be hosted on a service like github or bitbucket.
Of course, the cooperation of the ecoinvent centre would be critical for the success of any such effort. There are good arguments for having software development happen independent of ecoinvent, but probably the best approach would be to have a diverse team of people from inside and outside the ecoinvent centre.
The first major component is a data converter, to convert from the ecospold 2 XML format to something more easily accessed and manipulated. XML is a great format for inventory dataset exchange, as it has schema descriptions, validation, etc. but XML is not a great format for working with data. Just google for XML and awful or horrible or terrible or sucks. Anyway, the converter would transform the necessary parts of the unit process datasets to the native data structures of whatever programming language is chosen.
The second major component, and the hardest one, is the allocator. This would take as an input an inheritance tree of unit process data sets, and resolve each into an allocated, single output unit process (A-SOUP, terminology by Gabor Doka), whose output is a product located in time and space. This can start simple - just work on economic or mass allocation, or just substitution. In theory, or in a world where no one had invented allocation at the point of substitution, this could be trivially parallelized, and so should be relatively quick. At the beginning, the software doesn't need to do everything, and difficult data sets like clinker production could just be skipped for now.
The last big component is the linker, which matches demand for products to the A-SOUPs that produce those products in the correct time and space. I think that this should be relatively easy to do. One improvement over the procedure as illustrated in the data quality guidelines could be to define all geographic relations in advance, even in something as simple as a text file, to avoid having to integrate GIS functionality into the new software (see also Some thoughts on Ecoinvent geographies).
I envision the final codebase to have more tests than actual code, and significantly more documentation than actual code.
A brighter future
My best guess is that such a software would take around one year of work. The hard stuff has already been done by ifu hamburg and the ecoinvent centre. However, translating the work that has already been done into open source code will depend on a detailed specification document explaining exactly what the current software does.
If such a software were to come into existence, it could significantly help the adoption of ecoinvent version 3. It could also provide a nice foundation for future work. One of the very nice properties of the master unit process datasets is that one can develop new system models. A well-documented and well-tested open source software would allow both ecoinvent and others to develop new system models, and realize the promise of ecoinvent version 3.