1 Contributing to BioCro
1.1 Making Changes
1.1.1 Discuss first
- Check the list of GitHub issues for a discussion of the issue. If there is not one, create an issue with a description of the problem and your proposed solution.
By making changes without discussing it with the group, you risk spending time working on a solution that others may not accept. The members of the group also have diverse backgrounds and likely can give valuable design insights.
1.1.2 Follow BioCro’s git branching structure
In general, BioCro development follows the git-flow branching model, where there are two permanent branches (
main
anddevelop
) and three types of temporary branches (hotfixes, features, and releases).For most contributors, it is only necessary to know that most changes should be accomplished through feature branches, which are branched off
develop
and merged back intodevelop
via a pull request that must be approved by one or more other developers. The remainder of this section discusses our system in more detail.Beyond the basic description of git-flow, we have a few additional rules and clarifications specific to BioCro development:
Any merge into
main
ordevelop
must be done via a pull request. This requirement is enforced with GitHub branch protection rules and cannot be bypassed.All pull requests into
main
anddevelop
require approval before merging. This requirement is not enforced using GitHub branch protection rules, but please do not unilaterally merge a pull request without any form of approval from another developer. (There are two exceptions to this rule related to hotfix and release branches, described below.)The BioCro R package and repository follow semantic versioning of the form
major.minor.patch
. Hotfix branches should typically only increment the third (patch
) number. Most release branches will increment the second (minor
) number, and very rarely the first (major
) number. Feature branches do not directly change the version number.Hotfix and release branch names should be formatted like
type-version
, wheretype
is eitherhotfix
orrelease
, and version is the new version number. For example,hotfix-v3.0.3
would be a hotfix branch that changes the version number to3.0.3
. Feature branch names should reflect their purpose, and can be anything other thanmaster
,develop
,hotfix-*
, orrelease-*
.Whenever the package version changes, a description of the related changes should be added to the changelog in
NEWS.md
as a new section titled with the new version number. Such version changes will happen via release and hotfix branches. Release branches will generally include changes made during the course of several feature branches, so they may require a significant amount of documentation inNEWS.md
. To make it easier to prepare a list of these changes, each feature branch should include a description of its own changes in anUNRELEASED
section ofNEWS.md
. For more information about updating the changelog, please see the comment at the top ofNEWS.md
.
The following is a short description of BioCro’s implementation of the git-flow branching model:
The
main
branch always contains the latest stable release of the BioCro GitHub repository. In fact, a new tag and release is made any time changes are merged intomain
, and such changes should always be accompanied by an increment to the BioCro R package version number.The
develop
branch contains bug fixes and new features that have been completed but may not have been released yet.Hotfix branches are for urgent changes that warrant immediate incorporation into
main
; typically these are bug fixes. A hotfix branch should be branched offmain
. When ready, it should first merged intomain
via an approved pull request. Then, a second pull request should be made to merge intodevelop
. If there are no merge conflicts or test failures, this second request can be merged without any additional approval. When both merges are complete, the hotfix branch should be deleted.Feature branches are for less urgent changes that do not require immediate release; for example, the addition of a new R function to the package namespace. A feature branch should be branched off
develop
and merged back intodevelop
via an approved pull request. Before merging,NEWS.md
should be updated with a description of the changes under the# UNRELEASED
section (seeNEWS.md
for more details). When the merge intodevelop
is complete, the feature branch should be deleted.When sufficiently many changes have accumulated in
develop
to justify a new version of the package, a release branch is used to move the changes fromdevelop
tomain
. Release branches should not include substantive changes; rather, a release branch is primarily used to increment the package version and to ensure an up-to-date changelog inNEWS.md
. A release branch should be branched offdevelop
. When it’s ready, it should first merged intomain
via an approved pull request. Then, a second pull request should be made to merge intodevelop
. If there are no merge conflicts or test failures, this second request can be merged without any additional approval. When both merges are complete, the release branch should be deleted.
1.1.3 Making large modifications to BioCro
From time to time, someone will propose making a large change to the organizational structure of BioCro or to one of its central components. Here we also consider any modification that influences the way users or developers interact with BioCro to be “large.” Large changes must be carefully considered and discussed before they are implemented. When considering such proposals, a number of key points should be kept in mind:
BioCro is designed for computational modelers who want to focus on biology rather than computer programming; it is essential for BioCro to be easy to install and use.
A friendly user experience makes BioCro accessible to a wide range of users who each have the potential to contribute to broader scientific understanding through modeling. Thus, it is important to minimize any barriers that may prevent some scientists from using it.
BioCro is developed and maintained by a small team, most of whom have numerous other responsibilities. Thus, it is important to carefully consider the implementation and maintenance costs that may be associated with any proposed change, carefully weighing these against the perceived benefits.
Changes to BioCro R packages should be consistent with CRAN policies.
Distributing BioCro via CRAN is a key part of keeping it accessible to all users with maximum ease.
At its core, BioCro is a C++ library; the BioCro R package merely provides a convenient R interface. Even though this R interface exists, we wish the C++ core to remain usable on its own.
This has two implications:
The core C++ library should maintain a consistent interface. Any changes to the public API should be carefully considered and carefully delineated.2
It should be and should remain relatively painless to obtain and use this library without having to have a full R installation.
With BioCro, a premium is put on keeping it relatively self-contained; needless dependencies should be avoided.
There are a number of reasons for this:
Limiting dependencies will make it less likely BioCro will break for reasons beyond our control.
Limiting dependencies reduces security risks. (On this point see, for example, Russ Cox’s discussion of the 2017 Equifax fiasco in his article Our Software Dependency Problem.)
Limiting dependencies makes reproducibilty easier.
If one wishes to replicate results of a BioCro simulation, the task is more difficult if one has to worry not only about what platform, what version of R, and what version of BioCro were used to obtain the original results, but if, on top of that, one has to worry about the versions of other R packages depended upon.3
Limiting dependencies means that fewer steps are required to install BioCro.
BioCro has few dependencies, and all things being equal, we would like to keep it that way.4
If you propose a large modification to BioCro, please be prepared to discuss the following questions:
How will time costs change for maintainers, developers, and users?
Will there be more or fewer opportunities for BioCro to break due to changes in its dependencies?
Which BioCro features will be added or lost?
1.2 Code style
(Most of what is discussed here pertains specifically to code for the BioCro C++ library.)
1.2.1 Scientific considerations
1.2.1.1 Document sources and justifications in the code
Include citations to sources for equations and parameters used in the code. The citation should be sufficient to locate the article and relevant information within it. Include a table or figure reference if appropriate.
Use Doxygen-style comment syntax for high-level documentation of functions and classes. We use the Javadoc style of comment block in our code. (See the documentation of the Solar Position module
solar_position_michalsky
for an example of a Doxygen-style comment, and then look at the way this is rendered in the generated documentation.)Include reasoning and justification for the model, including assumptions that determine when use of the model is appropriate. These descriptions should be succinct.
1.2.1.2 Document units in the code
After every physical quantity, include a comment with the units. The idea is that every quantity will roughly be read as if it were written in normal text: for example,
double yield = 10 // Mg / ha
should be read as meaning “the yield was 10 Mg / ha”. Using dimensions instead of units is acceptable if the code is written with the expectation that coherent units are used.The following example shows how to indicate units in a number of different contexts. Note that, as in LaTeX,
^
is used to indicate a superscript, so thatm^2
indicates square meters.// In function signatures double ball_berry(double assimilation, // mol / m^2 / s double atmospheric_co2_concentration, // mol / mol double atmospheric_relative_humidity, // Pa / Pa double beta0, // mol / m^2 / s double beta1) // dimensionless from [mol / m^2 / s] / [mol / m^2 / s] // In assignments double leaf_temperature = air_temperature - delta_t; // K. // In return statements return assimilation_rate; // micromoles / m^2 / s. // In tables const std::map<SoilType, soilText_str> soil_parameters = {// d = dimensionless // d d d J / kg d J s / m^3 d d d Mg / m^3 // silt clay sand air_entry b Ks satur fieldc wiltp bulk_density 0.05, 0.03, 0.92, -0.7, 1.7, 5.8e-3, 0.87, 0.09, 0.03, 1.60 } }, { SoilType::sand, { 0.12, 0.07, 0.81, -0.9, 2.1, 1.7e-3, 0.72, 0.13, 0.06, 1.55 } }, { SoilType::loamy_sand, { };
If you would like to include other details, include the units in the same way, and include details following the units so that the variables are still read like regular text. For example, write
return gswmol * 1000; // mmol / m^2 / s. Convert from mol to mmol. ✓
not
return gswmol * 1000; // Converting from mol / m^2 / s to mmol / m^2 / s. ✗
Note that in a case such as this, the units apply to the entire value (
gswmol * 1000
) and not merely to the variable (gswmol
).Use SI conventions for units and dimensions, including capitalization. Specifically, use “
degrees C
”, not “C
”, to indicate °C.Use full names when symbols are not available:
micromoles / m^2
, notumol / m^2
degrees C
, not*C
.
Use
dimensionless
for dimensionless quantities, and include how the dimensions have canceled if that is informative.Use
^
to indicate exponentiation:m^2
, notm2
.Prefer an asterisk to indicate multiplication; but indicating multiplication by juxtaposing units with exactly one space between them is acceptable. Prefer exactly one space on each side of the asterisk:
kg * m / s
orkg m / s
.Either a solidus (“/”) or negative exponent is acceptable to indicate division, but ensure that the solidus is used correctly if used multiple times. Prefer exactly one space on each side of the solidus.
1.2.1.3 Document parameters
When adding models that require new parameters, document the parameters in the parameter table in src/parameters.h. Please keep the table well formatted.
If you are working on a model with undocumented parameters, it would be nice if you added them to the table as you work through the issue.
1.2.2 General coding considerations
Do not use C-style arrays. Use an appropriate data type from the standard library instead.
Use cmath, not math.h, for common mathematical functions.
Be careful with using-directives (e.g.
using namespace std
) in a global scope; do not use them in global scope in a header file. Try to make using-declarations (e.g.using std::string
) as local as possible. Type aliases (e.g.using string_vector = std::vector<std::string>
) are perfectly acceptable in the global scope of a header file.Strongly prefer the coherent set of SI units. Doing so reduces code complexity remarkably as no conversions are necessary. Yes, no one publishes values with these units, but do the conversion in one place, the manuscript, instead of dozens of times in the code, constantly having to look up units for variables, and then spending hours debugging silly, difficult-to-find errors. The coherent set of SI units consists of all the units without prefixes, except that kg is the coherent unit of mass, not g.
Do not copy and paste code, changing only small parts. Choose a design that eliminates the duplication.
Make an effort to write unit tests. (See the vignette An Introduction to BioCro for Those Who Want to Add Models for information about writing unit tests—specifically, writing unit tests for new modules.)
Do not mix sweeping formatting changes with behavior changes. Large formatting changes should be a separate commit, containing only formatting changes, and the commit comment should indicate that only formatting was changed. This way, code that changes program behavior won’t be obscured by a mass of changes to the formatting of the code.
The Standard C++ Foundation’s C++ Core Guidelines have useful advice about aspects of coding and design.
1.2.3 Formatting code
(Again, except in a few instances, this pertains specifically to C++ code.)
The most important aspect of formatting is that the code is easy to understand. Below are unenforced preferences.
Prefer
underscores_in_identifiers
notCamelCaseInIdentifiers
and, in R, notdots.in.identifiers
. Prefer lowercase-only identifiers. An exception may be made for commonly-recognized names used in a small scope, for example,I = V / R; F = m * a; E = m * (c * c);
Avoid unnecessary parentheses. For example, use “
a * b / c
” instead of“ or(a * b) / c
”“ . But in cases where the order of operations affects the result, parentheses may be used to erase any doubt in the mind of the reader (or the programmer!) as to what that order is. Thus, writinga * (b / c)
”(a / b) * c
instead of (the equivalent)a / b * c
is acceptable. Parentheses may also be used to group portions of a formula that are commonly considered as a sub-unit, where they provide some semantic value (see the previous bullet point). Consider naming parts of a complicated expression in order to break it down into simpler ones. For example,4 * a * c)) / (2 * a); x = (-b + sqrt(b * b -
may be rewritten in three lines as
4 * a * c); num = -b + sqrt(b * b - 2 * a; denom = x = num / denom;
Note that in C++, unlike in R, return statements do not require parentheses around the returned expression.
Restrict the line length of paragraph-like comments to 80 characters, excepting a compelling reason to do otherwise. Lines in sections that are not paragraph-like could be somewhat longer if it facilitates presenting material in a more readable format. In the following snippet from the module library documentation, for example, we have allowed slighly-longer lines in order to be able to maintain one line per interval:
/* * However, this definition is flexible. For example, for our soybean model * (soybean_development_rate_calculator.h) we define the intervals as follows: * -1 <= DVI < 0 : Sowing to Emergence * 0 <= DVI < 1 : Emergence to R1 (Flowering) is broken into three stages. * 0 <= DVI < 0.333 : Emergence to V0 (Cotyledon stage) * 0.333 <= DVI < 0.667 : V0 (Cotyledon stage) to R0 (End of Floral Induction) * 0.667 <= DVI < 1 : R0 (End of Floral Induction) to R1 (Flowering) * 1 <= DVI < 2 : R1 (Flowering) to R7 (Maturity) */
As for the code lines themselves, we point to the following advice from the Linux kernel project:1
The preferred limit on the length of a single line is 80 columns.
Statements longer than 80 columns should be broken into sensible chunks, unless exceeding 80 columns significantly increases readability and does not hide information.
Do not include trailing whitespace, i.e., whitespace characters preceding newline characters.
Each file should end with a newline character (i.e. a terminal endline).
Use spaces rather than tab characters.
In general, formatting preferences should follow something similar to the Google C++ style guide, except in cases where the code has been formatted in a more readable way, such as when aligning parts in a table.
The Standard C++ Foundation guidelines offer some advice about formatting conventions that are informative, particularly regarding the use of code comments.
For tools to help with formatting code, see the section 1.3.
1.2.4 R-specific coding advice
Prefer to use the double-bracket operator (
list[['element']]
) rather than the dollar-sign operator (list$element
) when accessing the elements of a list. The$
operator uses partial matching, whereas[[
, by default, does not. (However, it can be specified:list[['element', exact = FALSE]]
.) Avoiding partial matching by using[[
gives us more confidence that errors won’t occur.While there is no inherent performance difference between a
for
loop and an apply-type function such asapply
orlapply
(the apply functions actually usefor
loops in their source code), it is nevertheless possible to write a “bad”for
loop that runs slowly. Common culprits include a failure to pre-allocate memory or a poor choice in assignment method. If afor
loop seems to run slowly, consider replacing it with an apply-type function or tweaking the assignment method (e.g. replacingappend
with[
). Many guides for optimizing loop performance are available online, such as Strategies to Speedup R Code and Why loops are slow in R.
1 The Linux kernel project recently changed the default length for code lines from 80 to 100 characters with the following commit comment:
Yes, staying withing 80 columns is certainly still preferred. But it’s not the hard limit that the checkpatch warnings imply, and other concerns can most certainly dominate.
Increase the default limit to 100 characters. Not because 100 characters is some hard limit either, but that’s certainly a “what are you doing” kind of value and less likely to be about the occasional slightly longer lines. ↩︎
1.3 Formatting Tools
1.3.1 Clang’s formatting tool
Many of these BioCro formatting preferences can be applied automatically using the program clang-format with the .clang-format file provided in the base directory of BioCro. Do not apply clang-format to all files indiscriminately, as that will ruin manually-aligned tables.
The best time to reformat a file is immediately before (or possibly immediately after) making substantive changes to the code in the file. But, as mentioned in Section 1.2.2, sweeping formatting changes should be made in a separate commit, separate from any substantive changes made to the code. This way, changes to the functioning of the code won’t be obscured by changes to formatting that have no effect on code semantics.
1.3.1.1 Installation
One can install clang-format on Ubuntu using sudo apt install clang-format
. On macOS, clang-format is available from the
Homebrew package manager.
1.3.1.2 Using the Clang formatting tool
Files can be formatted using
clang-format file_name > new_file
or edited in place using
clang-format -i file_name
If your editor has the ability to display differences between the original and revised versions of the file, it is a good idea to step through and inspect the proposed changes to ensure they are desirable.
1.3.1.3 Using the Clang formatting tool with the CodeLite IDE
On Windows, macOS, or Linux, the CodeLite IDE includes clang-format
and provides an easy way to use it. First go to Plugins -> Source
Code Formatter -> Options. In the C++ tab, select use .clang-format file
. Now press Ctrl-I
or click Plugins -> Source Code Formatter
-> Format Current Source to format a file.
1.3.2 EditorConfig
Another tool to help with formatting is EditorConfig. EditorConfig, when used in conjunction with the .editorconfig file provided in the base directory of BioCro, provides a method for standardizing settings across different text editors. While some editors have native support, others require a plugin. See the EditorConfig website for more details.
As of this writing, no C++ API for BioCro has been defined: there is no document that makes clear what publicly accessible portions of the framework and standard library are guaranteed to remain stable and available to be programmed against and which portions are subject to change.↩︎
Irrespective of what dependencies BioCro now has or are added to it, a researcher who is concerned about reproducability should consider making a containerized version of BioCro. See, for example, the chapter Docker and Reproducibility in the document for the workshop Reproducible analysis and Research Transparency. The BioCro maintainers don’t provide containerized versions of BioCro as we think this is a task better left to the individual researcher.↩︎
BioCro’s strong dependencies are the R framework, the C++ compiler used in installing the BioCro package, the C++ Standard Library, the and the Boost C++ Library.
As for R package dependencies, the BioCro R package depends only upon packages in the R standard library (stats and utils) for its basic installation and functioning. BioCro does use other, non-standard R packages for building the documentation and for testing, but these are not essential to a fully-functioning BioCro installation.
For further reading on the benefits and pitfalls of using dependencies, see, for example, Russ Cox’s article Our Software Dependency Problem and Bill Sourour’s article Code dependencies are the devil.↩︎