Skip to contents

Motivation

Many data workflows require human review before publication. Official statistics, digital humanities, archives, machine learning, and data engineering all rely on iterative assessment of candidate values before they become part of a released dataset.

The goal of review is to make semantic review explicit and reproducible. Rather than overwriting existing values, each review round creates a new version of one or more reviewable claims while preserving the previous versions and their provenance. This review history can be inspected, reproduced, and extended with additional review rounds.

Creating reviewable claims

A review begins with a collection of claims.

claims <- claims_df(
  Orange,
  scope_var = "age",
  subject_var = "Tree"
)

head(claims, n = 6)
#>   claim_id  age Tree circumference_candidate
#> 1        1  118    1                      30
#> 2        2  484    1                      58
#> 3        3  664    1                      87
#> 4        4 1004    1                     115
#> 5        5 1231    1                     120
#> 6        6 1372    1                     142

A claims_df separates variables into three structural roles and one or more reviewable variables.

  • Identifier uniquely identifies each claim.
  • Scope defines the context in which the claim is made.
  • Subject identifies the entity being described.
  • Reviewable variables contain the values that may change during semantic review.

The structural variables remain stable throughout the review process, while reviewable variables may acquire successive reviewed versions.

Readers familiar with the tidy data principles will recognise this separation. Structural variables play a role analogous to identifiers, dimensions, and attributes in statistical data production, while the reviewable variables correspond to measured values that are iteratively reviewed and improved. The structural variables provide the context for interpretation and allow claims to be grouped, filtered, or compared, whereas the reviewable variables capture the semantic content that evolves during the review process.

names(claims)
#> [1] "claim_id"                "age"                    
#> [3] "Tree"                    "circumference_candidate"

Notice that circumference has become circumference_candidate. Candidate values are the current working version of each reviewable variable and form the starting point for the first review round.

First review round

A review round allocates one or more review columns. At this stage the package records that a review is about to take place, but it does not prescribe how that review should be performed.

reviewed <- claims |>
  review("circumference")

names(reviewed)
#> [1] "claim_id"                "age"                    
#> [3] "Tree"                    "circumference_candidate"
#> [5] "circumference_review_1"

The review algebra is independent of the review interface. Reviewers may use whatever environment best suits the task, provided that the reviewed values are written back into the allocated review columns.

The package deliberately does not prescribe how the review is carried out. For example, the review may be

  • a dplyr::mutate() pipeline,
  • values imported from a reviewed CSV file,
  • an interactive R console session,
  • a Shiny application,
  • an Excel or LibreOffice spreadsheet,
  • an OpenRefine workflow.

For illustration we edit one value directly.

reviewed$circumference_review_1[1] <- 31

Once the review values have been entered, the review round can be documented.

reviewed <- reviewed |>
  explain(
    activity = "manual review",
    agent = person(
      "Jane",
      "Doe",
      role = "rev"
    ),
    used = "doi:10.5281/zenodo.1234567"
  )

explain() records how the review was performed, who performed it, and
what evidence or resources were used. Provenance is stored separately from
the reviewed values, allowing the same review data to be interpreted or
exported in different provenance models.

Readers familiar with FAIR data principles, reproducible research, or
statistical and digital heritage workflows may recognise a common problem:
as data are cleaned, harmonised, and reviewed, an increasing share of the
knowledge about why particular decisions were made remains only in the
analyst’s or curator’s head. The resulting datasets may be reusable, but
the review process itself is often difficult to inspect, reproduce, or
audit.

The review algebra aims to record as much of this review history as
possible while requiring as little additional documentation effort as
possible. By separating reviewed values from the provenance of the review,
it records which agents (people, software, or AI models) performed
which activities, and what entities (datasets, publications,
files, or other resources) informed those decisions.

This lightweight provenance layer follows the concepts of the PROV data
model without requiring users to work directly with RDF or ontologies. It
can later be serialised to PROV-O or aligned with standards such as SDMX,
DataCite, or archival provenance models. The objective is to improve the
reviewability, auditability, reproducibility, and ultimately
the trustworthiness and reusability of reviewed data while
remaining compatible with ordinary R data frames.

attr(reviewed, "prov_activity")
#>       candidate        review_1 
#>        "create" "manual review"

attr(reviewed, "prov_agent")
#>        candidate         review_1 
#>               NA "Jane Doe [rev]"

attr(reviewed, "prov_used")
#>                    candidate                     review_1 
#>                           NA "doi:10.5281/zenodo.1234567"

Second review round

Additional review rounds continue from the current accepted review rather than from the original candidate values. Each round therefore forms a chain of semantic revisions.

reviewed <- reviewed |>
  review("circumference")

Again, the review itself may take place using any suitable workflow.

reviewed$circumference_review_2[1] <- 32

Each review round has its own provenance. Different reviewers, software, or evidence can therefore contribute to successive stages of the review.

reviewed <- reviewed |>
  explain(
    activity = "quality assurance",
    agent = "OpenRefine 3.10",
    used = "doi:10.2908/NAMA_10_GDP"
  )

The review history is therefore

circumference_candidate
            ↓
circumference_review_1
            ↓
circumference_review_2

Finalising a review

Once the review is complete, the current reviewed values become the new candidate values. The review history is retained, allowing future review rounds to continue from the released version while preserving the complete review trail.

reviewed <- reviewed |>
  finalise_review()

The review history is preserved, while future review rounds begin from the finalised candidate values.

names(reviewed)
#> [1] "claim_id"                "age"                    
#> [3] "Tree"                    "circumference_candidate"
#> [5] "circumference_review_1"  "circumference_review_2"
head(reviewed, 6)
#>   claim_id  age Tree circumference_candidate circumference_review_1
#> 1        1  118    1                      32                     31
#> 2        2  484    1                      58                     58
#> 3        3  664    1                      87                     87
#> 4        4 1004    1                     115                    115
#> 5        5 1231    1                     120                    120
#> 6        6 1372    1                     142                    142
#>   circumference_review_2
#> 1                     32
#> 2                     58
#> 3                     87
#> 4                    115
#> 5                    120
#> 6                    142

The complete review algebra therefore consists of four operations: create reviewable claims, allocate review rounds, document (explain) each review, and promote the current review into the next candidate version.

Complete workflow

claims <- claims_df(
  Orange,
  scope_var = "age",
  subject_var = "Tree"
) |>
  review("circumference")

claims$circumference_review_1[1] <- 31

claims <- claims |>
  explain(
    activity = "manual review",
    agent = person("Jane", "Doe", role = "rev"),
    used = "doi:10.5281/zenodo.1234567"
  ) |>
  review("circumference")

claims$circumference_review_2[1] <- 32

claims <- claims |>
  explain(
    activity = "quality assurance",
    agent = "OpenRefine 3.10",
    used = "doi:10.2908/NAMA_10_GDP"
  ) |>
  finalise_review()

Garamantes example

The review algebra is independent of the application domain. The same workflow can be applied to statistical observations, archival metadata, cultural heritage collections, or machine-generated annotations.

garamantas <- data.frame(
  resources = c("https://garamantas.lv/en/file/475833",
                "https://garamantas.lv/en/file/471397",
                "https://garamantas.lv/en/file/475825"),
  instance_of = rep("photograph", 3),
  depicts  = rep("building", 3)
)

garamantas_claims <- claims_df(
    garamantas,
    scope_var = "instance_of",
    subject_var = "resources"
  ) |>
  review(c("depicts")) 

garamantas_claims$depicts_review_1[2] <- "group of people"

Although the domain differs, the review algebra remains unchanged:

claims
    ↓
review
    ↓
explain
    ↓
finalise_review
garamantas_reviewed <- garamantas_claims |>
  explain(
    activity = "expert review",
    agent = person("Alice", "Curator", role = "rev"),
    used = "doi:example001"
  ) |>
  finalise_review()

print(garamantas_reviewed)
#>   claim_id instance_of                            resources depicts_candidate
#> 1        1  photograph https://garamantas.lv/en/file/475833          building
#> 2        2  photograph https://garamantas.lv/en/file/471397   group of people
#> 3        3  photograph https://garamantas.lv/en/file/475825          building
#>   depicts_review_1
#> 1         building
#> 2  group of people
#> 3         building

attributes(garamantas_reviewed)
#> $names
#> [1] "claim_id"          "instance_of"       "resources"        
#> [4] "depicts_candidate" "depicts_review_1" 
#> 
#> $row.names
#> [1] 1 2 3
#> 
#> $id
#> [1] "claim_id"
#> 
#> $scope
#> [1] "instance_of"
#> 
#> $subject
#> [1] "resources"
#> 
#> $reviewable
#> [1] "depicts"
#> 
#> $review_column
#>             depicts 
#> "depicts_candidate" 
#> 
#> $prov_id
#> [1] "candidate" "review_1" 
#> 
#> $prov_activity
#>       candidate        review_1 
#>        "create" "expert review" 
#> 
#> $prov_agent
#>             candidate              review_1 
#>                    NA "Alice Curator [rev]" 
#> 
#> $prov_used
#>        candidate         review_1 
#>               NA "doi:example001" 
#> 
#> $class
#> [1] "claims_df"  "data.frame"

Although the application domain differs, the review algebra remains
unchanged. review() creates successive review versions, explain()
records their provenance, and finalise_review() promotes the current
reviewed values into the next candidate version.

The aim of the review package is to increase the trustworthiness,
reviewability, and reusability of tabular data before they are released or
exchanged with collaborators. Rather than replacing existing data
engineering workflows, it adds a lightweight semantic review layer that
preserves both the review history and its provenance.

Once a review has been completed, the resulting data can be published,
archived, or exchanged using existing metadata standards. The companion
dataset package provides an R-native representation for describing and
serialising released datasets, allowing reviewed data to be exchanged in a
reproducible and interoperable form.