Pyfc4 Intro

Now that some of scaffolding and ground-leveling is done around a project I’ve been working on, it felt like a good time to step back and try to philosophize and narratize (that should be a word). I imagine waxing more on the topic in the future, perhaps on specific aspects, but this might serve as a general introduction and reckoning.

In short, the project is a python client for Fedora Commons 4 (FC4), boringly called “pyfc4”. In my defense, I’ve worked on projects with some pretty fun names: “Ouroboros”, a middleware for wrangling the chaos of a digital collections universe; “ichabod”, a headless browser that rendered pages and used fuzzy matching to determine if the content had changed much; to name a couple. But this felt like a project that deserved a sane, calm, simple name.

Why’s that? Because, at best, pyfc4 should get out of your way.

Background

Over the past few years, we’ve built a digital collections platform/infrastructure I’m quite proud of around Fedora Commons 3.x, Solr, and the other usual suspects. And we did this using primarily python. The decision to use python was not calculated, other than we had more collective experience with the language at the time, and were blissfully unaware how much work rolling our own middleware would be, as opposed to using other ecosystems like Hydra (now Samvera) or Islandora, both really excellent and vibrant codebases and communities.

We knew Fedora was going to be in the mix, but how to communicate with it? After some initial work making raw HTTP calls to Fedora’s REST API, we stumbled on “[eulfedora]”(https://github.com/emory-libraries/eulfedora), the phenomenal, intuitive, and joy-to-use python client for Fedora Commons 3.x. Eulfedora describes itself as:

eulfedora is a Python module that provides utilities, API wrappers, and classes for interacting with the Fedora-Commons Repository in a pythonic, object-oriented way, with optional Django integration.

eulfedora.api provides complete access to the Fedora API, primarily making use of Fedora’s REST API. This low-level interface is wrapped by eulfedora.server.Repository and eulfedora.models.DigitalObject, which provide a more abstract, object-oriented, and Pythonic way of interacting with a Fedora Repository or with individual objects and datastreams.

This description accurately and succinctly describe what eulfedora is, and does, but does not attempt to justify its existence or utility in the face of alternatives. It leaves out things like handling encoding, large file streaming, alerting user’s to identifiers (PIDs) that break Fedora’s PID rules, and the list goes on. Like other language specific API wrappers out there, Eulfedora wrangles the specification of its target, and that of the target’s API, into a patterns and information reporting that is helpful to the user.

Eulfedora is integral to our digital collections platform, and has functioned almost without reproach for years now. Sure, we’ve encountered some strange edge cases, and even submitted an update or two, but 99.99% of the time, eulfedora does precisely what you expect, and how you expect it should happen.

And so, it’s unsurprising that eulfedora would be an inspiration for pyfc4; a python client to serve a very similar, but worth noting, not identical, role for Fedora Commons 4.x and its complete overhaul of philosophy and architecture.

One area that pyfc4 distinctly deviates from eulfedora is no current integrations or configurations specific to Django. That time may come, but not yet.

Overview

pyfc4 is envisioned as a lightweight python client for interacting with an instance of Fedora Commons 4.x. The idea was to map the majority, if not all, of Fedora’s API endpoints and functions to methods and patterns within pyfc4.

Similar to eulfedora, it imagines interacting with Fedora via a Repository instance, and then working on and managing Reource’s. It can be broken down into the following high-level models:

  • Repository
    • Transaction
  • API
  • Resource
    • ResourceVersion
    • RDFSource
      • Container
        • BasicContainer
        • DirectContainer
        • IndirectContainer
    • NonRDFSource

Very generally, if repo were an instance of Repository, it is passed around to Resource instances, which itself has a hierarchy of classes that align with the Linked Data Platform (LDP) that Fedora Commons 4.x is, itself, an implementation of.

pyfc4 supports transactions, versioning of resources, and the more mundane but important CRUD operations for resources. One example of the utility of pyfc4, to pick on the “U” from CRUD, would be updating resources.

Updates

Updates in FC4, if not overwriting the totality of a resource’s metadata (which is not preferred for a number of reasons), are performed by issuing PATCH requests containing a SPARQL update query. Now, like any convenience pattern or method from a client library, this is something quite doable, step-by-step, determining what triples have been added, removed, modified, building the SPARQL query with the correct DELETE {}, INSERT {}, WHERE {} syntax. But you’re going to be doing this often. pyfc4 approaches this common task in the following way:

  1. when retrieving a resource, storing the current state of the RDF graph as self.rdf. _orig_graph
  2. all local changes to the resource’s graph – adding triples, removing them, etc. – is made to a copy of that graph at self.rdf.graph
  3. when it comes time to update the server with local changes, a diff is calculated between the original graph and the modified one, and converted into a SPARQL query that is sent to the server via a PATCH request

It’s much more convenient to type foo.update() than navigating those distinct phases each time, and a library like pyfc4 provides that convenience. It also provides a single point of interaction for updates that can be optimized and improved over time, instead of simliar, but not identical, methods for updating resources. Just one example, but an exmample nonetheless of why a client is helpful as opposed to making requests to a raw API endpoint.

RDF / graph interactions

Another area that pyfc4 might make one’s life easier, if not speedier, is interacting with the RDF payloads that FC4 provides for a resource. pyfc4 parses this graph – handling a variety of different RDF serialization formats for input or output – and provides some convenient ways to access those triples.

One particularly handy way is an object-like approach. pyfc4 parses the graph from a resource, and based on the URIs of the predicates, and stored prefixes, stores them by prefix with the objectcs accessible through dot notation.

For example, if foo is a retrieved BasicContainer resource:

# get ldp:contains 
In [6]: foo.rdf.triples.ldp.contains
Out[6]:
[rdflib.term.URIRef('http://localhost:8080/rest/foo/bar'),
 rdflib.term.URIRef('http://localhost:8080/rest/foo/baz')]

# In [8]: goober.rdf.triples.foaf.knows
Out[8]: [rdflib.term.URIRef('http://localhost:8080/rest/tronic/tronic2')]

# a more complete example of how this trickles through
# add namespace
In [9]: foo.add_namespace('licorice','http://licorice.org#')

# add triple
In [10]: foo.add_triple(foo.rdf.prefixes.licorice.best_color, 'black')

# update
In [11]: foo.update()
Out[11]: True

# now, object-like access to this triple parsed from graph
In [12]: foo.rdf.triples.licorice.best_color
Out[12]: [rdflib.term.Literal('black', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#string'))]

Though pyfc4 is working nicely in the lovely cleanspace of testing and spec fullfillment, I’m hopeful that putting it to work for actual repository tasks will reveal other areas that convenience patterns and methods can be put in place. Perhaps even distilling common patterns into processes or approaches that would be faster than raw, and numerous, API calls.

Okay, enough for now. While I could go on, a dentist appointment looms, and that is perhaps the best excuse to keep this short and sweet for now. In the meantime, more documentation, examples, and information can be found on the github page: https://github.com/ghukill/pyfc4.

Drifting Gutters

We’ve all been there: you’ve just digitzed a handful of gorgeous books, or equally likely, find yourself in possession of a thousand images, but the digitization resulted in opposing pages as a single image file. You’d like to ingest these books into your stunning repository and front-end with each page as a seperate image file. Split each page 50/50 vertically? Piece of cake?

Think again. A sneaky side effect of many book scanners, when creating images of opposing pages, is a drifting gutter for the book where you’d like to split the page. This is a result of the aggregate page height on either side of the book shifting as pages are flipped. A 50/50 vertical split across all pages ends with early pages cropping too much from the right-hand page, nice crops around the middle, and left-hand pages overly cropped near the end.

To split a set of pages where the book gutter is drifting from left to right, one needs to move the centerline for the crop from left to right across the pages when splitting. To this end, whisked together a small, simple python script to crop a set of images, moving the centerline point accordingly. It accepts just a couple of inputs: percentage from left-right to begin, and percentage from left-right to end on. I’ve shared this script as a gist here.

For example, in a recent book, after some trial and error, it worked to begin cropping at 51%, and end at cropping 52%. Said another way, the centerline was at 51% from the left-hand side in the beginning, and drifted to 52% by the end of the book. The script simply counts the number of pages inputted, takes the percentage change, and applies an incrementing crop across each page. Voila!

The result, the properly cropped page on the left, the improperly cropped page on the right. In this example, no text is cutoff from the improper crop, but one can imagine this problem could errantly crop text in pages if not accounted for, and does get worse over the course of large books.

All kinds of improvements could be made, but it works in a pinch.

Syntax Highlighting

I learned today that Jekyll comes with some built-in syntax highlighting when firing through GitHub pages. So, wanted to take er for a spin…

def print_reaction(reaction):
    print(reaction)
        
reaction = "This is amazing!"
print_reaction(reaction)

I like what I’m seeing.