I’m excited to say, work has commenced on a rewrite of Digital Collection’s primary API (pinning this link to a commit before the API disappears as we know it). I also use the term “API” a bit loosely here, as it has served almost exclusively for internal use, powering our decoupled front-end. Now an API that is used wholly internally certainly qualifies under the myriad of API definitions out there. Where I challenge that coveted title is the lack of consistency and documentation it has exhibited until this point.
And that’s okay! Which, if one hasn’t noticed already, is a running theme around here.
The API grew piecemeal with the rest of the ecosystem. Where once it queried Solr directly for an object’s metadata, later it would retrieve that Solr doc via a method buried in an Ouroboros content-type object. Where once we would fire off multiple API functions to fire – member of collections, related objects, comprehension of images, etc. – later they were grouped under a
singleObjectPackage class that aggregated and returned all that information in single, sprawling response. It’s come a long way, and has proved to be extremely versatile, reliable, and fun to build.
But as mentioned in a previous post, we are in the process of re-building / refreshing the front-end, and the opportunity presented itself to rework, refine, and wildly improve a meandering bit of code.
With this opportunity to completely restructure the API, it’s a great time to leverage a library that might help with building out an API. After a bit of poking around, Flask-RESTful emerged as a very enticing option, and the route I think we’re going. For a variety of reasons:
ability to handle client content negotion (with a bit of finagling)
One of our goals with the Digital Collections is to treat our collections as data in many ways (a quick Googling will reveal the blossoming ideas and literature around this idea, perhaps fodder for another typing). Mark Phillips from UNT has a neat post about hacking their resource URLs, that left a lasting impression. Excited about the thoughtful way in which the URL could be leveraged for different views and pieces of a resource. It percolated for a bit until this opportunity for API and front-end reworking, and the simulataneous emphasis on collections as data, presented itself.
Without losing the thread too quickly here, I would like our API for routes such as
/item/wayne:foobar/? to return metadata in JSON form, but then have a route like
/item/wayne:foobar/txt – if it’s a book – return raw text, with a
Content-Type header. More to the point of content negotiation, let the client request different forms of the same information at the same route.
parameter parsing built-in
Good grief, this is just a no brainer. We can enforce parameters types (string, int, etc.), and automatically return responses specific to a particular parameter, with appropriate HTTP codes, as well. Sign me up.
This is really just the tip of the iceburg. Instead of wiring and hand-rolling each response or error, we can pipe our data through this library in a coherent fashion each time. Moreover, this pattern I keep encountering, I would like to put the kibosh on:
- dreaming up new thing, exploring options
- begin to build from scratch, stumble on other libraries
- great excitement
- realize that libraries have a learning curve
- sense of “meh”, continue building from scratch
- progress slows as better understanding of requisite bits and pieces emerges
- continue building…
- cobbled code meets minimum requirements
- begin to improve and refactor, realize that original libraries held functionality all along
- realize spending much more time on writing bits and pieces than would have spent learning library
- vow to never make this mistake again
And SO, mistake made not again! Flask-RESTful has been an utter delight thus far, and looking forward to pressing on.
Choosing a framework for our v2 Digital Collections front-end
We are currently in the midsts of refreshing our Digital Collections front-end. It has been a workhorse for us, still functions, and still looks respectable, but the foundation is beginning to crack as we push the original design and corners of spaghetti code beyond their original visions.
- user login and authentication
- iteratively improved search
- full-text vs. structured metadata refinement
- improvement of facets
- collection browsing
- serials browsing and interfaces
- inter-linked learning objects
- introduction of hierarchical content types such as archival materials and serials
- and the list goes on…
I’m proud of what we’ve built, something that is remarkably usable and cohesive given the breakneck pace of change and learning that ran parallel. It has survived entire re-imaginings of how digital objects are structured, a full-fledged API on the back-end that powers it, migration of servers, introduction of vagrant provisioning, you name it.
But its time has come. As we push into more digital objects, we’ve started to notice some performance hits that are a result of inefficient JS tangles and approaches. Our initial approach was a “lightweight” JS front-end that relied heavily on AJAX calls to retrieve information from our back-end API, that was drawn on the page with jQuery-Mustache templating. We’ve made a handful of improvements that keep it humming along, but any substantial changes would require reworking a lot of ugly JS code. And I can say that, because we wrote it all.
The visual style is also feeling a bit dated, or perhaps if but only stale. It needs a refresh there too.
And there was the important issue of sustainability. We know the ins and outs of the JS code, but thar be dragons thar, and feels near impossible to document in a lucid fashion.
So, the time is right. We have at our disposal someone who is going to put together front-end wireframes that we can use to wire and implement. The next big decision: what kind of organization and/or framework for the front-end?
We spent a bit of time going round and round, discussing the pros and cons of emerging JS, Python, and other frameworks. It is worth noting, all while simulataneously congnizant that we may migrate to a more turn-key solution down the road, if projects like Hydra-in-a-Box provide a truly, and palatable kit-and-kaboodle option. Another goal, briefly alluded to above, is sustainability in a front-end; something that can be worked on, improved, fixed, and loved for some time.
I can’t believe I’m typing this, but we are starting to hone in on using a PHP framework. Considering Slim and Lumen at this point. Why a PHP framework? Why not Flask to augment the other python components in the stack?
For a combination of reasons.
First, PHP is a language commonly used here in the libraries. More people know it now, and though you could debate this a bit, it’s probable that anyone coming in later will at least be familiar with PHP. Perhaps you could say the same about Python, but as long as the website is PHP based, we’ll have people “in shop” who know PHP. Perhaps the same can’t be said for Python. And that’s important.
Second, we would like to keep front-end and back-end cleanly separated. At our initial wireframing meeting, the individual creating a working wireframe leveraged our quirky and undocumented API and created a working demo. That was amazing. It reinforced the idea of treating our collections as data, maybe even first and foremost. Front-ends will come and go, amaze and disgust, but our underlying, structured digital objects will remain. An organized API for access to those materials means a multitude of front-end interfaces are possible, and migration down the road is easier.
We also know ourselves well enough that if we created a python-based, Flask front-end, we would inevitably start importing libraries and models directly from the back-end Ouroboros ecosystem. While this may be programatically efficient in some ways, maybe faster in others, it would muddle clean lines between our back and front ends that we would like to maintain.
And so, PHP is looking good. We still get the following from a PHP framework:
- URL routing: as easy to return machine-readable data as rendered pages
- built-in templating: likely with syntax nearly identical to Jinja in Flask
- models: make a nice connection between front-end models and API
- ORM: room to grow into user accounts, etc.
- conventions to organize our code
- and many more these tired fingers haven’t yet gotten to
Undoubtedly there will be updates and twists in this adventure to a new front-end – no little one, being a complete rewrite of our digital collections API – but it’s exciting to have a path to explore at this point.
Archival Material Ingest Workflow
It has taken quite some time, but we’ve honed in on a workflow for ingesting archival materials from the Reuther library – aka, University Archives here on campus – into our Fedora Commons based repository. Turns out, the following is all there is to it!
Rest assured, there are more to those arrows than immediately meets the eye. Let’s dive in.
Our goal was to take digitized archival materials from the Reuther Library and provide preservation and access via our Digital Collections infrastructure. At a very high level, we were imagining one digital object in Fedora for each digital file coming from the archives.
But we were realistic from the get-go, it was going to be a much larger enterprise than file-by-file. How would we manage ingest and description at scale? To answer that, we need to look at the systems in play: Archivematica and ArchivesSpace.
Archivematica is a series of tubes. To be more precise, micro-services. Archivematica takes groups of files, runs them through a host of preservation micro-services – virus scanning, file format normalization, checksumming, etc. – and ties them up together in a tidy bow with an over-arching METS file. Archivematica is inspired by the venerable OAIS model (Open Archival Information System), and as such, speaks in terms of AIPs, SIPs, and DIPs.
We are using Archivematica as means to get actual, discrete digital files from the Reuther into a format that we can batch process them for ingest. Additionally, we get all the preservation friendly treatment from the micro-services, and begin a paper trail of metadata about the file’s journey. It’s quite possible we’ll dig deeper into affordances and functionality of AM (it’s time to shorten), but for now, it’s primarily a virus checking, checksumming, METS writing, server/building spanning networked pipeline for us. And it’s going to be great!
The next dish in the cupboard is ArchivesSpace. ArchivesSpace, save a passionate and exploration here of just what ASpace is and represents, it’s safe to think of it as the next generation of archival software used to handle description, management of information around materials, discovery, and much more. Our partner in crime, the Reuther Library, is slowly switching to ASpace to handle their description and information management. It is a database driven application, that still also exports finding aids for archival collections in EAD. We’ll be using those, with plans to leverage the API once deployment has settled down a bit.
Our involvement with ArchivesSpace is limited primarily to our metadata librarian who takes a manifest of the files as processed by Archivematica, an EAD of descriptive and intellectual organization metadata about the collection as exported from ArchivesSpace, and creates a new METS file meant to enrich / augment the original Archivematica METS file.
I’ve perhaps nestled myself too deeply in the weeds here, so lets zoom out. We…
- take files from the archives via Archivematica
- these come with an AM generated METS file that represents the “physical” hierarchy and organzization of the digital files on disk
- we then take an EAD from ArchivesSpace that contains “intellectual” hierarchy and description about the materials, and synthesize a new METS file that represents the intellectual organization of the files - something we refer to as “AEM METS”, for “Archival Enrichment Metadata (AEM)”
- with the original digital files, AM METS, and AEM METS, we create bags on disk
- finally, ingest!
Where, and how, does this happen?
This occurs in an increasingly substantial corner of our adminitrative middleware, Ouroboros, called the “Ingest Workspace”. This Ingest Workspace, the intent of this blog post and which I’ve managed to bury pretty far down here, is where we take these collaborative bits of information and assimilate them into sensible digital objects we can ingest. It’s the green box in the diagram above.
This process has taken a considerable amount of time, as it’s a complex process! So much so, that wonderful folks at the Bentley Historial Library received a grant and to fund research into wiring these platforms together to aid in these kind of ingest workflows (as you dig down, the details are different, but the end goals share many similarities - moreover, their blog and presentations at conferences have been a huge help for thinking through these processes).
The difficult and complexity come, in large part, to reconciling physical and intellectual arangement of archival materials, or any materials for that matter. A quick and dirty example: a postcard from a friend is in a shoebox, in a drawer, in my desk. That is the stellar physical arrangement I have chosen. However, I may have it intellectually organized under meaningful materials –> postcards –> international. And that’s an easy example where the hierarchical levels align. How, then, might we digitize this item and provide access to a user, while also trying to contextualize the item within its intellectual and physical place?
We decided to drop one: physical. I should point out here that even within the “physical” hierarchy, we are actually often referring to digital and analog versions of the resource. The Reuther library has made the very wise choice of organizing their digital files where possible to mimick their physical hierarchy, which makes this considerably easier. But suffice it to say that we retain both, in form or another, such that we can work backwards and figure out where the original digital or analog version lives.
To wrap up, we are choosing to organize and contextualize the files primarily based on their intellectual arrangement as suggested by ArchivesSpace. Which, in fine fashion to finish, explains the need to interwine information from Archivematica and ArchivesSpace! This post comes on the heels of an early successful ingest of this workflow, expecting all kinds of interesting twists and turns as we proceed – newly digitized materials, updates to metadata, pointing back to ArchivesSpace (which we have in our back pocket with ASPace identifiers), etc.