Bag Validation


Something didn’t happen today, and it was incredibly validating. No validation pun intended here. Honestly.

We’ve been ingesting ebooks into our digital collections platform for quite some time now, and one perennial problem we face is that of malformed books. This can come in many forms, and it hinges on how we structure our digital ebooks. At a very high level, each physical page gets about five digital representations:

  • the page image, as a TIFF file
  • raw text from page, as a text file
  • coordinates of words on the page, as ALTOXML file
  • PDF of the page, with words overlaying the page image
  • HTML of the page, including some limited layout (this one is not great, may be deprecated soon)

So, for a 100 page book, you might – should – end up with 500 distinct files: 001.tif, 001.pdf, 001.xml, 001.txt, 001.html, 002.tif, you get the point.

Before we had any ebook specific checks in place, it was not uncommon for a book to enter the ingest workflow missing a singular PDF, XML, or other file relating to a page. Or even, an entire page (001,002,004, no 003). This would result in ebooks ingested, but missing key components that would rile things down the pipeline in highly annoying and unforeseen ways. From a preservation standpoint, it was also not ideal to allow missing derivatives to slip through.

But those days are mostly over. We have included some bag validation for each different kind of content type in our digital collections, that look for specific properties. For example, an Image object should roll through with an original image, a JP2 derivative, a thumbnail, etc. If one of those is missing, it fails the validation on that datastream. For ebooks, we’re looking for parity of derivative file counts for each page. If a page comes through missing something, we get notified in our Ingest Workspace (fodder for another post), and that bag (object) is prevented from getting ingested.

In this way, we can let 44/45 perfectly good books get ingested, then diagnose the rogue baddie. The wheels of digitization and access roll on, and we identify things that need fixing. The screenshot above shows this check firing on a batch today, long since forgetting about putting it in place. Great huzzahs!


Long ago, and far away, I had a thousand other blogs, all lost to the sands of time on the internet. Well, not lost per say, more abandoned as I realized I would not be able to faithfully shepard them along the winding roads of internet time. My goal was to consolidate here, hunkering down in the welcoming leaves of markdown and GitHub.

In the shuffle, however, I lost a blog I was most fond of, a simple “Word of the Day” or “For the Word”, or something along those lines. It was posts dedicated to a single word, a meditation on the wonderful acorns of knowledge and enshrined history that exist within a single word. And so, upon reading a word last night that fit the bill, I’d like to bring it back here. And so, without further ado…


The impetus for blogging about this word can be traced back to a recent trip to Denton, TX to visit an old friend. While there, we saw the legendary performer Paul Slavens perform. He takes money from the audience and makes up songs on the spot based on song titles or themes they scribble on bar napkins.

Some estimates have him at more than 2,000 improvised songs in the last 20 years. So, needless to say, there are plenty of examples on YouTube. But, despite what you might see in the following video, it’s hard to capture his charm and wit that exists between songs; the real appeal and virtuoso.

Wikipedia says this about him,

“In the mid 90s Slavens began creating improvisational songs based on audience suggestions, and has created an estimated 2000 songs over the last 2 decades, many recordings exist, although few have been released. Often these songs are humorous in nature and can be quite ribald.”

And such was my (re)acquaintance with ribald. Without any objective definition, I knew precisely what this word meant. How is that possible? Moreover, can a definition ever replace this initial correlation of ribald that I now hold?

In Philosophical Investigations, Wittgenstein opens up with a quote from St. Augustine,

“The individual words in language name objects—sentences are combinations of such names. In this picture of language we find the roots of the following idea: Every word has a meaning. This meaning is correlated with the word. It is the object for which the word stands.”

But, interestingly, immediately begins to push against this idea, suggesting it’s a far too simplistic understanding of language, and by proxy, words. I think it’s safe to assume that Wittgenstein would support the idea that a word does not have a single meaning, but is actually given meaning from context, learning, and much more.

And so, returning to ribald, this word was perfectly defined for me through a personal experience and the persona of Paul Slavens. Now, sure, of course, I realize there is an agreed upon definition of ribald. From the venerable OED, for ribald as noun:

“1. a. In the medieval period: a person of low social status, esp. regarded as worthless or good-for-nothing; a rascal, vagabond. Also as a form of address. Now arch. or hist.

“2. A foul-mouthed or blasphemous person; one who uses offensive, irreverent, or scurrilous language; one who jeers or jokes in a rude or lewd way. Now rare.”

“3. A promiscuous or loose woman; a wanton, a harlot. Obs.”

“4. A wicked, dissolute, or licentious person; a villain. Now arch. and regional (Sc.).”

and so on, and so forth. Also from the OED, for ribald as adjective:

“1. Of a person or persons: (in early use) lewd, coarse, or licentious in language or behaviour; deliberately and offensively abusive or impious; (now usually in weakened sense) given to bawdy, vulgar, or irreverent talk or behaviour; amusingly rude.”

“2. Of language, humour, etc.: coarse, vulgar, scurrilous, irreverent; (subsequently esp.) referring to sexual matters in an amusingly rude or irreverent way. Now the most common sense.”

of which appears to be much more common.

So we have these definitions, and they are, unsurprinsgly, expansive to say the least. We love words like this; forged in the streets of pre-industrial London, tumbled around during the bawdy – notice the similar ‘ald’, ‘awd’ sounds, coincidence? – early 1900’s. And yet, with all that history and lyrical verse dedicated to this word, I meet it halfway with my intuited, pop-culture, internal definition.

It reminds me of work we’re doing with objects in digital repositories. There is a tension between front-ends that extract disconnected information from disparate sources, reconstituting client-side for a conceptual whole, vs. opinionated server-side models that pull some suggestions for stylings from here and there, but for the most part “push” or impell themselves through a series of pasta makers (I’m consciously choosing to move away from meat-based metaphors, you know, for the planet). The net effect is often the same to the unaware user, but the mechinations that move the system are fundamentally different in their approach. Both have pros and cons. And relevant to this discussion, there is very rarely an ideal state that adheres entirely to one philosophy or another. These systems we build and work with are muddy, confused over time, and contorted to work in the real word.

Much like words; those great and wonderful puzzles.

ps. all typos and mis-spellings are my own, no editing has been performed.



I’m excited to say, work has commenced on a rewrite of Digital Collection’s primary API (pinning this link to a commit before the API disappears as we know it). I also use the term “API” a bit loosely here, as it has served almost exclusively for internal use, powering our decoupled front-end. Now an API that is used wholly internally certainly qualifies under the myriad of API definitions out there. Where I challenge that coveted title is the lack of consistency and documentation it has exhibited until this point.

And that’s okay! Which, if one hasn’t noticed already, is a running theme around here.

The API grew piecemeal with the rest of the ecosystem. Where once it queried Solr directly for an object’s metadata, later it would retrieve that Solr doc via a method buried in an Ouroboros content-type object. Where once we would fire off multiple API functions to fire – member of collections, related objects, comprehension of images, etc. – later they were grouped under a singleObjectPackage class that aggregated and returned all that information in single, sprawling response. It’s come a long way, and has proved to be extremely versatile, reliable, and fun to build.

But as mentioned in a previous post, we are in the process of re-building / refreshing the front-end, and the opportunity presented itself to rework, refine, and wildly improve a meandering bit of code.

With this opportunity to completely restructure the API, it’s a great time to leverage a library that might help with building out an API. After a bit of poking around, Flask-RESTful emerged as a very enticing option, and the route I think we’re going. For a variety of reasons:

ability to handle client content negotion (with a bit of finagling)

One of our goals with the Digital Collections is to treat our collections as data in many ways (a quick Googling will reveal the blossoming ideas and literature around this idea, perhaps fodder for another typing). Mark Phillips from UNT has a neat post about hacking their resource URLs, that left a lasting impression. Excited about the thoughtful way in which the URL could be leveraged for different views and pieces of a resource. It percolated for a bit until this opportunity for API and front-end reworking, and the simulataneous emphasis on collections as data, presented itself.

Without losing the thread too quickly here, I would like our API for routes such as /item/wayne:foobar/metadata or /item/wayne:foobar/? to return metadata in JSON form, but then have a route like /item/wayne:foobar/txt – if it’s a book – return raw text, with a text/plain Content-Type header. More to the point of content negotiation, let the client request different forms of the same information at the same route.

parameter parsing built-in

Good grief, this is just a no brainer. We can enforce parameters types (string, int, etc.), and automatically return responses specific to a particular parameter, with appropriate HTTP codes, as well. Sign me up.

and more

This is really just the tip of the iceburg. Instead of wiring and hand-rolling each response or error, we can pipe our data through this library in a coherent fashion each time. Moreover, this pattern I keep encountering, I would like to put the kibosh on:

  1. dreaming up new thing, exploring options
  2. begin to build from scratch, stumble on other libraries
  3. great excitement
  4. realize that libraries have a learning curve
  5. sense of “meh”, continue building from scratch
  6. progress slows as better understanding of requisite bits and pieces emerges
  7. continue building…
  8. cobbled code meets minimum requirements
  9. begin to improve and refactor, realize that original libraries held functionality all along
  10. realize spending much more time on writing bits and pieces than would have spent learning library
  11. vow to never make this mistake again

And SO, mistake made not again! Flask-RESTful has been an utter delight thus far, and looking forward to pressing on.