Deriving Tasks From System Metrics

Curious if a persnickety OAI server was behaving the tail-end of last week, and over the weekend, I checked on some system monitoring I’d setup awhile ago using the supremely handy nmon library, and the visualizing accompaniment nmonchart. I encountered the following graph:

Despite all the activity, this confirmed to me that the OAI server did not have the CPU pegged. But, what I found even more interesting, was that it showed quite precisely when I ran a full index of objects in Fedora to Solr: 8:16am to 12:43pm.

While I didn’t need to know this information, per say, I was struck how much of our activities as humans, and activities on behalfs of processes, can be easily divined from CPU metrics. This is an incredibly rough analysis, of course, but always delight in the insights from data.

Perils Of Digital Preservation

What are the greatest perils of digital preservation?

  • the total collapse of our modern digital infrastructure, vanishing our digital artifacts and memories in a single, fell swoop?
  • small-scale hard drive failures and format obsolescence, surgically and quietly rendering our files inaccessible?
  • forgetting a single digital object during a software / hardware migration?

I had a near miss with the last one recently. It was a snowflake of an object. Without naming titles or identifiers, it was a book scanned as a one-off digital object. An important, interesting, and culturally valuable book. And this is precisely why it got lost in the shuffle.

During migration, or even general record-keeping, auditing, and intellectual control, we focus on the big collections. Or, we cut our teeth on the small ones, working up to the big push (makes sense when, perhaps, one collection is literally 1,000x larger than small ones). We measure our success in achieiving 100% migration rate – both in quantity and fidelity – in groups, “got 2293/2293 for that collection, 422/422 for the other, and 16/16 for that little tyke over there,” and so on, and so forth.

But what about those other objects that have made it into our purview and custody? The objects that have no collection, that have no measurement of quantity outside of their self-reflexive parity? Those are the ones at risk.

I have likened it to “hitching your wagon” to a known entity. Or “safety in numbers.” The list goes on. The moment we tether an object to another, preferably a bunch, they benefit from the visibility of the herd.

The original files for the object were always safe, but all the work that went into ingest, creating derivatives, modeling for shifting platforms, would have all been lost. Not to mention any additional content, metadata, or insight that might have accompanied the object as it vinted as a digital object.

I never did write-up our conversion from single-object ebooks to ebooks that are modeled as multiple-objects, but it was quite an undertaking. Not only did this object in question not belong to a collection, but once it had missed the ebook migration, it had two strikes against it. It not longer registered in QA and auditing as an “ebook”; instead, drifting into the tepid abyss of non-intellectually controlled items.

Do you “have” an object if not controlled?

Is every connection to a collection or another object, distinction in an otherwise entropic stew of files on a server?

There are all kinds of safeguards and practices against “misplacing” a digital object like this, but in some way, don’t they all involved tethering? Even if but a sliver of metadata that reads, “I am object, hear me roar”?

Caching

A strange thought today while working with a colleague on tuning caching for our digital collections with Varnish.

We have been working to cache thumbnails and single item pages, and in the process, and I just about physically tripped on the interesting difference between caching website resources, and archiving a rendered version of the website.

To cache a single item page, we have been experimenting with using python to make headless, HTTP requests to our front-end PHP framework, Slim. I had delighted that a single request would put into motion the reconciling work that Slim does for a single item page, including a couple of API calls to our backend, and then save that rendering as a static HTML response for future visits. Awesome.

But on testing today, we noticed that a “preview” image was not cached, and would load the first time. Actually, a handful of things were not cached. Anything that the browser requested, after our front-end framework had delivered its payload, had not been cached in that early item page caching. Thinking it through, this is expected! But it was interesting, and got the wheels turning…

What if we were to use a headless browser to render the page, something like Selenium, or Splash, one of my favorites from the wonderful people at ScrapingHub. Or the myriad of other headless browser options out there. What would happen then? It was thinking this through, that it became clear it would work for caching the entirety of the page, but not in the way I had originally anticipated.

When I think of headless browsers, and the amazing things they do, one product is the HTML of the page, fully formed even after Javascript Ajax calls (which are incredibly common now). However, I had not deeply considered what happens to other resources like images which are pulled in via <img> tags. What do headless browsers do with these? Are they content to leave them as-is? or pull in the binary bits and drop those where the image would have landed? Interesting in its own right, there was more!

By firing off a headless browser for a single item page – that contains at least one additional image request via an <img> tag – that should trigger the HTTP request needed for Varnish to cache that URL. So, if one were to load that single item page after a headless browser already had, one would not receieve the entirety of the page pre-rendered like headless browsers provide, but would instead just be delighted with the virtually instant response of any HTTP requests the page needed.

Which introduces this interesting area between raw, completely unrendered pages and archived instances (like WARC files). If we cache each HTTP request the page requires, the only thing we leave to the browser is to actually put all the pieces together as a whole (including firing javascript).

I realize as I type this out, that some of the nuance of the insight may be lost without more discussion, but suffice it to say, caching is an interesting and ceaselessly beguiling beast.