Caching

A strange thought today while working with a colleague on tuning caching for our digital collections with Varnish.

We have been working to cache thumbnails and single item pages, and in the process, and I just about physically tripped on the interesting difference between caching website resources, and archiving a rendered version of the website.

To cache a single item page, we have been experimenting with using python to make headless, HTTP requests to our front-end PHP framework, Slim. I had delighted that a single request would put into motion the reconciling work that Slim does for a single item page, including a couple of API calls to our backend, and then save that rendering as a static HTML response for future visits. Awesome.

But on testing today, we noticed that a “preview” image was not cached, and would load the first time. Actually, a handful of things were not cached. Anything that the browser requested, after our front-end framework had delivered its payload, had not been cached in that early item page caching. Thinking it through, this is expected! But it was interesting, and got the wheels turning…

What if we were to use a headless browser to render the page, something like Selenium, or Splash, one of my favorites from the wonderful people at ScrapingHub. Or the myriad of other headless browser options out there. What would happen then? It was thinking this through, that it became clear it would work for caching the entirety of the page, but not in the way I had originally anticipated.

When I think of headless browsers, and the amazing things they do, one product is the HTML of the page, fully formed even after Javascript Ajax calls (which are incredibly common now). However, I had not deeply considered what happens to other resources like images which are pulled in via <img> tags. What do headless browsers do with these? Are they content to leave them as-is? or pull in the binary bits and drop those where the image would have landed? Interesting in its own right, there was more!

By firing off a headless browser for a single item page – that contains at least one additional image request via an <img> tag – that should trigger the HTTP request needed for Varnish to cache that URL. So, if one were to load that single item page after a headless browser already had, one would not receieve the entirety of the page pre-rendered like headless browsers provide, but would instead just be delighted with the virtually instant response of any HTTP requests the page needed.

Which introduces this interesting area between raw, completely unrendered pages and archived instances (like WARC files). If we cache each HTTP request the page requires, the only thing we leave to the browser is to actually put all the pieces together as a whole (including firing javascript).

I realize as I type this out, that some of the nuance of the insight may be lost without more discussion, but suffice it to say, caching is an interesting and ceaselessly beguiling beast.

Standards

I was reflecting today while putting together some thoughts for class, that learning / memorizing a single standard is useful, but learning how to learn standards, can be so much more valuable.

We have covered MARC and EAD, DACS, ISAD(G), and the list goes on. Obviously, each of these is critically important and uniquely interesting in their own right, but they do not lend themselves to a linear read. Standards encapuslate everything from history, to why, to how, to specific rules, to integration with other standards. Each standard is a complex network of information with muliptle inroads, and cannot be treated as a linear text to be read once and understood in its entirety. Furthermore, standards may vary greatly from one to another. Some may explain tag libraries for EAD based standards, others might attempt to codify norms of behavior or philosophies of a body into workflows and decision trees.

However different they might be, and how little they may lend themselves to a single-read-and-understand, standards also share a striking similarity: they are standards! They are attemping to make order out of chaos, impose or suggest a way of doing things so that people and systems may be interoperable across space and time. And in this similarity, they open themselves up to those familiar with standards.

Just today I was reading the meeting notes from a Hydra / Sufia related working group that was interested in codifying the metadata principles and formats for Sufia. It was mentioned that a handful of well made standards in other, related areas were using a fixed set of words to help standardize the standards! Words like: MUST, SHOULD, ALLOW, that would help humans and machines parse the rules for this particular standard. The IIIF Image API is a nice example of a relatively new standard, where a considerable amount of work has been done to make sure it is expressive, succinct, and unambiguous. The discussions leading up to the standard, I’m sure, where quite lively and full of questioning. But the result is a standard with clear language and vision.

So, back to my notes, I got to thinking. The value of learning some of these standards is not to internalize their every twist and turn, their specific rules or exceptions, but to instead feel their radiating essence. What standards are similar? What standards are complimentary? How much of the standard documentation is narrative, how much is meant to be referenced? How much is distinctly machine-readable (thinking RDF ontologies, XML schemas, etc.)?

If it is anything like learning programming languages - and I believe that it is - learning the shape and confines of single standard opens the door to picking up other standards quickly. The first time you see MUST and SHOULD in a document are jarring, but seeing them in a different standard’s documentation is comforting like an old friend.

Indexing_digital_objects

That moment, when posited ideas and hairbrained schemes begin to coalesce into a workable plan. That happened this morning after a few days of thinking and conferring about ways to update how we index digital objects from Fedora into Solr.

The funny thing, this is a return to a model we had in place at least 1-1.5 years ago now. At least parts. Currently, we have only one Solr core that powers the front-end, the same Solr core that we index objects to from Ouroboros (our internal manager), and the same core we search for objects to work on internally. This came about when we moved to a more distributed ecosystem, and major ingests or updates were not on the live, production machine anymore. A single core worked just dandy for some time.

However, we’ve started to revisit the indexing process, caching, and how we find and select objects to work on. One problem, until an object from Fedora was indexed in Solr, we wouldn’t have the ability to select it in Ouroboros and manage it. We had workarounds: index all recently ingested objects, ingest via command line, etc., but these were numerous and confusing to remember. Our goal was to have some kind of database / store that was up-to-date with currently held objects in Fedora, that we could search to select objects.

Another problem was made evident via growth. As the amount of objects we manage was increasing, handling the searching and selecting of them client-side is becoming unfeasible. 10-12 seconds to return ~ 50k records, would not scale as we continue to grow. We needed to move our searching of objects server-side. We are using DataTables to filter and browse. The goal is to now write a Python(Flask)/Solr connector for server-side DataTable processing. I’ve done something similar for PeeWee, and am looking forward to mapping Solr into a similar endpoint that our Flask app can return data for DataTables.

We’ve flipped back on our Stomp client that listens to Fedora’s messaging service, and will automatically index objects via our solrIndexer pipeline into the fedobjs Solr core. This core will be used for searching and selecting objects internally, and pushing those PIDs to a user-specific database of PIDs to workon. From that workspace, users can then work on objects as per normal in Ouroboros.

The API will ping the more stable search Solr core for powering the front-end, as we did formerly. The finesse, will be figuring how, and how often, to replicate changes from fedobjs to search, or manually index objects straight to search. We are envisioning nightly replications, with the option to manually index objects to the search core if we need them there ASAP.

Very excited to have a more rational and well-mapped approach. And you can tell, when the pencil hits the paper, and then the pen confirms (!), the model has been grokked, and we can finally dig into making it happen.

ALSO, that prefix “man” in “manual”, I’m not a fan. Looking up the etymology of “manual” in the Oxford English Dictionary (OED) begins to suggest that the word has origins in “manuālis” (roughly, held in the hand), which points back further to “manus” (relating to “the hand”, but also to Roman law, “A form of power or authority, principally involving control over property, held in some instances by a husband over his wife; a form of marriage contract giving a husband such authority.”).

Time to find a better word.