Standards

I was reflecting today while putting together some thoughts for class, that learning / memorizing a single standard is useful, but learning how to learn standards, can be so much more valuable.

We have covered MARC and EAD, DACS, ISAD(G), and the list goes on. Obviously, each of these is critically important and uniquely interesting in their own right, but they do not lend themselves to a linear read. Standards encapuslate everything from history, to why, to how, to specific rules, to integration with other standards. Each standard is a complex network of information with muliptle inroads, and cannot be treated as a linear text to be read once and understood in its entirety. Furthermore, standards may vary greatly from one to another. Some may explain tag libraries for EAD based standards, others might attempt to codify norms of behavior or philosophies of a body into workflows and decision trees.

However different they might be, and how little they may lend themselves to a single-read-and-understand, standards also share a striking similarity: they are standards! They are attemping to make order out of chaos, impose or suggest a way of doing things so that people and systems may be interoperable across space and time. And in this similarity, they open themselves up to those familiar with standards.

Just today I was reading the meeting notes from a Hydra / Sufia related working group that was interested in codifying the metadata principles and formats for Sufia. It was mentioned that a handful of well made standards in other, related areas were using a fixed set of words to help standardize the standards! Words like: MUST, SHOULD, ALLOW, that would help humans and machines parse the rules for this particular standard. The IIIF Image API is a nice example of a relatively new standard, where a considerable amount of work has been done to make sure it is expressive, succinct, and unambiguous. The discussions leading up to the standard, I’m sure, where quite lively and full of questioning. But the result is a standard with clear language and vision.

So, back to my notes, I got to thinking. The value of learning some of these standards is not to internalize their every twist and turn, their specific rules or exceptions, but to instead feel their radiating essence. What standards are similar? What standards are complimentary? How much of the standard documentation is narrative, how much is meant to be referenced? How much is distinctly machine-readable (thinking RDF ontologies, XML schemas, etc.)?

If it is anything like learning programming languages - and I believe that it is - learning the shape and confines of single standard opens the door to picking up other standards quickly. The first time you see MUST and SHOULD in a document are jarring, but seeing them in a different standard’s documentation is comforting like an old friend.

Indexing_digital_objects

That moment, when posited ideas and hairbrained schemes begin to coalesce into a workable plan. That happened this morning after a few days of thinking and conferring about ways to update how we index digital objects from Fedora into Solr.

The funny thing, this is a return to a model we had in place at least 1-1.5 years ago now. At least parts. Currently, we have only one Solr core that powers the front-end, the same Solr core that we index objects to from Ouroboros (our internal manager), and the same core we search for objects to work on internally. This came about when we moved to a more distributed ecosystem, and major ingests or updates were not on the live, production machine anymore. A single core worked just dandy for some time.

However, we’ve started to revisit the indexing process, caching, and how we find and select objects to work on. One problem, until an object from Fedora was indexed in Solr, we wouldn’t have the ability to select it in Ouroboros and manage it. We had workarounds: index all recently ingested objects, ingest via command line, etc., but these were numerous and confusing to remember. Our goal was to have some kind of database / store that was up-to-date with currently held objects in Fedora, that we could search to select objects.

Another problem was made evident via growth. As the amount of objects we manage was increasing, handling the searching and selecting of them client-side is becoming unfeasible. 10-12 seconds to return ~ 50k records, would not scale as we continue to grow. We needed to move our searching of objects server-side. We are using DataTables to filter and browse. The goal is to now write a Python(Flask)/Solr connector for server-side DataTable processing. I’ve done something similar for PeeWee, and am looking forward to mapping Solr into a similar endpoint that our Flask app can return data for DataTables.

We’ve flipped back on our Stomp client that listens to Fedora’s messaging service, and will automatically index objects via our solrIndexer pipeline into the fedobjs Solr core. This core will be used for searching and selecting objects internally, and pushing those PIDs to a user-specific database of PIDs to workon. From that workspace, users can then work on objects as per normal in Ouroboros.

The API will ping the more stable search Solr core for powering the front-end, as we did formerly. The finesse, will be figuring how, and how often, to replicate changes from fedobjs to search, or manually index objects straight to search. We are envisioning nightly replications, with the option to manually index objects to the search core if we need them there ASAP.

Very excited to have a more rational and well-mapped approach. And you can tell, when the pencil hits the paper, and then the pen confirms (!), the model has been grokked, and we can finally dig into making it happen.

ALSO, that prefix “man” in “manual”, I’m not a fan. Looking up the etymology of “manual” in the Oxford English Dictionary (OED) begins to suggest that the word has origins in “manuālis” (roughly, held in the hand), which points back further to “manus” (relating to “the hand”, but also to Roman law, “A form of power or authority, principally involving control over property, held in some instances by a husband over his wife; a form of marriage contract giving a husband such authority.”).

Time to find a better word.

Human Names And Opinionation

I’m going to go on the record as saying I don’t know if “opinionation” is a word, but I’d sure like it to be.

One of the most difficult, interesting, and complex things we deal with when building out a Digital Collections platform is representing information from digital objects in a meangingful way on the “front-end”, access system. In end-to-end frameworks like Rails or Django, there is often tight coupling between models and views. If you give a model an attribute like title, it’s relatively easy when rendering a page to say something along the lines of, object.title to place the title.

We do things a bit differently. One of our goals from the beginning of this long and wild ride, has been a distinct and purposeful disconnect between our “back-end” and “front-end”. We use an in-house built API to communicate with our front-end, that renders relevant information to the page. But in a situation like this, where coupling is a bit looser, where does one house opinions or translations from back-end, database fields to front-end, human readable information? Where is the Solr field rels_isMemberOfCollection translated to Collection?

Our solution in our v1 system was to have the front-end query the back-end every page load, requesting a hash of values to help translate. It looked something like this:

:
"AllImage"
info:fedora/CM:Archive
:
"Archive"
info:fedora/CM:Audio
:
"Audio"
info:fedora/CM:Collection
:
"Collection"
info:fedora/CM:Container
:
"Container"
info:fedora/CM:ContentModel
:
"ContentModel"
info:fedora/CM:Document
:
"Document"
info:fedora/CM:HierarchicalFiles
:
"HierarchicalFiles"
info:fedora/CM:Image
:
"Image"
info:fedora/CM:Issue
:
"Issue"
info:fedora/CM:LearningObject
:
"Learning Object"
info:fedora/CM:Serial
:
"Serial"
info:fedora/CM:Video
:
"Video"
info:fedora/CM:Volume
:
"Volume"
info:fedora/CM:WSUebook
:
"WSUebook"
info:fedora/wayne:collectionAmericanPressman
:
"American Pressman"
info:fedora/wayne:collectionCFAI
:
"Changing Face of the Auto Industry"
info:fedora/wayne:collectionDFQ
:
"Detroit Focus Quarterly"
info:fedora/wayne:collectionDPLAOAI
:
"DPLA OAI-PMH"
info:fedora/wayne:collectionDSJ
:
"The Detroit Sunday Journal"
info:fedora/wayne:collectionDennisCooper
:
"Dennis Glen Cooper Collection"
info:fedora/wayne:collectionDigDressColl
:
"Digital Dress Collection"
info:fedora/wayne:collectionHeartTransplant
:
"First U.S. Human-to-Human Heart Transplant"
info:fedora/wayne:collectionHermanMiller
:
"Herman Miller Consortium Collection"
info:fedora/wayne:collectionLincolnLetters
:
"The Lincoln Letters"
info:fedora/wayne:collectionMIM
:
"Made in Michigan Writers Series"
info:fedora/wayne:collectionMOT
:
"Michigan Opera Theatre Performance Images"
info:fedora/wayne:collectionNightingale
:
"Florence Nightingale Collection"
info:fedora/wayne:collectionRENCEN
:
"Building the Detroit Renaissance Center"
info:fedora/wayne:collectionRamsey
:
"Eloise Ramsey Collection of Literature for Young People"
info:fedora/wayne:collectionReuther
:
"Walter P. Reuther Library Collection"
info:fedora/wayne:collectionReutherSwanger
:
"Toni Swanger Papers"
info:fedora/wayne:collectionUniversityBuildings
:
"Wayne State University Buildings Collection"
info:fedora/wayne:collectionVanRiperLetters
:
"Van Riper Family Correspondence"
info:fedora/wayne:collectionWPAscores
:
"WPA Music Manuscripts"
info:fedora/wayne:collectionWSUebooks
:
"Wayne State University eBooks"
info:fedora/wayne:collectionvmc
:
"Virtual Motor City"

No shame here, we were running a tight ship, and the overhead of that API call was small, as it was cached by Varnish on the back-end. But we wanted to improve this process. In addition to unnecessary API calls, it also required using that hash on the front-end to “translate” all values from the API response, in multiple places for a single page load.

There were two major things that needed translation:

  1. A human legible name for facets, such as info:fedora/CM:Image –> Image
  2. A human title for related objects, such as converting info:fedora/wayne:collectionvmc –> Virtual Motor City

For our v2 platform, we’re splitting up these concerns.

The facets are small, and mostly unchanging, so for those we are creating a static hash that is embedded in PHP, front-end framework. That hash is used uniformally, and easily, across the system for translating facet names.

The more difficult concern was how to get human names from object identifiers when those come through in the facet results from Solr. The solution was to grab our spoons and dig backwards into the indexing process, and at the time of indexing, include a “human” form of the relationship. So where a Solr record formerly only had a rels_isMemberOfCollection:info:fedora/wayne:collectionvmc field/value, it now also contains a human_isMemberOfCollection:Virtual Motor City field/value as well. This means, our native Solr response is returning both rels_* and human_* facets, which are easily cherry-picked on the front-end. As a matter of efficiency, when records are indexed, a hash similar to the one outlined above is queried from Solr, but is then used across a batch-indexing job, sometimes thousands of records.

human_hash.jpg

With lots of moving parts and control over those parts, it can be paralyzing sometimes to know what to change, and the ramifications downstream. But sometimes a piece of paper and pencil is the best bet for sketching out a new path forward.