Indexing_digital_objects

That moment, when posited ideas and hairbrained schemes begin to coalesce into a workable plan. That happened this morning after a few days of thinking and conferring about ways to update how we index digital objects from Fedora into Solr.

The funny thing, this is a return to a model we had in place at least 1-1.5 years ago now. At least parts. Currently, we have only one Solr core that powers the front-end, the same Solr core that we index objects to from Ouroboros (our internal manager), and the same core we search for objects to work on internally. This came about when we moved to a more distributed ecosystem, and major ingests or updates were not on the live, production machine anymore. A single core worked just dandy for some time.

However, we’ve started to revisit the indexing process, caching, and how we find and select objects to work on. One problem, until an object from Fedora was indexed in Solr, we wouldn’t have the ability to select it in Ouroboros and manage it. We had workarounds: index all recently ingested objects, ingest via command line, etc., but these were numerous and confusing to remember. Our goal was to have some kind of database / store that was up-to-date with currently held objects in Fedora, that we could search to select objects.

Another problem was made evident via growth. As the amount of objects we manage was increasing, handling the searching and selecting of them client-side is becoming unfeasible. 10-12 seconds to return ~ 50k records, would not scale as we continue to grow. We needed to move our searching of objects server-side. We are using DataTables to filter and browse. The goal is to now write a Python(Flask)/Solr connector for server-side DataTable processing. I’ve done something similar for PeeWee, and am looking forward to mapping Solr into a similar endpoint that our Flask app can return data for DataTables.

We’ve flipped back on our Stomp client that listens to Fedora’s messaging service, and will automatically index objects via our solrIndexer pipeline into the fedobjs Solr core. This core will be used for searching and selecting objects internally, and pushing those PIDs to a user-specific database of PIDs to workon. From that workspace, users can then work on objects as per normal in Ouroboros.

The API will ping the more stable search Solr core for powering the front-end, as we did formerly. The finesse, will be figuring how, and how often, to replicate changes from fedobjs to search, or manually index objects straight to search. We are envisioning nightly replications, with the option to manually index objects to the search core if we need them there ASAP.

Very excited to have a more rational and well-mapped approach. And you can tell, when the pencil hits the paper, and then the pen confirms (!), the model has been grokked, and we can finally dig into making it happen.

ALSO, that prefix “man” in “manual”, I’m not a fan. Looking up the etymology of “manual” in the Oxford English Dictionary (OED) begins to suggest that the word has origins in “manuālis” (roughly, held in the hand), which points back further to “manus” (relating to “the hand”, but also to Roman law, “A form of power or authority, principally involving control over property, held in some instances by a husband over his wife; a form of marriage contract giving a husband such authority.”).

Time to find a better word.

Human Names And Opinionation

I’m going to go on the record as saying I don’t know if “opinionation” is a word, but I’d sure like it to be.

One of the most difficult, interesting, and complex things we deal with when building out a Digital Collections platform is representing information from digital objects in a meangingful way on the “front-end”, access system. In end-to-end frameworks like Rails or Django, there is often tight coupling between models and views. If you give a model an attribute like title, it’s relatively easy when rendering a page to say something along the lines of, object.title to place the title.

We do things a bit differently. One of our goals from the beginning of this long and wild ride, has been a distinct and purposeful disconnect between our “back-end” and “front-end”. We use an in-house built API to communicate with our front-end, that renders relevant information to the page. But in a situation like this, where coupling is a bit looser, where does one house opinions or translations from back-end, database fields to front-end, human readable information? Where is the Solr field rels_isMemberOfCollection translated to Collection?

Our solution in our v1 system was to have the front-end query the back-end every page load, requesting a hash of values to help translate. It looked something like this:

:
"AllImage"
info:fedora/CM:Archive
:
"Archive"
info:fedora/CM:Audio
:
"Audio"
info:fedora/CM:Collection
:
"Collection"
info:fedora/CM:Container
:
"Container"
info:fedora/CM:ContentModel
:
"ContentModel"
info:fedora/CM:Document
:
"Document"
info:fedora/CM:HierarchicalFiles
:
"HierarchicalFiles"
info:fedora/CM:Image
:
"Image"
info:fedora/CM:Issue
:
"Issue"
info:fedora/CM:LearningObject
:
"Learning Object"
info:fedora/CM:Serial
:
"Serial"
info:fedora/CM:Video
:
"Video"
info:fedora/CM:Volume
:
"Volume"
info:fedora/CM:WSUebook
:
"WSUebook"
info:fedora/wayne:collectionAmericanPressman
:
"American Pressman"
info:fedora/wayne:collectionCFAI
:
"Changing Face of the Auto Industry"
info:fedora/wayne:collectionDFQ
:
"Detroit Focus Quarterly"
info:fedora/wayne:collectionDPLAOAI
:
"DPLA OAI-PMH"
info:fedora/wayne:collectionDSJ
:
"The Detroit Sunday Journal"
info:fedora/wayne:collectionDennisCooper
:
"Dennis Glen Cooper Collection"
info:fedora/wayne:collectionDigDressColl
:
"Digital Dress Collection"
info:fedora/wayne:collectionHeartTransplant
:
"First U.S. Human-to-Human Heart Transplant"
info:fedora/wayne:collectionHermanMiller
:
"Herman Miller Consortium Collection"
info:fedora/wayne:collectionLincolnLetters
:
"The Lincoln Letters"
info:fedora/wayne:collectionMIM
:
"Made in Michigan Writers Series"
info:fedora/wayne:collectionMOT
:
"Michigan Opera Theatre Performance Images"
info:fedora/wayne:collectionNightingale
:
"Florence Nightingale Collection"
info:fedora/wayne:collectionRENCEN
:
"Building the Detroit Renaissance Center"
info:fedora/wayne:collectionRamsey
:
"Eloise Ramsey Collection of Literature for Young People"
info:fedora/wayne:collectionReuther
:
"Walter P. Reuther Library Collection"
info:fedora/wayne:collectionReutherSwanger
:
"Toni Swanger Papers"
info:fedora/wayne:collectionUniversityBuildings
:
"Wayne State University Buildings Collection"
info:fedora/wayne:collectionVanRiperLetters
:
"Van Riper Family Correspondence"
info:fedora/wayne:collectionWPAscores
:
"WPA Music Manuscripts"
info:fedora/wayne:collectionWSUebooks
:
"Wayne State University eBooks"
info:fedora/wayne:collectionvmc
:
"Virtual Motor City"

No shame here, we were running a tight ship, and the overhead of that API call was small, as it was cached by Varnish on the back-end. But we wanted to improve this process. In addition to unnecessary API calls, it also required using that hash on the front-end to “translate” all values from the API response, in multiple places for a single page load.

There were two major things that needed translation:

  1. A human legible name for facets, such as info:fedora/CM:Image –> Image
  2. A human title for related objects, such as converting info:fedora/wayne:collectionvmc –> Virtual Motor City

For our v2 platform, we’re splitting up these concerns.

The facets are small, and mostly unchanging, so for those we are creating a static hash that is embedded in PHP, front-end framework. That hash is used uniformally, and easily, across the system for translating facet names.

The more difficult concern was how to get human names from object identifiers when those come through in the facet results from Solr. The solution was to grab our spoons and dig backwards into the indexing process, and at the time of indexing, include a “human” form of the relationship. So where a Solr record formerly only had a rels_isMemberOfCollection:info:fedora/wayne:collectionvmc field/value, it now also contains a human_isMemberOfCollection:Virtual Motor City field/value as well. This means, our native Solr response is returning both rels_* and human_* facets, which are easily cherry-picked on the front-end. As a matter of efficiency, when records are indexed, a hash similar to the one outlined above is queried from Solr, but is then used across a batch-indexing job, sometimes thousands of records.

human_hash.jpg

With lots of moving parts and control over those parts, it can be paralyzing sometimes to know what to change, and the ramifications downstream. But sometimes a piece of paper and pencil is the best bet for sketching out a new path forward.

Handling Repeating Get Params

Re: the last couple of posts about repeating GET parameters, and how PHP is slightly unconventional in how it parses. Came up with a solution: a QueryBuilder class.

IMG_20170123_140040.jpg

It was a particularly pernicious problem, and time will tell how well our solution scales and evolves. The problem came down to how the Slim PHP framework parsed GET parameters, and the Guzzle PHP client encoded GET requests.

Slim used the built-in PHP function, parse_str that followed the PHP convention to only capture repeating GET parameters when the GET parameter string contained square brackets [] around those repeating fields. For example:

?fq=foo&fq=bar would get truncated to 'fq'=>'bar'

However, if square brackets were used, repeating values would get picked up from ?fq[]=foo&fq[]=bar, and become 'fq'=['foo','bar'].

So, we needed to always send requests to our /search route with square brackets. But we did not want to the indices that http_build_query includes when building a GET param string, as that would hurt our ability to manipulate the URL by cherry-picking known parameters to remove.

Speaking of http_build_query, this is what Guzzle uses to build GET parameters for an HTTP request. As alluded to above, an associative array like 'fq'=['foo','bar'], would result in the following string, ?fq[0]=foo&fq[1]=bar.

This was also not ideal, as our API is not prepared to handle fq[n] fields of an unknown n quantity. The verdict is still out if/how python Flask-RESTful can handle that kind of regex parsing.

So, we needed to fix HTTP requests on the way out too. The end result was two places in a typical advanced query that required GET parameter fixing. We created a QueryBuilder class that is invoked where and when needed, to prepare I/O GET parameters. The best part is, this class has become a logical place to house any complex behavior related to search and query parameter parsing and prepping.

So what is this thing? When does QueryBuilder become a thing unto itself? Why can’t Guzzle optionally not include indices in HTTP request when passed an associative array as parameters? Why can’t Slim parse a route with repeating GET parameters that don’t have square brackets?

These are the questions that make all of this occassionaly frustrating, but always interesting. Observing that libraries we use to parse and prepare HTTP requests were following conventions incompatible with components up and downstream, it prompted the creation of a class that is proving to be supremely helpful.