Human Names And Opinionation

I’m going to go on the record as saying I don’t know if “opinionation” is a word, but I’d sure like it to be.

One of the most difficult, interesting, and complex things we deal with when building out a Digital Collections platform is representing information from digital objects in a meangingful way on the “front-end”, access system. In end-to-end frameworks like Rails or Django, there is often tight coupling between models and views. If you give a model an attribute like title, it’s relatively easy when rendering a page to say something along the lines of, object.title to place the title.

We do things a bit differently. One of our goals from the beginning of this long and wild ride, has been a distinct and purposeful disconnect between our “back-end” and “front-end”. We use an in-house built API to communicate with our front-end, that renders relevant information to the page. But in a situation like this, where coupling is a bit looser, where does one house opinions or translations from back-end, database fields to front-end, human readable information? Where is the Solr field rels_isMemberOfCollection translated to Collection?

Our solution in our v1 system was to have the front-end query the back-end every page load, requesting a hash of values to help translate. It looked something like this:

:
"AllImage"
info:fedora/CM:Archive
:
"Archive"
info:fedora/CM:Audio
:
"Audio"
info:fedora/CM:Collection
:
"Collection"
info:fedora/CM:Container
:
"Container"
info:fedora/CM:ContentModel
:
"ContentModel"
info:fedora/CM:Document
:
"Document"
info:fedora/CM:HierarchicalFiles
:
"HierarchicalFiles"
info:fedora/CM:Image
:
"Image"
info:fedora/CM:Issue
:
"Issue"
info:fedora/CM:LearningObject
:
"Learning Object"
info:fedora/CM:Serial
:
"Serial"
info:fedora/CM:Video
:
"Video"
info:fedora/CM:Volume
:
"Volume"
info:fedora/CM:WSUebook
:
"WSUebook"
info:fedora/wayne:collectionAmericanPressman
:
"American Pressman"
info:fedora/wayne:collectionCFAI
:
"Changing Face of the Auto Industry"
info:fedora/wayne:collectionDFQ
:
"Detroit Focus Quarterly"
info:fedora/wayne:collectionDPLAOAI
:
"DPLA OAI-PMH"
info:fedora/wayne:collectionDSJ
:
"The Detroit Sunday Journal"
info:fedora/wayne:collectionDennisCooper
:
"Dennis Glen Cooper Collection"
info:fedora/wayne:collectionDigDressColl
:
"Digital Dress Collection"
info:fedora/wayne:collectionHeartTransplant
:
"First U.S. Human-to-Human Heart Transplant"
info:fedora/wayne:collectionHermanMiller
:
"Herman Miller Consortium Collection"
info:fedora/wayne:collectionLincolnLetters
:
"The Lincoln Letters"
info:fedora/wayne:collectionMIM
:
"Made in Michigan Writers Series"
info:fedora/wayne:collectionMOT
:
"Michigan Opera Theatre Performance Images"
info:fedora/wayne:collectionNightingale
:
"Florence Nightingale Collection"
info:fedora/wayne:collectionRENCEN
:
"Building the Detroit Renaissance Center"
info:fedora/wayne:collectionRamsey
:
"Eloise Ramsey Collection of Literature for Young People"
info:fedora/wayne:collectionReuther
:
"Walter P. Reuther Library Collection"
info:fedora/wayne:collectionReutherSwanger
:
"Toni Swanger Papers"
info:fedora/wayne:collectionUniversityBuildings
:
"Wayne State University Buildings Collection"
info:fedora/wayne:collectionVanRiperLetters
:
"Van Riper Family Correspondence"
info:fedora/wayne:collectionWPAscores
:
"WPA Music Manuscripts"
info:fedora/wayne:collectionWSUebooks
:
"Wayne State University eBooks"
info:fedora/wayne:collectionvmc
:
"Virtual Motor City"

No shame here, we were running a tight ship, and the overhead of that API call was small, as it was cached by Varnish on the back-end. But we wanted to improve this process. In addition to unnecessary API calls, it also required using that hash on the front-end to “translate” all values from the API response, in multiple places for a single page load.

There were two major things that needed translation:

  1. A human legible name for facets, such as info:fedora/CM:Image –> Image
  2. A human title for related objects, such as converting info:fedora/wayne:collectionvmc –> Virtual Motor City

For our v2 platform, we’re splitting up these concerns.

The facets are small, and mostly unchanging, so for those we are creating a static hash that is embedded in PHP, front-end framework. That hash is used uniformally, and easily, across the system for translating facet names.

The more difficult concern was how to get human names from object identifiers when those come through in the facet results from Solr. The solution was to grab our spoons and dig backwards into the indexing process, and at the time of indexing, include a “human” form of the relationship. So where a Solr record formerly only had a rels_isMemberOfCollection:info:fedora/wayne:collectionvmc field/value, it now also contains a human_isMemberOfCollection:Virtual Motor City field/value as well. This means, our native Solr response is returning both rels_* and human_* facets, which are easily cherry-picked on the front-end. As a matter of efficiency, when records are indexed, a hash similar to the one outlined above is queried from Solr, but is then used across a batch-indexing job, sometimes thousands of records.

human_hash.jpg

With lots of moving parts and control over those parts, it can be paralyzing sometimes to know what to change, and the ramifications downstream. But sometimes a piece of paper and pencil is the best bet for sketching out a new path forward.

Handling Repeating Get Params

Re: the last couple of posts about repeating GET parameters, and how PHP is slightly unconventional in how it parses. Came up with a solution: a QueryBuilder class.

IMG_20170123_140040.jpg

It was a particularly pernicious problem, and time will tell how well our solution scales and evolves. The problem came down to how the Slim PHP framework parsed GET parameters, and the Guzzle PHP client encoded GET requests.

Slim used the built-in PHP function, parse_str that followed the PHP convention to only capture repeating GET parameters when the GET parameter string contained square brackets [] around those repeating fields. For example:

?fq=foo&fq=bar would get truncated to 'fq'=>'bar'

However, if square brackets were used, repeating values would get picked up from ?fq[]=foo&fq[]=bar, and become 'fq'=['foo','bar'].

So, we needed to always send requests to our /search route with square brackets. But we did not want to the indices that http_build_query includes when building a GET param string, as that would hurt our ability to manipulate the URL by cherry-picking known parameters to remove.

Speaking of http_build_query, this is what Guzzle uses to build GET parameters for an HTTP request. As alluded to above, an associative array like 'fq'=['foo','bar'], would result in the following string, ?fq[0]=foo&fq[1]=bar.

This was also not ideal, as our API is not prepared to handle fq[n] fields of an unknown n quantity. The verdict is still out if/how python Flask-RESTful can handle that kind of regex parsing.

So, we needed to fix HTTP requests on the way out too. The end result was two places in a typical advanced query that required GET parameter fixing. We created a QueryBuilder class that is invoked where and when needed, to prepare I/O GET parameters. The best part is, this class has become a logical place to house any complex behavior related to search and query parameter parsing and prepping.

So what is this thing? When does QueryBuilder become a thing unto itself? Why can’t Guzzle optionally not include indices in HTTP request when passed an associative array as parameters? Why can’t Slim parse a route with repeating GET parameters that don’t have square brackets?

These are the questions that make all of this occassionaly frustrating, but always interesting. Observing that libraries we use to parse and prepare HTTP requests were following conventions incompatible with components up and downstream, it prompted the creation of a class that is proving to be supremely helpful.

Multiple Get Parameters

An interesting aside about GET parameters, particularly of the multiple variety.

Solr accepts, where appropriate expects, the same GET parameter multiple times. e.g. the fq parameter:

http://example.org/solr/search?q=goober&fq=foo:bar&fq=foo:baz

Pardon an oversimplification, but in this scenario Solr is using a custom parser to parse the multiple fq GET parameters. It is custom, in a sense, because RFC 3986, which serves as a specification for generic URLs and parsing parameters, doesn’t explicitly discuss how to handle multiple GET parameters.

But they exist. And Solr is a great example.

Further speculating under the hood in Solr, you can divine that it also allows nesting of values in GET parameters, as demonstrated with fields like facet.field which, in addition to being repeatable, also exists next to a frighteningly similar field facet. When solr parses a URL such as:

http://example.org/solr/search?q=goober&fq=foo:bar&fq=foo:baz&facet=true&facet.field=foo

we can assume that anything with a facet. prefix, like facet.field, is probably getting grouped as a nested structure Solr-side.

But how do other systems handle this?

There is a convention, not a specification, that I stumble on from time to time that can be a bit of a headache. Some libraries fallback on using square brackets [] affixed at the end of a field to tell future parsers that this field is repeating, and should be slotted into some kind of list or array, instead of overwriting a key / value pair previously seen in the URL.

This is great, and works for well for back and forths between systems, but can be complicated when those parameters eventually need to slung over to Solr. Python Flask, for example, out of the box, only handles repeating GET parameters when they come in with the [] suffix. e.g.

http://example.org/route?fq[]=bar&fq[]=baz

This means, before you can scuttle over to Solr, you’d need to rename these fq[] keys to fq, as Solr does not know what to do with fq[].

Just one of those things. But interesting, and perhaps telling, that the HTTP protocol and parameter parsing is getting pushed to it’s logical limits this day in age.