Handling Repeating Get Params

Re: the last couple of posts about repeating GET parameters, and how PHP is slightly unconventional in how it parses. Came up with a solution: a QueryBuilder class.

IMG_20170123_140040.jpg

It was a particularly pernicious problem, and time will tell how well our solution scales and evolves. The problem came down to how the Slim PHP framework parsed GET parameters, and the Guzzle PHP client encoded GET requests.

Slim used the built-in PHP function, parse_str that followed the PHP convention to only capture repeating GET parameters when the GET parameter string contained square brackets [] around those repeating fields. For example:

?fq=foo&fq=bar would get truncated to 'fq'=>'bar'

However, if square brackets were used, repeating values would get picked up from ?fq[]=foo&fq[]=bar, and become 'fq'=['foo','bar'].

So, we needed to always send requests to our /search route with square brackets. But we did not want to the indices that http_build_query includes when building a GET param string, as that would hurt our ability to manipulate the URL by cherry-picking known parameters to remove.

Speaking of http_build_query, this is what Guzzle uses to build GET parameters for an HTTP request. As alluded to above, an associative array like 'fq'=['foo','bar'], would result in the following string, ?fq[0]=foo&fq[1]=bar.

This was also not ideal, as our API is not prepared to handle fq[n] fields of an unknown n quantity. The verdict is still out if/how python Flask-RESTful can handle that kind of regex parsing.

So, we needed to fix HTTP requests on the way out too. The end result was two places in a typical advanced query that required GET parameter fixing. We created a QueryBuilder class that is invoked where and when needed, to prepare I/O GET parameters. The best part is, this class has become a logical place to house any complex behavior related to search and query parameter parsing and prepping.

So what is this thing? When does QueryBuilder become a thing unto itself? Why can’t Guzzle optionally not include indices in HTTP request when passed an associative array as parameters? Why can’t Slim parse a route with repeating GET parameters that don’t have square brackets?

These are the questions that make all of this occassionaly frustrating, but always interesting. Observing that libraries we use to parse and prepare HTTP requests were following conventions incompatible with components up and downstream, it prompted the creation of a class that is proving to be supremely helpful.

Multiple Get Parameters

An interesting aside about GET parameters, particularly of the multiple variety.

Solr accepts, where appropriate expects, the same GET parameter multiple times. e.g. the fq parameter:

http://example.org/solr/search?q=goober&fq=foo:bar&fq=foo:baz

Pardon an oversimplification, but in this scenario Solr is using a custom parser to parse the multiple fq GET parameters. It is custom, in a sense, because RFC 3986, which serves as a specification for generic URLs and parsing parameters, doesn’t explicitly discuss how to handle multiple GET parameters.

But they exist. And Solr is a great example.

Further speculating under the hood in Solr, you can divine that it also allows nesting of values in GET parameters, as demonstrated with fields like facet.field which, in addition to being repeatable, also exists next to a frighteningly similar field facet. When solr parses a URL such as:

http://example.org/solr/search?q=goober&fq=foo:bar&fq=foo:baz&facet=true&facet.field=foo

we can assume that anything with a facet. prefix, like facet.field, is probably getting grouped as a nested structure Solr-side.

But how do other systems handle this?

There is a convention, not a specification, that I stumble on from time to time that can be a bit of a headache. Some libraries fallback on using square brackets [] affixed at the end of a field to tell future parsers that this field is repeating, and should be slotted into some kind of list or array, instead of overwriting a key / value pair previously seen in the URL.

This is great, and works for well for back and forths between systems, but can be complicated when those parameters eventually need to slung over to Solr. Python Flask, for example, out of the box, only handles repeating GET parameters when they come in with the [] suffix. e.g.

http://example.org/route?fq[]=bar&fq[]=baz

This means, before you can scuttle over to Solr, you’d need to rename these fq[] keys to fq, as Solr does not know what to do with fq[].

Just one of those things. But interesting, and perhaps telling, that the HTTP protocol and parameter parsing is getting pushed to it’s logical limits this day in age.

Advanced Search

We’re getting into advanced search. With topic modeling and other alt-metadata searching around the corner, we want to make sure we’ve got real estate established in the new front-end for the digital collections we’ve been working on, for unique / beta / interesting / experiemental search options and interfaces.

So, decided to start wiring up an “advanced search” page to augment our poweful, but relatively simple search page. Like anything interesting, this candy bar has split higher up than expected, and I’m left chewing much more nougat than anticipated. Great adventure

A first step – er, middling step – was to investigate what “advanced search” looks like in other online, search interfaces. Both surprising, and unsurprising for reasons hopefully touched on later, the world of “advanced search” has shrunk in the age of facets and intelligent, powerful, omnipotent back-end search engines such as Solr. The days of crafting the perfect boolean search are dwindling as other search strategies become more popular. This is not to discount a single, surgical search. Not at all. For literature reviews, or general prudent thoroughness, a well-crafted, complex search is still incredibly powerful. And is actually still possible through many single-searchbox interfaces.

Take Google for example. With keywords, you can define quite particular searches.

Search only websites on digital.library.wayne.edu:

site:digital.library.wayne.edu winter czar

Or even limiting words:

seattle -space needle

But many users don’t know these little tricks. Since the dawn of online search interfaces, there have often been accompanying “advanced search” interfaces that provide a handful of boxes and inputs that users can use to refine their search. Our own library catalog is a nice example:

Screen Shot 2017-01-17 at 2.09.05 PM.png

But as I started to dig into what other advanced search interfaces looked like for digital collections or archives, I started to notice a trend. Something I’m referring to as the mighty four in my internal running monologue. They are variations on the following four “advanced” search options:

  • all / any these words
  • this exact word or phrase
  • none of these words

Some will differentiate between all and any, some do not (yet in my simple mind, they still get the mighty four moniker). But they share a likeness I hadn’t really stopped to consider until this undertaking today. Here are some examples:

Screen Shot 2017-01-17 at 1.58.00 PM.png

Google Books

Screen Shot 2017-01-17 at 1.58.21 PM.png

Google Advanced Search

Screen Shot 2017-01-17 at 1.58.30 PM.png

Advanced Search from the lovely University of North Texas (UNT) Digital Library (whom I turn to often for inspiration and sanity checks, thanks UNT!).

I, and I’m sure anyone even tangentially involved in the library, archives, and information realm, could write a small novella on the evolution and state of online search interfaces. They are as important to us as the tensile strength of fishing line for deep sea fishers, the Tin / Lead ratio to stained glass makers, the fat / acid ratios for cooks, and so on and so forth. However, given the myriad of insight and perspectives around search interfaces, it’s hard to ignore that facets have taken on a prominent role in search interfaces. Think of Amazon, eBay, and the list goes on. Facets are incredibly powerful, and have changed some thinking on search. Instead of high-powered, laser-focused boolean searches, we are looking more to iterative refinement in our searching.

For the most part, I think this is a good thing. I’m a big fan of doing 10-20 searches in what-interface-have-you, mentally sifting and collating what you learned, rinse and repeat. As compared to toiling and banging your head over a perfect set of logical operators what will reveal your images of booze running from Canada to Detroit in the 1920’s.

But advanced search has a place, particulary when new and interesting search options that hinge on things like community-generated metadata, or machine learned topics are nigh at our disposal.