Re: the last couple of posts about repeating GET parameters, and how PHP is slightly unconventional in how it parses. Came up with a solution: a
It was a particularly pernicious problem, and time will tell how well our solution scales and evolves. The problem came down to how the Slim PHP framework parsed GET parameters, and the Guzzle PHP client encoded
Slim used the built-in PHP function,
parse_str that followed the PHP convention to only capture repeating
GET parameters when the
GET parameter string contained square brackets
 around those repeating fields. For example:
?fq=foo&fq=bar would get truncated to
However, if square brackets were used, repeating values would get picked up from
?fq=foo&fq=bar, and become
So, we needed to always send requests to our
/search route with square brackets. But we did not want to the indices that
http_build_query includes when building a
GET param string, as that would hurt our ability to manipulate the URL by cherry-picking known parameters to remove.
http_build_query, this is what Guzzle uses to build
GET parameters for an HTTP request. As alluded to above, an associative array like
'fq'=['foo','bar'], would result in the following string,
This was also not ideal, as our API is not prepared to handle
fq[n] fields of an unknown
n quantity. The verdict is still out if/how python Flask-RESTful can handle that kind of regex parsing.
So, we needed to fix HTTP requests on the way out too. The end result was two places in a typical advanced query that required
GET parameter fixing. We created a
QueryBuilder class that is invoked where and when needed, to prepare I/O
GET parameters. The best part is, this class has become a logical place to house any complex behavior related to search and query parameter parsing and prepping.
So what is this thing? When does
QueryBuilder become a thing unto itself? Why can’t Guzzle optionally not include indices in HTTP request when passed an associative array as parameters? Why can’t Slim parse a route with repeating
GET parameters that don’t have square brackets?
These are the questions that make all of this occassionaly frustrating, but always interesting. Observing that libraries we use to parse and prepare HTTP requests were following conventions incompatible with components up and downstream, it prompted the creation of a class that is proving to be supremely helpful.
An interesting aside about GET parameters, particularly of the multiple variety.
Solr accepts, where appropriate expects, the same
GET parameter multiple times. e.g. the
Pardon an oversimplification, but in this scenario Solr is using a custom parser to parse the multiple
GET parameters. It is custom, in a sense, because RFC 3986, which serves as a specification for generic URLs and parsing parameters, doesn’t explicitly discuss how to handle multiple
But they exist. And Solr is a great example.
Further speculating under the hood in Solr, you can divine that it also allows nesting of values in
GET parameters, as demonstrated with fields like
facet.field which, in addition to being repeatable, also exists next to a frighteningly similar field
facet. When solr parses a URL such as:
we can assume that anything with a
facet. prefix, like
facet.field, is probably getting grouped as a nested structure Solr-side.
But how do other systems handle this?
There is a convention, not a specification, that I stumble on from time to time that can be a bit of a headache. Some libraries fallback on using square brackets
 affixed at the end of a field to tell future parsers that this field is repeating, and should be slotted into some kind of list or array, instead of overwriting a key / value pair previously seen in the URL.
This is great, and works for well for back and forths between systems, but can be complicated when those parameters eventually need to slung over to Solr. Python Flask, for example, out of the box, only handles repeating
GET parameters when they come in with the
 suffix. e.g.
This means, before you can scuttle over to Solr, you’d need to rename these
fq keys to
fq, as Solr does not know what to do with
Just one of those things. But interesting, and perhaps telling, that the HTTP protocol and parameter parsing is getting pushed to it’s logical limits this day in age.
We’re getting into advanced search. With topic modeling and other alt-metadata searching around the corner, we want to make sure we’ve got real estate established in the new front-end for the digital collections we’ve been working on, for unique / beta / interesting / experiemental search options and interfaces.
So, decided to start wiring up an “advanced search” page to augment our poweful, but relatively simple search page. Like anything interesting, this candy bar has split higher up than expected, and I’m left chewing much more nougat than anticipated. Great adventure
A first step – er, middling step – was to investigate what “advanced search” looks like in other online, search interfaces. Both surprising, and unsurprising for reasons hopefully touched on later, the world of “advanced search” has shrunk in the age of facets and intelligent, powerful, omnipotent back-end search engines such as Solr. The days of crafting the perfect boolean search are dwindling as other search strategies become more popular. This is not to discount a single, surgical search. Not at all. For literature reviews, or general prudent thoroughness, a well-crafted, complex search is still incredibly powerful. And is actually still possible through many single-searchbox interfaces.
Take Google for example. With keywords, you can define quite particular searches.
Search only websites on
site:digital.library.wayne.edu winter czar
Or even limiting words:
seattle -space needle
But many users don’t know these little tricks. Since the dawn of online search interfaces, there have often been accompanying “advanced search” interfaces that provide a handful of boxes and inputs that users can use to refine their search. Our own library catalog is a nice example:
But as I started to dig into what other advanced search interfaces looked like for digital collections or archives, I started to notice a trend. Something I’m referring to as the mighty four in my internal running monologue. They are variations on the following four “advanced” search options:
- all / any these words
- this exact word or phrase
- none of these words
Some will differentiate between all and any, some do not (yet in my simple mind, they still get the mighty four moniker). But they share a likeness I hadn’t really stopped to consider until this undertaking today. Here are some examples:
Google Advanced Search
Advanced Search from the lovely University of North Texas (UNT) Digital Library (whom I turn to often for inspiration and sanity checks, thanks UNT!).
I, and I’m sure anyone even tangentially involved in the library, archives, and information realm, could write a small novella on the evolution and state of online search interfaces. They are as important to us as the tensile strength of fishing line for deep sea fishers, the Tin / Lead ratio to stained glass makers, the fat / acid ratios for cooks, and so on and so forth. However, given the myriad of insight and perspectives around search interfaces, it’s hard to ignore that facets have taken on a prominent role in search interfaces. Think of Amazon, eBay, and the list goes on. Facets are incredibly powerful, and have changed some thinking on search. Instead of high-powered, laser-focused boolean searches, we are looking more to iterative refinement in our searching.
For the most part, I think this is a good thing. I’m a big fan of doing 10-20 searches in what-interface-have-you, mentally sifting and collating what you learned, rinse and repeat. As compared to toiling and banging your head over a perfect set of logical operators what will reveal your images of booze running from Canada to Detroit in the 1920’s.
But advanced search has a place, particulary when new and interesting search options that hinge on things like community-generated metadata, or machine learned topics are nigh at our disposal.