We’re getting into advanced search. With topic modeling and other alt-metadata searching around the corner, we want to make sure we’ve got real estate established in the new front-end for the digital collections we’ve been working on, for unique / beta / interesting / experiemental search options and interfaces.
So, decided to start wiring up an “advanced search” page to augment our poweful, but relatively simple search page. Like anything interesting, this candy bar has split higher up than expected, and I’m left chewing much more nougat than anticipated. Great adventure
A first step – er, middling step – was to investigate what “advanced search” looks like in other online, search interfaces. Both surprising, and unsurprising for reasons hopefully touched on later, the world of “advanced search” has shrunk in the age of facets and intelligent, powerful, omnipotent back-end search engines such as Solr. The days of crafting the perfect boolean search are dwindling as other search strategies become more popular. This is not to discount a single, surgical search. Not at all. For literature reviews, or general prudent thoroughness, a well-crafted, complex search is still incredibly powerful. And is actually still possible through many single-searchbox interfaces.
Take Google for example. With keywords, you can define quite particular searches.
Search only websites on
site:digital.library.wayne.edu winter czar
Or even limiting words:
seattle -space needle
But many users don’t know these little tricks. Since the dawn of online search interfaces, there have often been accompanying “advanced search” interfaces that provide a handful of boxes and inputs that users can use to refine their search. Our own library catalog is a nice example:
But as I started to dig into what other advanced search interfaces looked like for digital collections or archives, I started to notice a trend. Something I’m referring to as the mighty four in my internal running monologue. They are variations on the following four “advanced” search options:
- all / any these words
- this exact word or phrase
- none of these words
Some will differentiate between all and any, some do not (yet in my simple mind, they still get the mighty four moniker). But they share a likeness I hadn’t really stopped to consider until this undertaking today. Here are some examples:
Google Advanced Search
Advanced Search from the lovely University of North Texas (UNT) Digital Library (whom I turn to often for inspiration and sanity checks, thanks UNT!).
I, and I’m sure anyone even tangentially involved in the library, archives, and information realm, could write a small novella on the evolution and state of online search interfaces. They are as important to us as the tensile strength of fishing line for deep sea fishers, the Tin / Lead ratio to stained glass makers, the fat / acid ratios for cooks, and so on and so forth. However, given the myriad of insight and perspectives around search interfaces, it’s hard to ignore that facets have taken on a prominent role in search interfaces. Think of Amazon, eBay, and the list goes on. Facets are incredibly powerful, and have changed some thinking on search. Instead of high-powered, laser-focused boolean searches, we are looking more to iterative refinement in our searching.
For the most part, I think this is a good thing. I’m a big fan of doing 10-20 searches in what-interface-have-you, mentally sifting and collating what you learned, rinse and repeat. As compared to toiling and banging your head over a perfect set of logical operators what will reveal your images of booze running from Canada to Detroit in the 1920’s.
But advanced search has a place, particulary when new and interesting search options that hinge on things like community-generated metadata, or machine learned topics are nigh at our disposal.
In everything gloriously complex, it can be difficult to ascertain when you’ve reached a level of understanding that might merit debrief and reflection. In my forays into machine learning, this is probably as good a point as any.
I’ve been interested in Machine Learning for quite some time now. Since I first started stumbling on ImageNet and how it helped image processors classify images, to actual tinkering with an Alpha release of Google Vision, through the heady days of self-driving cars, and backing up even further, initial go’s at deriving meaning and topics from text with Python libraries like NLTK.
“Quite some time,” being relative. It’s a rapidly developing area of computer science, philosophy, mathematics, humanities, and their intersections.
Recently I’ve embarked on a project to try and derive topics from a corpus of academic articles in PDF form (around 700 of them), then by re-submitting one to the model, asking what other articles are “similar”. After quite a bit of trial and error, researching these emerging areas, and honing in on workflows that I might actually whiddle into working, I’m thrilled to have some results coming back that are – genuinly – spine-tingling.
The guts of this project falls on the python library, gensim. A masterful library meant to humanize the multi-dimensional math that underpins machine learning (and I use that term loosely here).
The rough sketch is thus:
- Point this little tool (called
atm, for “article topic modeling”, and the way it dispenses fun things to think about) at a dropbox folder with API credentials
- Downloads the entire directory, in this case, ~700 articles
- Create a bag of words for each document
- Strip of stopwords and punctuation (poorly at this point, I might add)
- Then the fun part: create a Latent Dirichlet Allocation (LDA) model with gensim
- Index ~100 topics that are suggested by this model, for this given corpus
- Finally, query the model with a document (in this case, an article from the corpus) for similarity to other documents in the corpus
The results are other documents in the corpus, and a percentile on topical vectors that match the article. I should stop myself here: the details are still forming, and while I’m getting a good grasp on relatively low-level how this works, that’s fodder for another post. At this point, the rough sketches.
Below are the results of a query, run through a Jupyter notebook of atm:
I submitted an article called
Arnold_2003.pdf, and it suggested a handful that match topically. The magic, the interest, lies in how these topics are derived and how the similarites are ranked. Much of this can be attributed to the LDA model that gensim creates for me. While these results are fascinating in their own right, what really sends chills down my spine, is the similarity with which this process / workflow shares with other domains like image processing, self-driving cars, speech recognition, etc.
Google’s Tensorflow has a wonderful tutorial, MNIST for ML that helped with my understanding. When the inputs you’re dealing with are 28x28 pixel images, of nothing but handwritten numerals from 0-9, you can begin to wrap your head around the math that supports machine learning. When we quantify input – sound, visual, text – into vectors and matrices, we can look for patterns over moving windows of input. Well, computers can. They can see patterns in a machine-digestable version of media we entertain with our senses. And when, with great grit and finesse, we can bubble up these patterns to more high-level libraries, we can apply them to actual corpuses. It’s phenomonal.
The results from atm are already encouraging, and it’s almost by-the-book tuning from tutorials. I want more corpuses, more targeted querying, adjusting of modeling parameters, the works. But for now, I’ve been thrilled to get something working with the tools I love and understand.
Much more to come on this front. An example: taking the thousands of PDFs that will soon flood our digital collections platform at Wayne State Digital Collections, run topic modeling on these, and provide new and interesting ways to find related documents.
Stumbld on an interesting access challenge for our Digital Collections today.
When searching for the keyword,
library, all records in the Digital Collections are returned because
library exists for each record somewhere in the metadata (probably embedded somewhere like
Wayne State University Library).
Though Solr’s stellar ranking boosts relevant records to the top of the results – items that have library in the title, or prominently in the description – it’s still a little unnerving that it returns so many positive results.
An option would be to limit particular Solr fields from being indexed. BUT, should we start accepting records from other institutions, with varying metadata, that might become an important field to search and facet on. I’m sure there are workarounds for this scenario, but interesting all the same, and indicative of the iterative nature of tuning search and discovery systems.