False Conditions

Was reminded this morning of a lesson that drifts in and out of working on systems with lots of moving parts: all improvements are inextricably based to the current condition of supporting infrastructure.

Said another way: anything you do, anything you change, is probably based on information available to you at the time.

But this isn’t the lesson. The lesson is that that kind of decision making is often flawed. I’m sure this is of no surprise to many, but I’m uncomfortable how each iteration of an improvement to a particular part of the system brings this same lesson home. Fool me once, shame on you, fool me twice, yada yada.

A concrete example might help.

We have series of pipes and routes in our server-side API that abstracts routes for images from our IIIF-based Loris image server. So, we can ask for http://foo.bar/item/goober:tronic/thumbnail and get back a thumbnail at the more complicated URL path, http://foo.bar/loris/fedora|goober:tronic/full/full/0/default.png. The latter is not semantically meaningful to many, and contains hardcoded infrastructure such as loris in the URL. Our image server may change, and our goal is to have Cool URIs for things like thumbnails, metadata, etc. As always on this blog, over-simplification for the sake of idea exploring.

Recently, we had the rare and supremely delicious surprise of server kernel patching improving server-side rendering of images in our python-based image server, Loris. Dramatically. We are still exploring precisely what explains the speed increase (perhaps fodder for another post), but suffice it to say, it’s great. However, thumbnails started to break. The reason, one of our image proxies was streaming the results with the requests python library. When speeds / rendering / IO was slower on the server, this sped up the load time for thumbnails. But when the server speed increased, it revealed what I’m assuming was some kind of race-condition as the bits jumped through these proxied hoopes. Again, this is all speculation at this point, but the fact remains that removing the streaming flag from a particular request has fixed the problem, and, the thumbnails load even faster.

Our original design to stream a response sped up the load-time with a particular set of server conditions. Now that the conditions have changed, that decision is no longer correct. How interesting, that a decision once correct, becomes flawed over the passage of time. Such is a day in the life of managing a system with lots of moving parts.

Strange Attractors Part 1

A post, in two parts.

Part 1: Strange Attractors, and their strangely attractive backstory

I recently had the pleasure of driving to Texas and back, and while on the road, partook in some podcasts. One of them was called, “Stuff You Should Know”, and dedicated an episode to Chaos Theory. They did an admirable job with a slippery idea, and while I took away some new insights, this topic was not new to me.

No, I’ve been a fan / advocate / evagnelist / lunatic / devotee of Chaos Theory since the heady days of 2001. Early in my undergraduate forays into math and physics, I convinced a professor to let me explore the interesting ramifications of nonlinear and fractal geometry. Before going any further, if you’ve searched for “fractal geometry” or even just “fractals” on the internet, you’ve gotten some websites that looked like they emerged from the 1990’s. You see, the people that love fractals have a need to share their insights. I was firmly in this camp.

I graphed henons in QBASIC, porting code from the appendix of – gasp – print books. I drew sierpinski triangles everywhere. I used “bifurcate” every chance I could. It was a wild high, and I chased that tempestuous beast into the darkest of nights.

I became fascinated by coastlines: did you know they are infinitely long? On a map, the circuitous circumference around an island nation like Iceland may look to be… ~1,000 miles? But in reality, we can dive into any fjord or inlet, crouch low and begin poking around in the rocks that roll gently in the lapping tides. Where is the coastline? Is it before or after that pebble? If that pebble is included, don’t we have to include it in the total distance of the island perimeter? Yes, you most certianly do, as a responsible member of society, a reasonable steward of information.

This leaves us with an uncomfortable truth that “edges” and “boundaries” are often much more complex than we anticipated. Also revealed from fractal geometery, the same complex gravitational forces that create a valley of flour in your bowl when an egg is cracked into it, are the same that shape the timeless Grand Canyon (at least, many are shared - in all likelihood, your flour valleys have not been whipped by wind and rain for millenia). But the point remains: brocoli looks like trees, and a bunch of geese flying look suspiciously like reflections of light on lazily undulating waves.

And yet there is so much more to this story, which again, I must commend the hosts of that podcate for nobly wading through, with patience and a singular sense of where they are in the discussion. Restating known things can be useful for honing intuition, and setting the stage. So to explore the strange attraction of Strange Attractors, we must go back to Sir Isaac Newton, King Oscar II of Sweden, and the “n-body-problem”.

Newton could predict apples falling, and was having good success predicting comets and cannonballs. So what about predicting the location of a handful of planets 10, 100, 1000 years out? Not so much:

“Knowing three orbital positions of a planet’s orbit – positions obtained by Sir Isaac Newton (1643-1727) from astronomer John Flamsteed[6] – Newton was able to produce an equation by straightforward analytical geometry, to predict a planet’s motion; i.e., to give its orbital properties: position, orbital diameter, period and orbital velocity.[7] Having done so, he and others soon discovered over the course of a few years, those equations of motion did not predict some orbits very well or even correctly.[8] Newton realized it was because gravitational interactive forces amongst all the planets was affecting all their orbits.”

And here, the badlands of Chaos Theory and nonlinear dynamics peeked forth. Scientists and mathematicians started to understand that the gravitational pull from each planet was simultanesouly pulling on one another. Each moment, or “iteration”, magnified any variances or inaccuracies in the initial measurements that were plugged into the equations. If a cue ball hitting three pool balls with speed x is turned into equations, we can predict where they will be after 1000 “moments”. However, if speed x is ever so slightly different, those pool balls will be in wildly different places just a few “moments” later. This is a gross over-simplification of the n-body-problem, but it gets at the heart of it.

King Oscar II held a contest for anyone who could solve this problem. As I didn’t intend to delve too deeply into the history, but instead muse on what it might mean in the present, I continue to muddy and gloss over the finer points of this. This article explores the contest and solution in much better detail. But for our purposes here, Poincaré pointed out that it was unsolvable, and won the prize. He proved that a perfect predition relied on infinitely accurate measurements, and we know that’s not possible. And Chaos Theory was born.

Fast forward lots of years, and we’re finally getting back to Strange Attractors, and the meteorologist Edward Lorenz stumbld on this very problem while trying to distill complex equations around thermal dynamics to a simple form. This passage from Wikipedia sums it up nicely,

“Minute variations in the initial values of variables in his twelve-variable computer weather model (c. 1960, running on an LGP-30 desk computer) would result in grossly divergent weather patterns.[2] This sensitive dependence on initial conditions came to be known as the butterfly effect (it also meant that weather predictions from more than about a week out are generally fairly inaccurate).[13]”

And here is what he graphed:

The hands grow cold, the coffee cup has tipped past halfway. Where are we in this discussion? Those dark spots in the graph above, those are the strange attractors we’ve danced around. Why does this matter? Why is interesting? How is this related to coastlines, pebbles, brocoli, and eggs? That, is fodder for another post…

Bag Validation

bag_validation.png

Something didn’t happen today, and it was incredibly validating. No validation pun intended here. Honestly.

We’ve been ingesting ebooks into our digital collections platform for quite some time now, and one perennial problem we face is that of malformed books. This can come in many forms, and it hinges on how we structure our digital ebooks. At a very high level, each physical page gets about five digital representations:

  • the page image, as a TIFF file
  • raw text from page, as a text file
  • coordinates of words on the page, as ALTOXML file
  • PDF of the page, with words overlaying the page image
  • HTML of the page, including some limited layout (this one is not great, may be deprecated soon)

So, for a 100 page book, you might – should – end up with 500 distinct files: 001.tif, 001.pdf, 001.xml, 001.txt, 001.html, 002.tif, you get the point.

Before we had any ebook specific checks in place, it was not uncommon for a book to enter the ingest workflow missing a singular PDF, XML, or other file relating to a page. Or even, an entire page (001,002,004, no 003). This would result in ebooks ingested, but missing key components that would rile things down the pipeline in highly annoying and unforeseen ways. From a preservation standpoint, it was also not ideal to allow missing derivatives to slip through.

But those days are mostly over. We have included some bag validation for each different kind of content type in our digital collections, that look for specific properties. For example, an Image object should roll through with an original image, a JP2 derivative, a thumbnail, etc. If one of those is missing, it fails the validation on that datastream. For ebooks, we’re looking for parity of derivative file counts for each page. If a page comes through missing something, we get notified in our Ingest Workspace (fodder for another post), and that bag (object) is prevented from getting ingested.

In this way, we can let 44/45 perfectly good books get ingested, then diagnose the rogue baddie. The wheels of digitization and access roll on, and we identify things that need fixing. The screenshot above shows this check firing on a batch today, long since forgetting about putting it in place. Great huzzahs!