Image Processing and Reunification Workshop

Image Processing and Reunification Workshop

“zooming in as de-familiarizing”, Laura Wexler

I recently had the pleasure of attending the Image Processing and Reunification Workshop at the University of Maryland, and wanted to pull together some thoughts and notes while it was fresh on the brain. I can let the workshop’s website speak to the motivation and actualization of the event, but my enthusiastic thanks and kudos to the primary organizers Ricky Punzalan and Trevor Muñoz for putting it together.

I hardly know where to begin. Maybe with the quote from above, which occured on the last day of the workshop, and work backwards.

rewind_tape.gif

The workshop sought to bring together people from computer vision, humanities scholars, and the Libraries / Archives / Museums (LAM) world, to share understandings around image processing and image reunification.

  • What do those terms mean?
  • How are they used differently across domains?
  • How can computer scientists, humanities scholars, and what I’ll refer to here as “digital LAMs” collaborate, share and write grants, and otherwise explore these areas together?
  • In the words of the workshop programming, what are the great challenges in each area?

I came into the workshop thinking it was going to focus on standards like the International Image Interoperability Framework (IIIF), and the digital object repositories like Fedora Commons that serve said images.

In a nutshell, my thinking that the discussions would take those technologies and standards as the basis for conversation, is quite precisely why I found this workshop so helpful and constructive. Though I expected the diversity of attendees to steer the conversation in many directions, I hadn’t anticipated that IIIF and the technical mechanics of image sharing – which seem so central to image processing and reunification to me – would be but part of the puzzle.

This is worth restating: I understood there is plenty in this space that do not include the preservation and access of images, but the majority of my focus and energy is towards just that, and a certain tunnel vision creeps in. This workshop helped me remove those goggles for a moment, and see the role of Libraries, Archives, etc. amidst the overlapping Venns of computer vision, research questions, national and local research agendas, etc.

So the quote? A paraphrasing of something that came up in the second day’s discussion, Laura Wexler quipped that zooming in on an image can have de-familiarizing effects. This is “zooming in” literally and figuratively. As we dive into the pixels of an image, the writing on a building, the eyelashes of a face, we risk losing context of that detail in the greater landscape of the image. Similarily, even just focusing on one image, we might fail to see how it relates to other images from the same role, the same photgrapher, the same place, etc. Laura Tilton, whom works with Taylor Arnold and Laura Wexler on the excellent Photogrammar project, asked what “Distance Looking” might mean. What is happening when we take in 100k of images in one interface and moment?

I wish I had my notebook nearby to tether bullets to penned thoughts during the workshop, but alas I do not. In an effort to not lose thoughts at the lure of more, I will post this now. I would add that I hope or plan to revisit and fill in those gaps, but if I’ve learned anything about the internet, it’s that “check back often” or “coming soon” never fares well. What I can say, on the heels of this workshop I had the good fortune of listening to some of the same speakers via a live webcast of Collections as Data 2016, and I realized what had happened. My thinking had been irrevocably broadened by the workshop, and the connections abound. My thanks again to the organizers, it was a joyous romp into the intersections of computer science, archival principles, and fascinating research.

.0000789%

And then there are edge cases. Let’s start with the numbers. At current count, we’ve got:

As mentioned in a previous post, we have recently changed how our ebooks are modeled. As such, we have also had cause to re-create IIIF manifests for each book. In the process, I identified 6 books that weren’t cooperating with our current soup-to-nuts ingest workflow for an ebook. Though admittedly I’ve been wrangling some of these books into the new form for a good week or two, I couldn’t help but think that number is pretty good.

What is 6 books not cooperating out of 591?

.01%

That’s a number. Here’s another. As it was only one page per book that didn’t cooperate, we can think of each book as actually an ammalgamation of images.

What about 6 images out of 76,010?

.0000789%

What!?!?!

Either we are nearly flawless orchestrators of ingest workflow brilliance, or the human mind can hardly fathom how effortlessly computers crank through work.

I’m inclined to beleive the latter.

Fedora Commons and Datastream Dis-Contents

Had a good headscratcher recently.

Ingest times for ebooks going into Fedora Commons were becoming suspiciously long. Specifically, they would slow down over the course of a single book’s ingest: the first 0-20 pages would fly along, then would become noticeably slower, until they were but crawling along.

At the time of these slowdowns, our ebooks were modeled as a single digital object in Fedora (more on this below), with the following approximate structure for pages in the ebook object:

- IMAGE_1
- IMAGE_1_JP2
- HTML_1
- ALTOXML_1
...
- IMAGE_n
- IMAGE_n_JP2
- HTML_n
- ALTOXML_n

So a 100 page book might have approximately 400 datastreams. This was v2 of our ebook modeling.

Our first, v1 model for ebooks was a bit more abstracted out, with different file types for each page broken out into their own objects. A lofty model below shows these relationships, where circles represent objects in Fedora:

Ingest Times for Gulliver's Travels

In that model, any one of the “child” objects would still have as many datastreams as there were pages in the book, which could be 100, 200, 1000+. It was, in retrospect, better than v2, but not optimal.

We eventually abandoned that model in favor of the v2, single-object approach for simplicity’s sake. We were pushing back a bit on over-engineering and abstracting components of complex digital object in Fedora. For management / preservation purposes, we had but one object to shepard! It seemed great in the long-term. This is, of course, a philosophical debate in-and-of-itself, but suffice it to say we had talked through many of the pros and cons, and the single-object approach was working nicely for us. We had a few hundred large books already ingested during v1, and after switching to this v2, single-obect model, we began in on ingesting a couple runs of serials. These serials, importantly, contained “books” with only 15-50 pages per.

And this is why we didn’t notice the many datastreams per object problem for some time. Until, that is, we started ingesting books with 400, 500, 1000+ pages and started noticing the “Great Slowdown.”

The graph below represents the time each datastream type took to add to a single digital object for an example v2 book representative of the problem. The y-axis is time in seconds, the x-axis each datatream (page), and the different lines differnt kinds of datastreams (IMAGE, ALTOXML, HTML). As you can pretty clearly see, particularly in the blue IMAGES line which amplifies the trend, each datastream took increasingly long to add to the object:

Ingest Times for Gulliver's Travels

While I could tell they were slowing down during ingest, it wasn’t until graphing it out that the unformity of the slowdown was so evident. Thinking it might have been related to the large TIFF images we were converting to JP2, or the hundreds (sometimes thousands) of RELS-INT and RELS-EXT relationships we were creating, I thought it might be related to garbage collection in Java for Fedora, something I wasn’t terribly knowledgeable about. Even posted to StackOverflow, hoping someone might recognize that familiar curve and have insight or advice. Though I appreciate the suggestions and places to look next for tuning GC for Tomcat, it wasn’t the problem.

Curious if it was related to the RELS-INT or JP2 conversion background tasks, I tried iteratively adding thousands of nearly empty, plain text datastreams to an object – timing those too – as a test. And wouldn’t you know, there was that curve again:

Ingest Times for Tiny Datastreams

That’s when I became convinced it was the sheer number of datastreams causing the slowdown, with other tasks and processes amplifying the trend to varying degrees.

With that hypothesis in mind, did a bit more digging on the Fedora Commons listserv and struck gold. Turns out, one should not add that many datastreams to an object in Fedora.

"...to have more than a few dozen datastreams in a content model is very unusual and implies the possibility of useful refactoring..."

"...I don't mean to imply any criticism, but I do wonder about any Fedora-based architecture featuring objects with thousands of datastreams. It can be objectively said that such an architecture is not at all idiomatic."

We have done quite a bit of work with Fedora Commons over the past 3-4 years, with lots of time spent thinking about the pros and cons of various modeling approaches, but had not encountered good reasoning behind why a large number of datastreams for a given object was a bad idea. Yes, you could make the case from an architectual, idiomatic point-of-view, with tendrils and reasoning buried in good preservation practice (e.g. migrating file formats of certain pages but not others, etc.). But, it seemed you could make a case for storing all the datastreams in a single object as well, leveraging RELS-INT for relationships between those datastreams, with the reduced complexity as a digital presrvation gain. The reasoning for not many datastreams per object seemed limited to approach, not code effeciency and performance. At least I had not stumbled on those factors.

All in all it was a most interesting quest, and we came out victorious. Our v3 model for ebooks aligns more closely with the community – and PCDM, as we slowly move along towards FC4 – and our ebooks have been migrated. Though the elements and clues of the mystery (plotly is amazing, by the way) were fascinating, the entire episode has left a more enduring and interesting thought upon the brow: that best practices / approaches for doing something may have a diverse web of reasoning behind it. Too many datastreams were frowned upon, but most listserves and literature made it sound like it was for modeling and/or preservation reasons, not a technical limitation of Fedora to efficiently handle a large number of them per object. That said, I would have to assume it is because that many datastreams per object is not “idiomatic” use of Fedora that it is not efficient.

There is a strange chicken-or-the-egg thing happening here, perhaps better left for another time… where a high-level architectual decision has performance implications downstream, but because those performance effects are not the “root” of why that approach is favored, they are not mentioned much in literature…