While the facade of State Library Victoria stands abiding and unchanging, back-of-house functions more like a busy seaport. Items come and go in a flurry of activity, sometimes by the thousand. In some cases, the items coming into the Library are a complete mystery – they may seem united under their collection name, but in practice, anything could be contained in the boxes that arrive on the Library’s doorstep.
Take for example the Theatre Programme Collection. This collection contains over 37,000 items – posters, programs, fliers, and ephemera from Victoria's rich theatrical history. It's one of the largest collections of its kind in the world, yet much of it remains undiscovered and inaccessible to the public.
/filters:quality(75)/filters:no_upscale()/filters:strip_exif()/slv-lab/media/images/1741123154111/_mg_2588.jpg)
Thankfully, cases like this are exactly Briette Bermingham’s specialty . In her role as Description Manager for Published Materials, she organises and oversees the creation of library records for incoming items so that patrons can find and use new material. Usually this process involves the work of many hands, including volunteers who sort their way through boxes of programs and give a summary of their contents. While these processes have been an important part of helping collections become accessible, they still leave question marks in the catalogue when it comes to details about performances and performers documented in these newly processed collections. As the years have passed, and more donations have arrived, the backlog of programs needing work has increased, creating a mounting challenge for staff and volunteers who together face the daunting task of manually recording details from each theatre program to make them discoverable online.
But after attending a recent Code Club meeting, Briette had a spark of inspiration that could change this.
Her idea was simple - what if they could capture low-quality images of every program with a phone camera, and use OCR (Optical Character Recognition) to digitally 'read' the text on them? This approach to 'minimum viable digitisation' would be a big departure from the exacting process of capturing archival-quality images for the catalogue, but could massively speed up the process of creating basic records of new collection items. Not only would this process be faster, but it could supply records with more detailed information about each item, making the new material more discoverable by catalogue searchers.
Take the 'Comedy Bites' poster , for example. In the standard descriptions process, it might only be labelled ‘Comedy Poster’, leaving out key details that would highlight the item in a more specific search request, like ‘Melbourne International Comedy Festival’ or even the name of the event, 'Comedy Bites'. If someone were searching the catalogue for the work of David Astle or Libbi Gorr, this item would be missed entirely a lost connection in Victoria's cultural history.
Eager to test out her plan, Briette invited Sotirios Alpanis , co-organiser of Code Club, down into the stacks to explore technical options. They quickly agreed that a phone camera and OCR program would work well enough for turning images into text. Instead of using expensive proprietary OCR solutions that often come with usage restrictions, they landed on the open-source OCR engine Tesseract (https://github.com/tesseract-ocr/tesseract) to carry out this work.
/filters:quality(75)/filters:no_upscale()/filters:strip_exif()/slv-lab/media/images/1743992262028/theatre_programmes_unboxing.jpg)
In this image , you can see the OCR output from the ‘Comedy Bites’ program. While imperfect, it provides vastly more detail than the general ‘comedy poster’ label that might otherwise be affixed to this work. The text is full of valuable information about performers, dates, and the show itself all data points that could help make this item discoverable to researchers and the public alike. With a reliable-enough text-reading system in hand, Briette and Sotirios turned their attention to a more intricate problem - finding a way to extract the most helpful information from this mixture of text.
Typically, this problem is addressed later on in the cataloguing process, where data, labels and categories are added that help users surface specific information within a collection. This process dates back to the time of the Ancient Sumerians, who used to carve catalogue indexes into clay tablets. While strategies for indexing information have improved since then, creating data that describes a collection remains a slow and delicate job that requires extensive human effort. With 37,000 items waiting for processing and limited staff resources, Sotirios and Briette were determined to devise a faster method.
Thankfully, around this time, Sotirios was beginning to explore Hugging Face an online community for sharing machine learning models and datasets. For this task, he homed in on a technology called Named Entity Recognition (NER) - a type of natural language processing that sorts text into chosen categories, like names, dates, locations, and organisations, helping users to highlight and use the most important text in a dataset. With the goal of finding a model that could reliably create library-relevant categories and labels, Sotirios set about testing various NER models for accuracy and usefulness. Every model behaves differently, so Sotirios had to trial many systems before landing on Flair's "ner-ontonotes-large" (https://huggingface.co/flair/ner-english-ontonotes-large) .
/filters:quality(75)/filters:no_upscale()/filters:strip_exif()/slv-lab/media/images/1743992564a04/theatre_programmes_cam_setup_overhead.jpg)
With accurate OCR and NER models in hand , Sotirios and Briette combined these techniques into a working prototype hosted on Hugging Face (https://huggingface.co/spaces/SLV-LAB/theatre-programmer). Armed with a relatively inexpensive desktop USB camera and a functional pipeline, it was time to share their work with the wider Library team . When demonstrated at a Collections department meeting, the prototype was met with such enthusiasm that some team members immediately adopted the system into their daily workflows .
While team members must verify and sometimes improve upon the outputs, Sotirios suggests that this is the system working as intended. 'Part of this is about demystifying this technology by showing what it is and isn’t good at. In this case, demonstrating that you would use it, but always verify the output' he explains.
This human-in-the-loop approach is crucial - any application providing automatically generated content to the public needs human oversight. The project isn't about replacing skilled cataloguers, but rather about freeing them from tedious transcription work so they can apply their expertise where it matters most.
When it comes to the tens of thousands of items in this collection, any process efficiency goes a long way in opening access to the collection. Even more importantly, keeping humans in the loop ensures that access is in line with hard-to-come-by expertise in best informational practices. The goal is to bridge the gap between collection problems and technological solutions, while respecting the irreplaceable value of human judgement.
So where does the project go from here? While the current prototype works well for individual images, the natural next step would be to scale the system to handle multiple images at once. Adopting a serverless architecture would allow the system to dynamically scale to process even very large batches of images into draft finding aids for review. With scale, however, often comes increased cost, and at present the project is waiting for the necessary funding to be able to continue this work.
Though the Theatre Programmes project has been temporarily paused while the Library assesses its strategic priorities, the prototype has sparked interest in applying similar techniques to other collections. One promising candidate is the Library's oral history collection, which has been digitised but remains largely inaccessible. The same pipeline could be adapted to convert audio to text, then apply named entity recognition to create more access points in catalogue records and finding aids, helping to make these valuable historical accounts discoverable. So while this particular project has been shelved for now, we may soon see this model up and running elsewhere in the Library, opening windows into the collection for all manner of curious minds.
Visit the Hugging Face space to try the prototype for yourself, or see the attached Resources to read more about the tech behind the project.
Resources
Type | Author(s) | Tags | |
---|---|---|---|
tutorial |
| ||
codebase |
|