Skip to content

How can AI assist in cataloguing historic children's books from the 19th century?

Markus Stauffiger
Markus Stauffiger |

bielefeld-qualitaetsvergleich-1

The Colibri Project

Funded by the German Research Foundation (DFG), the Colibri – Corpus Libri et Liberi project is digitising and making accessible 19th‑century children’s and youth literature. The collection contains around 15,000 bibliographic units of German‑language works from 1801 to 1914. Partners include Bielefeld University Library, the Berlin State Library – Prussian Cultural Heritage, the Technical University of Braunschweig Library and the International Youth Library Munich.

The challenge

Although the books have already been scanned, much of their content remains insufficiently catalogued due to staff shortages. Historical layouts, old typefaces and unpredictable chapter structures make manual processing demanding. Library Director Barbara Knorn therefore looked for an innovative approach to make these holdings available more quickly and economically. Together with Artur Nold, Head of Library Technology, she launched a pilot based on modern AI tools.

 

The pilot: AI meets children’s books

Goals of the pilot

  • Automatic recognition of book structures (chapters, sections, and advertisements if applicable)
  • Extraction of tables of contents
  • Identification and description of image elements
  • Generation of METS XML files that can later be integrated into Goobi

Approach

  • Use of large language models (GPT‑4o here) without relying on existing OCR
  • Processing in small blocks (five pages at a time) because an LLM cannot yet handle an entire book in one go
  • Combining classical programming with AI methods to merge and prepare the data
  • Quality assurance by the library team, which still oversees the process and intervenes if necessary

Special features

Unlike earlier AI approaches, this system requires no specific configuration or training for each book or layout. That is precisely the strength of the new models: they recognize chapters independent of layout and can capture the information semantically.

Challenges & initial findings

  • No continuous quantitative success measurement
    As this was a pilot, no detailed metrics were collected. Results were mostly very good but occasionally left room for improvement.
  • Practical quality comparison
    A dedicated comparison website was used instead of digging through complex METS XML files, enabling direct checks of recognised chapters, images or advertisements.
  • Hierarchy levels
    The test setup focused on books with a single hierarchy level. More complex structures (e.g. multiple parts, subchapters or irregular chapters) still need work and were only touched on in the pilot.
  • Goobi integration
    Direct integration of the generated METS files into Goobi was examined and is technically feasible, but implementation lies beyond this short pilot.

Despite some limitations, the tests already showed high potential: AI can eliminate much manual labour, allowing specialists to focus on content curation and quality assurance.

Another example: the division ‘First Part’ with its sub-chapters was not recognized separately due to the pilot setup:

bielefeld-qualitaetsvergleich-2

Here you can see that there was no time for purely manual, detailed cataloguing, whereas AI produced the structure with only very few errors:

bielefeld-qualitaetsvergleich-3

Why this is exciting

  • Relief from routine tasks
    Instead of hours of typing and manual structuring, staff can focus on work that requires deeper expertise – for example quality assurance, historical classification
    and contextualization.

  • Scalability to thousands of books
    The pilot approach can be scaled to thousands of books with only minor adjustments. The models work patiently and quickly, without extra configuration for
    every new layout.

  • Building internal AI expertise
    Guided by the Archipanion team, Bielefeld University Library plans to continuously expand its knowledge and capacities around AI systems and to explore on‑premise solutions to maintain data protection and sovereignty.

Humans and AI – a strong team

In such a multifaceted field as historical book collections, specialized expertise remains essential: AI handles repetitive routine work, freeing librarians to spend more time on quality assurance and in‑depth content cataloguing. Those who embrace this opportunity can make old treasures accessible quickly – not replacing professionals, but empowering them. 

Share this post