The project Detecting text reuse in H.C. Andersen’s work is part of a larger publication project, Hans Christian Andersen’s Fairy Tales and Stories – the digital manuscript edition, led by Associate Professor Ane Grum-Schwensen, which aims to digitalize and publish all the preserved manuscripts of Hans Christian Andersen in an online, genetic edition (if you are interested in knowing more about the digital edition, it is described in more detail at http://andersen.sdu.dk/ms. The updated description is in Danish, but an older English version is available at http://beta.auh.sdu.dk/en/.)
In 2019, senior researcher Ejnar Stig Askgaard from Odense City Museums began comparing Hans Christian Andersen’s notes, written between approximately 1833 – 1875, with the 162 fairy tales, novels and autobiographies. This had led to the discovery that Hans Christian Andersen liked to use symbols such as cross marks or deletions in his notes to indicate that the note had been reused in his fairytales.
For Detecting text reuse in H.C. Andersen’s work, Berg wanted to find out where each note had been reused. Earlier research had managed to manually identify where 278 notes had been reused in Hans Christian Andersen’s published work, but this had been a time-consuming effort, taking many months of work.
As 861 of the notes had been digitalized in addition to Hans Christian Andersen’s published work, Berg was able to apply digital methods to solve his problem. He contacted Zhiru Sun, Assistant Professor at the Department of Design and Communication at SDU, who used a method called Natural Language Processing to find similarities between the notes and Hans Christian Anderson’s work. Using the Python application on UCloud, this method generated a number of tables, which indicated how similar a specific note is to a specific fairytale.
“It only took me around 8 hours to generate these tables and find a good indication of where all the 861 digitalized notes had been reused,” Sun explains.
In the tables above, each note has received a score from -1 to 1. The closer the score is to 1, the more similar is the note to the fairytale and vice versa. Note_61, e.g., where the low score indicates that it is very different from all fairytales, is a shopping list.
You can read a more detailed interview with Zhiru Sun here.