A Tale of Two Transcriptions. Machine-Assisted Transcription of Historical Sources

This article explains how two projects implement semi-automated transcription routines: for census sheets in Norway and marriage protocols from Barcelona. The Spanish system was created to transcribe the marriage license books from 1451 to 1905 for the Barcelona area; one of the world’s longest seri...

Full description

Saved in:
Bibliographic Details
Main Authors: Gunnar Thorvaldsen, Joana Maria Pujadas-Mora, Trygve Andersen, Line Eikvil, Josep Lladós, Alicia Fornés, Anna Cabré
Format: Article
Language:English
Published: International Institute of Social History 2015-01-01
Series:Historical Life Course Studies
Subjects:
Online Access:http://hdl.handle.net/10622/23526343-2015-0001?locatt=view:master
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832569396360380416
author Gunnar Thorvaldsen
Joana Maria Pujadas-Mora
Trygve Andersen
Line Eikvil
Josep Lladós
Alicia Fornés
Anna Cabré
author_facet Gunnar Thorvaldsen
Joana Maria Pujadas-Mora
Trygve Andersen
Line Eikvil
Josep Lladós
Alicia Fornés
Anna Cabré
author_sort Gunnar Thorvaldsen
collection DOAJ
description This article explains how two projects implement semi-automated transcription routines: for census sheets in Norway and marriage protocols from Barcelona. The Spanish system was created to transcribe the marriage license books from 1451 to 1905 for the Barcelona area; one of the world’s longest series of preserved vital records. Thus, in the Project “Five Centuries of Marriages” (5CofM) at the Autonomous University of Barcelona’s Center for Demographic Studies, the Barcelona Historical Marriage Database has been built. More than 600,000 records were transcribed by 150 transcribers working online. The Norwegian material is cross-sectional as it is the 1891 census, recorded on one sheet per person. This format and the underlining of keywords for several variables made it more feasible to semi-automate data entry than when many persons are listed on the same page. While Optical Character Recognition (OCR) for printed text is scientifically mature, computer vision research is now focused on more difficult problems such as handwriting recognition. In the marriage project, document analysis methods have been proposed to automatically recognize the marriage licenses. Fully automatic recognition is still a challenge, but some promising results have been obtained. In Spain, Norway and elsewhere the source material is available as scanned pictures on the Internet, opening up the possibility for further international cooperation concerning automating the transcription of historic source materials. Like what is being done in projects to digitize printed materials, the optimal solution is likely to be a combination of manual transcription and machine-assisted recognition also for hand-written sources.
format Article
id doaj-art-73d8cd633a7f43aa8580b689ad5aa5cb
institution Kabale University
issn 2352-6343
2352-6343
language English
publishDate 2015-01-01
publisher International Institute of Social History
record_format Article
series Historical Life Course Studies
spelling doaj-art-73d8cd633a7f43aa8580b689ad5aa5cb2025-02-02T21:33:59ZengInternational Institute of Social HistoryHistorical Life Course Studies2352-63432352-63432015-01-012119A Tale of Two Transcriptions. Machine-Assisted Transcription of Historical SourcesGunnar Thorvaldsen0Joana Maria Pujadas-Mora1Trygve Andersen2Line Eikvil3Josep Lladós4Alicia Fornés5Anna Cabré6Norwegian Historical Data Centre, University of TromsøCentre for Demographic Studies, Autonomous University of BarcelonaNorwegian Historical Data Centre, University of TromsøNorwegian Computing Center, OsloAutonomous University of BarcelonaAutonomous University of BarcelonaAutonomous University of BarcelonaThis article explains how two projects implement semi-automated transcription routines: for census sheets in Norway and marriage protocols from Barcelona. The Spanish system was created to transcribe the marriage license books from 1451 to 1905 for the Barcelona area; one of the world’s longest series of preserved vital records. Thus, in the Project “Five Centuries of Marriages” (5CofM) at the Autonomous University of Barcelona’s Center for Demographic Studies, the Barcelona Historical Marriage Database has been built. More than 600,000 records were transcribed by 150 transcribers working online. The Norwegian material is cross-sectional as it is the 1891 census, recorded on one sheet per person. This format and the underlining of keywords for several variables made it more feasible to semi-automate data entry than when many persons are listed on the same page. While Optical Character Recognition (OCR) for printed text is scientifically mature, computer vision research is now focused on more difficult problems such as handwriting recognition. In the marriage project, document analysis methods have been proposed to automatically recognize the marriage licenses. Fully automatic recognition is still a challenge, but some promising results have been obtained. In Spain, Norway and elsewhere the source material is available as scanned pictures on the Internet, opening up the possibility for further international cooperation concerning automating the transcription of historic source materials. Like what is being done in projects to digitize printed materials, the optimal solution is likely to be a combination of manual transcription and machine-assisted recognition also for hand-written sources.http://hdl.handle.net/10622/23526343-2015-0001?locatt=view:masterNominative SourcesCensusVital RecordsComputer VisionOptical Character RecognitionWord Spotting
spellingShingle Gunnar Thorvaldsen
Joana Maria Pujadas-Mora
Trygve Andersen
Line Eikvil
Josep Lladós
Alicia Fornés
Anna Cabré
A Tale of Two Transcriptions. Machine-Assisted Transcription of Historical Sources
Historical Life Course Studies
Nominative Sources
Census
Vital Records
Computer Vision
Optical Character Recognition
Word Spotting
title A Tale of Two Transcriptions. Machine-Assisted Transcription of Historical Sources
title_full A Tale of Two Transcriptions. Machine-Assisted Transcription of Historical Sources
title_fullStr A Tale of Two Transcriptions. Machine-Assisted Transcription of Historical Sources
title_full_unstemmed A Tale of Two Transcriptions. Machine-Assisted Transcription of Historical Sources
title_short A Tale of Two Transcriptions. Machine-Assisted Transcription of Historical Sources
title_sort tale of two transcriptions machine assisted transcription of historical sources
topic Nominative Sources
Census
Vital Records
Computer Vision
Optical Character Recognition
Word Spotting
url http://hdl.handle.net/10622/23526343-2015-0001?locatt=view:master
work_keys_str_mv AT gunnarthorvaldsen ataleoftwotranscriptionsmachineassistedtranscriptionofhistoricalsources
AT joanamariapujadasmora ataleoftwotranscriptionsmachineassistedtranscriptionofhistoricalsources
AT trygveandersen ataleoftwotranscriptionsmachineassistedtranscriptionofhistoricalsources
AT lineeikvil ataleoftwotranscriptionsmachineassistedtranscriptionofhistoricalsources
AT josepllados ataleoftwotranscriptionsmachineassistedtranscriptionofhistoricalsources
AT aliciafornes ataleoftwotranscriptionsmachineassistedtranscriptionofhistoricalsources
AT annacabre ataleoftwotranscriptionsmachineassistedtranscriptionofhistoricalsources
AT gunnarthorvaldsen taleoftwotranscriptionsmachineassistedtranscriptionofhistoricalsources
AT joanamariapujadasmora taleoftwotranscriptionsmachineassistedtranscriptionofhistoricalsources
AT trygveandersen taleoftwotranscriptionsmachineassistedtranscriptionofhistoricalsources
AT lineeikvil taleoftwotranscriptionsmachineassistedtranscriptionofhistoricalsources
AT josepllados taleoftwotranscriptionsmachineassistedtranscriptionofhistoricalsources
AT aliciafornes taleoftwotranscriptionsmachineassistedtranscriptionofhistoricalsources
AT annacabre taleoftwotranscriptionsmachineassistedtranscriptionofhistoricalsources