The Arquivo Geral da Cidade de Rio de Janeiro, headquartered on Rua Amoroso Lima in the Cidade Nova neighbourhood, is sitting on an estimated 40,000 duplicate digital image files accumulated over roughly 15 years of stop-start digitisation efforts. The problem did not arrive overnight. It is the documented residue of at least four separate scanning drives, each launched under a different municipal administration, each using different file-naming conventions, and none talking cleanly to the others.
Why does this matter right now? The Prefeitura do Rio has been pushing hard since 2024 to complete a unified open-data portal — the Dados Abertos Rio platform — and the image-duplication backlog is one of the concrete obstacles preventing the archive from delivering a clean, searchable visual record of the city to the public. Every redundant file consumes server space, muddies search results, and adds labour costs every time a cataloguer has to manually verify whether two nearly identical .tif files are truly the same scan or represent different prints of the same photograph.
A Trail of Competing Projects
The roots of the problem go back to at least 2009, when the then-Secretaria Municipal de Cultura launched its first large-scale effort to digitise the archive's photographic collection, which spans everything from 19th-century portraits taken in the Santa Teresa neighbourhood to aerial survey photographs of the Baixada Fluminense from the 1970s. That initial project used a proprietary metadata schema that was never fully documented.
A second, larger digitisation drive began in 2014 ahead of the World Cup and Olympics, when federal and municipal money briefly flowed toward cultural heritage projects. Scanning teams worked across multiple sites, including the Biblioteca Nacional on Avenida Rio Branco and partner institutions in Madureira, and delivered files in a different format — largely .jpg rather than archival .tif — often without cross-referencing what had already been done five years earlier.
The decisive complication came in August 2019, when the archive's primary server infrastructure was migrated to a new data centre managed by the Empresa Municipal de Informática, known as Iplanrio. According to internal technical documentation cited in a 2023 audit report by the Controladoria Geral do Município, the migration process created automatic backup copies of approximately 28,000 image folders before the old directory structure was decommissioned. Many of those backup copies were never deleted. The audit estimated the total redundant storage load at roughly 1.2 terabytes — modest by commercial standards, but significant for a municipal archive operating on a constrained IT budget.
What the Replacement Process Actually Involves
Replacing or merging duplicate images is not as simple as running a hash-matching algorithm and deleting files. Archivists at the Arquivo Geral have argued internally — and correctly — that two scans of the same physical photograph can carry different metadata, different resolution levels, and different conservation timestamps, meaning the lower-quality duplicate sometimes holds contextual information the higher-quality version lost. That argument has slowed automated culling proposals at least twice since 2021.
The current approach, approved by the archive's technical council in March 2026, is a hybrid: automated perceptual-hash comparison will flag probable duplicates, but a team of three trained archivists will review each flagged pair before any file is permanently removed or merged. The process is expected to run through the end of 2026, targeting completion before January 2027, when the Dados Abertos Rio portal is scheduled for its next major public release.
For researchers, journalists, and community groups in neighbourhoods like Lapa, Saúde, and the Zona Norte who rely on the archive's photographic record to support urban heritage claims and planning disputes, the practical advice is straightforward: file any formal image requests through the archive's official Protocolo Geral system now, noting if you need archival-quality .tif versions, before the deduplication process potentially alters file identifiers and disrupts existing reference numbers. The archive's public reading room on Rua Amoroso Lima is open Tuesday through Friday, 9 a.m. to 5 p.m.