We hold a large collection of 19th and 20th Century newspaper data, in machine-readable formats, which we can provide to University of Edinburgh researchers for the purposes of text and data mining.  This collection focuses on English-language newspapers from around the world, notably the UK, US, Australia and New Zealand.

Please get in touch if you would like more details and to arrange access. 


petri dish with dot overlays
womans face in pink over text

CDCS Datashare

We use our Datashare space to share data sets produced by CDCS projects. It currently contains a number of interesting LiDAR datasets generated by ECA students and our colleagues at uCreate Studio, in the University of Edinburgh Library.

Go to CDCS DataShare

further afield


There is a wealth of further data available to our community, locally and online: including many collections made available by the University of Edinburgh. This page lists examples with particular relevance to the arts, humanities and social sciences. We can also support you in identifying and accessing other data appropriate for your research, drop us a line if you need some help. 



EIDF logo

Edinburgh International Data Facility

The University of Edinburgh also host the Edinburgh International Data Facility, which provides support for large scale data processing. It is a collection of computational, data management and safe haven services supported by the Data Driven Innovation Programme of the Edinburgh and South-East Scotland City Region Deal.

Edinburgh University Resources

CAHSS DataShare

Other communities across the College of Arts Humanities and Social Science also make their data available within DataShare. Subjects include Economics, Law, Business, and Psychology, as well as humanities data, for example, the Annotated Reference Corpus of Scottish Gaelic (ARCOSG)Scottish History (including the Survey of Scottish Witchcraft, 1563 - 1736)

Angus McIntoshCentre

The Angus McIntosh Centre for Historical Linguistics takes an empirical approach to linguistic variation across time and space and develops corpora and visualisation tools for historical linguistics (see their Projects Hub). Much of their work is centred around the history of English and Scots, but the Centre is also interested in historical linguistics and language change independent of specific temporal, geographical or genetic affiliations.

Collections as Data

The University of Edinburgh Centre for Research Collections is making full collections available as downloadable datasets.

These currently include the Statistical Accounts of Scotland, and the Digitised Theses Collection

Other datasets are freely available and include the University's Musical Instrument Collection and Art Collection



Other Resources

British Library Data

The British Library Data (including Books and Newspaper collections) contains numerous rich digital resources, featuring Books Collections (including Books from the 19th Century), Maps and the British Newspaper Archive

Edinburgh University library has negotiated a license from Gale Cengage to have access to the British Library Newspapers Part IV: 1732 - 1950.  Contact us for access. 


NLS Storage Vault



CESSDA (Consortium of European Social Science Data Archives) provides large-scale, integrated and sustainable data services to the social sciences. It brings together social science data archives across Europe, with the aim of promoting the results of social science research and supporting national and international research and cooperation.


CLARIN (Common Language Resources and Technology Infrastructure) makes digital language resources available to scholars, researchers, students and citizen-scientists from all disciplines, especially in the humanities and social sciences, through single sign-on access.


CLOSER (Cohort and Longitudinal Studies Enhancement Resources), is the home of longitudinal research and brings together eight world-leading longitudinal studies with participants born throughout the 20th and 21st centuries. 

Growing Up in Scotland is a longitudinal research study, tracking the lives of thousands of children & their families from the early years, through childhood and beyond

Scottish Longitudinal Study  is a large-scale linkage study created using data from administrative and statistical sources.


Since 2005, DANS (Data Archiving and Networked Services ) has been supporting researchers, data professionals, other data archives, research institutions and research financiers with questions in the field of data management, certification and topics such as FAIR, open access and software sustainability.


DARIAH (Digital Research Infrastructure for Arts and Humanities) is a network of people, expertise, information, knowledge, content, methods, tools and technologies from its member countries.   It develops, maintains and operates an infrastructure in support of ICT-based research practices and sustains researchers in using them to build, analyse and interpret digital resources


The EEBO TCP corpus consists of the works represented in the Early English Books Online collections known as Short Title Catalogues I and II (based on the Pollard & Redgrave and Wing short title catalogs respectively), as well as the Thomason Tracts and the Early English Books Tract Supplement collections. Together these trace the history of English thought from the first book printed in English in 1475 through to 1700. The books in these collections include works of literature, philosophy, politics, religion, geography, history, politics, mathematics, music, the practical arts, natural science, and all other areas of human endeavor. 


E-RIHS is the European Research Infrastructure for Heritage Science that supports research on heritage interpretation, preservation, documentation and management. E-RIHS mission is to deliver integrated access to expertise, data and technologies through a standardized approach, and to integrate world-leading European facilities into an organisation with a clear identity and a strong cohesive role within the global heritage science community.


The European Social Survey (ESS) is an academically driven cross-national survey that has been conducted across Europe since its establishment in 2001. The survey measures the attitudes, beliefs and behaviour patterns of diverse populations in more than thirty nations.


Europeana provides access to millions of books, music and artworks from European libraries, museums and archives. Collections include Archaeology, Manuscripts, Maps & Geography, and Newspapers.  


Black and White photo of Pompeii by Severin Worm-Petersen


Hathi Trust

The HathiTrust Research Center has developed a suite of tools and services for text data mining including web-based algorithms, freely-accessible datasets, and secure computing capsules.  

They also provide extracted data sets, including Extracted Features 2.0 which offers volume- and page-level data for 17+ million volumes in the HathiTrust Digital Library. The data include: bibliographic metadata, computationally-inferred metadata about the page, and tokens (words), parts of speech, and their per-page counts. The dataset represents more than 6 billion pages of text from the digital library and includes nearly 3 trillion tokens from the corpus.


The International Social Survey Programme is a cross-national collaboration programme conducting annual surveys on diverse topics relevant to social sciences.

JISC Medical Heritage

The Jisc Medical Heritage Library is a collection of over 66,000 digitised European medical publications from the nineteenth century.  Ten UK partner libraries have selected and digitised books and other content for the Medical Heritage Library project. All the digitised content can be accessed from the Historical Texts portal.

CDCS  holds xml files of the whole library. If you would like to access these files, please contact us.  

National Library of Scotland Data Foundry

The Data Foundry is home to the data collections of the National Library of Scotland, which holds the world’s largest collection of Scottish published material.. It presents collections in machine-readable form: digitised collections (text and images); metadata collections; map data; and organisational data. The Library's Digital Scholarship Service updates and adds to the data collections on a regular basis.





OpenGLAM (Galleries, Libraries, Archives & Museums) is a network that supports exchange and collaboration between cultural institutions that support open access to their collections. It is an initiative and working group of the Open Knowledge Foundation (OKFN), currently known as Open Knowledge International, and was co-funded by the European Commission. 

PA-X Peace Agreements Database

The PA-X Peace Agreement Database is a database and repository of peace agreements from 1990 to date, current up until 1 January 2020. PA-X provides a comprehensive dataset of peace agreements from 1990 to end of 2019, capable of underpinning both quantitative and qualitative research. 


PARTHENOS (Pooling Activities, Resources and Tools for Heritage E-Research networking)  aims at strengthening the cohesion of research in the broad sector of Linguistic Studies, Humanities, Cultural Heritage, History, Archaeology and related fields through a thematic cluster of European Research Infrastructures, integrating initiatives, e-infrastructures and other world-class infrastructures, and building bridges between different, although tightly, interrelated fields.


The Rijksmuseum makes available extensive descriptions of more than a half a million art historical objects, hundreds of thousands of object photographs and the complete library catalogue to developers, researchers and enthusiasts.





Synergies for Europe’s Research Infrastructures in the Social Sciences (SERISS), aims to equip Europe’s social science data infrastructures to play a major role in addressing the key societal challenges facing Europe today and ensure that national and European policymaking is built on a solid base of the highest-quality socio-economic evidence.


The Survey of Health, Ageing and Retirement in Europe (SHARE) is a multidisciplinary and cross-national panel database of micro data on health, socio-economic status and social and family networks of about 140,000 individuals aged 50 or older (around 380,000 interviews). SHARE covers 27 European countries and Israel.


The SSHOC (Social Sciences and Humanities Open Cloud) provides training, advice, and educational resources which enable data producers, data users, and data professionals to gain maximum benefit from the SSH area of the EOSC (European Open Science Cloud)

UK Data Service

The UK Data Service holds the UK’s largest collection of social, economic and population data resources.


Wikidata is a free and open knowledge base that can be read and edited by both humans and machines. Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others. Wikidata also provides support to many other sites and services beyond just Wikimedia projects! The content of Wikidata is available under a free licenseexported using standard formats, and can be interlinked to other open data sets on the linked data web.