Call for Text Mining Projects

Graphic mashup of medieval drawing

The Centre for Data, Culture & Society is looking for humanities and social science researchers who can ask complex questions of large-scale data sets.

This call is open to research active staff and PhD students within CAHSS at the University of Edinburgh. 

All project activities can be conducted remotely.

Deadline for expressions of interest closed on Friday April 9th, 2021 

If you would like to find out more information, please email cdcs@ed.ac.uk.

Introducing defoe

Over the past three decades, large scale digitisation has been transforming the collections of libraries, archives, and museums. The volume and quality of available digitised text now makes searching and linking these data feasible, and new High-Performance Computing (HPC) environments and analytic frameworks enable researchers to mine large-scale digital collections effectively. The Centre for Data, Culture & Society has a text and data mining facility at the University of Edinburgh called defoe, for interrogating large and heterogeneous text-based archives. defoe uses the power of analytic frameworks, such as Apache Spark, Jupyter Notebooks, and HPC environments to manipulate and mine huge digitised archives in parallel at great speed using a simple command line.

We want to support researchers to use defoe. You don't need to have programming skills - we have an EPCC programmer on board who will help - you just need to use your own subject expertise to pose interesting questions and explore the data. You can also propose a new data set in line with your research interests if we don't have what you're looking for.

We have invited expressions of interest - the deadline closed on Friday 9th of April, 2021. 

We will then invite successful candidates to an online sandpit event, an opportunity to collaboratively explore ideas in more depth, where more information on defoe and the data available will be shared.

Text mining and CDCS

Introductory talk from Prof. Melissa Terras introducing defoe and the text mining sandpit.

Recorded on 7th May, 2020

Click here to view the video in full screen

Data Sets

New This Year

  • STATISTICAL ACCOUNTS OF SCOTLAND (1791 - 1845)

An unrivalled portrait of life in Scotland's parishes during the Agricultural and Industrial Revolution. 

  • UNIVERSITY OF EDINBURGH DIGITISED PHD THESES COLLECTION

All the theses in the Edinburgh Research Archive, dating back to the 17th century. 

  • BRITISH LIBRARY NEWSPAPERS

1TB of digitised British newspapers from the 18th Century to the early 20th Century.

  • TIMES DIGITAL ARCHIVE

All the articles in 69,699 volumes of The Times newspaper between 1785 and 2009.

  • PAPERS PAST: NEW ZEALAND AND PACIFIC NEWSPAPERS

Over 5 million pages of New Zealand and Pacific newspapers from the 19th and 20th Centuries.

  • GAZETTEERS OF SCOTLAND

20 volumes of the most popular 19th Century gazetteers of Scotland, including detailed historical and geographic information about each place.

  • ENCYCLOPAEDIA BRITANNICA 1768 - 1860

The first 8 volumes of the Encyclopaedia Britannica, issued from 1768-1860, comprising a total of 143 volumes. 155,388 pages, 166m words.

  • BRITISH LIBRARY BOOKS

Over 68,000 books from the 16th to the 19th century, covering geography, philosophy, history, poetry and literature in a variety of languages.


If we don't have what you're looking for in term of data sets, you can propose a new data set for use with defoe.

 

Graphic mashup of medieval drawing

Key Dates

  • Expressions of Interest: 5pm, Friday 9th April
  • Invitations to Sandpit: Friday 16th April
  • Sandpit: 3-5pm, Tuesday 27th April
  • Project delivery: May - July

More information on defoe:

EPCC - Mining digital historical textual data

defoe: a Spark Based Toolbox (FULL PAPER)