CDCS Text Mining Lab

The Centre for Data, Culture & Society is looking for humanities and social science researchers who can ask complex questions of large-scale data sets.

Deadline for expressions of interest: April 20th - Applications are now closed.

This call is open to research active staff and PhD students within CAHSS at the University of Edinburgh. 

**Please note that we envisage being able to conduct this activity remotely.**

Introducing defoe

Over the past three decades, large scale digitisation has been transforming the collections of libraries, archives, and museums. The volume and quality of available digitised text now makes searching and linking these data feasible, where previous attempts were restricted due to limited data availability, quality, and lack of shared infrastructures. There is hunger for large scale text mining facilities from the humanities community, with commercial providers allowing limited access to their own digitised collections. However, there are barriers to querying the wealth of newspapers and books that now exist in digitised, openly licensed form at scale, which would allow humanists to be in control of their text mining research. One major barrier is that the humanities community has limited capacity and/or skills to use High-Performance Computing (HPC) environments and analytic frameworks to create applications to mine large-scale digital collections effectively.

The Centre for Data, Culture & Society is keen to remove some of these obstacles and assist Humanities researchers to undertake complex analysis of datasets at scale.

We have built a text and data mining facility at the University of Edinburgh called defoe, for interrogating large and heterogeneous text-based archives (see below for a full list of datasets). defoe uses the power of analytic frameworks, such as Apache Spark, Jupyter Notebooks, and High Performing Computing environments to manipulate and mine huge digitised archives in parallel at great speed using a simple command line. We can visualise text, and also assist with linguistic or semantic analysis.

We want to support researchers to use defoe, a text and data mining facility at the University of Edinburgh for interrogating large and heterogeneous text-based archives, and to help us explore how we might develop the system further. You don't need to have programming skills - we have an EPCC programmer on board who will help - you just need to use your own subject expertise to pose interesting questions and explore the data. You can also propose a new data set in line with your research interests if we don't have what you're looking for.

We will then invite successful candidates to a sandpit event, an opportunity to collaboratively explore ideas in more depth, where more information on defoe and the data available will be shared.

More about defoe:

EPCC - Mining digital historical textual data

defoe: a Spark Based Toolbox (FULL PAPER)

Data Sets

  • BRITISH LIBRARY BOOKS

Over 68,000 books from the 16th to the 19th century, covering geography, philosophy, history, poetry and literature in a variety of languages.

  • BRITISH LIBRARY NEWSPAPERS

1TB of digitised British newspapers from the 18th to the early 20th Century

  • TIMES DIGITAL ARCHIVE

All the articles in 69,699 volumes of The Times newspaper between 1785 and 2009.

  • PAPERS PAST: NEW ZEALAND AND PACIFIC NEWSPAPERS

Over 5 million pages of New Zealand and Pacific newspapers from the 19th and 20th Centuries.

 

  • GAZETTEERS OF SCOTLAND

20 volumes of the most popular 19th Century gazetteers of Scotland, including detailed historical and geographic information about each place.

  • ENCYCLOPAEDIA BRITANNICA 1768 - 1860

The first 8 volumes of the Encyclopaedia Britannica, issued from 1768-1860, comprising a total of 143 volumes. 155,388 pages, 166m words.

Coming soon:

  • JISC MEDICAL HERITAGE LIBRARY
  • HANSARD ARCHIVE
  • STATISTICAL ACCOUNTS OF SCOTLAND
  • DIGITISED THESES FROM EDINBURGH UNIVERSITY LIBRARY

If we don't have what you're looking for in term of data sets, you can propose a new data set for use with defoe.

Crowd mashup

KEY DATES

  • Expressions of Interest: 20th April
  • Invitations to Sandpit: 29th April
  • Sandpit: 7th May
  • Finalists Announced: 18th May
  • Project Stage 1: Working with defoe: May - July
  • Project Stage 2: Analysing and visualising data: July - Sept
  • Projects Showcase: September /Oct

APPLICATIONS ARE NOW CLOSED

Do you have questions or need more information?

Get in touch at cdcs@ed.ac.uk