Over the past three decades, large scale digitisation has been transforming the collections of libraries, archives, and museums. The volume and quality of available digitised text now makes searching and linking these data feasible, where previous attempts were restricted due to limited data availability, quality, and lack of shared infrastructures. There is hunger for large scale text mining facilities from the humanities community, with commercial providers allowing limited access to their own digitised collections. However, there are barriers to querying the wealth of newspapers and books that now exist in digitised, openly licensed form at scale, which would allow humanists to be in control of their text mining research. One major barrier is that the humanities community has limited capacity and/or skills to use High-Performance Computing (HPC) environments and analytic frameworks to create applications to mine large-scale digital collections effectively.
The Centre for Data, Culture & Society is keen to remove some of these obstacles and assist Humanities researchers to undertake complex analysis of datasets at scale.
We have built a text and data mining facility at the University of Edinburgh called defoe, for interrogating large and heterogeneous text-based archives (see below for a full list of datasets). defoe uses the power of analytic frameworks, such as Apache Spark, Jupyter Notebooks, and High Performing Computing environments to manipulate and mine huge digitised archives in parallel at great speed using a simple command line. We can visualise text, and also assist with linguistic or semantic analysis.
We want to support researchers to use defoe, a text and data mining facility at the University of Edinburgh for interrogating large and heterogeneous text-based archives, and to help us explore how we might develop the system further. You don't need to have programming skills - we have an EPCC programmer on board who will help - you just need to use your own subject expertise to pose interesting questions and explore the data. You can also propose a new data set in line with your research interests if we don't have what you're looking for.
We will then invite successful candidates to a sandpit event, an opportunity to collaboratively explore ideas in more depth, where more information on defoe and the data available will be shared.
More about defoe: