It's common practice for trainers to use small, pre-cleaned and simplified data sets in courses and workshops. While working with clean data limits the cognitive load, it can lead to a disconnect with real-world data, making it difficult for learners to apply their learning to actual research projects.
We're interested in how we can support the connection between research and technical skills, and in how we can equip our researchers with the skills they'll need to work with real-world data in solving problems. We also aim to reuse and showcase the University's rich data collections.
Over the past year, we collaborated with colleagues from the Library to gather, process and prepare two of these collections for use in training and research. We focused primarily on the Statistical Account of Scotland and the Digitised PhD Theses datasets.
The Statistical Account of Scotland dataset comprises 29,083 text files derived from transcriptions of the Old and New Statistical Accounts of Scotland. The Digitised PhD Theses dataset includes metadata for approximately 25,000 PhD theses defended at the University of Edinburgh. We transformed both raw datasets into tabular formats, making them suitable for both teaching and research purposes. This dataset was used to teach Text Analysis at our Summer School and during training events across the year, enabling trainers to showcase a series of methods from data wrangling and data visualisation, to text analysis and topic modelling.