Image of Calton Hill with text and shape overlays

Text & Data Analysis in the Wild

10-14 June 2024 

This course will equip attendees with knowledge of all the main techniques required to conduct data and text analysis, providing them with the basics of web scraping, text analysis, topic modelling, sentiment analysis, data wrangling, data analysis, inferential statistics, cluster analysis, and data visualisation. 

Hands-on practical exercises are combined with theory, alongside presentations on the real-world applications of the techniques that will be taught. Participants will also have an opportunity to bring their own datasets and research questions to the course, to apply and tailor what they are learning to their own projects.

Overview

This course is designed to help researchers understand how data and text analysis projects are performed in a research environment. It starts with the identification of a series of research questions connected to this year’s core topic. Throughout the week we are going to see how life in Scotland changed across the centuries and how the history of Scotland is perceived. We are going to explore how computational methods can be used to obtain, clean, and analyse structured and unstructured datasets to answer those questions. We are going to look at a combination of the Statistical Accounts of Scotland (17191-1845), historical forums, and National Record of Scotland data.

Key Principles: 

  • General knowledge of the interface of RStudio and coding is required. Materials will be released in advance to support attendees in refreshing their R skills in preparation. 

  • Fosters interdisciplinary thinking by bringing social science and humanities researchers together to explore methods.

  • Covers both structured and unstructured data.

  • Illustrates the development process of data-led projects by moving through phases of the project lifecycle.

  • Challenge-led, helping researchers learn how to deal with real-world data.

RStudio Refresher

Although the Text and Data Analysis in the Wild stream will provide attendees with many new digital skills, a level of prior knowledge is necessary to get the most out of our training. Try our short quiz to test your knowledge of R, covering some of the basics you should be familiar with before the Summer School. 

Take the quiz

If you didn't do as well as you'd hoped, don't worry! In the video on the right, Training Manager Dr. Lucia Michelin goes through a refresher of some of the core R tools and techniques. The Summer School also offers Stream 1: A Gentle Introduction to Coding which provides a comprehensive guide through the basics of coding and programming for research. 

Daily Schedule

Registration and Welcome

Morning Seminar:

Dr Jessica Witte 

Title: As an AI language model...": Reading & Writing with GPT-3.5

Generative AI tools like ChatGPT have expanded the possibilities for computational text analysis with their ability to quickly write content while interacting in a humanlike manner. Yet in many ways, generative AI chatbots--and the large language models that power them--function as digital "black boxes" developed through corporate-secret training protocols and pipelines. Informed by her work developing HAZEL, a pilot generative AI chatbot designed to assist authors of guidance literature published by Historic England, Jessica will discuss what AI chatbots' glitches and hallucinations can tell us about how they function and, more broadly, what this tells us about how computers "read" and "write." We will also consider how we, as researchers, can leverage these novel technologies for text analysis in our own work.

Morning Sessions:

An introduction to the datasets and research questions: The origin and types of data used in the summer school will be discussed, alongside the research questions we will be exploring and answering throughout the week.

Data wrangling: We are going to look at the raw data for the week and reflect on what kind of cleaning they will require and how these cleaning steps can be performed programmatically via coding to speed up the process.

Afternoon Sessions:

Data wrangling: We will begin to examine structured and unstructured data and the ways to enhance and clean a dataset. Common pitfalls, such as dealing with NULL values, and tools such as how to add additional rows, columns, and data, will be covered. At the same time, we are going to see how regex can be used to programmatically extract information from unstructured data

Keynote Lecture:

Prof. Melissa Terras

Title: How do the Humanities Keep Up with AI? Opportunities and Issues for Research

This talk explores the evolving dynamics between AI technologies and the humanities, asking how traditional fields like literature and history can integrate AI to enhance scholarly research and cultural understanding. It will discuss existing and potential methodologies, interdisciplinary collaborations, and the critical role humanistic inquiry plays in guiding the ethical development and application of artificial intelligence.

Melissa Terras is Professor of Digital Cultural Heritage within Design Informatics at the University of Edinburgh, UK. She is Director of Creative Informatics, the Edinburgh based AHRC Creative Cluster (2018-2024) supporting innovation in creative and cultural contexts, and a founding Director of Transkribus, the AI-powered platform for text recognition of historical documents.

 

Morning Seminar:

Ash Charlton

Title: Expect the unexpected: text analysis as an exploratory tool to examine legacies of race and slavery in the early Encyclopaedia Britannica (1768-1860)

Text analysis enables new ways of viewing historical documents as data – but how do you navigate these texts and effectively explore them using digital methodologies? Ash will discuss findings from her research into legacies of slavery in the early Encyclopaedia Britannica, sharing what you can achieve with text analysis, as well as some of the challenges of working with digital historical texts and methods.

Morning Sessions:

An introduction to text analysis: What is text analysis? When and why might we use it? We will also look at where to find textual data? How do we retrieve it efficiently and ethically? How do we structure the data in a way that can be used to perform quantitative analysis?

Text analysis tools: We will cover basic text analysis methods. From basic pre-processing (stemming/lemmatization/stop word removal) to basic analysis techniques such as word frequency, keywords in context, and word clouds

Afternoon Sessions:

Text analysis: We will begin to look at text mining/NLP, covering methods such as topic modelling, named entity recognition, and text classification. We will look at how (and why) they have been developed, alongside the problems and limitations of these methods.

BYOD session: Participants will work together on ongoing research datasets provided in advance by the summer school attendees, focusing on good practices and troubleshooting.

Morning Seminar:

Dr Andrea Kocsis

Title: "More than feeling? - Emotion Detection in Dark Hertiga Reviews

Can digital humanities rewrite concepts from non-digital heritage studies? With the help of distant reading, my talk aims to re-evaluate why some heritage sites do not evoke hot cognition in visitors. Hot cognition is a form of affect, a direct emotional way in which we can interpret heritage experiences before or without thinking them over. I argued that the exhibition's curation, the story-telling, and levels of immersion play a more critical role in the hot interpretation than the time that has passed since the atrocity. To test my hypothesis, I have analysed 6000 TripAdvisor reviews about sites commemorating temporally distant tragedies, such as the Clifford Tower in York, the  Mary Rose Museum in Portsmouth, and the Medieval Massacre exhibition at the Swedish History Museum. Methodologically, I aimed to review the landscape of computational methods in heritage affect research (sentiment analysis, emotion detection, topic modelling) and compare them to close reading. My research proved that extant data, like online reviews, can be a valid alternative to onsite observation and interviews. 

Morning Sessions:

Introduction to web scraping and HTML tags: We will explore different techniques, different website structures, and how to solve common problems encountered in web scraping.

Web scraping data from a forum: We will apply the skills we learnt to look at how the history of Scotland is discussed on a history forum. 

Afternoon Sessions:

An introduction to sentiment analysis: We will cover the core concepts, tools, and techniques used to undertake sentiment analysis.  

Advanced sentiment analysis: Using the techniques learned earlier, we will use sentiment analysis to examine trends in popular feelings about the history of Scotland using the data we scraped in the morning. 

Morning Seminar:

Dr Uğur Ozdemir

Title: Beyond Binary Divides: Affective Sorting and the Multidimensional Landscape of Polarization

This paper introduces the concept of affective sorting as a means to examine affective polarization (AP) in society, expanding beyond the conventional approach that primarily focuses on political party affiliations. Using The American National Election Studies spanning 2004 to 2020, we observe a significant realignment in the American political landscape, characterized by a deepening of partisan identities and the alignment of sentiments towards diverse social and political groups within this partisan dichotomy post-2012. This study reveals that affective sorting offers a nuanced understanding of AP, showing how political, ideological, and social identities interconnect within a singular affective dimension, reflecting a complex transformation in the electorate's affective orientations. Our findings highlight the importance of considering the multidimensional aspects of social identities and the intricate dynamics of polarization, underscoring the profound impact of partisan identity on the electorate's affective dispositions towards various groups.

Morning Sessions:

Data analysis: Regressions. We will introduce a variety of regression models that are used for understanding the relationship between variables in a dataset. Thereafter, we will deal with linear models and generalised linear models.

Data analysis: PCA. We will explore Principal Component Analysis, a method of identifying variables with the greatest variance in a dataset. 

Afternoon Sessions:

Data analysis: Clustering. The afternoon will see the introduction of k-means and hierarchical clustering as methods of finding patterns in your data. 

Data analysis: NULL hypothesis testing. We will use this technique to test the statistical significance of observations in our dataset. 

Morning Seminar:

Dr Alexis Pister

Title: Designing Network Visualisations for the Real World 

Networks are used in a wide range of domains to model real-world phenomena such as trophic networks, financial transactions, social relationships and more. Visualization is very useful and often required to show a network's general structure and reveal potential interesting patterns. 

However, designing usable and useful network visualisations is not trivial, as the design space is vast, and many pitfalls exist such as choosing the wrong layout or using an inefficient visual encoding, that can rapidly lead to unreadable “hairball” node-link diagrams. Choosing the correct visualisation technique, layout, and visual designs is especially not trivial since networks can take many different forms: multivariate, bipartite, temporal, spatial, etc. 

In this talk, I will reflect on the field of network visualisation through the presentation and discussion of several network visualisation projects I worked on in collaboration with historians, social scientists, and epidemiologists using real data. I will discuss visualisation techniques and design choices that can lead to efficient network visualisations usable in the real world to answer research questions and reveal interesting insights.  

Morning Sessions:

Data visualisation: We will look at the basic principles of data visualisation, followed by an introduction to the "Grammar of Graphics" and various plots within this package. We will look at ways to visualise both structured and unstructured data. 

Afternoon Sessions:

Data visualisation: We will shift to advanced data visualisation, using spatial data. We will work with examples of real-world data, and cover the fundamentals of using spatial data and visualising it effectively.

Next Steps: In the final session of the summer school, together with the attendees of the other stream we will discuss the results of the week and which would be the next steps you can take to continue developing your computational skills. 

Monday Tuesday Wednesday Thursday Friday
09:00-09:30 Registration
09:30-09:40 Welcome Setting Up Setting Up Setting Up Setting Up
09:40-10:40 Seminar Seminar Seminar Seminar Seminar
10:40-11:00 Coffee Coffee Coffee Coffee Coffee
11:00-12:30 Introduction Text Analysis Web Scraping Data Analysis Data Visualisation
12:30-13:30 Lunch Lunch Lunch Lunch Lunch
13:30-15:00 Data Wrangling Text Analysis Sentiment Analysis Data Analysis Data Visualisation and Geographical Data
15:00-15:30 Coffee Coffee Coffee Coffee Coffee
15:30-17:00 Keynote BYOD BYOD BYOD Next Steps
Evening Reception Ceilidh Club Pub Quiz Drinks
Green: Room 2.55 in Wing A or Room 1.55 Wing A
Grey: Teaching Rooms, 1.50 and 1.52 in Wing B
Yellow: Small Events Space, Room 4.55 in Wing A
Blue: Social events happening outside the building

Our Instructors & Speakers

Alexis Pister

Alexis Pister

Alexis Pister is a data visualisation research engineer working at the Edinburgh Future Institute and the Vishub. He is specialised in visual analytics and network visualisation, and works on several cross-disciplinary projects on how to effectively design and visualise complex multidimensional datasets from the humanities and social sciences. He is also involved in the development of new visualisation toolkits and tools such as Netpanorama and the Vistorian, to make network visualisation easier and more expressive.

He obtained a Ph.D. from Paris-Saclay University on network visualisation applied to historical records and a Master’s degree in Bioinformatics and modelling at INSA Lyon.

Ash Charlton

Ash Charlton

Ash is based in the School of History, Classics & Archaeology.

Ash’s research is focused on using text mining to identify legacies of race and slavery in the early Encyclopaedia Britannica (1768-1860). Her project uses keyword frequency counts and distributions across eight editions of the Encyclopaedia Britannica to explore how race and slavery were portrayed in the publication, and how this changed over time as anti-slavery sentiment increased in Great Britain. She is also using network analysis to map the explicit and implicit references to slavery across articles, and to identify where there are silences on the topic. She is working in collaboration with the National Library of Scotland, and their digitised Encyclopaedia Britannica dataset from their Data Foundry forms the basis of her research.

Jessica Witte Photo

Jessica Witte

Dr Jessica Witte is a postdoctoral fellow at the University of Edinburgh.

Her research interests include creating and applying textual analysis to historical texts that particularly focus on the medicalization of women’s bodies. She applies and creates these tools to better understand the epistemic dimension of women’s experiences that can be used to create better interventions based on individualized experience, and patient advocacy.

Ki Tong

Ki Tong

Ki is a PhD candidate at the Advanced Care Research Centre studying ways to enhance greenspace accessibility for older adults. She is a landscape architect with professional experience delivering construction projects and landscape assessments. Besides an interest in using QGIS and ArcGIS for geospatial visualisation and analysis, she expanded her exploration with aggregating geospatial data and performing further analysis with R to study the correlation between environmental variables and urban density.

Melissa Terras

Melissa Terras

Melissa Terras is Professor of Digital Cultural Heritage at the University of Edinburgh‘s College of Arts, Humanities, and Social Sciences.

Her research focuses on the digitisation of cultural heritage, including its technologies, procedures, and impact, and how this intersects with internet technologies. She was the founding director of the Centre for Data, Culture, & Society.

Rhys Davies

Rhys Davies

Rhys is based at the School of Health in Social Sciences.

Rhys is a psychologist researching adaptive behaviours and mental health in elite sports. His research makes use of statistical modelling with survey data, particularly investigating interactions to determine how context shapes the efficacy of “adaptive” behaviours. His preferred coding language is R, and he is passionate about using data visualisation techniques to communicate and simplify research findings.

Ugur Ozdemir

Ugur Ozdemir

Dr Ugur Ozdemir is a Lecturer in Quantitative Political Science at the University of Edinburgh.

His research interests include comparative political behaviour, formal models of electoral politics, and quantitative methods. He is a dedicated advocate of bridging the gap between theoretical modelling and empirical analysis.

Elizabeth Pankratz

Elizabeth Pankratz

Elizabeth is a third-year PhD student at the University of Edinburgh's Centre for Language Evolution. In her research, she uses computational models and behavioural experiments to study how people learn grammatical structure and how those learning processes could shape language. She has also taught both Bayesian and frequentist statistics and loves to help people solve tricky stats problems

Andrea Kocsis

Andrea Kocsis

Dr Andrea Kocsis comes from an interdisciplinary and international background. In her research, she combines heritage studies with data and network science. The research she is conducting as a Chancellor's Fellow at the University of Edinburgh focuses on investigating historical maritime trade as a complex system. As a National Librarian’s Research Fellow in Digital Scholarship (2024-25), she works on making web archives more accessible to wider audiences.

Jessica Teed

Jessica Teed

Jess is a PhD student in the School of Philosophy, Psychology, and Language Sciences. She is interested in exploring how the brain processes visual information and memories using behavioural paradigms combined with neuroimaging techniques such as functional MRI, transcranial magnetic stimulation, and electroencephalography. Jess's research focuses on the functional organisation of visual representations including object perception and visual imagery, using Python and R for experimental design and analysis.

Ozan

Ozan Evkaya

Ozan Evkaya (FHEA) is a University Teacher in Statistics at the School of Mathematics and has been teaching mathematics students in higher education across different subjects. Outside of university teaching, Ozan is a co-organiser of TEMSE seminars and local GenAI group in School of Math,  co-organiser of EdinbR group and member of RSS Edinburgh local community. Previously, he held postdoc positions at Padova University (2021) and KU Leuven (2020), after completing his PhD in Statistics (2018) at Middle East Technical University.