Text & Data Analysis in the Wild

05-09 June 2023

Venue: LG.07 40 George Square

This course will equip attendees with knowledge of all the main techniques required to conduct data and text analysis, providing them with the basics of web scraping, text analysis, topic modelling, sentiment analysis, data wrangling, data analysis, inferential statistics, cluster analysis, and data visualisation. Learning will be challenge-led, with data gathered from the Scottish and UK Government websites being used to analyse a series of research questions related to the ongoing cost of living crisis.

Hands-on practical exercises are combined with theory, alongside presentations on the real-world applications of the techniques that will be taught. Participants will also have an opportunity to bring their own datasets and research questions to the course, to apply and tailor what they are learning to their own projects.

Application Form Return to Summer School Main Page Practical Information

Overview

This course is designed to help researchers understand how data and text analysis projects are performed in a research environment. It starts with the identification of a series of research questions connected to this year’s core topic (cost of living in Scotland and UK). Then it explores how computational methods can be used to obtain, clean, and analyse structured and unstructured datasets to answer those questions.

Key principles:

General knowledge of the interface of RStudio and coding is required. Materials will be released in advance to support attendees in refreshing their R skills in preparation.
Fosters interdisciplinary thinking by bringing social science and humanities researchers together to explore methods.
Covers both structured and unstructured data.
Illustrates the development process of data-led projects by moving through phases of the project lifecycle.
Challenge-led, helping researchers learn how to deal with real-world data.

RStudio Refresher

Although the Text and Data Analysis in the Wild stream will provide attendees with many new digital skills, a level of prior knowledge is necessary to get the most out of our training. Try our short quiz to test your knowledge of R, covering some of the basics you should be familiar with before the Summer School.

TAKE The Quiz

If you didn't do as well as you'd hoped, don't worry! In the video on the right, Training Manager Dr. Lucia Michelin goes through a refresher of some of the core R tools and techniques. The Summer School also offers Stream 1: A Gentle Introduction to Coding which provides a comprehensive guide through the basics of coding and programming for research.

Daily Schedule

Registration and Welcome

Morning Seminar:

Dr Jessica Witte: “What’s in the Internet Archive? A big (meta)data analysis”?

Dr. Witte will discuss her analysis of the language metadata of the ~35 million texts stored in the text repository of the Internet Archive, the challenges of which include problems with automating metadata using Tesseract OCR, potential ethical issues related to objects with violent/traumatic/colonialist histories, and general methodological issues (such as the need to manually annotate a percentage of the dataset, the difficulties involved in standardising messy data, and how to scrape large amounts of data).

Morning Sessions:

An introduction to the dataset and research questions: The origin and types of data used in the summer school will be discussed, alongside the research questions we will be exploring and answering throughout the week.
Introduction to webscraping: We will explore different techniques, different website structures, and how to solve common problems encountered in webscraping.

Afternoon Sessions:

Web scraping data from government websites: We will apply the skills we learnt in the morning to look at how the cost of living crisis is being presented by the Scottish and UK Governments.
BYOD session: Participants will work together on ongoing research datasets provided in advance by the summer school attendees, focusing on good practices and troubleshooting.

Morning Seminar:

Dr Justin Chun-Ting Ho: Extracting Latent Moral Information from Text with ChatGPT

Moral intuitions play a vital role in a wide range of individual behaviours as well as personal and political choices. Moral Foundations Theory is the most prominent framework on the innate moral capacities, which has been applied to study a wide range of social and political studies. However, extracting moral information from text has been challenging due to moral intuitions’ latent and highly contextual nature. Against this background, this paper explores the potential of ChatGPT, a large language model developed by OpenAI, for extracting latent moral information from non-English text. Three existing approaches, including the use of off-the-shelf dictionaries, content analysis by trained annotators, and crowdsourcing, are compared with ChatGPT. This research highlights the potential of ChatGPT as a cost effective tool for extracting latent moral information from text and also its potential pitfall.

Morning Sessions:

An introduction to text analysis: What is text analysis? When and why might we use it? We will also look at where to find textual data? How do we retrieve it efficiently and ethically?
Text analysis tools: We will cover basic text analysis methods and introduction to word clouds. There will be a demonstration of word clouds for our dataset, before engaging in basic pre-processing (stemming/lemmatization/stop word removal).

Afternoon Sessions:

Text analysis: We will begin to look at text mining/NLP, covering methods such as topic modelling, named entity recognition, and text classification. We will look at how (and why) they have been developed, alongside problems and limitations of these methods.
BYOD session: Participants will work together on ongoing research datasets provided in advance by the summer school attendees, focusing on good practices and troubleshooting.

Morning Seminar:

Dr Ugur Ozdemir: Affective Partisan Sorting in the UK: Turbulent Times

Affective partisan polarisation - strong emotional attachments towards co-partisans and hostility towards opposing partisans - has been the focus of American politics in the last decade but has not attracted much attention within the UK context. In this paper, I use British Election Study Internet Panel data to affective partisanship in the UK. The data covers the ‘turbulent times’ – 6 months before the independence referendum to the aftermath of the 2021 elections giving us a unique opportunity to test how issue polarisation and affective polarisation are related. Using graded response item response models, I provide empirical evidence on how the former drives the latter.

Morning Sessions:

An introduction to sentiment analysis: We will cover the core concepts, tools, and techniques used to undertake sentiment analysis.
Advanced sentiment analysis: Using the techniques learned earlier, we will use sentient analysis to examine trends in popular feelings about the cost of living crisis using data from social media platforms.

Afternoon Sessions:

Data wrangling: We will begin to examine structured data and the ways to enhance and clean a dataset. Common pitfalls, such as dealing with NULL values, and tools such as how to add additional rows, columns, and data, will be covered.
BYOD session: Participants will work together on ongoing research datasets provided in advance by the summer school attendees, focusing on good practices and troubleshooting.

Morning Seminar:

Dr Sam Leggett: Exploring "Fishy" Data - an Adventure in Multivariate Statistics and the Cult of J.W. Tukey

In this session we will explore large-scale archaeological isotope data to track major changes in diet and mobility in early medieval Europe. It uses archaeological data to highlight the power of Exploratory Data Analysis over Confirmatory Data Analysis, giving an example of meta-analysis, data re-use and the importance of Open Research in the humanities and social sciences.

Morning Sessions:

Data analysis: Regressions. We will introduce a variety of regression models that are used for understanding the relationship between variables in a dataset. Thereafter, we will deal with linear models and generalised linear models.
Data analysis: PCA. We will explore Principal Component Analysis, a method of identifying variables with the greatest variance in a dataset.

Afternoon Sessions:

Data analysis: Clustering. The afternoon will see the introduction of k-means and hierarchical clustering as methods of finding patterns in your data.
Data analysis: NULL hypothesis testing. We will use this technique to test the statistical significance of observations in our dataset.

Keynote Lecture:

Prof. Melissa Tarras: The Boundaries of Digitised Content: Designing Research Projects within Collection Constraints

Morning Seminar:

Dr Pedro Jacobetty: Combining Computational Research Methods for Text Analysis and Visualisation

In this talk, we will explore the power of visualization in text analysis using computational research methods. Our focus will be on combining natural language processing (NLP) and network analysis to analyse textual corpora. First, we will use keyphrase extraction to identify the most representative phrases within each text. These key phrases provide semantic information that helps reduce the complexity of the corpus. We will then leverage the co-occurrence of keyphrases in each text to create a network diagram that represents the entire corpus. This visualization allows us to explore the relationships between the keyphrases, revealing hidden semantic structures that might otherwise remain undiscovered in the complexity of the corpus. By visualising the relationships between keyphrases, we can quickly gain insights into the underlying meanings and themes within the textual data. Overall, our goal is to demonstrate how visualization can play a crucial role in the analysis of textual data using computational research methods. By combining NLP and network analysis, we can extract rich semantic information from the corpus and visualize it in a way that reveals the underlying structures of meaning within the text.

Morning Sessions:

Data visualisation: We will look at the basic principles of data visualisation, followed by an introduction to the "Grammar of Graphics" and various plots within this package. We will look at ways to visualise both structured and unstructured data.

Afternoon Sessions:

Data visualisation: We will shift to advanced data visualisation, using spatial data. We will work with examples of real-world data, and cover the fundamentals of using spatial data and visualising it effectively.
Next Steps: In the final session of the summer school, we will discuss the next steps you can take to continue developing your computational skills.

	Monday	Tuesday	Wednesday	Thursday	Friday
09:00-09:30	Registration
09:30-09:40	Welcome	Setting Up	Setting Up	Setting Up	Setting Up
09:40-10:40	Seminar	Seminar	Seminar	Seminar	Seminar
10:40-11:00	Coffee	Coffee	Coffee	Coffee	Coffee
11:00-12:30	Introduction	Text Analysis	Sentiment Analysis	Data Analysis	Data Visualisation
12:30-13:30	Lunch	Lunch	Lunch	Lunch	Lunch
13:30-15:00	Webscraping	Text Analysis	Data Wrangling	Data Analysis	Data Visualisation
15:00-15:30	Coffee	Coffee	Coffee	Coffee	Coffee
15:30-17:00	BYOD	BYOD	BYOD	Keynote	Next Steps
Evening	Pub Quiz	Pub Crawl	Ceilidh	Drinks Reception	Dinner

Grey: Events taking place in the Teaching Room (LG07 Room, 40 George Square)
Yellow: Events taking place in the Project Room, 50 George Square
Pink: Refreshment breaks that will take place in the lounge area outside the teaching rooms
Teal: Events in the social programme of the summer school

Our Practical Workshop Instructors & Helpers

Andrew McLean

Andrew is an archaeologist based at the School of History, Classics and Archaeology. His research interests currently focus on the economy of the Roman Adriatic, while his methodological approaches include GIS and statistical analysis. He is expanding on traditional Least Cost Path (LCP) analysis by using circuit theory to model maritime movement. Through this, he is familiar with QGIS, R, Circuitscape, shell scripting and programming languages such as Julia and Python.

Fang Jackson-Yang

Fang is a PhD student at the School of Philosophy, Psychology, and Language Sciences. Her research investigates how speakers encode prominent information in simulated conversations and how listeners predict upcoming utterances in comprehension. She works with both laboratory and corpus data. She conducts data analyses in R using multivariate statistical tools such as mixed-effects models.

James Besse

James is a PhD student in Science, Technology and Innovation Studies. His research covers the implementation of e-ID systems, specifically looking at the EU Settlement Scheme. He uses text mining alongside social surveys and interviews to understand user experience with the EUSS. James is also interested in the social impacts of new technologies, and how digital methods can help to understand them, more broadly. His areas of expertise entail statistics, web scraping, data visualization, research design and the use of R.

James Page

James is an archaeologist whose research focuses on the economy of Northern Italy in the Roman period. His work uses a combination of network modelling and statistical analyses to look at the dynamics behind inland trade and test the validity of prior modelling. He is familiar with QGIS, R, and OpenRefine.

Lara Dal Molin

Lara is a PhD student in Science, Technology and Innovation Studies and Sociology at the University of Edinburgh, part of the joint programme in Social Data Science with the University of Copenhagen. Her research interests concern the intersection between Artificial Intelligence, language and gender, which she investigates through the social study of GPT language models. Lara also co-coordinates the AI Ethics and Society group at the University of Edinburgh and is a tutor in the Schools of Social and Political Science and Informatics.

Our Speakers

Melissa Terras

Melissa Terras is Professor of Digital Cultural Heritage at the University of Edinburgh‘s College of Arts, Humanities, and Social Sciences.

Her research focuses on the digitisation of cultural heritage, including its technologies, procedures, and impact, and how this intersects with internet technologies. She was the founding director of the Centre for Data, Culture, & Society.

Jessica C. Witte

Dr Jessica Witte is a postdoctoral fellow at the University of Edinburgh.

Her research interests include creating and applying textual analysis to historical texts that particularly focus on the medicalization of women’s bodies. She applies and creates these tools to better understand the epistemic dimension of women’s experiences that can be used to create better interventions based on individualized experience, and patient advocacy.

Justin Chun-Ting Ho

Dr Justin Chun-Ting Ho is a postdoctoral fellow at Academia Sinica, the national academy of Taiwan. He formerly worked at Sciences Po.

His research focuses on nationalism and populism with a focus on how they are communicated via social media. His research employs a range of computational methods, including computational text analysis, social network analysis, and machine learning.

Ugur Ozdemir

Dr Ugur Ozdemir is Lecturer in Quantitative Political Science at the University of Edinburgh.

His research interests include comparative political behaviour, formal models of electoral politics, and quantitative methods. He is a dedicated advocate of bridging the gap between theoretical modelling and empirical analysis.

Sam Leggett

Sam Leggett is an archaeologist whose work has used bioarchaeological and funerary evidence to investigate diet and mobility at multiple scales across western Europe in the first millennium AD, with a particular regional focus on early medieval England. Her current project, ArchaeoFINS, is centred around an old and unresolved archaeological question of when, where, how, and why people began to eat fish again after the introduction of farming in Europe, with a focus on fishy Vikings and Scotland's coastal communities

Pedro Jacobetty

Pedro Jacobetty is a sociologist whose research interests intersect include technology, digital culture, knowledge production and circulation, media and communication. He is also interested in innovative ways of using digital methods for social sciences and art.

Apply Now

Application Form