Tidying Data with OpenRefine
No matter if your data is newly created or inherited, you will almost always need to clean it before creating graphs and charts. The data cleaning stage can be one of the most time-consuming and tedious tasks in research, but there are tools that can help. In this step, we are going to focus on a really powerful and fast tool that hopefully will help you saving time while cleaning your dataset.
OpenRefine is a tool for working with messy data: cleaning it; transforming it from one format into another and extending it with web services and external data.
Why Use OpenRefine?
- Data is often very messy. OpenRefine provides a set of tools that allow you to identify and amend the messy data.
- It is important to document what you do to your data. Publishers and funding bodies often require documentation of the steps you took when working with your data. With OpenRefine, you can capture all actions applied to your raw data and share them as supplemental material.
- All actions are easily reversed in OpenRefine.
- If you save your work it will be to a new file. OpenRefine always uses a copy of your data and does not modify your original dataset.
- Data cleaning steps often need repeating with multiple files. OpenRefine keeps track of all of your actions and allows them to be applied to different datasets.
- Some concepts such as clustering algorithms are quite complex, but OpenRefine makes it easy to introduce them, use them, and show their power.
What can you do with OpenRefine?
- Import data in various formats
- Explore datasets in a matter of seconds
- Apply basic and advanced cell transformations
- Deal with cells that contain multiple values
- Create instantaneous links between datasets
- Filter and partition your data easily with regular expressions
- Use named-entity extraction on full-text fields to automatically identify topics
- Perform advanced data operations with the General Refine Expression Language