The massive proliferation of machine-readable information that is available online opens up new possibilities for research and for the more efficient collection of large amounts of information. As a result, the ability to gather such information has become an important asset for researchers in different fields. The possible applications and purposes are vast. For instance, one could automate the collection of coordinates for a set of cities from Wikipedia; one might want to collect a set of articles from a website; or perhaps one is interested in obtaining specific information from available forms provided by the public administration.
Scholars from different disciplines of the University of Kaiserslautern had the opportunity to learn the fundamentals of web scraping in a workshop organized by Georg Wenzelburger and supported by the TU-Nachwuchsring (network for the promotion of young scientists). Over one and a half course days, taking place on the 24th and 25th of May, Dominic Nyhuis gave insights into the theoretical foundations and the practical application of web scraping techniques. Dr. Dominic Nyhuis is the co-author of a comprehensive book on web scraping and has given various workshops on the topic, among others at the ECPR Winter School.
The workshop started with an introduction to the basics needed to understand how websites are structured and how this makes an at least partially automated collection of information possible. Dominic Nyhuis then went on to demonstrate how websites can be processed and parts of these sites can selectively be retrieved using the software R. The course was able to cover various exercises and examples, and participants had the chance to suggest applictions of web scraping they might need for their own research. As an outlook, the course concluded with a brief overview of text mining methods, i.e. techniques that can be applied after text has been scraped from the web.