The ZEW-FDZ offers a novel panel of semi-structured webpage data on company level – the Mannheimer Webpanel. It comprises textual webpage data retrieved from a broad range of German firm websites. A detailed description of the webscraping methods used to harvest the data as well as an examination of the dataset (corpus of German corporate websites) can be found in this discussion paper: Kinne, Jan and Janna Axenbeck (2018), Web Mining of Firm Websites: A Framework for Web Scraping and a Pilot Study for Germany, ZEW Discussion Paper No. 18-033, Mannheim. Download (PDF file, not accessible, 2,36 MB).
The dataset provides, among others, the following variables:
- ID – unique company identifier.
- dl_rank – usually a company website consists of several single webpages. In this context, dl_rank represents the chronological order in which the individual webpages were downloaded. The main page of a website has rank 0, the first subpage processed after the main page has rank 1, and so on.
- dl_slot – the domain name of the website.
- title – the title of the company website as indicated in the website's meta data.
- keywords – list of keywords of the company website as indicated in the website's meta data.
- description – the description of the company website as indicated in the website's meta data.
- text – the text/content that was downloaded from the webpage.
- timestamp – the exact time when the webpage was downloaded.
- url – the URL of the webpage.