European Court of Human Rights Mapping Project
Principal Investigator(s): View help for Principal Investigator(s) Jessica Greenberg, University of Illinois at Urbana-Champaign; Benjamin Krupp, University of Illinois at Urbana-Champaign; Stephanie Auer
Version: View help for Version V1
| Name | File Type | Size | Last Modified |
|---|---|---|---|
|
|
text/csv | 28 MB | 11/24/2021 02:24:AM |
Project Citation:
Project Description
Please see "collection notes" for complete methodology.
Scope of Project
Once we arrived at a master data frame that contained all of the categories we desired, the next task was to clean for inconsistencies in HuDoc’s manual coding. We wrote a data cleaning script to prepare the data set for exploratory analysis and visualization. This script targeted four areas of inconsistency within the data. (1) It pulls country name from the title of the case and solves for inconsistencies with country name, spellings and languages used. (2) It formats dates to be consistent and machine-readable. (3) It separates cases that involve multiple countries. (4) It removes the lower court ruling when the Grand Chamber overturned Chamber judgments. This means the dataset reflects the Court’s final ruling of violation or nonviolation.
We tested separately for the comprehensiveness and accuracy of our master set. In comprehensiveness tests, our objectives were to make sure that the dataset included all cases, and mirrored both internal court reporting and HuDOC. In accuracy tests, our objective was to test our dataset at the highest resolution (specific article, paragraph and subparagraph information) against both HuDOC and original court documents (judgments) to identify possible bugs in our scraping script and/or inconsistencies in HuDOC’s internal coding. In our comprehensiveness testing, we checked our cumulative dataset numbers against those reported in ECtHR’s annual yearbooks. Over the entirety of the dataset, there was a 6% discrepancy in total violations between these two sets. Our master dataset included only 94% of the cases listed in the Annual Yearbooks. We attribute this discrepancy to cases that were not listed in English (our scrape sorted for English versions of the cases), cases of overturned violations, and HUDOC’s internal miscoding and human error.
To test the master frame for accuracy, we manually coded (based on judgment documents) 193 cases selected at random proportionate to the total number of violation judgments in a given year. Ratio was 1 test judgment for every 100 cases, with a minimum per-year of 1 test case, in order to maintain historical breadth of tests. We manually coded these cases for article violations, application number, date and country of origin. We compared this manually coded test frame against our master frame, and found that our data set was 99% accurate to HUDOC, but that HUDOC was only 92.8% accurate to case documents. After analysis, this 92.8% error was due to HuDOC miscoding, specifically around missing paragraph and sub-designations when the case was coded/entered into the HuDOC system. The vast majority (around 80%) of these mistakes were in articles 5, 6 and protocol 1 – areas where paragraph and sub-paragraph designation are most meaningful. The HuDOC system breaks down article number as a general number (Article 5, Article 6, etc.). In cases in which subparagraph specifies a particular subcategory of violation it also breaks down that general number into paragraph (Eg. article 5-1). In our tests we found coding inconsistencies in how these subdesignations were used. For example, if a violation was found of Article 5, subparagraph 1, it was at times coded as simply an article 5 violation, in other cases it was coded as both a 5 and a 5-1 violation (the standard and proper way to code this to maximize search accuracy). In other cases it was coded as only 5-1. What this means is that searches for Article 5 violations would not account for all 5 cases, if they had not been comprehensively coded. We found these coding inconsistencies in as many as 14.6% of Article 6 cases and 13.6% of article 5 cases. Roughly a third of these hand-coding errors were correctable in the script (for example, ensuring that every 5-1 violation was also listed as a 5 violation). However, for the cases in which paragraph was omitted in HuDOC, the only corrective path would be hand-coding the court documents ourselves. This means our dataset returns more accurate results than HuDOC for queries relating to articles with the most paragraph sub-designations (for example 5, 6 and some protocols), but is not 100% accurate to case documents (judgments) and can not achieve full accuracy to case documents in those same areas.
We are confident that any errors in the dataset are due to the HuDOC source material. Within the parameters of HuDOC our dataset is accurate within 1% to HuDOC and corrects for HuDOC errors (likely due to human coding error) where possible, particularly with regard to paragraph and subparagraph as outlined above.
Methodology
Related Publications
Published Versions
Found a serious problem with the data, such as disclosure risk or copyrighted content? Let us know.
This material is distributed exactly as received from the data depositor. As of April 2026, depositors are required to submit study materials in accessible formats. ICPSR has not reviewed, checked, or processed this material. For additional information about the study, please contact the investigator(s) directly. If you have questions about the accessibility of materials distributed by ICPSR or require further assistance, please visit ICPSR's Accessibility Center.