Our recommendations to the European Commission on rules for researcher access to online platforms
At Cybersecurity for Democracy (C4D), one of our top goals is to increase transparency in the tech industry. Our research defining transparency mechanisms and experience building transparency tools have informed legislation such as the Digital Services Act (DSA) in Europe. In October, the European Commission launched a public consultation on the rules for researchers to access online platform data under the DSA, the goal of which was to solicit input ahead of adopting the rules in the first quarter of 2025.
We submitted comments with three key suggested areas for the Commission to focus on:
- Clearly defined modes of access for data (including research APIs, publicly accessible dashboards, and researcher accounts)
- Clearly defined data categories (including user created content, algorithms and models, source code, and algorithms and models, source code, and algorithm design documents)
- Mechanisms for validation & auditing of platform provided data.
Below is our full submission, which can also be found on the European Commission site:
European Commission Consultation, Digital Services Act: Delegated Regulation on Data Access
Cybersecurity for Democracy Response
About Cybersecurity for Democracy
Cybersecurity for Democracy (C4D) is a multi-university, research-based, nonpartisan and independent effort to expose online threats to our safety, democracy, and social fabric, and recommend how to counter those threats. We conduct research and analysis to better understand the systemic effects of algorithms and AI tools on large online networks and work to help the public and regulators understand the implications of our findings and develop solutions. We also help produce machine-learning tools for journalists and policymakers to study these products themselves. Our Principal Investigators (PIs) are Damon McCoy (Professor of Computer Science and Engineering at New York University) and Laura Edelson (Assistant Professor of Computer Science at Northeastern University). More about C4D at www.cybersecurityfordemocracy.org.
Introduction
In response to the European Commission’s draft Delegated Regulation on Article 40 of the Digital Services Act, we submit these views of the NYU / Northeastern Cybersecurity for Democracy Center.
Researcher access to platform data was included in the Digital Services Act for two purposes. First, so that research into the detection, identification, and understanding of systemic risk in the European Union can take place, and second, for the assessment of the adequacy, efficiency, and impact of systemic risk mitigation measures. As computer science researchers, this goal was exciting and motivating to us because studying systemic risk from algorithmic systems, and finding methods to reduce that risk are the very core of our research agenda. Examples of our work include research into child safety online [1], threats against election workers [2], empirical study of news engagement [3], audits of political ad archives [4], exploration of how Israel/Gaza conflict-related content was subject to disparate amplification feed algorithms [5] and numerous other studies [6][7]. In the course of this work, we have also gained practical experience with what data are required for our research, and the practical difficulties and intricacies of working with meaningfully public platform data. Our comments come from this practical, hands-on perspective, with three particular areas of focus:
- Clearly defined modes of access for data
- Clearly defined data categories
- Mechanisms for validation & auditing of platform provided data
Given the DSA’s overall goals of protecting consumers and their fundamental rights online by setting clear and proportionate rules, we expect that an expansive range of research will be conducted. To support this, we believe it is essential for the Commission to clarify a wide range of modes of access and data categories that will allow for research into numerous systemic risks to the European Union.
Given the disturbing trend toward more restrictive data access overall from Very Large Online Platforms (VLOPs) and the history of inaccurate and misleading data provided to researchers [8] and the Commission, we have two further suggestions for consideration. First, we encourage the Commission to consider validation and auditing mechanisms that will allow researchers and policymakers to establish whether data provided are correct and complete. Such tools will serve to bolster public trust in systemic risk research and in VLOPs. Second, we agree with commenters who suggest that an independent advisory mechanism would be essential to effective DSA implementation, to ensure shared learning across national regulators.
Lastly, we disagree with the suggestion of some industry commenters that the EU refrain from specifying data access types, who suggest instead that data access should be decided on a case-by-case basis. We are concerned that this will become an additional surface for delay and would unnecessarily expand costs for all parties. We would strongly encourage the Commission to streamline the procedures for vetted research requests with standardized data access types (in some but not all cases). We note that VLOPs participate in many standards bodies that successfully reconcile the tension between the need to standardize data access with the need to allow for ongoing technology updates. For data access for research into systematic risk, this can be accomplished by setting out an initial list of data access types and setting a schedule to regularly update that list.
Modes of Access
We encourage the Commission to consider a range of modes of access to data to enable a variety of research into systemic risk. This is necessary because there is no single mode of access that is well suited to all research objectives or to the privacy risk profiles of all data categories. We particularly encourage the commission to consider three modes of data access that we believe are particularly well suited to studying public, widely disseminated content:
- Researcher APIs: Application Programming Interfaces (APIs) are one of the main modes we use to get a glimpse of content being recommended by social media platforms. Research APIs have a track record of doing the kind of research on systemic risk which the Commission seeks to explore, and social media companies have historically shown an ability to operate these kinds of APIs successfully, despite the fact that they are largely no longer available. For example, our research in 2021 that showed that news from misinformation sources on Facebook, both from the far-right and the far-left, got significantly more attention than factual content was only possible because of data access via the CrowdTangle API [9]. This year, we published an analysis of hate speech on Twitter aimed at election workers that identified the networks that accelerate threats to these public employees. This work was also only possible because of access to data provided via the Twitter API [10]. We are far from the only researchers to utilize APIs to perform research into systemic risk and the list of important papers is too long to recount here. This track record demonstrates the utility of APIs for enabling the systemic risk research that the DSA envisions. Unfortunately, the last two years have brought significant steps backward in availability of data via these tools, with the Reddit API and the Twitter API being re-priced to be out of the reach of academics in 2023 [11][12], and CrowdTangle being shut down entirely in 2024 [13]. Despite this, the long existence of APIs clearly demonstrates that they are both technically feasible and not cost-prohibitive for platforms.
- Publicly Accessible Dashboards: In addition to its API, CrowdTangle also offered publicly available dashboards so that non-technical users could study the platform and see important civic and commercial content. These dashboards allowed a wide range of researchers, particularly social scientists, to study on-platform user behavior.
- Researcher accounts: A ‘Research account’ is a user account typically created by a researcher for the purpose of studying a social media platform. They are sometimes directly controlled by researchers, and in other circumstances are used alongside automated tools to perform repeated actions in a controlled way. Researcher accounts allow researchers to study how platforms treat users, platform design, and algorithm design in ways that have no parallel in an API. They also serve a validation function when used in parallel with platform provided data.
Relevant Data Categories
We encourage the commission to develop standardized formats for several commonly used data categories. We believe this will lessen on-ramp time for researchers and lower costs for everyone - platforms, digital services coordinators, and researchers. As we have noted above, our focus in these comments is to convey learnings from our experience conducting research over highly-public platform data. The data categories we detail here are not the only ones relevant to the study of systemic risk in the European Union, however, we believe that standardizing access to these categories of data would enable a great deal of important research from several different domains.
- User created content - Social media content itself, including outbound links, images, videos, text, and metadata summarizing view counts, reaction counts and temporal data have commonly been included in platform provided content data. In addition to these commonly included fields, we would also encourage a standard format for user created content to include information about what share of views and engagements with a post were derived from paid promotion (i.e. advertising) and what share came from algorithmic feed recommendations.
- Social media algorithms and models - Social media platforms use a wide range of algorithms and models, including recommender systems for content feeds, for classifying content, and for categorizing users. Algorithms and models are not typically provided to independent researchers, however, we have experience auditing these systems in the course of our work with law enforcement agencies in the United States. Additionally, we have audited open source feed recommendation algorithms.
- Source code, particularly for model training - Machine learning models and AI systems are trained to be optimized for specific objective functions, or goals. Understanding how a model was trained and the specific optimizations and tradeoffs are key to understanding why a model behaves the way it does, and understanding what other designs and optimizations may be possible.
- Algorithm design documents & documentation - ‘Human language’ design documents about how algorithms and AI systems are designed in the way that they are and documentation about how they function can be extremely useful aids in bringing outside researchers up to speed quickly in how a system functions in practice.
Validation & Auditing
Finally, it is our belief that at least some mechanisms to verify platform-provided data would be highly advantageous. This won’t always be possible because of privacy concerns, but given the track record of erroneous or incomplete data provided to researchers by platforms, such mechanisms may prove a vital safeguard that can increase confidence in research findings that rely on platform-provided data. We propose three particular mechanisms that we believe would be particularly useful.
- Interactive Data Discussions - Once a vetted research project has been approved by a DSC and a plan for a platform to provide data has been formulated, that data must not be handed over in a vacuum. Platforms must be required to participate in interactive sessions where researchers receiving data ask questions about what is or is not included. This is critical for the researchers to understand what conclusions can be drawn from the provided data.
- Research Accounts - As we discussed in ‘Modes of Access’, researcher accounts allow researchers to study platform behavior. For this reason, they also serve a validation function when used in parallel with platform provided data.
- Public Summary Data - In the past, when major errors in platform-provided data have been caught, it has been because researchers identified incompatibilities between the granular, detailed data provided to them, and the broad statistics platforms were providing about their public data. It has served, in effect, as a ‘checksum’ to tell researchers if the numbers they are looking at are plausible. Requiring platforms to share (ideally publicly) high level summary data would serve as one useful part of a validation mechanism.
We also believe there needs to be a provision to require platforms to provide corrected data if auditing demonstrates that previously provided data was incomplete or flawed. The process for doing this could be overseen by DSCs as part of the lifecycle of approved research projects.
Our Experiences Researching VLOPs
We are researchers who regularly interact with large data sets from social media companies using various modes of access, including Application Programming Interface (API) keys. We are leaders in the field of social media data donation, including a project we led where more than 16,000 volunteers donated their data to political ad tracking tool “Ad Observer” [14]. This tool, the data from which was freely accessible to journalists and the public, was used to discover political ads in our domestic elections from foreign government sources, among other problematic finds. These ads weren’t registered in Facebook’s ad archive, again clearly demonstrating the need for validation and auditing of the data VLOPs provide. However, instead of welcoming help to improve their product, Facebook threatened us with legal action, raised false concerns about Ad Observer privacy and attempted to shift the blame for their actions to their settlement agreement with the FTC over their own prior user privacy violations. The FTC issued a statement noting the inaccuracy of Facebook’s claims [15], but nevertheless the company sent us notice to Cease and Desist from further AdObserver research, and terminated our Facebook accounts. Our Facebook accounts have still not been restored, now several years later.
We think it is not in society’s interest for platforms to get a veto over who studies systemic harm. We believe this was one motivator of the Digital Services Act, to allow the public (through Digital Services Coordinators) to have some say in what research into systemic risk on VLOPs gets done, instead of leaving platforms as the gatekeeper. However the fact that our personal accounts are banned from Facebook currently prevents us from even applying for access to other programs that offer research data to vetted academics, because they require validation of researchers’ Facebook accounts. This is an implementation detail, but it is one that excludes researchers like us, who have been targeted and removed from Meta’s platform.
Thank you
We are grateful for the Commission’s sustained focus on these important questions, and we hope to be a continued resource to your efforts. Please direct any questions to:
info@cybersecurityfordemocracy.org.
[1] https://x.com/LauraEdelson2/status/1803775373743648809; https://www.wsj.com/tech/instagram-recommends-sexual-videos-to-accounts-for-13-year-olds-tests-show-b6123c65
[2] https://www.sciencedirect.com/science/article/pii/S1566253524002379
[3] https://dl.acm.org/doi/pdf/10.1145/3487552.3487859
[4] https://www.politico.eu/wp-content/uploads/2021/12/08/Facebook-Political-Ad-Paper.pdf
[6] https://arxiv.org/pdf/2301.02737
[7] https://dl.acm.org/doi/pdf/10.1145/3485447.3512142
[8] https://www.washingtonpost.com/technology/2021/09/10/facebook-error-data-social-scientists/
[9] https://dl.acm.org/doi/abs/10.1145/3487552.3487859
[10] https://doi.org/10.1016/j.inffus.2024.102459
[11] https://www.theverge.com/2023/6/8/23754780/reddit-api-updates-changes-news-announcements
[12] https://www.poynter.org/ifcn/2023/twitter-is-removing-free-api-access-and-no-one-is-excited/
[13] https://www.wired.com/story/meta-kills-crucial-transparency-tool-worst-possible-time/