Researcher Releases AI Training Dataset Based on Posts to Leftist Echo Chamber Bluesky

Leftist protesters
Rosemary Ketchum/Pexels

A machine learning researcher has released a massive dataset containing one million public posts from leftist social media echo chamber Bluesky, raising questions about data privacy and consent. The data could potentially be used to train AI to be even more woke than notoriously left-leaning AI chatbots like ChatGPT.

404Media reports that in a move that has raised concerns about user privacy and consent, Daniel van Strien, a machine learning librarian at AI community platform Hugging Face, released a dataset composed of one million Bluesky posts. The dataset, intended for machine learning research, includes the text content of each post along with metadata such as the time of posting and the user’s decentralized identifier (DID).

Van Strien announced the dataset on Bluesky last week, stating, “This dataset contains 1 million public posts collected from Bluesky Social’s firehose API, intended for machine learning research and experimentation with social media data. Each post contains text content, metadata, and information about media attachments and reply relationships.”

While the data was collected from Bluesky’s public firehose API, which aggregates all public data updates on the platform in real-time, the inclusion of user DIDs has raised privacy concerns. The dataset is not anonymous, and van Strien also created a search tool for finding users based on their DID, which he published on Hugging Face.

A quick review of the dataset reveals that it contains a wide range of content, from political discussions and concert chatter to pornography. Notably, the dataset is a snapshot of Bluesky at a specific point in time, meaning it may include posts that have since been deleted by users.

According to the project page, the dataset could be used for various purposes, such as training language models, analyzing social media posting patterns, and studying conversation structures. However, the page also lists “out of scope” uses, including building automated posting systems, creating fake or impersonated content, and extracting personal information about users.

The dataset has quickly gained popularity on Hugging Face, becoming one of the top trending projects on the platform. This rapid adoption highlights the growing interest in using social media data for machine learning research.

Bluesky, which is built on the open AT Protocol, has previously stated that it does not use user content to train generative AI and has no intention of doing so. The platform uses AI internally for content moderation and its Discover algorithmic feed but does not train any generative AI systems on user content.

In response to the dataset’s release, a Bluesky spokesperson shared the following statement: “Bluesky is an open and public social network, much like websites on the Internet itself. Just as robots.txt files don’t always prevent outside companies from crawling those sites, the same applies here. We’d like to find a way for Bluesky users to communicate to outside orgs/developers whether they consent to this and that outside orgs respect user consent, and we’re actively discussing how to achieve this.”

van Strien has since removed the dataset, saying in a post on Bluesky: “I’ve removed the Bluesky data from the repo. While I wanted to support tool development for the platform, I recognize this approach violated principles of transparency and consent in data collection. I apologize for this mistake.”

Breitbart News previously reported that leftists have flooded Bluesky with complaints, requests for censorship, and even child pornography:

After days of explosive growth on the platform, the Bluesky Safety team posted Friday that it received 42,000 moderation reports in the preceding 24 hours, compared to 360,000 in all of 2023. Most troublingly, the company acknowledged that it is receiving reports of “CSAM” or child sexual abuse material, commonly known as child pornography.

On X/Twitter, users are noting that the new platform is quick to censor anyone engaged in wrongthink, including one user allegedly banned on the same day he signed up, echoing the previous Twitter moderation rules before Elon Musk bought the company.

Read more at 404Media here.

Lucas Nolan is a reporter for Breitbart News covering issues of free speech and online censorship.

COMMENTS

Please let us know if you're having issues with commenting.