To look inside this black box, we analyzed Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA. (OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT)
There's a search tool to see what sites are this data set. Relevant screenshot attached...
Per the Washington Post:
Inside the secret list of websites that make AI like ChatGPT sound smart
There's a search tool to see what sites are this data set. Relevant screenshot attached...