@b it seems like anthropic at least has been including scraped pages marked with the big bench canary string for over two years: https://www.lesswrong.com/posts/kSmHMoaLKGcGgyWzs/big-bench-canary-contamination-in-gpt-4 (apologies for the lesswrong link but canary strings seem to be in the weeds enough that pretty much the only people discussing this are AI boosters)