LLMs training data

This shit is (one of multiple reasons) why I despise LLMs and hold not insignificant disdain for the people who advocate for using them.

The way LLMs operate is fundamentally unethical, and most of their operators do more unethical things on top of that, such as destructive, aggressive crawling, stealing and plagiarizing on an industrial scale (see Meta torrenting hundreds of books — not for personal use, like a normal individual would, but to enable a large scale plagiarism machine)

This is not even getting into the unfathomable amounts of electrical power and computer hardware required to train them at scale.

 

Let me be clear, there is no way to use LLMs that is both:

  • Responsible and ethical
  • Worth the time, money, and energy invested

    Yes, you can technically run a model entirely locally, with training data sourced ethically that you have permission to use for that purpose.

    But you won’t, because doing so is impossible in practice, except maybe for small tech demos.

    You won’t have enough training data, and most people you ask for more will give you the same response they gave e.g. Microsoft, Meta, Google: Fuck no.

    That’s why those corporations stopped asking, if they ever did to begin with.

    And if you do beg, borrow, and (mostly) steal enough data, you’ll still need an inordinate amount of computing power that is inherently both monetarily and ethically expensive.

Mitigating SourceHut’s partial outage caused by aggressive crawlers lobsters’ comments.

I have never thought of that before.

Apart from artists’ disdain on how these models are trained on their works. Which was obvious. Here, with the texts, it wasn’t too obvious to me.

Reconsider this, reconsider Cloudflare. Life isn’t that simple, huh?