To fight AI, we need 'personhood credentials,' say AI firms

31337@sh.itjust.works · 12 days ago

Production AI is highly tuned by training data selection and human feedback. Every model has its own style that many people helped tune. In the open model world there are thousands of different models targeting various styles. Waifu Diffusion and GPT-4chan, for example.

31337@sh.itjust.works · 13 days ago

I think you have your janitor example backwards. Spending my time revolutionizing energy productions sounds much more enjoyable than sweeping floors. Same with designing an effective floor sweeping robot.

31337@sh.itjust.works · 13 days ago

AI are people, my friend. /s

But, really, I think people should be able to run algorithms on whatever data they want. It’s whether the output is sufficiently different or “transformative” that matters (and other laws like using people’s likeness). Otherwise, I think the laws will get complex and nonsensical once you start adding special cases for “AI.” And I’d bet if new laws are written, they’d be written by lobbiests to further erode the threat of competition (from free software, for instance).

31337@sh.itjust.works · 14 days ago

The search engine LLMs suck. I’m guessing they use very small models to save compute. ChatGPT 4o and Claude 3.5 are much better.

31337@sh.itjust.works · 15 days ago

Donation, patronage, gift economy, mutual aid, or whatever you want to call it is fine by me. People can pirate a lot of proprietary software as well, yet people still pay.

31337@sh.itjust.works · 15 days ago

To fight AI, we need 'personhood credentials,' say AI firms

31337@sh.itjust.works · 15 days ago

Yet, people still pay for it.

31337@sh.itjust.works · 16 days ago

The problem is that HP writes drivers and software for those things for Windows, but not for Linux, so Linux depends on random people to write software for those things for free (which often involves complex reverse-engineering). With Linux you need to make sure you use widely-used hardware that someone has already written support for (this is mostly applicable to laptops and peripherals, which often use custom non-standard hardware). There may be a way to fix your problems, but you’ll have to search forums or issue trackers for the solutions, and they’re probably pretty involved to get working correctly. The router crashing thing is probably just a coincidence though, or the laptop is using a feature that’s broken on your router.

31337@sh.itjust.works · 23 days ago

In the Texas counties I’m most familiar with, if you’re arrested and they don’t have a good case, they just keep resetting court dates for years instead of going ahead with the process. If you can’t afford a bond, you’ll be in jail that whole time (which pressures people to take plea deals), if you can secure a bond, you’re out, but with limited rights and a whole lot of hassles to deal with.

31337@sh.itjust.works · 29 days ago

I thought the tuning procedures, such as RLHF, kind of messes up the probabilities, so you can’t really tell how confident the model is in the output (and I’m not sure how accurate these probabilities were in the first place)?

Also, it seems, at a certain point, the more context the models are given, the less accurate the output. A few times, I asked ChatGPT something, and it used its browsing functionality to look it up, and it was still wrong even though the sources were correct. But, when I disabled “browsing” so it would just use its internal model, it was correct.

It doesn’t seem there are too many expert services tied to ChatGPT (I’m just using this as an example, because that’s the one I use). There’s obviously some kind of guardrail system for “safety,” there’s a search/browsing system (it shows you when it uses this), and there’s a python interpreter. Of course, OpenAI is now very closed, so they may be hiding that it’s using expert services (beyond the “experts” in the MOE model their speculated to be using).

31337@sh.itjust.works · 29 days ago

I find Kagi results a little bit better than Google’s (for most things). I like that certain categories of results are put in their own sections (listicles, forums) so they’re easy to ignore if you want. I like that I can prioritize, deprioritize, block, or pin results from certain domains. I like that I can quickly switch “lenses” to one of the predefined or custom lenses.

31337@sh.itjust.works · 29 days ago

Their line goes up when they show they’re investing in AI, and it goes down when it looks like they’re falling behind or not investing enough in it.

TBH, a lot of times I find myself interacting with ChatGPT instead of searching. It’s overhyped, but it’s useful.

31337@sh.itjust.works · 1 month ago

The EFF link I posted above provides evidence. Again, here’s a quote from part of it:

The process of machine learning for generative AI art is like how humans learn—studying other works—it is just done at a massive scale. Huge swaths of data (images, videos, and other copyrighted works) are analyzed and broken into their factual elements where billions of images, for example, could be distilled into billions of bytes, sometimes as small as less than one byte of information per image. In many instances, the process cannot be reversed because too little information is kept to faithfully recreate a copy of the original work.

As I mentioned before, Copilot at least, helps people avoid copyright infringement by notifying you if your code is similar to public code. The solution I’m proposing is no new laws, and just enforcing the ones we have. Most of the laws being proposed look like attempts at regulatory capture to me.

31337@sh.itjust.works · 1 month ago

That we already have laws that protect copyright infringement (which seem like they would still apply if it was spit out by an LLM or not), and no more should be made. That training on public data is fine.

31337@sh.itjust.works · 1 month ago

I’m saying using code for training is a different issue that copyright infringement. I edited my post above to better lay out my position.

31337@sh.itjust.works · edit-2 1 month ago

I stated that they can do this, and asked if they could be sued if they used near-verbatim code generated from an LLM, just like they could be sued if they copy-pasted AGPL code.

Edit: Tools like CoPilot tell you if your code is similar to publicly available code so you can avoid these issues.

Edit: Just looked up EFF’s position and I tend to agree with it:

Artificial Intelligence and Copyright Law

Artists are understandably concerned about the possibility that automatic image generators will undercut the market for their work. However, much of what is criticized is already considered fair use under copyright law, even if done at scale. Efforts to change copyright law to transform certain fair uses into infringement carry serious implications, are likely to interfere with the innovative potential of AI tools, and ultimately do not benefit artists. In fact, the use of these tools could expand the capacity of artists to create expressive works. Policymakers should emphasize the importance of human labor and investment in what receives copyright protection to maintain wages and dignity. Artists should be protected from efforts by large corporations to both substitute their labor with AI tools and create a new, unnecessary copyright regime around AI-generated art.

Machine Learning is a Fair Use

The process of machine learning for generative AI art is like how humans learn—studying other works—it is just done at a massive scale. Huge swaths of data (images, videos, and other copyrighted works) are analyzed and broken into their factual elements where billions of images, for example, could be distilled into billions of bytes, sometimes as small as less than one byte of information per image. In many instances, the process cannot be reversed because too little information is kept to faithfully recreate a copy of the original work.

The analysis work underlying the creation and use of training sets is like the process to create search engines. Where the search engine process is fair use, it is very likely that processes for machine learning are too. While the act of analysis may potentially implicate copyright, when that act is a necessary step to enabling a non-infringing use, it regularly qualifies as fair use. If the intermediate step were not permitted, fair use would be ineffective. As such, when factual elements of copyrighted works are studied and processed to create training sets—which, once again, is how we humans learn and are inspired by themes and styles in art and other works—that is likely to be found a fair use.

https://www.eff.org/document/eff-two-pager-ai

31337@sh.itjust.works · 1 month ago

After all, if an “AI” model, open source or not, is allowed to just “train” on my AGPL code and spit it back (with minor modifications at best) to an engineer in AWS that’s it for my project. Amazon will do the Amazon thing and steal the project. So say goodbye to any software freedom we have.

An engineer at AWS can already just copy your code, make minor modifications, and use it. I would think the same legal recourse would apply if it was outputted from an LLM or just a copy-paste? This seems like a tangential issue to whether the LLM was trained on your code or not (not training on your code obviously reduces the probability of the LLM spitting it back out near-verbatim though). Personally, I don’t see anything wrong with anyone using public code to build statistical models. And I think the pay-to-scrape models that Reddit, Xitter, and others are employing will help big tech build the “moat” they’re looking for. Big tech is asking for AI regulation for similar reasons.

31337@sh.itjust.works · 1 month ago

Information wants to be free.

31337@sh.itjust.works · 2 months ago

I wonder if such a system could be designed to be privacy-preserving.

31337@sh.itjust.works · 2 months ago

Doesn’t sound much more complicated than invitation-only services. Most people wouldn’t even really need to know the details of how it works.

31337@sh.itjust.works · 2 months ago

Same. I think I’ve read that a single GPT-4 instance runs on a 128 GPU cluster, and ChatGPT can still take something like 30s to finish a long response. A H100 GPU has a TDP of 700w. Hard to believe that uses only 10x more energy than a search that takes milliseconds.

31337@sh.itjust.works · 8 months ago

Mark Zuckerberg indicates Meta is spending billions of dollars on Nvidia AI chips