Home Science & TechSecurity Questionable AI Training Tactics a Growing Concern

Questionable AI Training Tactics a Growing Concern

by ccadm


AI training tactics continue to come under scrutiny due to their lack of oversight. It’s common for contemporary writers to draw inspiration and even borrow aspects from earlier stories. While this practice is part of the evolution of writing, there are times when an author’s works and style are duplicated without consent.

When this situation occurs, modern copyright infringement laws allow the originating content creator to recoup losses. However, the same can’t be said about the growing number of AI systems found to have used illegally acquired works to develop their models. Now, the industry faces a crossroads in terms of training tactics and retribution for those who have experienced losses. Here’s what you need to know.

Questionable AI Training Tactics

A flurry of lawsuits now claim that OpenAI and META (META -0.05%) purposely sought out workarounds when acquiring library data for their model training. The plaintiffs of the lawsuit claim that the company was aware and didn’t care that they were potentially stealing millions from authors without compensation, or even a mention.

Claims such as this, aren’t a huge surprise to many who believe the AI race has led to a basic disregard for copyright laws. As such authors continue to push back against AI developers, requesting more transparency revolving around how data is acquired and processed by these systems.

While no clear paths have been shared with the public yet, the evidence has started to pile up against the AI firms. This evidence could result in sweeping changes to the training tactics used by AI developers in the future.

Training Tactics Used by Companies to Create Models

Training an AI system is a complex process that can involve gathering, and processing huge swaths of data from various sources. This data is what the AI system references when attempting to answer questions or figure out new scenarios. Consequently, most AI systems perform better when they have more data to reference.

Creating AI

The primary way to create an AI model involves data collection. In the past, data collection would be a time-consuming process that required engineers to seek out already existing databases rather than building from scratch. For example, healthcare providers may develop an AI that leverages national health statistics to provide more relevant medical answers.

From there, the developers decide on what algorithm to choose. The main options are supervised learning, unsupervised, semi-supervised, reinforced learning, linear regression, deep learning, random forest, naïve Bayes, and neural networks. Each of these algorithms provides unique pros and cons, which make them better suited for particular tasks.

Lastly, the iterative training process begins. In this stage, the model gets questioned and graded on the accuracy and performance provided. This step allows engineers to fine-tune and validate the model, boosting capabilities. This stage also helps engineers ensure the model continues to learn from the training data, versus just memorizing it.

Source – Uptech.team

Current AI Training Tactics are Expensive

The AI model training process is time-consuming and expensive and can be split into two main categories: training and running.  Training refers to the one-time cost of creating a particular model. For example, ChatGPT spent around $100M on its 4o model, according to the company’s CEO, Sam Altman.

Notably, these costs dwarfed earlier model expenses. For example, ChatGPT-3 costs approximately $4M to train. The rising costs of training AI is the direct result of more computational requirements. The newest models run on the latest NVIDIA chips, adding to their costs.

Additionally, AI has driven cloud computing prices higher. The majority of AI applications don’t natively run on users’ PCs. Instead, they rely on state-of-the-art data centers and cloud computing algorithms to support the massive computational requirements. All of these factors have made programming AI expensive.

AI Marketplaces

A recent jump in the number of AI training marketplaces indicates that there are now more developments in the space seeking to save on costs. AI marketplaces allow developers, content creators, and those seeking AI integration to meet up. Developers can find already built models that they can enhance or fine-tune for their needs, saving lots of time and funding in the process.

Runtime Cost

The runtime or inference costs of AI systems are another expense that developers must consider. The inference cost refers to how much money each interaction with the AI costs. The cost of running many of today’s AI systems is higher because the system must access all the data in its model to provide an accurate and helpful response. This step means that the AI will need to utilize a lot of computing power, from high-performance equipped machines, frequently. This requirement adds significant costs to the system.

Are Today’s AI Training Tactics Ethical?

When you look at the training tactics and strategies employed by today’s massive AI firms, it’s easy to see ethically and morally challenging stances exist within the industry. Yes, to make the best AI systems, developers need to provide valuable and accurate data to the model. However, some developers argue that the cost of getting copyright approval on all of the data used in the set would be astronomical, basically stifling innovation.

International copyright law protects authors from unauthorized use of their works, style, and likeness. AI systems seem to have found a legal loophole in that they can utilize nearly exact replicas of people, places, info, and stories, with little legal pushback reported so far.

However, there’s growing sentiment among content creators that these systems illegally obtained their works and then used them to train AI models to duplicate the format, tone, and style.  Evidence of the illegal usage of copyrighted books within OpenAI’s training models has been brought to light by recent revelations.

OpenAI lawsuit

In the OpenAI lawsuit, the plaintiffs allege that developers knowingly used shadow libraries to avoid paying for large collections of books. Shadow libraries are online platforms that provide access to copyrighted works for free. The ones listed in the OpenAI lawsuit include LibGen, Bok, Sci-Hub, and Bibliotik.

The lawsuit sets out to prove that OpenAI and META knew they were circumventing copyright laws. It demonstrates how the companies used shadow libraries and other free sources to lower their training costs significantly while robbing authors of their just payments.

In response to the allegations, META first acted as if it was unaware of any such actions. However, after emails surfaced that are believed to reveal the company’s full understanding of its actions, and explain that it torrented +81.7 terabytes of data from shadow libraries, equalling millions of works.

META Unredacted Emails

Ironically, it was internal emails that revealed that the company was well aware of the questionable nature of its decision to use shadow libraries. In the unredacted emails, a worried engineer named Nikolay Bashlykov questions the morality of the project, before going on to joke about the legality of the plan.

In later emails, the employee stated that he was worried about using META IP addresses to load torrent pirated content. Recognizing that this could be a problem, META instructed engineers to download the data from outside servers not connected to Facebook or META.

Orders from the Top

When originally questioned about META’s involvement in the torrenting, Mark Zuckerberg stated he had no idea of the process. The unredacted emails proved otherwise. The emails are believed to show that the decision to use non-FB servers came only after Zuckerberg’s direct approval.

Are AI Developers Using Stolen Content?

Given the evidence provided and the sudden boost in AI capabilities, it’s seemingly obvious that many AI systems have turned towards shadow libraries and other means to build more effective training models. These datasets contain copyrighted materials that never received the authors’ or publishers’ consent for use in training AI models.

Is it Illegal?

While it’s getting more difficult to deny the use of pirated material in today’s most advanced AI models, the legality of the practice remains in question. No AI company has been subject to copyright infringement laws yet. Additionally, the AI race is in full swing and many politicians may see limiting their local AI system access to data as a hindrance to innovation. As such, they may not move to make fighting AI copyright infringement as easy as traditional thefts.

Lawsuits Pouring In

Regulators may not be ready to put the heat on AI firms, but the content creators have had enough. Lawsuits continue to pour in from disillusioned authors who state that their content has been illegally acquired, distributed, and duplicated without any compensation.

Recently, Joseph Saveri Law Firm filed US federal class-action lawsuits directly about this matter. The lawsuit, which was filed on behalf of Sarah Silverman and other authors against OpenAI and META, seeks reparations for losses brought on by the product’s ability to duplicate its format and style.

The class-action lawsuit alleges multiple violations of the Digital Millennium Copyright Act, criminal negligence, and unfair competition laws. The goal of the lawsuit is to get a permanent injunction on these training tactics until a fair compensation and protection strategy can be put in place for authors.

Is DeepSeek Trained by ChatGPT?

Ironically, ChatGPT has alleged that it’s the victim of intellectual theft from an AI system after Chinese AI startup, DeepSeek sent ripples through the market. DeepSeek caused a tidal wave of interest after the company revealed its impressive performance, low costs, and advanced capabilities to the public last month.

OpenAI developers accused DeepSeek of using ChatGPT data to program its model, which allowed it to create a model that outperforms the competition and costs far less. In comparison, DeepSeek achieved performance on par with ChatGPT for a cost of $6M versus +$100M used by ChatGPT.

Additionally, DeepSeek manages to utilize far less computing power thanks to its unique setup. The inference costs for DeepSeek are far less than ChatGPT because of the use of several specialized models versus a single massive one.

As such, DeepSeek only needs to activate the model referring to the question, enabling it to utilize much less expensive and powerful NVIDIA chips. Specifically, DeepSeek uses 1/50th the cost of running the latest Claude 3.5 Sonnet model, making it a more cost-effective solution for businesses in the long term.

It Borrows

Interestingly, DeepSeek doesn’t deny the use of ChatGPT to develop “thinking” scripts. It even describes the process in the original DeepSeek whitepaper. The engineers felt that this approach would provide DeepSeek with more accurate information, which sped up its distillation process.

Additionally, it ensured that the data used to program competitors’ AI models wasn’t used to program DeepSeek. The results are a more efficient system that outperforms its predecessor and costs only a fraction to operate. Of course, many argue that ChatGPT’s costs should be included in DeepSeek’s budget if they leveraged the system to create theirs.

DeepSeek Identity Crisis

In a recent article, an AI researcher went to the source to see if DeepSeek borrowed a lot from ChatGPT. He started by asking the LLM if it thought DeepSeek was smarter than Gemini, Google’s competitor. Ironically, the LLM responded that it thought “it was ChatGPT.” This revelation was seen by many as all the evidence needed to prove the massive amount of data that DeepSeek gathered from ChatGPT.

Should Content Creators be Compensated for AI Use?

There’s growing concern for content creators in the market. As AI systems evolve, they’re sure to ingest even more copyrighted materials. In the past, engineers have seen companies turn off copyright management information to lower the risk of their actions being caught. However, the tide is turning.

Back in July 2023, a group of +8000 writers signed a letter penned to META CEO Mark Zuckerberg, OpenAI CEO Sam Altman, Alphabet CEO Sundar Pichai, Stability AI CEO Emad Mostaque, IBM CEO Arvind Krishna and Microsoft CEO Satya Nadella. The letter states that AI “mimics and regurgitates our language, stories, style, and ideas.” It demands compensation and recognition.

The Writers Guild of America and the Screen Actors Guild have also been vocal about the use of their works within the AI sector. They seek to guarantee certain rights and compensation for writers whenever their works get used to create AI models.

Training Tactics Options Emerge

Recognizing the limitations of the current setup and how it lacks any real legitimate way forward, BookCorpus set out to offer a better solution. The company was founded in 2015 with the specific goal of supporting AI researchers in training LLMs. As such, it includes thousands of works and models designed to enhance performance, without crossing ethical lines.

Already, multiple AI-focused service providers are entering the market. These firms combine access to valuable data, models, and more. They are tailored to meet AI computational requirements and often come alongside some form of cloud computing option as a way to further reduce development costs.

Companies Leading the AI LLMs Revolution

The rise of LLMs has made it easier than ever for anyone to interact with these systems. From a simple chat prompt, you can conduct in-depth research, create images and stories, and much more. Consequently, LLMs are seen as one of the biggest breakthroughs in computer interaction tech in a lifetime. Here’s one company that continues to drive innovation in the LLM market.

Alphabet Inc. (GOOG +0.71%) is the parent company of Google and its many subsidiaries. It’s one of the most recognizable and successful firms in the AI sector. Interestingly, the engineers choose to utilize the company’s other model, Google DeepMind, to create Google’s Gemini LLM. Gemini is an advanced LLM that translates, understands content, answers questions, and much more.

Notably, Google DeepMind has been hard at work creating LLMs and new features for the firm. For example, the new SELF-DISCOVER feature creates task-specific architecture within the models, reducing the overall time needed to accurately answer questions.

Alphabet Inc. (GOOG +0.71%)

Given Google’s dominance in the market, direct access to massive data, and continued expansion into purpose-built models, GOOG is a smart stock to hold. The company is one of the top-performing AI providers globally and has the network and finances to integrate its technology and expand it to the public effectively.

How will Training Tactics Change in the future

You can expect to see AI training tactics rely on more refined AI systems for data as the industry expands. DeepSeek demonstrated that its approach significantly lowered costs. Additionally, it’s going to be more difficult to claim copyright infringement if the company simply uses data created by another AI rather than directly.

All of these factors and the growing demand by governments to lead the AI race have led content creators to a very dangerous place. Hopefully, in the coming months, AI developers will create more effective training tactics that respect and compensate those whose data they leverage for success.

Learn about other cool AI Projects Now.



Source link

Related Articles