From hallucinations to hardware: Lessons from a real-world computer vision project gone sideways


Join a reliable event by enterprise leaders in about two decades. VB Transform, Real Enterprise AI strategy brings together people who build. Learn more


Computer vision projects are rare, as planned, and this was no exception. The idea was simple: things that can look at a photo of a laptop and any physical damage – cracked screens, things like missing buttons or broken hinges. Seemed to be a simple use case for image models and large language modelS (llms), but quickly turned into something more complicated.

We have created problems along the way, hallucinations, unreliable speeches and even laptops. In order to solve them, we ended in an atypical manner, not for an athiphic automation, but to improve the model of the model.

In this article, we do not work, and how the combination of approaches will help to build something safe.

Where we started: The monolith is asked

Our initial approach was a fairly standard for a multimodal model. We used a single, large desire to cross an image Image-capable llm and asked to identify visible damage. This monolet is easy to implement for desired strategy, clean, well-defined tasks and work decently. However, the real world data is rarely play.

We worked early in three major issues:

  • Hallucinations: The model will sometimes mislead the mistakes that he or she does not see or see.
  • Junk image detection: Do not even take pictures of not laptops, there was no reliable way for pictures that do not take pictures that slip on tables, walls or people sometimes sliding or nonsense ceremony reports.
  • Indefinite accuracy: The combination of these problems has made a model for the use of operations.

It was a point where we are clear, it is necessary to repeat.

First Adjustment: Mixing Image Resolutions

One thing we focus on the model of the model said that how much imaging quality affects. Users downloaded all kinds of pictures from the dog and high resolution. This has forced us to refer to research To emphasize how the description resolution affects deep learning models.

We taught and tested the model using a mixture of high and low resolution pictures. The idea was to make the model more comfortable to the vast image qualities he will face in practice. This helped to improve the sequence, but the main problems of the hallucination and old image work continued.

Multimodal Dourts: Text-only LLM goes multimodal

The description text is recommended from the latest practices that combine text-only with LLMS – as equipped equipment BunchWe decided to give an attempt at where the headlines were formed and then interpreted by a language model.

Here’s how it works:

  • LLM begins by creating numerous headlines for an image.
  • Another model called a multimodal placed model checks how much fit into the image of each title. In this case, you’ve used siglip to collect similarities between pictures and the text.
  • The system is based on these scores.
  • LLM uses this best inscriptions to write new ones trying to approach the things that the image is actually shown.
  • It repeats this process or has a certain limit until they stop improving the improvement.

Although the theory is intelligent, this approach has presented new problems for our use:

  • Continuous hallucinations: Headers themselves sometimes include imaginary damage that LLM expressed confidently.
  • Incomplete Coverage: Even in many hoods, some issues were completely abducted.
  • Increased complexity, low benefit: Added steps, complicated the previous installation of the system validly without reliability.

It was an interesting experience, but the result is not a solution.

A creative use of agent frames

It was a turning point. Agentic frames are usually used for task streams (calendar invitations or coordinating items with customer service actions), the image interpretation is smaller, Special agents can help.

We set up an agency frame like this:

  • Orchestra agent: Determined the description and which laptop components have determined that the visible (screen, keyboard, chassis, port).
  • Component agents: Special agents inspected each component for certain damage; For example, for a cut screens, the other is for missing buttons.
  • Junk detection agent: First of all, the description was not even a laptop.

This module has achieved more accurate and explanatory results in a task-controlled approach. Hallucinations fell sharp, elegant images were reliably flagged, and the task of each agent was simple and the quality was paid enough to manage the quality well.

Blind spots: A commercial of an approach to the agent

It was not perfect, as it was so effective. The two main restrictions indicated:

  • Increased delay: In a large number of sequences working in numerous sequences added to the total result time.
  • Covered gaps: Agents were able to detect clearly programmed problems to search only. If an image shows something unexpected, it will be carefully assigned to the identification of any agent.

We needed a road to balance the sensitivity with coverage.

Hybrid solution: Combine agent and monolithic approaches

We have created a hybrid system to make the gaps bridge:

  1. This Agentic Frame Initially, run first with the exact detection of known damage and old images. We restricted the number of the most important ones to improve the delay.
  2. Then a Monolith Image LLM use He scanned for something else that they can miss their consciences.
  3. Finally, we The model was delicate Used a set of pictures designed for pictures designed for high-priority usage cases to further improve accuracy and reliability, as often reported damage scenarios.

This combination has given us the accuracy of the agent’s structure, the widespread use of the monolithic, and increased the confidence of a wonderful regulation targeted.

What have we learned

When I twist this project, a few things were clear:

  • Agent frames are more versatile than they receive credit: Although it is usually related to workflow management, we have seen that they are able to significantly increase model performance when applied in a module in a module.
  • Mixing different approaches only relies on one: In addition to the widespread dissemination of LLMs, the combination of accurate, agent-based detection, which has a little subtle regulation, which is most important, has made more reliable results.
  • Visual models inclined to hallucinations: Even more advanced installation results or things that are not there. It needs a thoughtful system design to check these errors.
  • The image of the picture makes a difference: Both clear, high-resolution images and every day have helped the model remain strong, when both higher quality, higher quality, and real world photographs.
  • You need a way to catch tasteful pictures: It was one of the simplest changes we have prepared a special check for non-harmful or unrelated images, and made an effective impact on the overall system reliability.

Last Thoughts

The thing that started as a simple idea using an LLM offer to detect physical damage in laptop images, has become a deeper experience to combine different AI techniques to solve unexpected, real world problems. On the way, we realized that some of the most useful means were not intended for this type of work.

The frequent workflow was amazingly effective when the agency-friendly agency frames, structured damage detection and footage filtering as a filtering. With a little creativity, practice helped you to understand and build a system easier to understand and manage.

Shruti Tiwari is AI product manager in Dell Technologies.

Vadiraj Kulkarni is a scientist in Dell technology.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *