Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Join a reliable event by enterprise leaders in about two decades. VB Transform, Real Enterprise AI strategy brings together people who build. Learn more
Computer vision projects are rare, as planned, and this was no exception. The idea was simple: things that can look at a photo of a laptop and any physical damage – cracked screens, things like missing buttons or broken hinges. Seemed to be a simple use case for image models and large language modelS (llms), but quickly turned into something more complicated.
We have created problems along the way, hallucinations, unreliable speeches and even laptops. In order to solve them, we ended in an atypical manner, not for an athiphic automation, but to improve the model of the model.
In this article, we do not work, and how the combination of approaches will help to build something safe.
Our initial approach was a fairly standard for a multimodal model. We used a single, large desire to cross an image Image-capable llm and asked to identify visible damage. This monolet is easy to implement for desired strategy, clean, well-defined tasks and work decently. However, the real world data is rarely play.
We worked early in three major issues:
It was a point where we are clear, it is necessary to repeat.
One thing we focus on the model of the model said that how much imaging quality affects. Users downloaded all kinds of pictures from the dog and high resolution. This has forced us to refer to research To emphasize how the description resolution affects deep learning models.
We taught and tested the model using a mixture of high and low resolution pictures. The idea was to make the model more comfortable to the vast image qualities he will face in practice. This helped to improve the sequence, but the main problems of the hallucination and old image work continued.
The description text is recommended from the latest practices that combine text-only with LLMS – as equipped equipment BunchWe decided to give an attempt at where the headlines were formed and then interpreted by a language model.
Here’s how it works:
Although the theory is intelligent, this approach has presented new problems for our use:
It was an interesting experience, but the result is not a solution.
It was a turning point. Agentic frames are usually used for task streams (calendar invitations or coordinating items with customer service actions), the image interpretation is smaller, Special agents can help.
We set up an agency frame like this:
This module has achieved more accurate and explanatory results in a task-controlled approach. Hallucinations fell sharp, elegant images were reliably flagged, and the task of each agent was simple and the quality was paid enough to manage the quality well.
It was not perfect, as it was so effective. The two main restrictions indicated:
We needed a road to balance the sensitivity with coverage.
We have created a hybrid system to make the gaps bridge:
This combination has given us the accuracy of the agent’s structure, the widespread use of the monolithic, and increased the confidence of a wonderful regulation targeted.
When I twist this project, a few things were clear:
The thing that started as a simple idea using an LLM offer to detect physical damage in laptop images, has become a deeper experience to combine different AI techniques to solve unexpected, real world problems. On the way, we realized that some of the most useful means were not intended for this type of work.
The frequent workflow was amazingly effective when the agency-friendly agency frames, structured damage detection and footage filtering as a filtering. With a little creativity, practice helped you to understand and build a system easier to understand and manage.
Shruti Tiwari is AI product manager in Dell Technologies.
Vadiraj Kulkarni is a scientist in Dell technology.