Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

DeepSeek’s success shows why motivation is key to AI innovation


Join our daily and weekly newsletters for the latest updates and exclusive content in the industry’s leading AI coverage. Learn more


January 2025 Shook the AI ​​view. The seemingly open Epenai and strong American technology giants, large language models (LLMS) in the field of large language models (LLS), of course, they were shocked by what we could call a sub-dress. DeepSeek, a Chinese company did not suddenly protest the Openai. This DeepSee-R1 was better than the best models of American giants; He was a little behind in terms of evaluation, but suddenly he forced everyone to think about the effectiveness of hardware and energy.

Given the outgoing of best high-level apparatus, DeepSEeak seems to be encouraged in the field of efficiency in the field of efficiency, which is less concerned for larger players. Openai claimed that they claimed to offer Depth We may have used their models for training, but we do not have a specific proof to support it. Thus, whether it is true or not, Openai is the subject of an argument that simply tries to calm investors. However, DeepSseek published his work and people confirmed the results of the results at least smaller.

But how could it be Depth To save such costs when American companies are unable to do? The short answer is simple: there was more motivation. The long answer requires a little technical explanation.

The DeepSEek used KV-cache optimization

The key value cache used in each focus on an LLM that saves an important expense for the GPU memory was the optimization of the cache.

The LLMS consists of transformer blocks that cover a focus on a regular vanilla feed, which followed a regular vanilla feed. The feed forward consciousness network is conceptually modeled, but in practice, it is difficult to always identify samples in the data. The focus is solved for language modeling this problem.

The model processes texts using the verses, but we will apply them as words for simplicity. Each word in one LLM prescribes a high-size vector (say, in a thousand measures). Conceptually, each size represents a concept, such as being hot or cold, green, being soft, to be noun. One word vector office is its meaning and values ​​for each size.

However, our language allows other words to change the meaning of each word. For example, there is an apple meaning. However, as a modified version, there may be a green apple. The more modification sample, an apple in the context of an iPhone would be differentiated by an apple in the context of the grass. How do we allow our system to change the vector meaning of another word? This is where your attention is in.

Attention model assigns two other vector to each word: a key and a query. The survey expresses the meaning of a modified word and represents the types of change that the key can give to other words. For example, the word ‘green’ can provide information about color and green-ness. Thus, the key to the word ‘green’ will be a high value in the ‘green-ness’ size. On the other hand, the word ‘apple’ can be green or not, ‘Apple’ request will have a high value for green-ness size. ‘Green’ key ‘apple’ request should be relatively large compared to the ‘table’ and ‘table’ and ‘desk’ and ‘apple’ request. The focus adds to the value of the word ‘Apple’, a small piece of the word ‘green’. In this way, the value of the word ‘Apple’ was changed to be a little green.

When the LLM is preparing text, it makes a word followed by each other. If a word creates, the previous words are part of its context. However, the keys and values ​​of these words are already calculated. When another word is added to the context, its price must be updated based on the keys and values ​​of all previous words. Therefore, all these values ​​are kept in GPU memory. This is a sqv cache.

DeepSEek has identified the key to the key and value of a word. Thus, the meaning of the green word and the ability to influence the green is clearly closely related. Thus, it is possible to squeeze up to the exchange and decompress while processing both (and maybe small) vector and processing. DeepSeek found that it affected Performance in evaluationsBut a large number of GPUs saves memory.

DeepSeek application Moe

The nature of a nervous network is that the whole network is an assessment (or calculated) for each query. But all of them are not useful computing. The world’s knowledge is sitting in a network of a network or in parameters. The knowledge of the Eiffel Tower is not used to answer questions about the history of South American tribes. It is not useful when you know that an apple is a fruit, answering questions about the general theory of general relativity. However, when the network is calculated, all parts of the network are processed regardless of. This leads to large computing costs during the ideal text production. This is where the idea of ​​mixed experts (MOE) comes.

In a Moe model, the nervous network is divided into very small networks called a specialist. It should be noted that the “specialist” is not clearly defined ‘; The network implements it during training. However, networks provide a number of actuality accounts to each query and only activates parts with higher matching scores. This saves great costs in the calculation. It should be noted that some questions will be answered correctly in a large number of areas and performance of such surveys will be violated. However, the number of such questions is minimized because the areas are removed from the data.

The importance of learning reinforcement

It is taught to think through a well-thought-out model with a model of a model of the model, to imitate the response before delivering. It is asked to verbally explain the mind of the model (create a thought before creating the answer). The model is evaluated in both thinking and reinforcement and reinforcement (punished for a correct match for a correct match) and is punished for a wrong match with training information).

This requires expensive training information with thinking tokeni. DeepSeek just asked the system to create thoughts between stickers and and creating answers between stickers and . The model is rewarded or punished in purely form (use of labels) and answers. This demanded less expensive training information. In the early stage of the RL, the model tried to create very little thought that resulted in wrong answers. Finally, the model learned to create both long and consistent thoughts, which calls the DeepSeek ‘A-Ha’ moment. After this point, the quality of answers improved.

DeepSeek uses several additional optimization recommendations. However, they will not enter them highly technical, so it will not include them here.

Last Thoughts About DeepSeek and Larger Sunday

In any technology study, we must see what is possible before improving the efficiency. This is a natural progress. Deepseek’s contributions to LLM are phenomenal. Academic Contribution, you do not need to be taught or not using Openai speech. You can also change the work of the beginnings. However, there is no reason for Openai or other American giants to cause despair. This is the case Research works – One group benefits from the study of other groups. Deepseleek, of course, Google, Openai and other researchers previously benefited from research.

However, the idea that Openai will prevail in the world of LLM is now impossible. No amount of adjusting lobbyists or fingers will protect their monopolies. Technology is no longer open and open, and does not stop progress. Although this could be a little headache for Openai investors, there is a victory for the rest of our rest. Although it belongs to many of the future, we will always be grateful to early contributors such as Google and Openai.

Debdish Ray Chawdhuri is the main engineer Talented software.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *