LLM 강의 정리

Posted Feb 24, 2024

By syshin

views 1 min read

Pretrian을 더 시켜야 하는 경우

한국어 추가 학습

RLHF (Reinforcement Learning with Human Feedback)

리워드 모델을 통해 평가

ex) 이 답변이 다른 답변보다 좋은 답변일 확률이 0.2

두가지 문제접

Reward Model 학습이 어려움
강화학습을 이용했을대 생기는 불안정성

DPO(Direct Preference Optimization)

RLHF에서 보완된 방식

Reward Model을 없앰
강화학습 불안전성 완화

프롬프트 엔지니어링

Deep-Learning, NLP

This post is licensed under CC BY 4.0 by the author.

Trending Tags

programmers python lvl1 lvl2 computer-vision deep-learning lvl3 ai github java

Contents

Trending Tags

programmers python lvl1 lvl2 computer-vision deep-learning lvl3 ai github java

A new version of content is available.