Based on https://llm-course.github.io.

Basics

Language Model

A language model assigns probability to \(N\)-gram: \(f:V^n \rightarrow R^+\).

A conditional language model assigns probability to a word given some conditioning context:

\[g:(V^{n-1},V)\rightarrow R^{+}. \]

\[p(w_n|w_1,\ldots,w_{n-1}) = g(w_1,\ldots,w_{n-1},w) = \frac{f(w_1,\ldots,w_{n})}{f(w_1,\ldots,w_{n-1})}. \]

A probabilistic model that assigns a probability to every finite sequence (grammatical or not):

\[P(\text{I am noob})=\underbrace{P(\text{I})*P(\text{am}|\text{I})}_{P(\text{I am})}*P(\text{noob}|\text{I am}). \]

Decoder-only models (GPT-x models).

Encoder-only models (BERT, RoBERTa, ELECTRA).

Encoder-decoder models (T5, BART).

language modeling with \(n\)-gram

Definition for \(n\)-gram:

An \(n\)-gram language model assumes each word depends only on the last \(n−1\) words (markov assumptions with order \(n-1\)​):

\[\begin{align} P_{ngram}(w_1,\ldots,w_N)&=P(w_1)P(w_2|w_1)\ldots P(w_i|\underbrace{w_{i-1},\ldots,w_{i-(n+1)})}_{n-1 \text{ items}}\ldots P(w_N|w_{N-1},\ldots,w_{N-(n+1)}))\\ &=\prod_{i=1}^N P(w_i|w_{i-1},\ldots,w_{i-(n+1)}). \end{align} \]

  • Use EOS (end-of-sentence) </s> token to limit sentence length.
  • Add \(n-1\) beginning-of-sentence (BOS) (<s>) to each sentence for an \(n\)-gram mode to ensure consistency for \(P(w_i|w_{i-1},\ldots,w_{i-(n+1))}\) for the first \(n-1\) items.

  • Unigram model (\(1\)-gram​): \(P(w_1,\ldots,w_i)=\prod_{k=1}^iP(w_k)\).
  • Bigram model (\(2\)-gram): \(P(w_1,\ldots,w_i)=\prod_{k=1}^iP(w_k|w_{k-1})\).
  • Trigram model (\(3\)-gram): \(P(w_1,\ldots,w_i)=\prod_{k=1}^iP(w_k|w_{k-1},w_{k-2})\)​.

Evaluation

(Intrinsic Evaluation). Perplexity:

Inverse (\(1\over P(\ldots)\)) of the probability of the test set, normalized (\(\sqrt[N]{\ldots}\)) by the # of tokens (\(N\)) in the test set.

If a LM assigns probability \(P(w_1,\ldots,w_N)\) to a test corpus \(w_1,\ldots,w_N\), the perplexity for \(n\)-gram language model could be written as:

\[PP(w_1,\ldots,w_N)=\sqrt[N]{1 \over P(w_1,\ldots,w_N)}=\sqrt[N]{1 \over \prod_{i=1}^N P(w_i|w_{i-1},\ldots,w_{i-(n+1)})}. \]

Rewrite it into log-form (exponent of mean of log likelihood of all the words in an input sequence):

\[PPL(w_1,\ldots,w_N)=\exp\left(-\frac{1}{N}\sum_{i=1}^N\log\left(P(w_i|w_{i-1},\ldots,w_{i-(n+1)})\right)\right). \]

  • Lower perplexity \(\rightarrow\) a higher probability to the unseen test corpus.

(Extrinsic Evaluation). Word error rate (WER):

\[\text{WER} = \frac{\text{Insertions}+\text{Deletions}+\text{Substitutions}}{\text{Actual words in transcript}}. \]

ChatGPT

Resource

ChatGPT API key: http://eylink.cn

课程:https://github.com/mlabonne/llm-coursehttps://llm-course.github.io

学者:

  • (通信)黄川,hongyang du,
  • Open AI and famous guys, Lilian Weng, Yao Fu, Jianlin Su

图片\(\rightarrow\)文字:llava

大模型做network调度:netllm

大模型微调:LoRA