Day 16-1. Intro to NLP, Bag-of-Words

Notice

Recent Posts

Tags more

Archives

관리 메뉴

hojeomi blog

AI/Course

호저미 2021. 2. 15. 11:33

[../주재걸교수님_자연어처리/01_자연어처리_upstage_1일차 (1).pdf]

0. Intro

1. Natural language processing

sentence level
- sentiment analysis: 긍정인지 부정인지
- machine translation: 기계 번역은 인간이 사용하는 자연 언어를 컴퓨터를 사용하여 다른 언어로 번역하는 일을 말함
multi-sentence and paragraph level
- Entailment prediction: 두 문장 간의 논리적 내포 및 모순 관계 예측
  - 예) '어제 John이 결혼을 했다.'와 '어제 최소 한 명이 결혼했다.'라는 두 문장에서, 첫 문장이 참이면 뒷 문장은 자동으로 참이 됨
- question answering: 구글에 어떤 질문을 했을 때, 그 답을 바로 알려줌
- dialog systems: 챗봇처럼 대화를 하는 것
- summarization: 주어진 문서를 한 줄 요약

2. Text mining

3. Intormation retrieval(정보 검색)

4. Trends of NLP

word embedding: 단어를 벡터로
- 예) Word2Vec or GloVe
자연어처리 핵심 모델: RNN-family models(LSTM, GRUs), Transformer models(요즘 제일 많이 쓰임)
Transformer 이전에는 각각의 자연어처리 task에 특화된 모델이 따로 존재했음. 하지만 요즘에는 huge models were released by stacking its basic module, self-attention, and these models are trained with large-sized datasets through language modeling tasks, one of the self-supervised training setting that does not require additional labels for a particular task
- 예) 자가지도 학습: BERT, GPT-3
- 하지만 자가지도 학습을 통한 모델은 대규모의 데이터 및 GPU resource를 필요로 함. 예: 테슬라의 한 모델을 학습하는 데 전기세만 수십억

5. Bag-of-Words Representation

텍스트마이닝 분야에 딥러닝이 도입되기 전 단어 및 문자를 숫자로 표현하기 위해 가장 많이 쓰이던 방법
Step 1. Constructing the vocabulary containing unique words
- Example sentences: “John really really loves this movie“, “Jane really likes this song”
- Vocabulary: {“John“, “really“, “loves“, “this“, “movie“, “Jane“, “likes“, “song”}
Step 2. Encoding unique words to one-hot vectors
- John: [1 0 0 0 0 0 0 0]
- really: [0 1 0 0 0 0 0 0]
- loves: [0 0 1 0 0 0 0 0]
- this: [0 0 0 1 0 0 0 0]
- movie: [0 0 0 0 1 0 0 0]
- Jane: [0 0 0 0 0 1 0 0]
- likes: [0 0 0 0 0 0 1 0]
- song: [0 0 0 0 0 0 0 1]
- For any pair of words, the distance is sqrt(2), For any pair of words, cosine similarity is 0
  - 즉, 단어의 의미와 상관없이 모든 단어가 동일한 관계를 가짐
- Step 3. A sentence/document can be represented as the sum of one-hot vectors(위 벡터들을 순서에 맞춰 합침)
  - Sentence 1: “John really really loves this movie“
    - John + really + really + loves + this + movie: [1 2 1 1 1 0 0 0]
  - Sentence 2: “Jane really likes this song”
    - Jane + really + likes + this + song: [0 1 0 1 0 1 1 1]

6. 5번 이후 분류하는 방법: NaiveBayes Classifier for Document Classification

[참고] armin/argmax
- arguments of min, arguments of max란 뜻. 즉, 어떤 함수를 최소/최대로 만드는 정의역의 점들, elements 혹은 매개변수를 말함
특정 클래스에 있어서 특정 단어가 학습 데이터에 한번도 나타나지 않았을 때, 테스트 데이터에서 무조건 0으로 배정되는 단점이 있음 → 이를 해결하기 위해 다양한 regularization 방법들을 함께 사용하기도 함

Day 18. Seq2Seq Model with attention, Beam Search and BLEU (0)	2021.02.18
Day 16-2. Word Embedding: Word2Vec, GloVe (0)	2021.02.15
Day 14-2. Transfomer (0)	2021.02.05
Day 14-1. RNN (0)	2021.02.05
Day 13. CNN (0)	2021.02.04

'AI/Course' Related Articles

Comments