Mysterious Emergent Abilities of Large Language Models
Dive into the fascinating world of large language models and their emergent abilities in this insightful video. We discuss the unpredictable phenomenon of emergent abilities, which are present in larger models but not in smaller ones. Learn about the relationship between scaling up language models and the qualitative changes in their behavior. This video covers various aspects of emergent abilities, including few-shot prompting, augmented prompting strategies, and the potential for further scaling to expand the range of language model capabilities. Join us as we explore this exciting research area at the intersection of artificial intelligence and natural language processing
This video explains an algorithms for meta-learning that is model-agnostic. It is compatible with any model trained with gradient descent and applicable to a variety of different learning problems
0:00 - Intro
2:29 - Human Intelligence
4:07 - The goal of this meta learning
5:56 - Model-agnostic meta learning
10:17 - Step 1 - standard learning
12:04 - Step 2 - meta learning
15:59 - Algorithm
18:25 - Experiment setup
19:54 - Omniglot data
22:17 - MiniImagenet data
23:08 - Recap
Related Video:
Can Machines Learn Like Humans - In-context Learning\Meta Learning
https://youtu.be/no5P_0ZYoOw
Paper:
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
https://arxiv.org/abs/1703.03400
Code:
https://github.com/cbfinn/maml
Abstract:
We propose an algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent and applicable to a variety of different learning problems, including classification, regression, and reinforcement learning. The goal of meta-learning is to train a model on a variety of learning tasks, such that it can solve new learning tasks using only a small number of training samples. In our approach, the parameters of the model are explicitly trained such that a small number of gradient steps with a small amount of training data from a new task will produce good generalization performance on that task. In effect, our method trains the model to be easy to fine-tune. We demonstrate that this approach leads to state-of-the-art performance on two few-shot image classification benchmarks, produces good results on few-shot regression, and accelerates fine-tuning for policy gradient reinforcement learning with neural network policies.
...
https://www.youtube.com/watch?v=tGTNplKgt6Q
This video explains a legendary paper, BERT. It leverages the Transformer encoder and comes up with an innovative way to pre-training language models (masked language modeling). BERT has a significant influence on how people approach NLP problems and inspires a lot of following studies and BERT variants.
0:00 - Intro
1:32 - Transformer v.s LSTMs
3:34 - Pre-BERT times
8:22 - Model architecture
9:46 - WordPiece embeddings
14:25 - Special tokens
16:42 - Input representations
18:15 - Masked language modeling
20:03 - Mismatch between pre-training and fine-tuning
23:21 - Next sentence prediction
26:28 - Pre-training data
30:57 - end-to-end fine-tuning
34:45 - SQaUD
36:57 - Ablation over pre-training tasks
41:37 - Ablation over model size
43:17 - Feature-based approach with BERT
Related Videos:
Transformer explained
https://youtu.be/ELTGIye424E
Introduction of GPT-3: The Most Powerful Language Model Ever
https://youtu.be/Rv5SeM7LxLQ
Paper
https://arxiv.org/abs/1810.04805
Code
https://github.com/google-research/bert (TensorFlow)
https://github.com/huggingface/transformers (PyTorch)
Connect
Twitter https://twitter.com/home
email edwindeeplearning@gmail.com
Abstract
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
...
https://www.youtube.com/watch?v=j9toSIRf4RI
Can zero-shot generalization instead be directly induced by explicit multitask learning? Watch the video to find out!
0:00 - Intro
2:14 - Prompted training format
5:52 - Measuring generalization to unseen tasks
8:45 - Held-out tasks
10:45 - The future of NLP
11:48 - Model
12:17 - Experiment results
Connect
Linkedin https://www.linkedin.com/in/xue-yong-fu-955723a6/
Twitter https://twitter.com/home
email edwindeeplearning@gmail.com
Paper
https://arxiv.org/abs/2110.08207
Code
https://github.com/bigscience-workshop/promptsource/
Abstract
Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks. It has been hypothesized that this is a consequence of implicit multitask learning in language model training. Can zero-shot generalization instead be directly induced by explicit multitask learning? To test this question at scale, we develop a system for easily mapping general natural language tasks into a human-readable prompted form. We convert a large set of supervised datasets, each with multiple prompts using varying natural language. These prompted datasets allow for benchmarking the ability of a model to perform completely unseen tasks specified in natural language. We fine-tune a pretrained encoder-decoder model on this multitask mixture covering a wide variety of tasks. The model attains strong zero-shot performance on several standard datasets, often outperforming models 16x its size. Further, our approach attains strong performance on a subset of tasks from the BIG-Bench benchmark, outperforming models 6x its size.
...
https://www.youtube.com/watch?v=YToXXfrIu6w
It's a super cool paper that invents "vokenization" to generate a large amount of visually-grounded language datasets and trains visually-grounded models on those.
Most language models are trained on pure text data. Although it achieves significant success in recent years, but this is not how humans acquire a language. It raises an interesting question "Can language models achieve a high level of language understanding by reading the text input alone?" The answer is probably "no".
To push the boundary of language models, adding other learning signals in the learning process is the key to success. And the first thing that comes to my mind is vision (visual cue). However, the existing visually-grounded datasets are a level of magnitude smaller than pure text ones. This paper purposes "vokenization" method to overcome this problem, and uses the new data that generate to train visually-supervised language models.
More importantly, visually-grounded models show significant improvements over text-grounded only models.
0:00 - How did you learn your first language
1:00 - What's special about this paper
2:56 - How humans learn a language
5:23 - Visual pointing
6:07 - Challenge to visually-grounded supervision
9:58 - Token-image matching
11:53 - Vokenization
18:40 - Vokenizer training
23:39 - Visually-supervised language models
25:48 - Voken classification tasks
27:24 - Loss function
28:37 - Implication of voken classification
31:56 - Fine-tuning results
35:22 - Conventional visually-grounded corpora are very different
37:51 - Sentence-level v.s token-level
41:45 - Summary
Paper
https://arxiv.org/abs/2010.06775
Code
https://github.com/airsplay/vokenization
Abstract
Humans learn language by listening, speaking, writing, reading, and also, via interaction with the multimodal real world. Existing language pre-training frameworks show the effectiveness of text-only self-supervision while we explore the idea of a visually-supervised language model in this paper. We find that the main reason hindering this exploration is the large divergence in magnitude and distributions between the visually-grounded language datasets and pure-language corpora. Therefore, we develop a technique named "vokenization" that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images (which we call "vokens"). The "vokenizer" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora. Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks such as GLUE, SQuAD, and SWAG.
Connect
Twitter https://twitter.com/home
...
https://www.youtube.com/watch?v=4T1u3Z2DaZA
Asymptomatic people infected with Covid-19, by definition, they don't have any symptoms. We're not supposed to tell the difference between them and the healthy.
The AI system built by an MIT team can detect it with 97% accuracy. More interestingly, it's able to detect asymptomatic people 100% (sensitivity). The proposed model comprises 4 biomarkers (3 ResNet models and a Poisson mask). Each of them represents a hypothesis of the repository disease.
Caveat: More replication is needed. There are clinical trials ongoing in Mount Sinai and White Planes Hospitals in the US, Catalan Health Institute in Catalonia, Hospitales Civiles de Guadalajara in Mexico, and Ospedale Luigi Sacco in Italy
0:00 - Intro
4:30 - Are the asymptomatics free of change
6:04 - COVID-19 cough dataset
7:41 - Model architecture
11:43 - Muscular degradation
13:13 - Vocal cords
14:46 - Sentiment
15:47 - Lungs and Respiratory Tract
19:48 - Results
22:18 - How many layers to fine-tune
25:33 - Explainable deep learning
28:12 - Summary
Paper:
COVID-19 Artificial Intelligence Diagnosis using only Cough Recordings
https://ieeexplore.ieee.org/document/9208795
Connect
Twitter https://twitter.com/home
email edwindeeplearning@gmail.com
...
https://www.youtube.com/watch?v=J_OmBva8_RA
How to perform full end-to-end entity linking has always been a challenging problem in NLP. The typical approach for this is to use a model to detect entities and then employ another model to perform entity disambiguation. And this paper beautifully formulates these two steps into a single neural network model.
0:00 - Ya ya ya
0:56 - What's special about this paper
2:10 - System overview
3:29 - Question & entities
6:19 - Mention detection
9:06 - Entity disambiguation
11:46 - Mention detection loss
14:09 - Entity disambiguation loss
15:45 - Datasets
16:24 - Results & discussion
22:55 - Runtime comparison
23:10 - Proof of concept
25:10 - Summary
Connect
Twitter https://twitter.com/home
Email edwindeeplearning@gmail.com
Related videos:
REALM: Retrieval-Augmented Language Model
https://youtu.be/JQ-bxQT5Qsw
Question and Answer Test-Train Overlap in Open Domain QA
https://youtu.be/Cb5sj4_Ztfo
Paper
Efficient One Pass End to End Entity Linking for Questions
https://arxiv.org/abs/2010.02413
Code
https://github.com/facebookresearch/BLINK/tree/master/elq
Abstract
We present ELQ, a fast end-to-end entity linking model for questions, which uses a biencoder to jointly perform mention detection and linking in one pass. Evaluated on WebQSP and GraphQuestions with extended annotations that cover multiple entities per question, ELQ outperforms the previous state of the art by a large margin of +12.7% and +19.6% F1, respectively. With a very fast inference time (1.57 examples/s on a single CPU), ELQ can be useful for downstream question answering systems. In a proof-of-concept experiment, we demonstrate that using ELQ significantly improves the downstream QA performance of GraphRetriever.
...
https://www.youtube.com/watch?v=eXN7Bu06RjI
This video walks you through the paper "Quantifying Attention Flow In Transformers" that proposes a simple yet effective method to better analyze transformer-base models' attention weights.
Line to the paper: https://arxiv.org/abs/2005.00928
(Quantifying Attention Flow In Transformers)
The official code implementation of the paper:
https://github.com/samiraabnar/attention_flow
Relevant video:
Revealing Dark Secrets of BERT (Analysis of BERT's Attention Heads) - Paper Explained
https://youtu.be/mnU9ILoDH68
Abstract of the paper:
In the Transformer model, “self-attention” combines information from attended embed- dings into the representation of the focal em- bedding in the next layer. Thus, across layers of the Transformer, information originating from different tokens gets increasingly mixed. This makes attention weights unreliable as ex- planations probes. In this paper, we consider the problem of quantifying this flow of information through self-attention. We propose two methods for approximating the attention to in- put tokens given attention weights, attention rollout and attention flow, as post hoc methods when we use attention weights as the relative relevance of the input tokens. We show that these methods give complementary views on the flow of information, and compared to raw attention, both yield higher correlations with importance scores of input tokens obtained using an ablation method and input gradients.
...
https://www.youtube.com/watch?v=3Q0ZXqVaQPo
A groundbreaking way to do self-supervision on videos and text. I would say it's the BERT moment for this video-text understanding.
#videoclip #contrastivelearning #videotransformer
0:00 - Intro
3:31 - Retrieval augmented training
5:07 - Video and text encoding8:48 - Contrastive loss
12:09 - Zero-shot transfer to end tasks
14:05 - Experiment results
18:09 - What did we learn
VideoCLIP: Contrastive Pre-training forZero-shot Video-Text Understanding
https://arxiv.org/abs/2109.14084
Connect
Twitter https://twitter.com/home
Linkedin https://www.linkedin.com/in/xue-yong-fu-955723a6/
email edwindeeplearning@gmail.com
Abstract
We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches.
...
https://www.youtube.com/watch?v=vqMZjsIKUoQ
It proposes a data sampling technique and a two-stage fine-tuning approach, allowing people to sample more training data similar to our in-domain ASR transcripts and improve the model performance.
0:00 - How to make a model more accurate
1:02 - I published a paper
3:05 - Punctuation restoration
5:32 - In-domain data
7:29 - Annotated data is expensive
8:47 - Opensubtitles
10:04 - Data sampling via LM
11:34 - Two-stage fine-tuning
14:55 - Layer reduction
16:49 - Takeaway
18:10- EMNLP 2021
Connect
Linkedin https://www.linkedin.com/in/xue-yong-fu-955723a6/
Twitter https://twitter.com/home
email edwindeeplearning@gmail.com
Paper
Improving Punctuation Restoration for Speech Transcripts via External Data
https://arxiv.org/abs/2110.00560?context=cs
Abstract
Automatic Speech Recognition (ASR) systems generally do not produce punctuated transcripts. To make transcripts more readable and follow the expected input format for down-stream language models, it is necessary to add punctuation marks. In this paper, we tackle the punctuation restoration problem specifically for the noisy text (e.g., phone conversation scenarios). To leverage the available writ-ten text datasets, we introduce a data sampling technique based on an n-gram language model to sample more training data that are similar to our in-domain data. Moreover, we propose a two-stage fine-tuning approach that utilizes the sampled external data as well as our in-domain dataset for models based on BERT. Extensive experiments show that the proposed approach outperforms the baseline with an improvementof1.12%F1 score.
...
https://www.youtube.com/watch?v=jxOpu4hXPJY