DLHW2_Q4

.pdf

School

New York University *

*We aren’t endorsed by this school

Course

7123

Subject

Electrical Engineering

Date

Apr 3, 2024

Type

pdf

Pages

Uploaded by SuperHumanSteel12063

25/03/2024, 18:44 FashionMnist ViT.ipynb - Colaboratory Transformers in Computer Vision Transformer architectures owe their origins in natural language processing (NLP), and indeed form the core of the current state of the art models for most NLP applications. We will now see how to develop transformers for processing image data (and in fact, this line of deep learning research has been gaining a lot of attention in 2021). The Vision Transformer (ViT) introduced in this paper shows how standard transformer architectures can perform very well on image. The high level idea is to extract patches from images, treat them as tokens, and pass them through a sequence of transformer blocks before throwing on a couple of dense classification layers at the very end. Some caveats to keep in mind: ViT models are very cumbersome to train (since they involve a ton of parameters) so budget accordingly. ViT models are a bit hard to interpret (even more so than regular convnets). Finally, while in this notebook we will train a transformer from scratch, ViT models in practice are almost always pre-trained on some large dataset (such as ImageNet) before being transferred onto specific training datasets. v Setup As usual, we start with basic data loading and preprocessing. Ilpip install einops Requirement already satisfied: einops in /opt/conda/lib/python3.18/site-packages (0.7.0) import torch from torch import nn from torch import nn, einsum import torch.nn.functional as F from torch import optim from einops import rearrange, repeat from einops.layers.torch import Rearrange import numpy as np import torchvision import time torch.manual_seed(42) DOWNLOAD PATH = '/data/fashionmnist’ BATCH_SIZE TRAIN = 100 BATCH_SIZE_TEST = 1000 transform_fashionmnist = torchvision.transforms.Compose([torchvision.transforms.ToTensor(), torchvision.transforms.Normalize((©.5,), (8.5,))]) train_set = torchvision.datasets.FashionMNIST(DOWNLOAD PATH, train=True, download=True, transform=transform_fashionmnist) train_loader = torch.utils.data.DatalLoader(train_set, batch_size=BATCH_SIZE_TRAIN, shuffle=True) test_set = torchvision.datasets.FashionMNIST(DOWNLOAD PATH, train=False, download=True, transform=transform_fashionmnist) test loader = torch.utils.data.Dataloader(test _set, batch size=BATCH SIZE TEST, shuffle=True) v The ViT Model We will now set up the ViT model. There will be 3 parts to this model: » A patch embedding" layer that takes an image and tokenizes it. There is some amount of tensor algebra involved here (since we have to slice and dice the input appropriately), and the einops package is helpful. We will also add learnable positional encodings as parameters. * A sequence of transformer blocks. This will be a smaller scale replica of the original proposed ViT, except that we will only use 4 blocks in our model (instead of 32 in the actual ViT). » A (dense) classification layer at the end. Further, each transformer block consists of the following components: » A self-attention layer with H heads, » A one-hidden-layer (dense) network to collapse the various heads. For the hidden neurons, the original ViT used something called a GeLU activation function, which is a smooth approximation to the ReLU. For our example, regular ReLUs seem to be working just fine. The original ViT also used Dropout but we won't need it here. * |ayer normalization preceeding each of the above operations. Some care needs to be taken in making sure the various dimensions of the tensors are matched. https://colab.research.google.com/drive/126Qz6pSq6YfOUDfVuH8JnQrp100HdS1B#printMode=true 1/6

25/03/2024, 18:44 def pair(t): return t if isinstance(t, tuple) else (t, t) # classes class PreNorm(nn.Module): def def __init_ (self, dim, fn): super().__init_ () self.norm = nn.LayerNorm(dim) self.fn = fn forward(self, x, **kwargs): return self.fn(self.norm(x), **kwargs) class FeedForward(nn.Module): def def _init_ (self, dim, hidden_dim, dropout = 0.): super().__init_ () self.net = nn.Sequential( nn.Linear(dim, hidden_dim), nn.GELU(), nn.Dropout(dropout), nn.Linear(hidden_dim, dim), n.Dropout(dropout) > ) forward(self, x): return self.net(x) class Attention(nn.Module): def def __init__ (self, dim, heads = 4, dim_head = 64, dropout = @.1): super().__init_ () inner_dim = dim_head * heads project_out = not (heads == 1 and dim_head == dim) self.heads = heads self.scale = dim_head ** -0.5 self.attend = nn.Softmax(dim = -1) self.to_gkv = nn.Linear(dim, inner_dim * 3, bias = False) self.to_out = nn.Sequential( nn.Linear(inner_dim, dim), nn.Dropout(dropout) ) if project_out else nn.Identity() forward(self, x): b, n, _, h = *x.shape, self.heads gkv = self.to_gkv(x).chunk(3, dim = -1) q, k, v = map(lambda t: rearrange(t, 'b n (hd) ->b h nd', h =h), gkv) dots = einsum('b hid, bh jd->bhiij', q, k) * self.scale attn = self.attend(dots) out = einsum('b h i j, b h jd->bhid', attn, v) out = rearrange(out, '‘bhnd->bn (hd)") return self.to_out(out) class Transformer(nn.Module): def def __init_ (self, dim, depth, heads, dim_head, mlp_dim, dropout = 0.): super().__init_ () self.layers = nn.ModuleList([]) for _ in range(depth): self.layers.append(nn.ModuleList([ FashionMnist ViT.ipynb - Colaboratory PreNorm(dim, Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout)), PreNorm(dim, FeedForward(dim, mlp_dim, dropout = dropout)) 1) forward(self, x): for attn, ff in self.layers: X = attn(x) + x X = FF(x) + x return x class ViT(nn.Module): def __init__ (self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, pool = 'cls', channels = 3, dim_head = 64, dropout = 9.1, super().__init__ () image_height, image_width = pair(image_size) patch_height, patch_width = pair(patch_size) assert image_height % patch_height == @ and image_width % patch_width == @, num_patches = (image_height // patch_height) * (image_width // patch_width) patch_dim = channels * patch_height * patch_width 'Image dimensions must be divisible by the patch size.’ assert pool in {'cls', 'mean'}, 'pool type must be either cls (cls token) or mean (mean pooling)’ self.to_patch_embedding = nn.Sequential( Rearrange('b c¢ (h p1) (w p2) -> b (h w) (p1 p2 c)', pl = patch_height, p2 = patch_width), nn.Linear(patch_dim, dim), ) calf nne amhaddino = nn Paramatarf{+nrch randnf1 nim natrhace + 1 Aim\) https://colab.research.google.com/drive/126Qz6pSq6YTOUDNVuH8JnQrp100HdS1B#printMode=true 2/6

25/03/2024, 18:44 FashionMnist ViT.ipynb - Colaboratory BEA 1 PUS_CHUSUULIIE = e T @1 GIIS LS | LU Al e | @M 4y P LSS T 4y udmy ) self.cls_token = nn.Parameter(torch.randn(1, 1, dim)) self.dropout = nn.Dropout(emb_dropout) self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout) self.pool = pool self.to_latent = nn.Identity() self.mlp_head = nn.Sequential( nn.LayerNorm(dim), nn.Linear(dim, num_classes) ) def forward(self, img): x = self.to_patch_embedding(img) b, n, _ = x.shape cls_tokens = repeat(self.cls_token, ‘() nd -> bnd', b =b) x = torch.cat((cls_tokens, x), dim=1) x += self.pos_embedding[:, :(n + 1)] X = self.dropout(x) x = self.transformer(x) x = X.mean(dim = 1) if self.pool == 'mean' else x[:, @] x = self.to_latent(x) return self.mlp_head(x) model = ViT(image_size=28, patch_size=4, num_classes=10, channels=1, dim=64, depth=6, heads=4, mlp_dim=256) optimizer = optim.Adam(model.parameters(), 1lr=0.002) Let's see how the model looks like. model ViT( (to_patch_embedding): Sequential( (@): Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', pl=4, p2=4) (1): Linear(in_features=16, out_features=64, bias=True) ) (dropout): Dropout(p=0.1, inplace=False) (transformer): Transformer( (layers): Modulelist( (0-5): 6 x ModuleList( (@): PreNorm( (norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True) (fn): Attention( (attend): Softmax(dim=-1) (to_gkv): Linear(in_features=64, out_features=768, bias=False) (to_out): Sequential( (@): Linear(in_features=256, out_features=64, bias=True) (1): Dropout(p=0.1, inplace=False) ) ) (1): PreNorm( (norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True) (fn): FeedForward( (net): Sequential( (@): Linear(in_features=64, out_features=256, bias=True) (1): GELU(approximate="none') (2): Dropout(p=0.1, inplace=False) (3): Linear(in_features=256, out_features=64, bias=True) (4): Dropout(p=0.1, inplace=False) ) ) ) ) ) ) (to_latent): Identity() (mlp_head): Sequential( (@): LayerNorm((64,), eps=1e-05, elementwise_affine=True) (1): Linear(in_features=64, out_features=10, bias=True) ) ) This is it - 4 transformer blocks, followed by a linear classification layer. Let us quickly see how many trainable parameters are present in this model. def count_parameters(model): return sum(p.numel() for p in model.parameters() if p.requires_grad) print(count_parameters(model)) https://colab.research.google.com/drive/126Qz6pSq6YTOUDNVuUH8JnQrp100HdS1B#printMode=true 3/6

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version