How can one get similar results on CIFAR10 with the fully-MLP architecture as the original paper?
I found at least three MLP-only architectures (that avoid using the attention mechanism) for computer vision that reported very high results on ImageNet (70%+) and other benchmarks, like CIFAR10 and CIFAR100 (paper 1 code, paper 2 code, paper 3 code) — from “g-mlp”, “MLP-Mixer”, and “do-you-even-need-attention”
How can one get similar results on CIFAR10 with the fully-MLP architecture as the original paper?
I found at least three MLP-only architectures (that avoid using the attention mechanism) for computer vision that reported very high results on ImageNet (70%+) and other benchmarks, like CIFAR10 and CIFAR100 (paper 1 code, paper 2 code, paper 3 code) — from “g-mlp”, “MLP-Mixer”, and “do-you-even-need-attention”
MultiScale Vision Transformer tensor mismatch shape issue
There seems to be a tensor mismatch shape issue of the MultiScale Vision Transformer. Does anyone know how to resolve this issue?
Is there method to enhance the performance of vit?
I want to train vision transformer on Cifare10 , I tried to do fine tuning of hyperparameter to enhance the accuracy but actually I still obtain a bad accuracy , so , please there are not any suggestion to enhance my model thank you in advance I tried to load weight from pretrained vit on Imagenet but it doesn’t work :`
What is the best way to recover image from its CLIP features?
Suppose we have an image with size torch.Size([1, 3, 336, 336])
and encode it using CLIP with size torch.Size([1, 577, 1024])
, How to recover the origin image with this latent feature map?
Fine tuning LayoutLMv2 for Document question answering using custom data
I want to fine tune LayoutLMv2 for document question answering on custom data. Can somebody help me out on how to prepare the data for this task?
How do I train My Image Recognition Model to work like a Reward Punishment system, where I can tell who the person is, it couldn’t recognize?
I am researching methods to make an attendance system, where the professor clicks few photos (2 to 3)
and upload on app where automatic attendance is given for about 80 students. I am limited training Data, which is the biggest drawback and major issue we need to counter. I made a basic model that trains and marks attendance.