News
MaxVit Encoder: We leverage the MaxVit (Vision Transformer) model from the TIMM (PyTorch Image Models) library as the encoder.It is pre-trained on the ImageNet dataset and serves as a robust feature ...
This project aims to develop an Image Caption Generator using deep learning techniques. The model utilizes a pre-trained EfficientNet for feature extraction and a Transformer-based encoder-decoder ...
Image captioning refers to the process of creating a natural language description for one or more images. This task has several practical applications, from aiding in medical diagnoses through image ...
The model comprises a Swin Transformer encoder and a SegFormer decoder. The input image is initially divided into non-overlapping patches by the Patch Partition module, and then it proceeds through ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results