News

MaxVit Encoder: We leverage the MaxVit (Vision Transformer) model from the TIMM (PyTorch Image Models) library as the encoder.It is pre-trained on the ImageNet dataset and serves as a robust feature ...
This project aims to develop an Image Caption Generator using deep learning techniques. The model utilizes a pre-trained EfficientNet for feature extraction and a Transformer-based encoder-decoder ...
Image captioning refers to the process of creating a natural language description for one or more images. This task has several practical applications, from aiding in medical diagnoses through image ...