A Lightweight Visual Understanding System for Enhanced Assistance to the Visually Impaired Using an Embedded Platform
Keywords:
Visually Impaired, Assistive Technologies, Deep learning, Transformers, CNN, Video Description, Jetson NanoAbstract
Visually impaired individuals often face significant challenges in navigating their environments due to limited access to visual information. To address this issue, a portable, cost-effective assistive tool is proposed to operate on a low-power embedded system such as the Jetson Nano. The novelty of this research lies in developing an efficient, lightweight video captioning model within constrained resources to ensure its compatibility with embedded platforms. This research aims to enhance the autonomy and accessibility of visually impaired people by providing audio descriptions of their surroundings through the processing of live-streaming videos. The proposed system utilizes two distinct lightweight deep learning modules: an object detection module based on the state-of-the-art YOLOv7 model, and a video captioning module that utilizes both the Video Swin Transformer and 2D-CNN for feature extraction, along with the Transformer network for caption generation. The goal of the object detection module is for providing real-time multiple object identification in the surrounding environment of the blind while the video captioning module is to provide detailed descriptions of the entire visual scenes and activities including objects, actions, and relationships between them. The user interacts via a headphone with the proposed system using a specific audio command to trigger the corresponding module even object detection or video captioning and receiving an audio description output for the visual contents. The system demonstrates satisfactory results, achieving inference speeds between 0.11 to 1.1 seconds for object detection and 0.91 to 1.85 seconds for video captioning, evaluated through both quantitative metrics and subjective assessments.
Downloads
References
V. V. N. V. P. Kumar, V. P. Teja, A. R. Kumar, V. Harshavardhan and U. Sahith, "Image Summarizer for the Visually Impaired Using Deep Learning," 2021 International Conference on System, Computation, Automation and Networking (ICSCAN), Puducherry, India, pp. 1-4, 2021.
B. Arystanbekov, A. Kuzdeuov, S. Nurgaliyev and H. A. Varol, "Image Captioning for the Visually Impaired and Blind: A Recipe for Low-Resource Languages," 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Sydney, Australia, pp. 1-4, 2023
A. Chharia and R. Upadhyay, "Deep Recurrent Architecture based Scene Description Generator for Visually Impaired," 2020 12th International Congress on Ultra-Modern Telecommunications and Control Systems and Workshops (ICUMT), Brno, Czech Republic, pp. 136-141, 2020.
C. Chaitra, Chennamma, R. Vethanayagi, K. M. V. Manoj, B. S. Prashanth, T. Likewin, and D. S. L. Shiva, “Image/Video Summarization in Text/Speech," 2022 IEEE 2nd Mysore Sub Section International Conference (MysuruCon), Mysuru, India, pp. 1-6, 2022.
D. N. Jyothi, G. H. Reddy, B. Prashanth and N. V. Vardhan, "Collaborative Training of Object Detection and Re-Identification in Multi-Object Tracking Using YOLOv8," 2024 International Conference on Computing and Data Science (ICCDS), Chennai, India, pp. 1-6, 2024.
J. Sudhakar, V. V. Iyer and S. T. Sharmila, "Image Caption Generation using Deep Neural Networks," 2022 International Conference for Advancement in Technology (ICONAT), Goa, India, 2022, pp. 1-3, 2022.
X. Hao, F. Zhou and X. Li, "Scene-Edge GRU for Video Caption," 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China, pp. 1290-1295, 2020.
T. A. Tuib, B. H. Saoudi, Y. M. Hussein, T, H, Mandeel, F, T, Al-Dhief, "Convolutional neural network with binary moth flame optimization for emotion detection in electroencephalogram." Int J Artif Intell ISSN 2252.8938: 1173.
A. K. S. Alsajri, and A. V. Hacimahmud, "Review of deep learning: Convolutional Neural Network Algorithm." Babylonian Journal of Machine Learning,19-25, 2023.
R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation," 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, pp. 580-587, 2014.
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg, “SSD: Single Shot MultiBox Detector,” In European conference on computer vision, pages 21–37. Springer, 2016.
J. Redmon, S. Divvala, R. Girshick and A. Farhadi, "You Only Look Once: Unified, Real-Time Object Detection," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 779-788, 2016.
B. Xiao, J. Guo, and Z. He, "Real-Time Object Detection Algorithm of Autonomous Vehicles Based on Improved YOLOv5s," 2021 5th CAA International Conference on Vehicular Control and Intelligence (CVCI), Tianjin, China, pp. 1-6, 2021.
P. Zhang, W. Hou, D. Wu, B. Ge, L. Zhang, and H. Li, "Real-Time Detection of Small Targets for Video Surveillance Based on MS-YOLOv5," 2023 6th International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, pp. 690-694, 2023.
Y. Yang, "Drone-View Object Detection Based on the Improved YOLOv5," 2022 IEEE International Conference on Electrical Engineering, Big Data and Algorithms (EEBDA), Changchun, China, pp. 612-617, 2022.
C. Wang, A. Bochkovskiy, H. M. Liao, "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors," In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7464-7475. 2023.
S. Chourasia, R. Bhojane and L. Heda, "Safety Helmet Detection: A Comparative Analysis Using YOLOv4, YOLOv5, and YOLOv7," 2023 International Conference for Advancement in Technology (ICONAT), Goa, India, pp. 1-8, 2023.
T. Reddy Konala, A. Nammi and D. Sree Tella, "Analysis of Live Video Object Detection using YOLOv5 and YOLOv7," 2023 4th International Conference for Emerging Technology (INCET), Belgaum, India, pp. 1-6, 2023.
I. Hilali, A. Alfazi, N. Arfaoui and R. Ejbali, "Tourist Mobility Patterns: Faster R-CNN Versus YOLOv7 for Places of Interest Detection," in IEEE Access, vol. 11, pp. 130144-130154, 2023.
B. Wang, L. Ma, W. Zhang, and W. Liu, "Reconstruction Network for Video Captioning," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 7622-7631, 2018.
N. Aafaq, N. Akhtar, W. Liu, S. Z. Gilani and A. Mian, "Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 12479-12488, 2019.
S. Liu, Z. Ren and J. Yuan, "SibNet: Sibling Convolutional Encoder for Video Captioning," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 9, pp. 3259-3272, 1 Sept. 2021.
Q. Zheng, C. Wang, and D. Tao, "Syntax-Aware Action Targeting for Video Captioning," 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 13093-13102, 2020.
X. Zhang, C. Liu and F. Chang, "Guidance Module Network for Video Captioning," 2021 40th Chinese Control Conference (CCC), Shanghai, China, pp. 7955-7959, 2021.
Z. Sun, S. Chen and L. Zhong, "Visual-Aware Attention Dual-Stream Decoder for Video Captioning," 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, pp. 1-6, 2022.
N. Xu, A. Liu, Y. Wong, Y. Zhang, W. Nie, Y. Su, and M. Kankanhall, "Dual-Stream Recurrent Neural Network for Video Captioning," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 8, pp. 2482-2493, Aug. 2019.
A. Yousif and M. Al-Jammas, “Real-time Arabic Video Captioning Using CNN and Transformer Networks Based on Parallel Implementation,” Diyala Journal of Engineering Sciences vol. 17, No 1, March 2024.
Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, "Video swin transformer," Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
M. Tan and Q. Le, EfficientNet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning (ICML), pages 6105–6114, 2019.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advance Neural Inf. Process. Syst. 30 (2017).
X. Chen, M. Zhao, F. Shi, M. Zhang, Y. He, and S. Chen, "Enhancing Ocean Scene Video Captioning with Multimodal Pre-Training and Video-Swin-Transformer," IECON 2023- 49th Annual Conference of the IEEE Industrial Electronics Society, Singapore, Singapore, pp. 1-6, 2023.
S. Chaudhary, S. Sadbhawna, V. Jakhetiya, B. N. Subudhi, U. Baid and S. C. Guntuku, "Detecting Covid-19 and Community Acquired Pneumonia Using Chest CT Scan Images with Deep Learning," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, pp. 8583-8587, 2021.
M. Sarkar, S. Biswas and B. Ganguly, "A Hybrid Transfer Learning Architecture Based Image Captioning Model for Assisting Visually Impaired," 2023 IEEE 3rd Applied Signal Processing Conference (ASPCON), India, pp. 211-215, 2023.
A. S. Alva, R. Nayana, N. Raza, G. S. Sampatrao and K. B. S. Reddy, "Object Detection and Video Analyser for the Visually Impaired," 2023 Third International Conference on Artificial Intelligence and Smart Energy (ICAIS), Coimbatore, India, pp. 1405-1412, 2023.
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Compute. 9 (1997) 1735–1780.
A. Bodi, P. Fazli, S. Ihorn, Y. Siu, A. Scott, L. Narins, Y. Kant, A. Das, and I. Yoon, “Automated Video Description for Blind and Low Vision Users,” In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems. 1–7.
Y. -H. Huang and Y. -Z. Hsieh, "The Assisted Environment Information for Blind based on Video Captioning Method," 2020 IEEE International Conference on Consumer Electronics - Taiwan (ICCE-Taiwan), Taoyuan, Taiwan, pp. 1-2, 2020.
D. Chen and W. Dolan, “Collecting highly parallel data for paraphrase evaluation”. In ACL: Human Language Technologies- Volume 1. ACL, 190-200, 2011.
P. Muhammad Shameem, M. F. Imthiyaz, P. Abshar, K. Ijassubair, and A. K. Najeeb, "Real time visual interpretation for the blind," 2021 5th International Conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, pp. 1655-1660, 2021.
A. Papanai and H. Kaushik, "Hybrid Image Processing Device as Wearable Aide for Visually Impaired," 2022 8th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 2022, pp. 733-738, 2022.
K. M. Safiya and R. Pandian, "Computer Vision and Voice Assisted Image Captioning Framework for Visually Impaired Individuals using Deep Learning Approach," 2023 4th IEEE Global Conference for Advancement in Technology (GCAT), Bangalore, India, pp. 1-7, 2023.
K. Papineni, S. Roukos , T. Ward and W. Zhu, “Bleu: a method for automatic evaluation of machine translation,” In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, 2002;311–318.
S. Banerjee, and A. Lavie, “Meteor: An automatic metric for MT evaluation with improved correlation with human judgments,” In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005;65–72.
C. Lin, “Rouge: A package for automatic evaluation of summaries,” In: Proceedings of Workshop on Text Summarization Branches Out, Post2Conference Workshop of ACL 2004.
R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015;4566–4575.
W. Ji, R. Wang, Y. Tian, and X. Wang, “An attention based dual learning approach for video captioning,” Applied Soft Computing, vol. 117, p. 108332, 2022.
H. Munusamy and C. Sekhar, "Multimodal attention-based transformer for video captioning," Applied Intelligence (2023): 23349-23368, 2023.
N. Alrebdi and A. Al-Shargabi, "Bilingual video captioning model for enhanced video retrieval," Journal of Big Data 11.17, 2024.
M. A. A. Albadr, M. Ayob, S. Tiun, F. T. AL-Dhief, A. Arram, and S. Khalaf, “Breast cancer diagnosis using the fast-learning networkalgorithm,” Frontiers in Oncology, vol. 13, p. 1150840, 2023.
Published
How to Cite
Issue
Section
Copyright (c) 2024 Adel Jalal Yousif, Mohammed H. Al-Jammas
This work is licensed under a Creative Commons Attribution 4.0 International License.