Cross-Terminal Intelligent Diagnosis and Treatment System Based on Multimodal Large Language Models

Yuyuan Li; Jingang Shi; Xiaolei Li; Xinyu Liu; Shouhe Lang

PDF

Published: May 15, 2026

Keywords:

general medical auxiliary diagnosis image segmentation multimodal large language models retrieval-augmented generation domestic adaptation cross-terminal collaboration

Yuyuan Li

Shenyang Jianzhu University

Jingang Shi

Shenyang Jianzhu University

Xiaolei Li

Shenyang Jianzhu University

Xinyu Liu

Shenyang Jianzhu University

Shouhe Lang

Shenyang Jianzhu University

Abstract

To address the prominent challenges in current medical auxiliary diagnosis—including inaccurate tongue image feature extraction, poor disease adaptability, lack of cross-terminal collaboration, and high dependence on foreign core technologies—a cross-terminal intelligent diagnosis and treatment system based on multimodal large language models was designed and implemented. The system takes traditional Chinese medicine (TCM) tongue diagnosis as the core application scenario. Using SAM-2 with LoRA lightweight fine-tuning, pixel-level precise segmentation of tongue images is achieved with 97.2% accuracy. A heterogeneous fusion feature extraction architecture combining ResNet and Vision Transformer is proposed, enabling three-layer information fusion of tongue body, tongue coating, and tongue texture, improving disease prediction accuracy to 84%. The Qwen3-VL multimodal large language model integrated with Retrieval-Augmented Generation (RAG) technology constructs an interpretable disease prediction engine with a retrieval precision rate of 61%. Full-stack deployment is completed on the domestic Kunpeng CPU and Ascend NPU hardware platform, achieving an inference speed of 20 Token/s. Experimental results demonstrate that the system achieves significant performance in accuracy, interpretability, and domestic adaptation, validating the feasibility and efficiency of domestic hardware and software systems in handling complex multimodal large model tasks.

How to Cite

Li , Y., Shi , J., Li , X., Liu , X., & Lang , S. (2026). Cross-Terminal Intelligent Diagnosis and Treatment System Based on Multimodal Large Language Models. Journal of Research in Multidisciplinary Methods and Applications, 5(5), 01260505005. Retrieved from http://www.satursonpublishing.com/jrmma/article/view/a01260505005

Issue

Vol. 5 No. 5 (2026)

Section

Articles

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

References

Ravi N, Gabeur V, Hu Y T, et al. SAM2: Segment Anything in Images and Videos[EB/OL]. arXiv:2408.00714, 2024.

Qwen Team. Qwen3-VL Technical Report[EB/OL]. arXiv:2511.21631, 2025.

He K, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.

Dosovitskiy A, Beyer L, Kolesnikov A, et al. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale[C]//International Conference on Learning Representations. [S.l.]: OpenReview, 2021.

Wu X, Xu H, Lin Z S, et al. A Survey of Deep Learning in Tongue Image Classification[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(2): 303-323.

Hu E J, Shen Y, Wallis P, et al. LoRA: Low-Rank Adaptation of Large Language Models[EB/OL]. arXiv:2106.09685, 2021.

Edge D. From Local to Global: A Graph RAG Approach to Query-Focused Summarization[EB/OL]. arXiv:2404.13652, 2024.

Huang S Q, Zhang Y L, Zhou J, et al. A Brief Discussion on Objectification, Quantification, and Standardization of TCM Tongue Diagnosis[J]. China Journal of Traditional Chinese Medicine and Pharmacy, 2017, 32(4): 1625-1627.

Jiang Y C, Fan C L, Ming X, et al. Design of Integrated TCM Tongue Image Acquisition and Analysis System[J]. Computer Measurement & Control, 2018, 26(1): 222-225.

Dettmers T, Pagnoni A, Holtzman A, et al. QLoRA: Efficient Finetuning of Quantized LLMs[C]//Advances in Neural Information Processing Systems. Red Hook: Curran Associates, 2023, 36: 10088-10115.

Article Sidebar

Main Article Content

Abstract

Article Details

References