Cross-Terminal Intelligent Diagnosis and Treatment System Based on Multimodal Large Language Models

Main Article Content

Yuyuan Li
Jingang Shi
Xiaolei Li
Xinyu Liu
Shouhe Lang

Abstract

To address the prominent challenges in current medical auxiliary diagnosis—including inaccurate tongue image feature extraction, poor disease adaptability, lack of cross-terminal collaboration, and high dependence on foreign core technologies—a cross-terminal intelligent diagnosis and treatment system based on multimodal large language models was designed and implemented. The system takes traditional Chinese medicine (TCM) tongue diagnosis as the core application scenario. Using SAM-2 with LoRA lightweight fine-tuning, pixel-level precise segmentation of tongue images is achieved with 97.2% accuracy. A heterogeneous fusion feature extraction architecture combining ResNet and Vision Transformer is proposed, enabling three-layer information fusion of tongue body, tongue coating, and tongue texture, improving disease prediction accuracy to 84%. The Qwen3-VL multimodal large language model integrated with Retrieval-Augmented Generation (RAG) technology constructs an interpretable disease prediction engine with a retrieval precision rate of 61%. Full-stack deployment is completed on the domestic Kunpeng CPU and Ascend NPU hardware platform, achieving an inference speed of 20 Token/s. Experimental results demonstrate that the system achieves significant performance in accuracy, interpretability, and domestic adaptation, validating the feasibility and efficiency of domestic hardware and software systems in handling complex multimodal large model tasks.

Article Details

How to Cite
Li , Y., Shi , J., Li , X., Liu , X., & Lang , S. (2026). Cross-Terminal Intelligent Diagnosis and Treatment System Based on Multimodal Large Language Models. Journal of Research in Multidisciplinary Methods and Applications, 5(5), 01260505005. Retrieved from http://www.satursonpublishing.com/jrmma/article/view/a01260505005
Section
Articles

References

Ravi N, Gabeur V, Hu Y T, et al. SAM2: Segment Anything in Images and Videos[EB/OL]. arXiv:2408.00714, 2024.

Qwen Team. Qwen3-VL Technical Report[EB/OL]. arXiv:2511.21631, 2025.

He K, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.

Dosovitskiy A, Beyer L, Kolesnikov A, et al. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale[C]//International Conference on Learning Representations. [S.l.]: OpenReview, 2021.

Wu X, Xu H, Lin Z S, et al. A Survey of Deep Learning in Tongue Image Classification[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(2): 303-323.

Hu E J, Shen Y, Wallis P, et al. LoRA: Low-Rank Adaptation of Large Language Models[EB/OL]. arXiv:2106.09685, 2021.

Edge D. From Local to Global: A Graph RAG Approach to Query-Focused Summarization[EB/OL]. arXiv:2404.13652, 2024.

Huang S Q, Zhang Y L, Zhou J, et al. A Brief Discussion on Objectification, Quantification, and Standardization of TCM Tongue Diagnosis[J]. China Journal of Traditional Chinese Medicine and Pharmacy, 2017, 32(4): 1625-1627.

Jiang Y C, Fan C L, Ming X, et al. Design of Integrated TCM Tongue Image Acquisition and Analysis System[J]. Computer Measurement & Control, 2018, 26(1): 222-225.

Dettmers T, Pagnoni A, Holtzman A, et al. QLoRA: Efficient Finetuning of Quantized LLMs[C]//Advances in Neural Information Processing Systems. Red Hook: Curran Associates, 2023, 36: 10088-10115.