Medical Conversation and Diagnosis Chatbot: Conversation (A)

This blog aims to show you how to get your own medical chatbot. There are two parts:
A. Medical Conversation with Patients
B. Symptomes Extraction and Diseases Diagnozing

You can find the codes from this:

Web Crawler
NLP SOTA Models: Conversation
Train Your Medical Conversational Model
Test Your Model

Web Crawler

This upper half aims to build a meidcal conversational dialogue system for automaticly medically responses, and to generate medical context consistent and medically meaningful responses conditioned on the dialogue history^1,².

According to Gururangan, Suchin et al³, when tailoring a pretrained model to the domain of a target task, a second phase of pretraining in-domain leads to performance gains. MedDialog⁴ provides 3.4 million annotation-free conversations between patients and doctors, which could be used for domain-adaptive pretraining.

There are two availabel Medical Dialogues you can directly use to generate your own NLP model: MedDialog dataset (Chinese) and MedDialog dataset (English). The authors developed their own web crawler but you should update these below codes, because these websites are updating frequentely.

If want to develop your own web crawler, you can try our framework. There is an internet medical diagnosis and treatment suggestion system in China, which can instantly chat with patients. We developed a web crawlerr to exract the medical dialogues between patients and doctors from this website. As we know, there in no such a proper corpus which can train a chatbot with an instant response. Finally, save them as json files.

Step 1:

There are 34 provinces in China, and the list provs in Line 34 is the province list.
Extract the information from this website with:
医生姓名(Doctors’ name),
医院科室(Facalties),
信息中心页(Information Center Links),
地区(Regions). And save the file as ‘step-1-result.csv’.

In step-1-doctorReferLinks.py from Line 147 to Line 150, you can choose which category you want, for example, mental diseases or psychological illness (精神, 心理 diseases) :

includedDepartmentKeywords = ['精神', '心理']
excludedDepartmentKeywords = []
includedHospitalKeywords   = ['精神', '心理']
excludedHospitalKeywords   = []

Step 2:

You can further extract more information by step-2-doctorMoreLinks.py, which includes:
医生姓名(Doctors’ name),
医生ID(Doctors’ ID),
个人网站(Personal Webpage),
评价分享链接(Evaluation Links),
医院科室(Faculties),
信息中心页(Information Center Links),
地区(Regions). Input the file ‘step-1-result.csv’ saved from the Step 1 and output the file as ‘step-2-result.csv’

Step 3:

We can then develop a web crawler now. Firstly, you should get an account:

config = {
    "url": {
      "login": "https://passport.haodf.com/user/showlogin",
      "target": "https://www.haodf.com/bingcheng/8822504854.html"  # https://www.haodf.com/kanbing/6478777497.html
    },
    "account": {
      "username": "userID",
      "password": "passward"
    }
}

And then, according to the structure of this web’s HTML, you can extract the dialogues.

Step 4:

Translate the Chinese language to English. The efficient way is to use pygtrans, because it has a wider stream than other packages. In line 19, you can make the threshold bigger according to the speed (<100000).

from pygtrans import Translate
import json
from time import sleep
client = Translate()
if __name__ == "__main__":

    translated_data = []
    raw_data = []
    flag_num = 0
    flag_num_IN = 0
    data_dumped = []
    print('START ########################################################## START')
    with open('train_data.json', 'rb') as json_file:
        student_loaded = json.load(json_file)
        for dialogue_seg in student_loaded:
            for dialogue_data in dialogue_seg:
                flag_num += 1
                raw_data.append(dialogue_data)
                if (flag_num >= 10000) or (dialogue_seg == student_loaded[-1] and dialogue_data == dialogue_seg[-1]):
                    print('IN ########################################################## IN')
                    flag_num_IN += 1
                    print(flag_num_IN)
                    data = client.translate(raw_data, target='en')
                    for i_sentence in range(flag_num):
                        translated_data.append(data[i_sentence].translatedText)

                    flag_num = 0
                    raw_data = []
                    sleep(3)

    print('END ########################################################## END')

    with open('train_data_translated.json', 'w', encoding='utf-8') as data_dumped:
        json.dump(translated_data, data_dumped)

NLP SOTA Models: Conversation

You can find more Text Generation models from Hugging Face. In this blog, we used GPT2 and T5 to respectively trian a medical conversational nlp model and a diagnosis model.

Transformers:

bertGPT:

GPT-2:

T5:

Train Your Medical Conversatinal Model

After you get the medical dialogues which looks like below:

Patient: clubfoot inversion.
Doctor: Use plaster to correct and treat as soon as possible, the sooner the better the effect.

Patient: Skin beauty, there is a mole on the lip. In 2007, I had a laser in the old home town hospital, and then it became bigger. I had an electric cautery in a beauty salon 5 months ago, and it became bigger again. I want to consult Dr. Shen and I Is it still possible to remove moles in this case? Can it be cured in your hospital? At first it was the size of a needle tip, but now it is the size of a mung bean!
Doctor: It can be surgery or laser.

Patient: Seborrheic dermatitis, there are many small spots in the T-shaped area, which are not easy to see in sunlight, but are particularly conspicuous in a brighter environment. Especially under the eyes and the bridge of the nose, it feels like a lot of pockmarks, and it is not smooth to the touch.
Doctor: It is recommended to go to the hospital for a diagnosis and then to treat it.

You can find the training source codes from my github repository - Medical-Dialogue-Chatbot.

This is the batch file with the environment:

#!/bin/sh
#SBATCH --partition=gpu
#SBATCH --time=24:00:00
#SBATCH --gres=gpu:4
#SBATCH --mem-per-gpu=32000
#SBATCH --job-name=GPT2-EN
#SBATCH --err=GPT2_EN.err 
#SBATCH --out=GPT2_EN.out 
#SBATCH --mail-user=kangkangsome@gmail.com    # where send info about job
#SBATCH --mail-type=ALL              # what to send, valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL
/bin/hostname
srun -l /bin/hostname
srun -l /bin/pwd
ml load Python/3.7.2-fosscuda-2019a
cd /Medical-Dialogue-Chatbot/
source TestEnvMDS/bin/activate
cd gpt2/
python gpt2_train_EN.py

Test Your Model

After crawling medical dialogues from two websites onlone, and merged them with MedDialog and MedDG. Althrough the Chatbot trained by GPT2 can generate reasonabel responses, the trustworthiness of medical Chatbot and the operability of the generated answers still under tests.

Example 1:

USER: Seborrheic dermatitis, there are many small spots in the T-shaped area, which are not easy to see in sunlight, but are particularly conspicuous in a brighter environment. Especially under the eyes and the bridge of the nose, it feels like a lot of pockmarks, and it is not smooth to the touch.
CHATBOT: Sometimes the bumps are a little more painful, but it is also quite normal as well. It’s normal with acne because you’re not expecting. It’s normal at that point. If you are like me and were like me, it never occurred to us that all this scarring was inevitable. In all likelihood, we are pretty normal when we’re acne prone. I don’t think it was easy to cut all the bumps. In fact, I did it for a few years before having
BUNCHMARK: It is recommended to go to the hospital for a diagnosis and then to treat it.

Example 2:

USER: I have undergone surgery for congenital heart disease. The child has suffered from syncope seven times. There is no cause for the onset. He just said that he felt uncomfortable before the attack, and then slowly fell down and lost consciousness.
CHATBOT: This was my youngest child. It is one of the more serious types of congenital heart disease, caused by a bacterial infection. This is extremely painful for him. It can lead to sudden death; he is unable to see and cannot keep up with the movements of his limbs. Tamil Nadu in 1992. Tamil Nadu, 2004 Indian Institute of Medical Sciences, Chennai Tamil Nadu Hospital, Chennai Tamil Nadu: Tamil Nadu. Tamil Nadu: The child who was born to me has also been seriously
BUNCHMARK: It is recommended to come to our hospital for further examination such as upright tilt test and flat panel test.

Reference:

[1] Wenge Liu, Jianheng Tang, Jinghui Qin, Lin Xu, Zhen Li, Xiaodan Liang. MedDG: A Large-scale Medical Consultation Dataset for Building Medical Dialogue System. Arxiv, 2020.

[2] Shuai Lin, Pan Zhou, Xiaodan Liang, Jianheng Tang, Ruihui Zhao, Ziliang Chen, Liang Lin. Graph-Evolving Meta-Learning for Low-Resource Medical Dialogue Generation. In AAAI, 2021.

[3] Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, Noah A. Smith. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In ACL, 2020.

[4] Guangtao Zeng, Wenmian Yang, Zeqian Ju, Yue Yang, Sicheng Wang, Ruisi Zhang, Meng Zhou, Jiaqi Zeng, Xiangyu Dong, Ruoyu Zhang, Hongchao Fang, Penghui Zhu, Shu Chen, Pengtao Xie. MedDialog: Large-scale Medical Dialogue Datasets. In EMNLP, 2020.