Understanding traditional Chinese medicine via statistical ...R ESEARCH ARTICLE Understanding traditional Chinese medicine via statistical learning of expert-specific Electronic Medical
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
RESEARCH ARTICLE
Understanding traditional Chinese medicinevia statistical learning of expert-specificElectronic Medical Records
Yang Yang1,2,†, Qi Li1,†, Zhaoyang Liu1, Fang Ye3, Ke Deng1,*
1 Center for Statistical Science & Department of Industry Engineering, Tsinghua University, Beijing 100084, China2 Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China3 Zhou Zhongying’s Studio, Nanjing University of Chinese Medicine, Nanjing 210046, China* Correspondence: [email protected]
Received August 28, 2018; Revised January 16, 2019; Accepted March 26, 2019
Background: Traditional Chinese medicine (TCM) has been attracting lots of attentions from various disciplinesrecently. However, TCM is still mysterious because of its unique philosophy and theoretical thinking. Due to the lackof high quality data, understanding TCM thoroughly faces critical challenges. In this study, we introduce the ZhouArchive, a large-scale database of expert-specific Electronic Medical Records containing information about 73,000 +visits to one TCM doctor for over 35 years. Covering the full spectrum of diagnosis-treatment model behind TCMpractice, the archive provides an opportunity to understand TCM from the data-driven perspective.Methods: Processing the text data in the archive via a series of data processing steps, we transformed the semi-structured EMRs in the archive to a well-structured feature table. Based on the structured feature table obtained, aseries of statistical analyses are implemented to learn principles of TCM clinical practice from the archive, includingcorrelation analysis, enrichment analysis, embedding analysis and association pattern discovery.Results: A structured feature table of 14,000 + features is generated at the end of the proposed data processingprocedure, with a feature codebook, a term dictionary and a term-feature map as byproducts. Statistical analysis ofthe feature table reveals underlying principles about the diagnosis-treatment model of TCM, helping us betterunderstand the TDM practice from a data-driven perspective.Conclusion: Expert-specific EMRs provide opportunities to understand TCM from the data-driven perspective.Taking advantage of recent progresses on NLP for Chinese, we can process a large number of TCM EMRs efficientlyto gain insights via statistical analysis.
Keywords: TCM; EMRs; data-driven perspective; Chinese text mining; statistical analysis
Author summary: Traditional Chinese medicine (TCM) is attracting more and more attentions from various disciplines.But TCM is still mysterious due to its unique philosophy, model and theoretical thinking. In this paper, we introduce theZhou Archive, a large-scale database of expert-specific Electronic Medical Records (EMRs) containing visits to one TCMdoctor. We transform the original EMRs into a well-structured feature table by multiple data processing tools. Based on thisstructured feature table, a series of statistical analyses are implemented to learn principles of TCM clinical practice, whichreveal insights to understand TCM from a data-driven perspective.
INTRODUCTION
Traditional Chinese medicine (TCM) has a long history of
over 2,000 years, and once played an important role inhealthcare in pre-modern East Asia. As an importantbranch of alternative medicine, it has been becoming
more and more popular worldwide in recent years, andattracting a lot of attentions from scientists of variousdisciplines. For example, Refs. [1–3] confirmed theunique treatment effects of acupuncture; Refs. [4–6]provided insights on how TCM prescriptions work viasystematic interactions with biological regulation net-work; and, the 2015 Nobel Prize awarded to Prof. YoyoTu for her contribution to the discovery of artemisinin in1977 casted lights on the great impact of TCM on humanbeings.On the other hand, however, TCM is still mysterious to
many people because of the unique philosophy, modeland theoretical thinking behind it. Similar to any otherhealthcare systems, TCM also contains three basiccomponents: (i) a toolbox of therapeutic technologies totreat patients, (ii) biomedical measurement instruments toobserve and measure physical status of patients, and (iii) adiagnosis-treatment model (DTM) to map the biomedicalobservations and measurements of a patient to a “proper”therapy in the toolbox. But, due to the philosophicalenvironment of ancient China and technical constraints inhistory, TCM developed these components in a uniqueway.First, TCM therapies typically have complex internal
structures. TCM prescriptions and acupuncture are thetwo primary therapies of TCM (although there are recordsof surgeries in the long history of TCM). A TCMprescription typically contains multiple ingredients,which may generate a mixture of hundreds of chemicalcompounds. An acupuncture therapy is usually composedof a series of acupunctures in different locations (calledacupoints) of the patient’s body. The combinatorial orsequential nature of TCM therapies provides flexibility totune treatment adaptively based on status of patients, butalso posts great challenges in quality control and efficacyevaluation of TCM therapies. Second, due to thetechnology constraints in history, biomedical measure-ments of TCM heavily depend on subjective observationof doctors, and rely on natural language to deliver theexperience. The combination of subjectivity of observa-tions and flexibility of natural language may introducemultiple levels of bias and noise to the measurements,leading to critical technical barriers in data analysis.Third, built on top of the Chinese philosophy, thediagnosis-treatment model of TCM is described in aunique language involved many philosophical concepts inancient China whose concrete meanings may change overtime and be interpreted in different ways. This phenom-enon makes it a challenging job to decode and understandthe diagnosis-treatment model of TCM from a positiveperspective.All these features shaped TCM into a healthcare system
with a unique knowledge representation style anddeduction logic, which is very different from the modern
healthcare system developed in the western world on topof anatomy and cell/molecular biology. In the pastdecades, many efforts have been given to build connec-tions between TCM and modern sciences, trying toevaluate, understand and reinterpret TCM in a modernway. These efforts can be roughly classified into twocategories: (i) the drug-discovery oriented research,which aims to identify potential drug candidates andvalidate them via randomized experiments [7–11]; and(ii) the theory-understanding oriented research, whichfocuses on revealing causal mechanism or associationpatterns of the diagnosis-treatment model behind TCMvia data-driven approaches [4,12‒18]. Although there aremany difficult issues in practical implementation, thedrug-discovery oriented research enjoys a relativelystraightforward logic. The theory-understanding orientedresearch, however, often faces critical challenges at bothmethodology level and data level.At the methodology level, it is very challenging to
design data models that can precisely reflect TCMthinking and/or appropriately approximate generatingprocedure of TCM data. At the data level, a majorproblem is the lack of high quality data carrying stablesignals about the full spectrum of TCM clinic practice.It’s not difficult to find a small-scale dataset withhundreds or thousands of patients from one TCM doctor.But, such a dataset is often biased to a small patientpopulation of a certain disease. It’s also possible toassemble many small-scale datasets into a large-scaledataset. But, a dataset generated in this way is often amixture of many inconsistent components, leaving manyuncontrollable risks in downstream data analysis.In this study, we introduce the Zhou Archive, a large-
scale database of expert-specific Electronic MedicalRecords (EMRs), which contains comprehensive infor-mation about 73,000+ visits to one TCM doctor by26,000+ distinct patients over 35 years from 1980 to2015. From many perspectives, the archive provides anideal opportunity to understand TCM in a data-drivenway. First, the scale of archive is large enough to supportmany data-driven approaches. Second, the 73,000+visits by 26,000+ patients cover 1,300+ diseases of 16major disease categories, including cancers, digestivediseases, infectious diseases, neurological diseases,respiratory diseases, cardiovascular diseases, urinarydiseases, rheumatism and so on, and are rich enough toreflect all aspects of TCM practice. Third, with data fieldsfor symptoms of patients, TCM diagnosis and TCMtreatment, the archive records all key components of thediagnosis-treatment model behind TCM, making itpossible to decode the model in a data-driven way.Moreover, as all EMRs in the records come from oneTCM doctor alone, the underlying logic of diagnosis-treatment model is more likely to be self-consistent,
Understanding TCM via statistical learning of expert-specific EMRs
which is extremely important to the success of data-drivenapproaches. At last, except for classic TCM features, thearchive also contains information about lab tests anddiagnosis from the western medicine perspective, allow-ing us to connect TCM concepts with Western Medicine.With the rise of medical big data and the popularity of
precise medicine in recent years, real world study basedon large scale EMRs has become an important paradigmin healthcare research [19–25]. We hope this study canopen a door to this paradigm for TCM-related studies.Like most EMR data in practice, the data in the archive isa mixture of structured data fields which encodeinformation with a well-design feature table, and semi-structured/unstructured data fields which deliver informa-tion via semi-structured or free texts. To transform theoriginal EMR data into a well-structured feature table forwhich statistical analysis can be implemented, we need todiscover a lot of TCM-specific and archive-specifictechnical terms from the archive, map them to theirstandard feature codes, and properly process the semi-structured and free Chinese texts in the archive to decodeinformation effectively. In this paper, we proposed asystematic data processing framework to achieve thisgoal.Based on the structured feature table obtained, a series
of statistical analyses are implemented to learn principlesof TCM clinical practice from the archive. Cross-categoryassociation patterns are discovered using various techni-cal tools and embedding analysis is used on prescriptionsand symptoms. Results from these analyses revealinsights to understand TCM from a data-driven perspec-tive.The remainder of this paper is organized as follows.
“Description Of The Data” briefly introduces the datastructure of the Zhou Archive. “Transfering Semi-Structured EMRs Into A Structured Feature Table”proposes a data processing framework to transform theoriginal semi-structured and unstructured data from thearchive to a well-structured feature table. In “StatisticalLearning Of The Structured Feature Table”, we analyzethe structured feature table obtained with a series ofstatistical methods and extract some hidden patterns ofthis database. Finally, we summarize and discuss thisstudy in the last section.
DESCRIPTION OF THE DATA
The archived EMRs contain 14 distinct data fields of 6categories, including: (i) Patient ID and Demographics(ID, Gender, Age), (ii) Visit Date, (iii) Clinical Features(Symptoms, Tongue Picture, Pulse Type, Labe Tests),(iv) Western Medicine Diagnosis (Disease, DiseaseCategory), (v) TCM Diagnosis (TCM Disease, TCMPathogenesis) and (vi) TCM Treatments (TCM Therapy,
TCM Prescription).The 14 data fields can be classified into three types: 7
structured fields encoding information with well-designedcodes (including Patient ID, Gender, Age, Visit Date,Disease, Disease Category, TCM Disease), 6 semi-structured fields encoding information with semi-struc-tured texts (including Tongue Picture, Pulse Type, LabTests, TCM Pathogenesis, TCM Therapy, TCM Prescrip-tion), and 1 unstructured filed that delivers informationwith free texts (i.e., Symptoms). All these data fieldscontain missing values.In the database, the column “Western Medicine
Diagnosis” comes from the records of visiting westernmedicine doctors before coming to Prof. Zhou. Thesewestern medicine diagnoses were recorded by Prof. Zhouin the archive, each for one visit. Totally, 1,339 distinctdiseases appear in the archive, which can be furtherclassified into 16 disease categories, including: Cancers,Digestive Diseases (DD), Infectious Diseases (InD),Neurological Diseases (ND), Respiratory Diseases(RD), Cardiovascular Diseases (CD), Urinary Diseases(UD), Rheumatism, Gynopathy, Skin Diseases (SD),Hematopathy, Endocrine Diseases (ED), OrthopedicDiseases (OD), Ophthalmological and Otorhinolaryngo-logical Diseases (OOD), Men Diseases (MD), andMiscellaneous Diseases (MiD). In terms of TCMdiseases, however, only 394 disctinct TCM diseasesappear, partially due to the higher missing rate of theTCM Disease field (88.4%) than the Disease field(17.5%). More detailed information about the patientscovered by the archive is provided in SupplementaryFigure S1A–S1E.One third of the patients in the archive visited Prof.
Zhou for multiple times. These patients with longitudinalrecords paid 4.7 visits on average within an average timespan of 242 days, and the average time gap between twoadjacent visits is 65 days. Supplementary Figure S1F‒S1H give the detailed distributions of visit frequency,overall time span and time gap between two adjacentvisits of these patients. Researchers who are interested inthis archive can check the website of Zhou Archive forTCM Study for detailed information on the data structureand data access.
TRANSFERING SEMI-STRUCTUREDEMRs INTO A STRUCTURED FEATURETABLE
With both structured data fields and semi-structured/unstructured data fields, the original archive is difficult toanalyze. In this section, we transform the semi-structuredand unstructured EMRs of the archive into a well-structured feature table, for which statistical analysis canbe conveniently implemented. To achieve this goal, we
process the structured fields, semi-structured fields andunstructured Symptoms field separately by different dataprocessing strategies.Figure 1 shows the route map of the data processing
procedure, which digests the original archive as the input,and returns the following outputs: (i) a feature codebookF which encodes all features generated from the archive,(ii) a term dictionary D which fully covers the vocabularyspecific to the archive (including all background words,common TCM terms and special terms used by Prof.Zhou), (iii) a term-feature map M which links terms in Dand the standard feature codes they correspond to, and themost importantly, (iv) a well-organized structured featuretable T with columns for different features and rows fordifferent records. Different from the raw data in thearchive, which delivers information via semi-structuredand unstructured texts, the transformed two-dimensionalfeature table T encodes information with a well-designeddata format and coding system.There are a few critical challenges in this data
processing procedure due to the semi-structured and
unstructured texts in the archive. First, text segmentationand term discovery. As there are no visible wordboundaries such as spaces in Chinese texts, the unstruc-tured Chinese texts in the Symptoms field must besegmented into sequences of meaningful terms to decodeinformation. However, because these texts contain manydomain specific words, phrases and technical terms thatare previously unknown, text segmentation is entangledwith term discovery in this study. The combination ofthese two critical problems posts great challenges inprocessing the free texts in the archive. Second,standardization of technical terms. Due to the flexibilityof free texts, many technical terms in the archive havemultiple variates. To sufficiently extract information fromthe data, we need to map different variates of a technicalterm to its standard code. Third, we also need tounderstand the semantic meaning of semi-structured andfree texts in the archive to precisely decode information.Although many tools have been invented to process
Chinese texts in the past decades, it is still not trivial toovercome above challenges in this study. Here, we
Figure 1. Flowchart to transfer the original Zhou Archive to a structured feature table.
Understanding TCM via statistical learning of expert-specific EMRs
propose an integrative data processing framework as apreliminary solution to this important but challengingproblem. As the same problem will be encountered inmany similar studies in the future, we hope that theframework we suggest can serve as a baseline solution forresearchers in this field.
Processing the structured and semi-structured datafields
First, we process the structured and semi-structured datafields, transforming them into a feature table. Because thestructured data fields already encode information with awell-designed feature codebook, it is straightforward todecode these fields to get the feature codebook Fa and afeature table Ta.For the semi-structured data fields, however, we need to
make extra efforts to collect technical terms in these fieldsand transform them into their standard feature codes.Taking advantage of the existing data structure in thesesemi-structured fields, a lot of technical terms can beconveniently extracted. For example, tongue/pulse-related terms and lab tests in the Clinic Feature fields,terms in the TCM Pathogenesis and TCM Therapy fields,as well as herb names in the Prescription field, can beobtained in a straightforward way by enumeratingChinese strings segmented by commas or numbers inthe according data fields. Totally, 5,000+ distinct termsare extracted in this way, forming a dictionary of termsdenoted as Db. Table 1A shows the most frequent termsextracted from each of these semi-structured fields.These extracted terms need to be transformed to their
standard codes before downstream analysis can beproceeded. This can be achieved via two typicaloperations: splitting and mapping. Many terms extractedfrom these semi-structured fields tend to abbreviatemultiple concepts to a single term. For example, term“taihouhuang” from the field of Tongue Picture is theabbreviation of two terms “thick tongue fur” and “yellowtongue fur”, term “maixianhua” from the field of PulseType is the abbreviation of two terms “stringy pulse” and“slippery pulse”, term “ganshenkuixu” from the field ofTCM Pathogenesis is the abbreviation of two terms“deficiency of liver” and “deficiency of kidney”.Standardization of these terms can be achieved byidentifying the multiple concepts compressed in oneterm, and listing the standard feature codes of theseconcepts in parallel (e.g., “taihouhuang” ↕ ↓“thick tonguefur, yellow tongue fur”). We call this operation as“splitting”, as it divides one technical term into multiplefeatures. On the other hand, many extracted terms refer tothe same concept. For example, term “manyigan” is anabbreviation of “chronic hepatitis B”; term “dashengdi”and “xishengdi” refer to the same herb “dried rehamnnia
root”. Standardization of these terms can be achieved by“mapping”, i.e., building a mapping table from theseterms to their standard feature codes (e.g., “dashengdi”↕ ↓
“dried rehamnnia root”, and “xishengdi” ↕ ↓ “driedrehamnnia root”). Please note that we may need thecombination of splitting and mapping sometimes tostandardize a term with complex structure.Totally, 4,000+ features are generated for the 5,000+
extracted terms, resulting in a feature codebook Fb. Thetransformation rules from terms inDb to features in Fb aresummarized in a term-feature map Mb, based on which astructured feature table Tb can be established from thesemi-structured fields.
Unique properties of the free texts in the Archive
Next, we process the free texts in the unstructuredSymptoms field. These free texts contain 1,177,007Chinese character tokens, of which 2,678 are unique.From the text analysis perspective, these texts are uniquein multiple dimensions. First, these texts contain a lot ofTCM-specific technical terms rarely used elsewhere andmany special terms invented by Prof. Zhou that arespecific to this archive only. Second, some segments ofthese texts are highly repetitive. Cutting down these freetexts into small pieces separated by natural boundaries(such as punctuation marks, ends of lines and so on), weobtained ~241,000 segments, of which ~126,000 areunique. Most of these unique segments are short stringswith £ 10 Chinese characters, and many of them repeatheavily in the free texts: 3 segments appear ³ 1,000times, and the 2,100+ segments that appear more than 10times contribute ~90,000 repeats together, which equiva-lents to 1/3 of the total number of segments generatedfrom the free texts. We summarize these unique segmentsinto a segment list Sc. Table 1B shows the top 100segments with the highest repeat frequency in Sc. Third,these texts are written in a unique style that is verydifferent from classic training corpus for Chinese textmining, which is typically based on news articles.These facts mean that we need to capture the special
technical terms in the archive to establish an archive-specific vocabulary, and a style-robust tool to process thefree Chinese texts in the archive. Moreover, as mostsegments (especially these highly repeated ones) in thesegment list Sc can deliver one piece of intact informationabout patients, it is more efficient to achieve semanticunderstanding with these segments, instead of words orterms, as the basic language units.
Processing the unstructured symptoms field
In the past decades, many tools for processing Chinesetexts have been proposed. In this study, we tried four
Understanding TCM via statistical learning of expert-specific EMRs
popular “supervised” methods, namely Jieba, StanfordParser [26–27], Language Technology Platform (LTP)[28] and THU Lexical Analyzer for Chinese (THULAC)[29–30], and a recently proposed “unsupervised” methodcalled TopWORDS [31], to process the free texts in theunstructured Symptoms field.The four supervised methods emphasize precise text
segmentation under the guidance of a preloaded vocabu-lary and high-quality training corpus. They typicallymatch the target texts with words from a preloadedvocabulary, and do statistical inference when meetingambiguous words based on a statistical model trained bymanually segmented and labelled training corpus. Whenthe actual vocabulary is covered by the preloadedvocabulary and writing style of the target texts are closeto the training corpus, these supervised methods usuallyperform pretty well. But, previous study [31] also showedthat when the actual vocabulary of the target textscontains a lot of words beyond the preloaded vocabulary,they often fail to recognize many of these unregisteredwords, especially when the writing style of the target textsis very different from the training corpus.The unsupervised method TopWORDS, however, pays
more attention on efficient new word discovery, althoughit can also be used as a tool for text segmentation. It caneffectively discover previously unknown words andphrases when no preloaded vocabulary and propertraining corpus are available, or the preloaded vocabularyand the training corpus do not fit the target texts well.Detailed information about the five NLP tools forprocessing Chinese texts can be found in the Appendix.As shown in Figure 1, each of these methods returns a
term dictionary D and a term boundary vector V as theoutputs for term discovery and text segmentation,respectively. The term dictionary D is the set of termsidentified by the method, and the term boundary vector Vis a vector with the same length of the target texts, whose
element V i can take three values: V i=2 if there is anatural boundary (e.g., punctuation mark, end of line andso on) behind the th position of the target texts, V i=1 ifthe method puts a term boundary there, and V i=0otherwise. The detailed term boundary vectors of differentsegmentation tools can be found in the website of “ZhouArchive for TCM Study”. Table 2 summarizes andcompares their performance in multiple angles, fromwhich we can see that both the reported term dictionaryand the predicted term boundary vector vary significantlyacross different methods, indicating the critical challengesin processing and understanding domain-specific Chinesetexts.Table 2A summarizes the number of nontrivial terms
(i.e., terms with more than one Chinese character)discovered by different methods. The number variesfrom the smallest 23,989 terms reported by Jieba to thelargest 47,248 terms reported by TopWORDS. Ignoringrare terms that appear only one time in the target texts,Table 2B recounts the number of frequent nontrivialterms. The number drops by half to 10,000+ for the foursupervised methods, while only drops by 8% for theunsupervised TopWORDS. We note that the termdictionaries reported by the five different methods donot match well with each other: DLTP
c and DTHUc achieve
the largest overlap ratio of 60%–70% for nontrivial termsand 70%–80% for frequent nontrivial terms, and theoverlap ratio of all the other pairs varies from 20%‒70%.These facts reflect the critical challenges in termdiscovery from domain-specific Chinese texts.Combining the five term dictionaries, we obtain a joint
term dictionary of 80,000+ distinct terms:
Dc=DJBc [DSP
c [DLTPc [DTHU
c [DTWc :
We identified 20,000+ technical terms with clearmedical meanings (including 14,300+ symptoms,
Figure 2. Statistical properties of segments cut from the free texts in the Symptom field. (A) Segment length. (B) Repeatfrequency. (C) Repeat frequency by length.
0For the technical terms: (1) we asked two TCM experts to label the discovered terms independently, if both of them agreed a discovered term to
be technical, we labeled it as a technical term; (2) for most technical terms, the two experts gave the same label, and for the a few technical terms
received different type labels, we asked two experts to discuss with each other for a second time and reported their consensus.
3,600+ body parts, 2,000+ disease names, 900+ labtests and 400+ medical treatments1), 60,600+ back-ground terms (i.e., correct words and phrases with nomedical meanings), and 5,700+ suspicious terms whosesemantic meanings cannot be easily determined. Table 1C lists the top 30 terms for each of the 6 term categories inDc. Table 2C shows the contribution of different methodsto discovery of technical terms, background terms andsuspicious terms, respectively. The results suggest that thesupervised methods indeed missed a lot of meaningfultechnical terms in this study, while the unsupervisedTopWORDS discovers 13,047 (61%) technical termsmissed by other methods. Figure 3 shows the lengthdistribution and type distribution of the discoveredtechnical terms in Dc by different methods. From thesefigures, we can see that TopWORDS tends to report morelonger words than the supervised methods, and con-tributes most to the discovery of technical terms. A term-feature map Mc for these discovered technical terms isestablished to achieve term standardization in a similarway to establish Mb.The variation on term discovery naturally leads to
variation on text segmentation. Table 2D and 2Ecompare the performance of different methods on textsegmentation based on the term boundary profilePc=ðVJB
c ,VSPc ,VLTP
c ,VTHUc ,VTW
c Þ. We propose two differ-ent criteria for the comparison of two methods: the lessrigorous criterion based on segmentation sites, and themore rigorous criterion based on segmented terms. Let Vand V
0be the term segmentation vectors of the two
methods to be compared. The segmentation site criterionsimply counts the number of common sites segmented byboth methods, i.e., #fi : Vi=Víi =1g; the segmentedterm criterion, however, counts the number ofcommon terms segmented by both methods, i.e.,#fi :Vi=Víi ∈f1,2g, and 9t > 0, s:t:,Viþt=Víiþt∈f1,2g,and Viþs=Víiþs=0 for 0 < s < tg. The degree of agree-ment of the five tested methods varies between 30% to90% under the segmentation site criterion, and drops to20%–85% under the more rigorous segmented termcriterion. The supervised methods tend to segment thetarget texts into smaller pieces, while the unsupervisedTopWORDS tends to cut the target texts with a largergranularity.
Table 2 Comparison of term discovery and text segmentation of free texts by different methods(A) Nontrivial terms discovered by different methods from the unstructured symptoms field
Understanding TCM via statistical learning of expert-specific EMRs
Because many technical terms are missed by each ofthese tested methods, it’s risky to proceed the downstreamanalysis based on the segmented texts by any of thesemethods alone. To get rid of this dilemma, we feed toTopWORDS the joint term dictionary Dc as the preloadvocabulary, and refit the model on the free texts from thesegment list Sc to do text segmentation. We chooseTopWORDS as the segmentation tool because thesegmentation results from it enjoy a proper granularityfor semantic understanding of the free texts in the archive.Totally, 8,000+ features are generated for the 20,000+
technical terms discovered, resulting in a feature code-book Fc. The transformation rules from terms in Dc tofeatures in Fc are summarized in a term-feature map Mc.Mapping the technical terms in the segmented texts totheir standard feature codes based on Mc with all thebackground terms ignored, we can transform the free textsin the unstructured symptoms field into a structuredfeature table Tc of binary features (with 1 for presentenceof a feature, and 0 for absence).
Generating a united feature table via dataintegration
The structured feature tables Ta, Tb and T c generatedfrom the structured, semi-structured and unstructured datafields of the archive can be further integrated into a unitedfeature table T=ðTa,Tb,T cÞ as the final output of the dataprocessing procedure. Some features may be shared byTa, Tb and Tc. We combined information about theseshared features via data integration.Totally, 14,000+ distinct features are involved in
the united feature table for the 26,000+ visits. And,26,000+ transformation rules for technical terms arecreated to establish the table. Detailed contents about the
united feature codebooks F=Fa[Fb[Fc, the unitedterm dictionary D=Db [Dc, the united term-feature mapM=Mb[Mc, and the united structured feature table Tcan be found in the website of “Zhou Archive for TCMStudy”.
STATISTICAL LEARNING OF THESTRUCTURED FEATURE TABLE
Based on the structured feature table obtained, a series ofstatistical analysis can be implemented to learn potentialprinciples of TCM clinical practice from a data-drivenperspective. Considering that the missing rate of somedata fields is very high, to avoid the potential influence ofthese missing values, in this study we only select thetechnical terms from the following 7 data fields whosemissing rate in first-visit records are less than 30%:Symptoms, Tongue Picture, Pulse Type, Disease, DiseaseCategory, TCM Pathogenesis and Herbs in TCMPrescription. Totally, 7,743 features are involved inthese selected data fields, among which 1,926 are rarefeatures whose frequency in the first-visit records £ 1.We ignored these rare features in the downstreamanalysis, and only focused on the 5,817 frequent features.
Correlation analysis
Our first effort is a correlation analysis to capture theoverall correlation structures of all the 5,817 selectedcross-category features. A 5,817� 5,817 correlationmatrix is obtained and most correlation coefficients inthe matrix fall into [ – 0:1,0:1] indicating reltatively weakcorrelation. Figure 4 shows the correlation heat map of afew highly-correlated features which correlate with someother features with a correlation coefficient beyond
(D) Comparison of segmentation sites by different methods
Segmentation sites Overlap with Jieba Overlap with SP Overlap with LTP Overlap with THULAC Overlap with TopWORDS
ð – 0:5,0:5Þ.From the heat map, we can observe a few blocks of
highly correlated features. For example, the largest featureblock highlighted in a black box reveals that diseasecategory “cancers” are closely related to TCM pathoge-nies “toxic head” and “phlegm stasis”, and a group ofherbs “andeophorae radix”, “glehniae radix”, “buttercuproot”, “pseudostellariae radix”, “ophiopogonis radix”,“appendiculate cremastra pseudobulb”, “herba euphor-ibiae helioscopiae”, “agrimony”, “hedyotis”, “barbatedskullcup herb”, “herba celiptae”, “ligustri lucidi fructus”.The smaller feature block next to the largest one discoversthat a few features on Pulse Type are closely related. Theother feature block located at the left-bottom corner reveala group of herbs (i.e., “lignum phetimiae”, “orientvinestem”, “largeleaf gentian root”, “preparemonkshd moterroot”, “aconiti preparata”, “asarum sieboldii”) closelyrelated to disease category “rheumatism”. These discov-eries reveal meaningful TCM knowledge.
Enrichment analysis
To further investigate how TCM concepts such as TCMdiseases, TCM pathogenesis and TCM therapies connectto each other and other features such as diseases,symptoms and herbs, we did the enrichment analysisbelow. For simplicity, let’s take “stuffiness of stomach”,the most frequent TCM disease in the archive, as anexample. First, we selected from the archive all first-visitrecords of which the feature “stuffiness of stomach” takes
1. We denote the subpopulation of selected records as P1,the subpopulation of other first-visit records as P0. Next,we identified the top 5 diseases, symptoms, TCMpathogenesis, TCM pathogenesis and herbs in theselected records in P1 with the highest relative frequency.Third, for each of the selected feature, we calculated itsodds ratio between P1 and P0 as its enrichment measurewith respect to feature “stuffiness of stomach”. At last, weplotted the relative frequency and the enrichment measurefor all feature selected for “stuffiness of stomach” in a barplot as showed in Figure 5A. Such an enrichment plotdemonstrates rich information about TCM disease“stuffiness of stomach”: (i) it is associated with chronicgastritis, chronic superficial dermatitis, chronic atrophicantralgastritis, headache and astriction, (ii) symptomgastric distention is major signature of it, (iii) “liver-stomach disharmony”, “dampness and heat resistance”and “stomache weak energy stagnation” are the majorTCM pathogenesis behind it, and (iv) “processed pinelliapreparata”, “cyperi rhizoma”, “perillae caulis”, “coptidisroot” and “magnolia bark” are the primary herbs to treat it.These messages help us understand the basic properties ofthe feature efficiently from multiple angles.Figure 5 displays the enrichment plots for a few most
frequent TCM diseases, TCM pathogenesis and TCMtherapies in the archive. We can read many insightfulmessages from these figures. For example, TCMpathogenesis “dampness-heat” is highly associated withliver-related diseases, and takes “impairment of liver andspleen” and “phellodendri chinensis cortex” as the
Figure 3. Length and type distribution of technical terms in Dc discovered by different methods. (A) Length distribution.(B) Type distribution.
Understanding TCM via statistical learning of expert-specific EMRs
signature symptom and herb respectively; TCM therapy“regulating and harmonizing the liver and spleen” is aregular treatment for liver-related diseases and TCMpathogenesis, and takes “barbary wolfberry fruit”,“pseudostellariae radix”, “deep-fried atractylodis macro-cephalae rhizoma”, “salviae miltiorrhizae” and “paeoniaeradix rubra” as the primary components of prescription.
Embedding analysis
Considering that correlation and enrichment analysesbasically utilize the data information in a pairwisefashion, they may miss signals reflecting high-orderstructures in the data. In this section, we analyzed thestructured feature tables obtained from the Zhou Archive
from an alternative perspective via embedding methods[32–34]. Different from previous correlation and enrich-ment analyses, embedding analysis considers co-occur-rence patterns of different features globally, and embedsfeatures with no geometric meanings (e.g., symptoms andherbs) into a linear space with geometric interpretation.Treating each feature as a “word” and each record as a
“document”, we can naturally apply approaches designedfor word embedding to the TCM data. Here, we selectedthe matrix factorization approach [32] as the primary toolto achieve feature-level embedding, and applied it to the4,776� 4,776 co-occurrence matrix C of the frequentfeatures, where Cij counts the number of first-visit recordsthat contain both feature i and feature j, to embed thefrequent features into a 300-dimensional linear space. The
Figure 4. Correlation heat map of highly correlated features based on the first-visit records.
Understanding TCM via statistical learning of expert-specific EMRs
detailed embedding vectors of the features can be found inthe website of “Zhou Archive for TCM Study”. Featuresthat stay close to each other in the embedding space tendto associate closely or share similar functions. Geometricstructure of these embedding vectors can be visualized ina 2-dimensiaonal space by techniques such as multi-dimensional scaling (MSD) [34]. Figure 6 shows theMSD plot of the 50 feature pairs with the shortest within-pair distance in the embedding space, most of whichprecisely reflect TCM knowledge. For example, the twosymptoms in pair {painful forehead, dizzy forehead} arecomplication that often happen concurrently, the twoherbs in pair {processed herba pogostemonis, processedfolium perillae} have similar function in expelling coldand vomiting, and the symptom-herb pair {chapped,sweet almond} corresponds to a well-known treatment tochapped and irritated skin in TCM.Beyond the feature-level embedding, we can also
embed records into a linear space in a similar way.Representing each record by a vector of 4,776 binaryvariables with 1 s and 0 s standing for the presence andabsence of features in the record, we used t-DistributedStochastic Neighbor Embedding (t-SNE) [33] to embedthe high-dimensional records into a 2-dimensionalrepresentation space. Unlike the linear dimension reduc-tion technique Principal Component Analysis (PCA) bymaximizing variance to preserve large pairwise distanceswhich fails in non-linear structure cases, t-SNE tries toretain the local structures while preserving almost thesame topology by embedding the original high dimen-sional space with a Student t-distribution. Figure 7demonstrates the results from t-SNE, where each pointcorresponds to a record with the color stands for thedisease category of the record. Among the 16 distinctdiseases categories in the archive, the 5 major categories,i.e., Cancers, Digestive Diseases (DD), Infectious Dis-eases (InD), Neurological Diseases (ND) and RespiratoryDiseases (RD), together with miscellaneous Diseases(MiD) contribute ~75% of the records. Interestingly,points associated with the 5 major categories cluster wellin the embedding space, with the MiD-related pointsspread out everywhere. These phenomena reflect hetero-geneity of TCM practice among different diseasecategories and are consistent to the definition of MiD.
Association pattern discovery
Next, we try to discover association patterns of theselected features from structured feature table. As allfeatures in the feature table is binary, the data structureperfectly fit the classic Market Basket Analysis (MBA)problem [35] in machine learning, which aims to discoveritems that tend to purchased together from a collection ofbaskets purchased by customers to a supermarket.
Association Rule Mining (ARM) [35,36] is the classicsolution to the MBA problem, which enumerates allfrequent item sets whose frequency (sometimes calledsupport) ³τF and generates association rules whoseconfidence ³τC based on these frequent item setsenumerated. Although computationally efficient to pro-cess large scale datasets and logically straightforward tounderstand, ARM often generates too many redundantassociation rules and tends to miss high-order associationpatterns that are important to many practical problems.Recently, Refs. [37,38] reformulated the MBA problem
into a statistical model selection problem and proposed anovel solution to this classic problem from the statisticalpoint of view. Assuming that each basket is composed of acollection of item modules (called themes) randomlyselected by the customer with different selection prob-abilities and different baskets are generated independentlyfrom the same mechanism, Refs. [37,38] approximatedthe data generation procedure of the baskets via a ThemeDictionary Model (TDM). Starting with an over-completeinitial theme dictionary composed of the frequent itemsets generated by ARM, and pruning it based on statisticalinference and model selection principles, the TDM-basedmethod can discover themes (especially the high-orderones) in the true theme dictionary effectively. ApplyingTDM to a collection of classic prescriptions in the historyof TCM, [37] discovered hundreds of herb moduleswhich tend to be used together in TCM practice. Many ofthese herb modules match well with TCM knowledge andsuccessfully reveal the internal structure of TCMprescriptions from a data-driven perspective.With the support of EMRs in the Zhou Archive which
contains both symptoms and prescriptions, in this study,we generalize this idea to learn association patternsbetween a module of symptoms and a module of herbs.By treating symptom-related features and herbs in theprescriptions as “items” and each EMR as a “basket” ofthese items, we obtained 23,000+ effective “baskets”from the first-visit records in the archive. The originalTDM can discover themes of all items in the baskets fromone single category. In this study, however, we are moreinterested in cross-category themes containing bothsymptoms and herbs, which connect a module ofsymptoms to a module of herbs and provide informationon how TCM treatment is determined based on theobserved symptoms. To fit this special request, wemodified the original TDM approach to a variant versionby adding some filters to label items from differentcategories which rules out all single-category themes via apre-screening of themes in the initial theme dictionary.After removing those redundant single-category associa-tion rules in the initial theme dictionary, it largely reducesthe number of the partitions for all baskets. We refer tothis variant version of the original TDM approach as to
Understanding TCM via statistical learning of expert-specific EMRs
the Cross-category TDM approach, which is abbreviatedto CTDM. Compared to the original TDM approach, theCTDM approach enjoys a better computational efficiencyas many single-category themes are excluded from themodel priori.Please note that we meant to include all diagnosis-
related features here to link symptoms to herbs directly.And, we only kept the first-visit records here, because thelongitudinal records of the same patient are often highlycorrelated with each other (the physical condition of apatient typically does not change dramatically within acouple of months, leading to similar symptoms, diagnosesand treatments), and may seriously violate the assumptionof independent samples behind TDM. We also removedbaskets containing more than 30 items which may largelyslow down the procedure. Totally, 5,175 effective itemssurvived this item-basket screening procedure, resultingin a collection of baskets with 20 items on average.Same as the TDM approach, the CTDM approach has
two control parameters: the minimum theme frequencyparameter τP and the maximum theme length parameterτL. In this study, we set τL=6 and τP=0:001, anddiscovered ~1,000 cross-category themes from thearchive. Table 3 shows the top 60 cross-category themesdiscovered by the CTDM approach, each of whichconnects a module of symptoms to a module of herbs,revealing important insights of TCM treatment. Forexample, the connections between herb modules {aspar-agus cochinchinensis, lilium davidii} and symptommodules {poor sleep} and {feel agitated}, the connectionbetween symptommodule {dry mouth} and herb modules{figwort root, flos chrysanthemi indici}, the connectionsbetween herb modules {tribulus terrestris, gastrodiae,ligusticum wallichii } and symptom modules {dizziness}and {headache}, the connection between symptommodules {oppression in chest, palpitation} and herbmodule {salvia miltiorrhiza} all precisely reflect impor-tant principles in TCM practice.Please note that we clustered these top themes based on
the symptom module and rearranged their location in thetable to deliver information more efficiently. Thecomplete list of discovered themes can be found in thewebsite of “Zhou Archive for TCM Study”.
CONCLUSION AND DISCUSSIONS
In this study, we introduce the Zhou Archive, a large-scaledatabase of expert-specific EMRs containing comprehen-sive information about 73,000+ visits to one TCM doctorby 26,000+ distinct patients over 35 years from 1980 to2015. Processing the text data in the archive via a series ofdata processing steps with the help of multiple popularNLP tools for Chinese texts, we transformed the semi-structure EMRs in the archive to a well-structured feature
table. A series of statistical analyses are implemented forthe structured feature table obtained to learn principles ofTCM clinical practice from the archive. Results fromthese analyses reveal insights to understand TCM from adata-driven perspective. Besides the statistical analysisdemonstrated in this paper, many other methods and newtools can be applied or developed to dig deeper into thisarchive. We hope the data processing and analysisframework proposed in this paper can motivate otherstudies for understanding TCM based on large-scale EMRdatasets.
SUPPLEMENTARY MATERIALS
The supplementary materials can be found online with this article at https://
doi.org/10.1007/s40484-019-0173-x.
ACKNOWLEDGEMENT
We thank the Zhou Zhongying’s Studio at Nanjing University of Chinese
Medicine for the great efforts on collecting, managing and sharing this
valuable archive. We also thank Miss Bing Liang, Mr. Qiuyu Liang and
Miss Che Wang for their efforts on data preparation and preprocessing.
This work was partially supported by the National Natural Science
Foundation of China (Nos. 11771242 & 11401338), the Tsinghua
University Initiative Scientific Research Program and Supporting Grant to
the Zhou Zhongying’s Studio 201159 by the State Administration of TCM
of China.
COMPLIANCE WITH ETHICS GUDELINES
The authors Yang Yang, Qi Li, Zhaoyang Liu, Fang Ye and Ke Deng declare
that they have no conflict of interests.
All procedures were in accordance with the ethical standards of the
institution or practice at which the studies were conducted, and with the
1964 Helsinki declaration and its later amendments or comparable ethical
standards.
APPENDIX
Introduction to the five NLP tools involved
Jieba is an open-source software developed by Sun Junyiin 2012. The software uses a variant of maximummatching algorithm and dynamic programming toachieve word segmentation, and use a Hidden MarkovModel to achieve named entity recognition. The method isequipped with a preloaded vocabulary of more than20,000 words, and trained with manually segmented andlabelled news articles from People’s Daily and someChinese novels segmented by ICTCLAS. Here, we usethe “accurate mode” of Jieba version 0.38 to do all theanalysis, and denote the reported dictionary as DJB.Stanford Parser is a tool developed by the Stanford
Natural Language Processing Group in 2003, which is amulti-language parser that can be used in English,
Chinese, German, etc. Trained with the Penn ChineseTreebank, Stanford Parser can work out the grammaticalstructure of Chinese sentences on top of word segmenta-tion. Here, we use the Stanford Parser version 3.9.1 to doall the analysis, and denote the reported dictionary as DSP.LTP is another open-source platform developed by the
Research Center for Social Computing and InformationRetrieval, Harbin Institute of Technology in 2007. It usesforward maximum match to merge the information of apreloaded vocabulary into the statistic model, and isequipped the online learning technique for fastercomputing. Here, we use the LTP version 3.4.0 to do allthe analysis and denote the reported dictionary as DLTP.THULAC is a tool developed by the Natural Language
Processing Group at the Department of Computer Scienceand Technology in Tsinghua University in 2016. Itachieves word segmentation based on the maximumentropy approach [12]. The statistical model behind istrained with manually segmented and labelled newsarticles from People’s Daily and other sources, whichcontain a total amount of 58 million Chinese characters.Here, we use THULAC python version v1_2 to do all theanalysis, and denote the reported dictionary as DTHU .TopWORDS is a tool developed be Ke Deng and Jun
S. Liu in 2016. Different from the above supervisedmethods which emphasize precise word segmentationunder the guidance of a preloaded vocabulary and high-quality training corpus, TopWORDS pays more attentionon efficient new word discovery when the preloadedvocabulary and the training corpus do not fit the targettexts well. Starting with an over-complete initial dic-tionary generated by enumerating all frequent strings inthe target texts, and pruning it into a much smaller finaldictionary via statistical model selection, TopWORDScan effectively discover previously unknown words andphrases that appear in the target texts more than 3 timeswhen no preloaded vocabulary and proper training corpusare available (available preloaded vocabulary and trainingcorpus will improve the performance of TopWORDS).TopWORDS has two control parameters: the minimalword frequency τP and the maximum word length τL. Wespecify τP=3 and τL=8 in this study.
REFERENCES
1. Liu, W. H. (2017) TCM acupuncture-moxibustion: contributing to
human health. World J. Acupunct. Moxibustion, 27, 1
2. Ahn, A. C., Bennani, T., Freeman, R., Hamdy, O. and Kaptchuk, T.
J. (2007) Two styles of acupuncture for treating painful diabetic
neuropathy–a pilot randomised control trial. Acupunct. Med., 25,
11–17
3. Liu, Z., Sun, F., Zhu, M. and Wang, X. (2004) Effect of
acupuncture on insulin resistance in non-insulin dependent
diabetes mellitus. J. Acupunt.Tuina Sci., 2, 8–11
4. Li, S. and Zhang, B. (2013) Traditional Chinese medicine network
pharmacology: theory, methodology and application. Chin. J. Nat.
Med., 11, 110–120
5. Zhang, B., Wang, X. and Li, S. (2013) An integrative platform of
TCM network pharmacology and its application on a herbal
formula, Qing-Luo-Yin. Evid. Based Complement. Alternat. Med.,
2013, 456747
6. Li, S., Zhang, B. and Zhang, N. (2011) Network target for
screening synergistic drug combinations with application to
traditional Chinese medicine. BMC Syst. Biol., 5, S10
7. Lam, W., Bussom, S., Guan, F., Jiang, Z., Zhang, W., Gullen, E. A.,
Liu, S. H. and Cheng, Y. C. (2010) The four-herb Chinese
medicine PHY906 reduces chemotherapy-induced gastrointestinal
toxicity. Sci. Transl. Med., 2, 45ra59
8. Xiang, Y. Z., Shang, H. C., Gao, X. M. and Zhang, B. L. (2008) A
comparison of the ancient use of ginseng in traditional Chinese
medicine with modern pharmacological experiments and clinical
trials. Phytother. Res., 22, 851–858
9. Jian, J. and Wu, Z. (2004) Influences of traditional Chinese
medicine on non-specific immunity of Jian Carp (Cyprinus carpio
var. Jian). Fish Shellfish Immunol., 16, 185–191
10. Bick, R. J., Poindexter, B. J., Sweney, R. R. and Dasgupta, A.
(2002) Effects of Chan Su, a traditional Chinese medicine, on the
calcium transients of isolated cardiomyocytes: cardiotoxicity due
to more than Na, K-ATPase blocking. Life Sci., 72, 699–709
11. Iwasaki, K., Satoh-Nakagawa, T., Maruyama, M., Monma, Y.,
Nemoto, M., Tomita, N., Tanji, H., Fujiwara, H., Seki, T., Fujii, M.,
et al. (2005) A randomized, observer-blind, controlled trial of the
traditional Chinese medicine Yi-Gan San for improvement of
behavioral and psychological symptoms and activities of daily
living in dementia patients. J. Clin. Psychiatry, 66, 248–252
12. Deng, K., Liu, D., Gao, S. and Geng, Z. (2005) Structural learning
of graphical models and its applications to traditional Chinese
medicine. Lect. Notes Comput. Sci., 3614, 362–367
13. Feng, Y., Wu, Z., Zhou, X., Zhou, Z. and Fan, W. (2006)
Knowledge discovery in traditional Chinese medicine: state of the
art and perspectives. Artif. Intell. Med., 38, 219–236
14. Yang, H., Chen, J., Tang, S., Li, Z., Zhen, Y., Huang, L. and Yi, J.
(2009) New drug R&D of traditional Chinese medicine: role of
data mining approaches. J. Biol. Syst., 17, 329–347
15. Wang, Q. and Zhu, Y. (2009) Epidemiological investigation of
constitutional types of Chinese medicine in general population:
based on 21,948 epidemiological investigation data of nine
provinces in China. Zhonghua Zhongyiyao Zazhi (in Chinese),
24, 7–12
16. Xue, R., Fang, Z., Zhang, M., Yi, Z., Wen, C. and Shi, T. (2013)
TCMID: traditional Chinese Medicine integrative database for
herb molecular mechanism analysis. Nucleic Acids Res., 41,
D1089–D1095
17. Liu, B., Zhou, X., Wang, Y., Hu, J., He, L., Zhang, R., Chen, S. and
Guo, Y. (2012) Data processing and analysis in real-world
traditional Chinese medicine clinical data: challenges and
approaches. Stat. Med., 31, 653–660
18. Wang, X., Qu, H., Liu, P. and Cheng, Y. (2004) A self-learning