Top Banner
Behavior Research Methods, Instruments, & Computers 1985, 209-216 Simple, applied text parsing MICHAEL M. GRANAAS University of Kansas, Lawrence, Kansas Text parsing for use in linguistic or artificial intelligence research can often be ill-suited for use in other lines of investigation in which simple, problem.specific parsing techniques may be used. This paper describes the development of a simple text parser for use in a specific research setting. The resulting parsing programs are compact enough for easy use on most available mini- and microcomputers, while still providing adequate results for some applications such as chunk- ing text into small units for later presentation. The specific case examined here involves parsing text for use in the Rapid Serial Visual Presentation (RSVP) format. The RSVP format involves the serial presenta- tion of single words or short text segments to a single fixed location on a computer screen. This type of presentation allows for precise control over various aspects of the text presentation, such as presentation duration. Prior work in our laboratory suggested that multiword text units were superior to single-word text units for RSVP presentation up to a point of about 20 character spaces. Beyond that point, little or no useful information is extracted from the text (Rayner, 1978). Furthermore, eye movement studies have shown that eye fixations are longer at points where phrase structure indicates that a heavier processing load should exist (Just & Carpenter, 1980). These results, when taken in combination, suggest that the RSVP format might be improved by providing subjects with text that has been divided into appropriate linguistic units. In a recent study, Cocklin (1983) compared the reada- bility of text divided into linguistic "idea units" with that of text divided into arbitrary "random" units presented in the RSVP format. The idea units were obtained by hav- ing independent raters divide the text into "natural" units. The random units were obtained by having a computer divide the same texts into unconstrained segments of the same average length (i.e., 13 character spaces). Cock- lin's subjects were able to use the information preserved in the structured text to improve their comprehension scores as measured by a multiple-choice question answer- ing task. Cocklin's (1983) study was confounded, however, by the difference in the variability of window sizes in the two conditions, with the structured text being more vari- able in length. It is possible that the readability of text presented in the random condition benefited from reduced perceptual variability, whereas the readability of text in This research was partially supported by NIMH Training Grant MHIS134-0S. The author would like to thank James F. Juola and Tim- othy D. McKay for their comments on earlier drafts of this work. Reprint requests should be mailed to Michael M. Granaas. Department of Psy- chology, University of Kansas, Lawrence, KS 66045. the structured condition benefited from linguistic infor- mation. Consequently, reduction of window-size varia- bility in the structured-text condition might further have improved comprehension in that condition. Despite this confound of unit variability acting against the linguistic units, Cocklin still obtained results favoring the linguis- tic units. The major problem in presenting text as linguistic units in the RSVP condition is not, however, one of unit varia- bility, but one oftime. In Cocklin's (1983) study, it was necessary to have four independent raters divide the text into linguistic units. These results were then tabulated, and decisions were made about the final placements of boundaries. After this had been done, the computer text mes still required editing so that unit boundaries could be marked for later recognition by the computer. Despite the fact that Cocklin used relatively short passages, each required a great deal of time to be readied for presenta- tion in the RSVP condition. Such a process clearly limits the potential usefulness of RSVP reading by making it clumsy and expensive to prepare text for presentation in the RSVP format. Additionally, due to their periodic na- ture, text sources such as newspapers and magazines are effectively eliminated from consideration for everyday use in RSVP presentation. It would simply prove to be too clumsy and expensive to prepare these materials for RSVP presentation on a daily, weekly, or monthly basis. The obvious solution to this problem seemed to be the development of a computerized text parser. Computer- ized text parsers have been developed in artificial intelli- gence approaches to natural language comprehension devices (e.g., Winograd, 1972). These differ in their com- plexity and ability to assign correct syntactic functions to constituents of a sentence. A simple parser would save a great amount of time in preparing texts for RSVP dis- play. Such a parser would need only to divide text into units of near optimal size and structure for RSVP presen- tation. Since such a parser divides the text into linguistic units only for use by human readers, its linguistic analy- sis of incoming text would not need to be nearly as ex- tensive as that required by a language comprehension device. Such a system would certainly provide a speed 209 Copyright 1985 Psychonomic Society, Inc.
8

Behavior Research Methods, Instruments, Computers … · Behavior Research Methods ... vide an ideal text-presentation mode for these mini ... Psychology and language: An introduction

Jun 04, 2018

Download

Documents

ngotuong
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Behavior Research Methods, Instruments, Computers … · Behavior Research Methods ... vide an ideal text-presentation mode for these mini ... Psychology and language: An introduction

Behavior Research Methods, Instruments, & Computers1985, 17~), 209-216

Simple, applied text parsing

MICHAEL M. GRANAASUniversity of Kansas, Lawrence, Kansas

Text parsing for use in linguistic or artificial intelligence research can often be ill-suited foruse in other lines of investigation in which simple, problem.specific parsing techniques may beused. This paper describes the development of a simple text parser for use in a specific researchsetting. The resulting parsing programs are compact enough for easy use on most available mini­and microcomputers, while still providing adequate results for some applications such as chunk­ing text into small units for later presentation.

The specific case examined here involves parsing textfor use in the Rapid Serial Visual Presentation (RSVP)format. The RSVP format involves the serial presenta­tion of single words or short text segments to a single fixedlocation on a computer screen. This type of presentationallows for precise control over various aspects of the textpresentation, such as presentation duration. Prior workin our laboratory suggested that multiword text units weresuperior to single-word text units for RSVP presentationup to a point of about 20 character spaces. Beyond thatpoint, little or no useful information is extracted from thetext (Rayner, 1978). Furthermore, eye movement studieshave shown that eye fixations are longer at points wherephrase structure indicates that a heavier processing loadshould exist (Just & Carpenter, 1980). These results, whentaken in combination, suggest that the RSVP format mightbe improved by providing subjects with text that has beendivided into appropriate linguistic units.

In a recent study, Cocklin (1983) compared the reada­bility of text divided into linguistic "idea units" with thatof text divided into arbitrary "random" units presentedin the RSVP format. The idea units were obtained by hav­ing independent raters divide the text into "natural" units.The random units were obtained by having a computerdivide the same texts into unconstrained segments of thesame average length (i.e., 13 character spaces). Cock­lin's subjects were able to use the information preservedin the structured text to improve their comprehensionscores as measured by a multiple-choice question answer­ing task.

Cocklin's (1983) study was confounded, however, bythe difference in the variability of window sizes in thetwo conditions, with the structured text being more vari­able in length. It is possible that the readability of textpresented in the random condition benefited from reducedperceptual variability, whereas the readability of text in

This research was partially supported by NIMH Training GrantMHIS134-0S. The author would like to thank James F. Juola and Tim­othy D. McKay for their comments on earlier drafts of this work. Reprintrequests should be mailed to Michael M. Granaas. Department of Psy­chology, University of Kansas, Lawrence, KS 66045.

the structured condition benefited from linguistic infor­mation. Consequently, reduction of window-size varia­bility in the structured-text condition might further haveimproved comprehension in that condition. Despite thisconfound of unit variability acting against the linguisticunits, Cocklin still obtained results favoring the linguis­tic units.

The major problem in presenting text as linguistic unitsin the RSVP condition is not, however, one of unit varia­bility, but one oftime. In Cocklin's (1983) study, it wasnecessary to have four independent raters divide the textinto linguistic units. These results were then tabulated,and decisions were made about the final placements ofboundaries. After this had been done, the computer textmes still required editing so that unit boundaries couldbe marked for later recognition by the computer. Despitethe fact that Cocklin used relatively short passages, eachrequired a great deal of time to be readied for presenta­tion in the RSVP condition. Such a process clearly limitsthe potential usefulness of RSVP reading by making itclumsy and expensive to prepare text for presentation inthe RSVP format. Additionally, due to their periodic na­ture, text sources such as newspapers and magazines areeffectively eliminated from consideration for everyday usein RSVP presentation. It would simply prove to be tooclumsy and expensive to prepare these materials for RSVPpresentation on a daily, weekly, or monthly basis.

The obvious solution to this problem seemed to be thedevelopment of a computerized text parser. Computer­ized text parsers have been developed in artificial intelli­gence approaches to natural language comprehensiondevices (e.g., Winograd, 1972). These differ in their com­plexity and ability to assign correct syntactic functions toconstituents of a sentence. A simple parser would savea great amount of time in preparing texts for RSVP dis­play. Such a parser would need only to divide text intounits of near optimal size and structure for RSVP presen­tation. Since such a parser divides the text into linguisticunits only for use by human readers, its linguistic analy­sis of incoming text would not need to be nearly as ex­tensive as that required by a language comprehensiondevice. Such a system would certainly provide a speed

209 Copyright 1985 Psychonomic Society, Inc.

Page 2: Behavior Research Methods, Instruments, Computers … · Behavior Research Methods ... vide an ideal text-presentation mode for these mini ... Psychology and language: An introduction

210 GRANAAS

advantage over human parsers and could possibly bemodified to run on-line so that texts could be parsed dur­ing presentation rather thai!. prior to presentation. Thiswould allow computer users the flexibility of selecting thepresentation format they prefer for any given text. In ad­dition, the computer could be instructed to use upperand/or lower size restrictions on parsed text to help reducevariability in the length of linguistic units.

The purpose of the research described here was to de­velop a text-parsing program that would reduce variabil­ity in text units presented to subjects and, at the same time,maintain useful linguistic structure.

METHOD

ApparatusThe original version of the parser was written and im­

plemented on a PDP-I1103 minicomputer with 12K ofavailable RAM, using a FORTRAN IV compiler (see Ap­pendix A for a listing). More recently, the parser has beentranslated to Pascal and implemented on a Sage nminicomputer under the P-System operating system (seeAppendix B for a listing).

Text Parsing ProgramBoth versions of the text parser were created to pro­

vide parsed text for later use in other research. Text forthese experiments was parsed prior to the experiment bymeans of the computer text parser. Several criteria wereestablished for the development of the parser. The firstof these was that text, regardless of subject matter, hadto be parsed into linguistic units that corresponded toCocklin's (1983) "idea units." Second, the text had tomeet a maximum size restriction of no more than 20character spaces per unit. This restriction was placed onunit length because earlier research had indicated that theeye could perceive and extract useful information froma maximum of about 20 characters per fixation (Rayner,1978). Additional research has shown that comprehen­sion in the RSVP format tends to decrease for windowsaveraging more than about 15 characters in length (Cock­lin, Ward, Chen, & Juola, 1984). Finally, the parsing pro­gram and text up to 9,000 characters in length had to fitwithin the memory limitations of the computer being used.[Note: This size-limitation restriction did not apply to thePascal version developed on the Sage n, due to a largeamount of available random access memory (RAM)]. Be­cause of possible future use of this program on othermachines, the size restrictions were still considered im­portant.

After several possible parsing models were considered,a model similar to one outlined in Psychology and Lan­guage (Clark & Clark, 1977, pp. 59-61) was adopted.This model was selected because of its compatibility withthe criteria outlined above. The parser was based on alimited number of function words: determiners, quanti­fiers, prepositions, pronouns, auxiliary verbs, relativepronouns, complementizers, subordinating conjunctions,

and coordinating conjunctions (see Appendix C for a list­ing of function words used).

The original version of the parser was implemented asa three-pass subroutine in FORTRAN IV. Since punctu­ation is a reliable indicator of linguistic structure in a text,the first pass of the parser simply looked for and markedpunctuation boundaries.

The second pass of the parser used a function-word al­gorithm to mark additional phrase boundaries. The al­gorithm made use of a conditional variable that was setto false when a non-function word was encountered andto true when a function word was encountered. The parsermade a left-to-right, word-by-word scan beginning at apreviously defined boundary. The first word in a giventext segment was scanned to determine whether or notit was a function word. If it was not, a conditional varia­ble was set to false. If it was, the scan continued untila non-function word was found, and then the conditionalvariable was set to false. When the conditional variablechanged from false to true, the computer was instructedto mark the intervening space as a phrase boundary. Thisprocedure was then repeated until a previously definedboundary was encountered.

The third and final pass of the parser ensured that phraseunits did not exceed the size restrictions set for RSVPpresentation. This pass of the parser counted the numberof characters in each phrase unit of text. When that countexceeded the 20-character maximum, the phrase wasdivided at a point as near to the center of the unit as pos­sible, without splitting a word. If the center of the phraseunit was the exact middle of a word (e.g., the letter "a"in the word"plant"), the phrase was broken prior to theword.

RESULTS AND DISCUSSION

The resulting parser was flexible enough to parse texton virtually any subject matter. This parser should alsobe compact enough to be implemented on virtually anyof the micro- or minicomputers available today, providedthat they have at least 16K of 16-bit RAM or 32K of 8­bit RAM, as well as a higher level computer language,such as FORTRAN or Pascal. It also provided phrasesthat closely approximated (87% agreement) Cocklin's(1983) idea units, while still resulting in text units thatwere less variable than those used by Cocklin. (Note: theoverall interrater agreement for four raters in Cocklin'sstudy was 90%.) Mean standard deviations were 4.7 and4.1 characters for Cocklin's idea units and the text-parsingprogram's units, respectively. A pairwise t test of the stan­dard deviations from common materials in the two pars­ing conditions showed a significant reduction in variabil­ity for text units produced by the text-parsing program[t(47) = 6.52, p = .0001].

An important result of this work is that it shows thata simple, easily implemented parsing routine can be de­veloped to parse text into linguistic units with a reason­able level of agreement (87 %) with human parsers. Since

Page 3: Behavior Research Methods, Instruments, Computers … · Behavior Research Methods ... vide an ideal text-presentation mode for these mini ... Psychology and language: An introduction

the parser can easily be modified to run on-line with in­coming text from different sources, it is ideally suited forimplementation on personal computers with limitedmemory for RSVP reading of texts via telephone or satel­lite news and information services. In addition, severalpersonal computer manufacturers are now selling in in­creasing numbers small, portable computers that have onlya few (l to 5) display lines. Page presentation of text onthese computers is quite limited, and RSVP would pro­vide an ideal text-presentation mode for these mini­displays.

REFERENCES

CLARK, H. H., & CLARK, E. V. (1977). Psychology and language: Anintroduction to psycholinguistics. New York: Harcourt BraceJovanavich.

COCKLIN, T. G. (1983). Sensory and linguistic determinants ofoptimaltext segments in RSVP reading. Unpublished master's thesis, Universityof Kansas, Lawrence.

COCKLIN, T. G., WARD, N. J., CHEN, H.-C., & JUOLA, J. F. (1984).Factors influencing readability of rapidly presented text segments.Manuscript submitted for publication.

JUST, M. A., & CARPENTER, P. A. (1980). A theory of reading: Fromeye fixations to comprehension. Psychological Review, 87, 329-354.

RAYNER, K. (1978). Eye movements in reading and information process­ing. Psychological Bulletin, 85, 618-660.

WINOGRAD, T. (1972). Understanding natural language. Cognitive Psy­chology, 3, 1-191.

Appendix AParsing Subroutine-FORTRAN

CCC

CCCC

lZ

13

1415

TEXT PARSING 211

PDIOD AI! CAPItALIZED,

DO 5,I-I,TSIZEZ-rXT(t)IF (Z.GE. "101) GOTO 5IF (Z.!Q ,PEI.AIlD.TXT(I-I ).LT." 133 .AND, TXT (I-I) .GT. "100) GOTO ,IF (Z.EQ.PEI.AND.TXT(I-2).LT. "133 .AlID,TXT(I-2).GT."100) GOfO ,IF (Z.!Q.PEI,OR.Z.!Q.QKAtt.OR.Z,!Q.IIIWt) GOfO 7IF( Z.EQ.CON.OI.Z.EQ.SEMCOL.OR.Z ,!Q.PARN.OI,Z,!Q.COL) GOTO 7IF (Z.EQ.QT.AND.TIT(I-I).EQ.IUlII) TXT(I-l)-STAIIIF (Z.EQ .QT .AND. TXT(I+I) •EQ.BLANII:) TXT(I+I )-STARGOTO 5

1-1+1IF (TXT(I).NE.BWIII:) GOTO 5Trt(I )-STAR

COIITIlIUE

IIOW lIE STAIT TIE FUNCTION WOlD S!ARCH.IIREII lIE FIND A NON-FUIlCTION WOlD FOLLOWED BYA FUIICTION WOlD lIE PARSE BETWEEN TllEII

FLGZ-DFLG-DPT-l811:-1COTO 13

FLG-lU(FLG2.EQ.O) COTO 13

Trt(PT)-rrt(PT)-"40FLGZ-D

CIIT-DPT-BII:IF (TXT(PT).EQ •STAR) FLG-Dz-rrt(PT)IF (Z.GE."101) GOTO 14IF(Z.LE."07I.AND.Z.GE."060) COTO 14811:-BII:+IGOTO 13

PTZ-lCIIT-CIIT+1

1lII:-1ll+1IF (811:.GE.T5IZE) COTO 302Z-Trt( Bl)IF (Z.LT."101.AIlD.Z.NE.APOS) COTO 25COTO 15

ccccCccccCCCCCCCCCCCC

CCCCCC

CCCCCCCCCC

SDiIOlITIJII PAISE

TlIS lounD IS DES1G1III) TO PAISE T!XTIIITO A PllUSE snuCTUU USIIIG PUIlCTUATIONAID rutlCTIOII lIOIDS. TIE PAisn ALSO WIlINSTIE PAISED TUT &110 ADJUSTS TIE PllRASES FOR LENCTIIso TlAT TIlt MY BE B!TI!I UTILIZED IN TIlE RSVPUADIIIG POUlATTlIS 10001IIE lIAS WRITT!ll III 1983 BYKlCIIA!L N. GlAllAAS FOR USE ON A DECPDP 11/03 NIlIl-<:ONPUTER.

IlIPUT TO TIllS ROUTINE CONE BY WAY OF TIlETWO COHllON BLOCII:S: WOlDS AND TErt.

WOIDS-\lID- AlRAY COIITAIIIIIIG TIlE LIST OF FUllCTION 1I0RDS

WICR WERE READ IN AN STORED IN A TREE STRUCTUREBY AII0TII!l SUBROUTlIlE

IT ,LT- IIGRT AND LEFT PaIIITERS FOR TNE TREE STRUCTURE

T!XT-TIT- AlRAY COIITAINIIIG TUT TO BE PARSEDDVEC- ARRAY NOT UTILIZED III TIllS SUBROUTINETSUE- IIIT!GER NUIlIIER OF ClIARACTERS IN Trt

COlDlON/llOlDSl WID ,RT, LTCOIOION/Ttrrl TXT ,DVEC, TSIZELOGlCAL*1 DVEC(60) ,Trt(9500) ,Z,WID( 110,10)InE PD,QIWIX ,EIWII ,CON, SEKCOL, STARmE BLAn,APOS,PARN,COL,QTllITEGD FLG, PT ,PT!,TSUE,RT( II 0) ,LT( 110)ItrrIC!1. I. J •CMT •PT t n'T ,CNT2 ,II I PT2 I FLC2 •FtC3DATA IUR/"040/.APOS/"0471.QTI ""IllI\TA PAlIlII"0"I,COL/"07ZIDATA PEl/"0561 .QIlARll:I "0771 .EllARll:I "0411DATA COII/"054/.SENCOL/"071/,STAR/'·'1

PASS ON! OF TIE PAiSER FINO AND IIARII: PU~CTUAnON

BOURDUES WITII A STARPlJIICTUATION OTlIER THAN P£!lIODS AIIll LEFT PARENSARE HARnD TO TNE IMlIEDIATE RIGHT IF A BLANII: SPACEIS AVAILABLE, OTBEKWISE NO HARII: IS HADE.LEFT PAREIlS ARE IIARnD TO THEIR II\IlEDIATE LEFT IF ABLA/lII: SPACE IS AVAILABLE. PERIODS ARE HARll:ED TOTlll!IR IHIl!DIATE RIGHT IF A BLANII: SPACE IS AVAILABLE.AND IF NEITRER OF TIlE TWO ClIARACTERS PRECEDING THE

CZ5 IP (CNT.GE.Il) COTO lZ

IF (TXT(FT).GE."141) coTO 30FLGZ-ITXT ( PT) -TXT (PT )+"040

CC30 PT-PT

J-ICIIT-I

31 IF (CNT.EQ.IZ) GOTO 300IF (Trt(PT).NE.WRD(PT2.J» GOTO 110

PT-PT+IIF (PT.EQ.IIlI:.AND.WRD(PTZ,J+l).LT."141) GOTO lOOJ-J+ICllToCIIT+1COTO 31

CC110 IF (TXT(PT).LT.WRD(PTZ,J» GOTO 115

PTZ-RT (PTZ)IF (PTZ.EQ.O) COTO lZGOTO 30

CC115 PTZ-LT(PTZ)

IF (PTZ.EQ.O) COlO lZGOTO 30

CC300 IF (FLG.EQ.O) GOlD 303309 IF (TXT(FT-ll.EQ.BUlII.OR.TXT(FT-l).EQ.STAR) GOTO 310

FT-FT-ICOTa l09

JlO TXT (FT-I )-STARFLG-O

l03 IF (FLG2.EQ.O) COTO lOITXT(PT )-rXT( FT)-"040FLGZ-D

CC301 COTO 13CC TIlE TRlRO AIlO FINAL PASS OF THE PARSER CHECKSC THE PREVIOUSLY PARSED TEXT FOR UNIT SIZE.C ANY UNIT LARGER THAN 20 CNAlIACTERS [N LENGTH IS

Page 4: Behavior Research Methods, Instruments, Computers … · Behavior Research Methods ... vide an ideal text-presentation mode for these mini ... Psychology and language: An introduction

212 GRANAAS

CCC302

305

CC350351352

)53

PARSED AS NEAR THE KIDDLE AS POSSIBLE \lITH OUTSPLITTIIIG UP A 1I0RD

FT"2PT"3eNT-!

IF(TX'!'(PT) .Eq.STAR) GOTO 350PT"PT+ICNT"CNT+IIF (PT .G!.TSIZE) GOTO 999GOTO 305

IF (CIlT.GT.20) GOTO 355FT"PT+IIF (TXT(fT),NE.BLANK) GOTO 353

fT"fT+1GOTO 352

PT"FT+ICNT"IGOTO 305

Appendix BParsing Program- Pascal

360

365

C999

BK"CNT/ZFLG"fT+BKFLGZ"FLG

IF(TX'l'(FLG) .NE. "040) GOTO 360TX'l'( FLG) "STARPT-fT+!CIIT-!GOTO 305

IF (TX'l'(FLGZ) .IlE."040) GOTO 365TX'l'( FLG2)-STARPT"fT+!CNT"IGOTO 305

FLG"FLG-IFLGZ"FLGZ+IIF(FLG .LT.fT .AND.FLG2.GE .PT) GOTO 351GOTO 356

COIflINUERETURN!IlD

<**** this oro gram is oe$ignec to parse text into aa chras& liKe structure using ounctuation andfunc~ion words. The parser also examines the~arsec text and adjusts the units for lengtnso tnat they may be oetter utilized in tne RSVPreading format ****}

{**** T~lS Vli!rS10n of tne text oarse}' was wrlt'ten in1984 by Mi~hael M. Granaas for use on a Sage IImini-comouter. ****}

varc;, : cnar;chin, wres, temp,pl,pf : in~eger;

wrd,data,prnt,outpt : text;words: array "1 •• 115,1 •• 121 of ~nar;

te~t : array [1 •• 3000] of char; {this could be exoanoed 'to any sizewitnin oounds of avai~a~le memoryto meet recuirements of larger textfiles}

fnt, cap, flg, stp, i, J, K : integer;mid,mi~1,mia2,numc~ar,numparse integer;avch,.", avwrd : real;

oeginchin := lZ';k := 1;

<**** this cu~rent version requires soiting tne nameof each new incut/outcut file. Minor modificationswould alow for prompting of the user thus eliminatingthe nQed fo~ continued editing and recomoiling ****}

reset (data,'exc2:s1');rewriteioutPt,'exp2:os01.text') ;rewrite(ornt,'PRINT£R:f);nump.rse:=0;

for i:= 1 to 3000 cotext CiJ :=' ';

for i:= 1 to 115 dofor 0:=1 to 12 do

beginwordsCi, jj :=' ,end;

while not ~OF(data) do!:leg inwhile not EOLNldata) do

beginread (data, ~h) ;textCkJ :~ c:h;k := k+l;ena;if EO~(data) then

begin

Page 5: Behavior Research Methods, Instruments, Computers … · Behavior Research Methods ... vide an ideal text-presentation mode for these mini ... Psychology and language: An introduction

TEXT PARSING 213

r ••d (o..t~.cn);end,end;close(dAt.l;

c:~in :-k-l;wnile no~ (text'chinJ in C'.·Jl do bRginchin :. chin-1;_ite1n (chinl;

end;

<•••• ~e.d in ~unction NO~ li.t ....}

~e.et (~~d,'exp1:NO~•• text·l,k:·1;J:-l;while not cOF(~~1 do beginWhile not EOLN(~~dl do begin~e.d (~~d.chl;

~ord.CJ,K~ :- en;i<:-k+l;ena;i~ EOLN(~~dl tn.nb.gin~••d(~~d,chl;~ord.CJ.K]I·'.',ko:-1;J:-J+1;

end;end;clo_(~~dll

< we ends ou~ input of tn. function NO~d list ••••}< now we begin the actu.l ~~sing by M.~king .DD~O~~iAt.

aunctuAtion bound~i......}

~~d. '.J-1,i:~;

k.-lIl;while not (k-1) 00

b.gini:-i+1,if tewttij-'.· tnen il-i+1;if t eM t [ i J in ['!',',',')',',',' I' • ' ?' j then

b.ginif t.xtCi+1J - • , th.n t.xt(i+1l.··.';

.nd;if t.xt til - '.' th.n

tl.ll ini1' not <t.xtCi-ll in ':'A' •• 'Z·l) tn.n

beginif not (t.xtCi-aJ in ('A' •• 'Z'J) th.r,begin

if t.)(t Ci+1J-' , then text (i+lj :_t.' ;.nd;end;end;

if t.xtCiJ •• u, tn.nb.ginif te>etCi"-lJ-' , then te>et Ci-1J :-' *';if t.xtti+ll.· , th.n t."tCi+U:·'.·:

.nd;if tewtCi]-'.' thenb.gink:-l;t.xtti+1l.····

end;.nd;

<•••• we no~ oegin the ••cond 0••• of the D.r••~ wnicn ~.~~o~ms •~unction wo~d •••~ch--wh.n.v.~ • non-function wo~d is follow.aby • function word a pAr•• is mAd. o.~w••n tnem ****}

fnt :=-1;c,ao:-0;flg,·0;sto.-0,pf,-l,while not (t."ttp'J in ('A' .. 'Z'.'.' •• 'z'Jl do

01"·01'+1,

r.o•• tDl.·pf;

""hil. (t."tCpl+1J in C'A' •• 'Z·.'.· •• ·z']) do01:·pl+1;

Page 6: Behavior Research Methods, Instruments, Computers … · Behavior Research Methods ... vide an ideal text-presentation mode for these mini ... Psychology and language: An introduction

214 GRANAAS

if (te"t CpfJ in C' A' •• ' Z' ]) thenbegincap:-l;te"tCofJ.- chr(ord(te"tCpf])+32)

end;

for i:- 1 to wrdsdo begin

J :-1;if t.xtC9~J wores Ci.J] then i:-wrds;

if te"tCpfJ words~i.J] thenbegintemp.-pf+l ;J:'"J+l:

while «temp(-ol) and not (wordsCi,J] in C'*']»do begin

if te"tCt:empj wordsCi.J] thenbegint.mp;-temo+1;J :sJ+lend

el.ebegintemp ••pl+l j

endgnd;

if (temp.o1+1) and (wordsCi,J]:'*') tnenbeginf1g:-1;il-wrds

endend {if}

end;{for}

if (flg·l) and (fnt=0) thenbeginfnt :-1;if te"tLof-ll·' -, then 1:1t"tCcf-ll,=''''flg:-0

ende1s. if (f1g-1) and (fnt-l) then

0"1g:-0e1•• if (f1g·0) ana (fnt-l) then

fnt .-0;

if cap-I thenbegincap.-0;te"tCpfJ •• chr(ord(textCafJ)-32l;

.1"td;

pf.'"01+1;whil. not <t.)(t (pt'] in L' At .... ' Z',' A' .... ' z' J) co

beginif t.xttpf].'*· then tnt:-l;if t."tCpfj.·.· then stp.-5jof.·pf+l

..nduntil .tp ) 1;

<.* nd function word se.rch ****}<* we now begin the final pass of the parser, which checks the

l ..ngth of the units, and adjust. them 50 that tney are nogreater than 20 characters in length ****}

cfp-l ;sto:=0;flg:a0;

while not (te"t Cpf] in C· A' •• ' Z']) dopf.'"pf+l;

pl.-pf+l;

..hile not (te"tCpU in ('*']) dopl:-01+1;

mid:=4;

Page 7: Behavior Research Methods, Instruments, Computers … · Behavior Research Methods ... vide an ideal text-presentation mode for these mini ... Psychology and language: An introduction

TEXT PARSING 215

.--pe.tnumch... :·p1-p1';

if numeh <-20 then f1g:-3;if numeh l20 then mid.-<numCh." div 2)+p1';

if <te"ttmidJ-' .) and <flg(1) thenbegint.wtrmidJ:='*';flg''''2

end;

midl._id-l;lWid2:_id+1;

while (flg(ll dobegin

if te"tC.idll=· • thenbeginte"tC.idll:=·.·;fl!!:-2

endelse if te"tt.id2J=· • ~hen

begint."tt.id2l ••·.·;flD:-a

.nell ­.idl:_iol-l;.id2:_id2+1

end; <.....U.}

if flg-3 th.n of.=01+1;flg:04;

while not <t."tipfl in C·A· •• ·Z·.·.· .. ·z·J) cobeginif te"tCDfl-'*' then stD'~;

pf:-p1'+lend;

ol:-pf+l;while not (t."tCDIJ in C·.·ll do

beginif te"tCplJ-·.· tnen stD.-3;01'-Dl+1

end;

until (stoHl;

<.*** Me J'IOM end the size adJust.-nt 0.•• o~ the oarser ****}<***. final part of prag.... outDuts the t."t in an aporopriate

fo....t for lat... use -}

1'0.. i:- 1 to chin doif t ...ttil-·.· then 1'1__........_0..--..1;

...,..it.::..,.(''th.... are ••nu.pa~•• MinGe... in this te)(t'):

pl.-78;...hile not <text tglJ in C' '.' *' J) dO

begin01:-pl-1;

.nd;

fo.. J ,- 1 to pido b.gin

w.. it.<outpt. t.xt CJJ);end;

pf:-pl+1;k:-0;

...hil. <k(1) cob.gin..... it.ln(outot) ;pl ,- 01'+78;

if p1>- Chin thenb.gin

pl 1- chin;k::lt7~

end;while not <t.xtCplJ in C' ','.'J)

do b.ginpl'-o1-1;

.nd;

Page 8: Behavior Research Methods, Instruments, Computers … · Behavior Research Methods ... vide an ideal text-presentation mode for these mini ... Psychology and language: An introduction

216 GRANAAS

for i :- pf to p1do beginwrit~(outpt,teKtCiJl;

lind;

pf r- pl~l;

.nd;writeln(outct) ;writelnCoutpt,'EOF') ;elose<outct,loekl:

end.

Appendix CFunction Words

a can nor whatabout concerning of whateverabove could off whenacross dare on whereafter despite once whichagainst do onto whicheverall during or whilealthough except ought whoamid following our whoeveramidst for over whomamoung from past whosean had regarding whyand have shall willare he she witharound how should withinas I since withoutat if so wouldbe in than yetbecause including that youbefore into the underbehind is though underneathbelow it through unlessbeneath like till untilbesides may to uponbetween might toward versusbeyond must was viabut near weby need were