A computational linguistic approach to natural language processing with applications to garden path sentences analysis

This paper discusses the computational parsing of GP sentences. By an approach of combining computational linguistic methods, e.g. CFG, ATN and BNF, we analyze the various syntactic structures of pre-grammatical, common, ambiguous and GP sentences. The evidence shows both ambiguous and GP sentences have lexical or syntactic crossings. Any choice of the crossing in ambiguous sentences can bring a full-parsed structure. In GP sentences, the probability-based choice is the cognitive prototype of parsing. Once the part-parsed priority structure is replaced by the full-parsed structure of low probability, the distinctive feature of backtracking appears. The computational analysis supports Pritchett’s idea on processing breakdown of GP sentences.


INTRODUCTION
The advent of the World Wide Web has greatly increased demand for natural language processing (NLP).NLP relates to human-computer interaction, discusses linguistic coverage issues, and explores the development of natural language widgets and their integration into multi user interfaces [1].The development of language technology has been facilitated by two technical breakthroughs: the first emphasizes empirical approaches and the second highlights networked machines [2].Natural language and databases are core components of information systems, and NLP techniques may substantially enhance most phases of query processing, natural language understanding and the information system [3][4][5].
By means of developed or used methods, metrics and measures, NLP has accelerated scientific advancement in human language such as machine translation [6][7], automated extraction systems from free-texts [8], the semantics-originated Generalized Upper Model of a linguistic ontology [9], artificial grammar learning (AGL) system [10], NIMFA [11], etc. Understanding natural language involves context-sensitive discrimination among word senses, and a growing awareness is created to develop an indexed domain-independent knowledge base that contains linguistic knowledge [12][13][14][15][16][17].
There are a lot of helpful NLP models for linguistic research focusing on various application areas, e.g.Zhou & Hripcsak' medical NLP model and Plant& Murrell's dialogue system.Zhou & Hripcsak' medical NLP model comprises three parts, i.e. "structure", "analysis" and "challenges"."Analysis" consists in morphological, lexical, syntactic, semantic and pragmatic parts.Morphology and lexical analysis determine the sequences of morphemes used to create words.Syntax emphasizes the structure of phrases and sentences to combine multiple words.
Semantics highlights the formation of the meaning or interpretation of the words.Pragmatics concerns the situation of how context affects the interpretation of the sentences and of how sentences combine to form discourse. [18] Plant& Murrell's Dialogue NLP System discusses the importance of Backus-Naur Form (BNF).This system analyzes the possibility for any user who understands formal grammars to replace or upgrade the system or to produce all possible parses of the input query without requiring any programming.
In the model, BNF is extended with simple semantic tags.The matching agent searches through a knowledge base of scripts and selects the most closely matching one.In this model, BNF is very helpful and useful for system to analyze natural language.[19] www.ijacsa.thesai.orgThe computational analysis of Garden Path (GP) sentences is one of the important branches of NLP for these sentences are hard for machine to translate if there is no linguistic knowledge to support.
GP sentences are grammatically correct and its interpretation consists of two procedures: the prototype understanding and the backtracking parsing.At the first time, readers most likely interpret GP sentences incorrectly by means of cognitive prototype.With the advancement of understanding, readers are lured into a parse that turns out to be a dead end.With the help of special word or phrase, they find that the syntactic structure which is being built up is different from the structure which has been created, namely it is a wrong path down which they have been led.Thus they have to return and reinterpret, which is called backtracking."Garden path" here means "to be led down the garden path", meaning "to be misled".Originally, this phenomenon is analyzed by the psycholinguists to illustrate the fact that human beings process language one word at a time when reading.Now, GP phenomenon attracts a lot of interest of scholars from perspectives of syntax [20][21][22][23][24], semantics [25][26][27][28], pragmatics [29][30], psychology [31][32][33][34], computer and cognitive science [35][36][37][38].
In this paper, Context Free Grammar (CFG) and BNF will be used to discuss the automatic parsing of GP sentences.Meanwhile, the pre-grammatical sentences, common sentences and ambiguous sentences will be analyzed from the perspective of computational linguistics as the comparison and contrast to GP sentences.

II. THE NLP-BASED ANALYSES OF NON-GP SENTENCES
Non-GP sentences in this paper include the pregrammatical sentences, common sentences and ambiguous sentences, all of which are shown how different they are from GP sentences.

A. Analysis of Pre-Grammatical Sentences
A pre-grammatical sentence is incorrect in grammar even though we can guess the meaning by the separated words or phrases.According to CFG, this kind of sentence fails to be parsed successfully.In a pre-grammatical sentence, the syntactic structure is not correct and the relationships among the parts are isolated even though sometimes the possible meaning of the sentence can be inferred from the evidence.For example, in the programming rules of (8), we can enter a lot of related verbs to rewrite example 1, e.g.V → {hear/play/write/sing/ record}.Thus the pre-grammatical sentence can be created into a common one.

B. Analysis of Common Sentences
A common sentence is grammatically acceptable and both CFG and BNF can parse it smoothly and successfully.If "record(verb)" is added into example 1, the formed sentence is a common one.In the semantic network, some nodes are associated with lexicon entries.In order to analyze example 2 clearly and concisely, we find a detailed description of lexicon is necessary besides the grammatical analysis."CTGY" means category; "PRES", present; "NUM", number; "SING", singular.The ATN in Fig. 3 shows the details of parsing of example 2, which belongs to the category of common sentence.There is no backtracking or ambiguity existing in the procedure shown below.3.In arc 7, Adj <new> is analyzed and the result is set in register.
5. In arc 6, the result of parsing in NP subnet is popped to general net in arc 1.
6. Again in arc 1, the popped result is set in register.
7. In arc 2, system starts to seek VP<record the song> and PUSH to VP subnet.
8. VP subnet begins to parse VP<record the song>.In arc 8, V<record> is set in register.

In arc 9, VP subnet begins to interpret NP <the song>.
There is no related rule to support the procedure in this VP subnet and as a result, the sub-sub-net of NP is activated again.NP <the song> is pushed to NP subnet.10.NP sub-sub-net begins to parse NP<the song>.In arc 4, Det <the> is set in register again.
12. In arc 6, the result of parsing in NP sub-sub-net is popped to VP subnet.13.In arc 9, NP<the song> is set in register.
15.In arc 2, the parsing result of VP subnet is set.
All the parsing results of subnets and sub-subnets show that S<the new singers record the song> is grammatically and semantically acceptable and reasonable.The information is set in register.System returns "SUCCESS" and parsing is over.
The algorithm of parsing discussed above can be found in Table 1, in which "Number" means the steps of parsing; "Complexity", the hierarchical levels of net; "Arc" or "A-?", the respective numbers shown in Fig. 3; "Programming", the BNF description.

C. Analysis of Ambiguous Sentences
An ambiguous sentence has more than one possible meaning, any of which can convey and carry the similar, different and even opposite information.
Example 3 ： The detective hit the criminal with an umbrella.
The example above brings syntactic ambiguity for the different syntactic structures convey different meanings.In example 3, two meanings are carried.The first is the detective using an umbrella hit the criminal, while the other is the detective hit the criminal who is carrying an umbrella.
p. SUCCESS In ATN created by means of example 3, three subnets are involved, i.e.NP subnet, VP subnet and PP subnet.S net is the general net.The reason why the different meanings of example 3 can be expressed lies in the attached structures of PP subnet.When PP subnet is attached to VP subnet, namely VP→VP PP is activated, the parsing result is "The detective using an umbrella hit the criminal".When PP subnet serves NP subnet, i.e.NP → NP PP, the interpretation is "The detective hit the criminal who is carrying an umbrella".The parsing algorithm of example 3 in "NP→NP PP" also has 24 steps and highest level of syntactic structure is "V", which means this parsing needs more cognitive or system burden to parse.
From Step 1 to Step 7, system parses example 3 along the same path in which both NP<the detective> and V<hit> are interpreted successfully without the existence of ambiguity.The same algorithm can be seen in both Table 2 and Table 3.
From Step 8, the difference appears.For the sake of clear and concise explanation, we start the algorithm used in Table 3 from step 8. 3 shows that "VP→VP PP" parsing is easier than "NP→NP PP" parsing since the first is less complex than the second.This provides the evidence that there is a default parsing even though more than one interpretation is involved in an ambiguous sentence.

The difference between Table 2 and Table
In example 3, "VP → VP PP" algorithm in which the sentence is parsed into "The detective using an umbrella hit the criminal" is the default interpretation.
Besides syntactic ambiguity shown in example 3, the existence of homographs is another important model to produce multi-meaning.www.ijacsa.thesai.orgThe whole BNF-based algorithm of example 4 is shown in Table 4, by which four interpretations discussed above can be parsed.
From the discussion above, we can know a pregrammatical sentence (e.g.example 1) is not good enough to meet the requirements of syntax for it fails to consist in the necessary components.A common sentence (e. g. example 2) is the essential part of natural language, and the exact expression is the core of the sentence.An ambiguous sentence comprises ambiguous structures (e.g.example 3) or ambiguous words (e.g.example 4), and any ambiguous interpretation is acceptable and understandable even though sometimes the parsing has different complexity.

III. THE NLP-FOCUSED ANALYSES OF GP SENTENCES
The parsing of a GP sentence includes two procedures, i.e. the prototype understanding and the backtracking parsing.The prototype understanding refers to the default parsing of cognition according to decoder's knowledge database.The backtracking parsing means the original processing breaks down and the decoder has to re-understand the GP sentence when the new information used to decode the sentence is provided linearly.Therefore, processing breakdown is the distinctive feature of the parsing of GP sentence.
q. SUCCESS From the lexicon analysis of example 5, we can notice the significant difference between "number (noun)" and "number (linking verb)".
According to the interpretation in LDOCE, "number (noun)" can mean "a word or sign that represents an amount or a quantity" just in the sentence of "Five was her lucky number"; or "a set of numbers used to name or recognize someone or something" in the sentence of "He refused to swap it with opposite number Willie Carne after the game because he had promised it to the Mirror."Besides the noun function, "number" can be parsed as "lingking verb".For example, in the sentence of "The men on strike now number 5% of the workforce", "number" is interpreted as "if people or things number a particular amount, that is how many there are." Based on the discussion above, ATN of example 5 can be created.In Fig. 6, the core of the parsing lies in NP subnet in which both "NP→Det Adj" and "NP→Det Adj N" are accepted.In cognitive system, "number (noun)" functions in order of priority while "number (lingking verb)"has a notably low probability.The difference of cognition can be shown in the ERP experiments and the psychological results develop the prototype ideas.[39][40][41] The BNF-based algorithm of example 5 includes 22 steps during the parsing, which can be shown in Table 5.
1.In arc 1, S net firstly seeks NP.System pushes down to NP subnet.According to the cognitive knowledge of decoder, "number(noun)"in <the opposite number>" is firstly parsed.
2. In arc 5, Det<the> is set in register.
4. In arc 6, N<number> is set in register.
5. In arc 7, parsing result of NP<the opposite number> is popped up to arc 1 in S network where it is pushed down.
6.In arc 1, NP<the opposite number> is set in register.
7. In arc 2, S network seeks VP and tries to push down to VP subnet.But the left components<about 5000>fail to find V according to lexicon analysis.System returns "FAIL" and backtracks to the original path in arc 1 where another parsing can be chosen besides the original one.In example 5, the cognitive crossing lies in the difference of "number(noun)" and "number(linking verb)".
8. In arc 1, system seeks NP and <the opposite> instead of the original <the opposite number> is pushed down to NP subnet.
11.In arc 7, NP<the opposite> is parsed successfully and sent back to arc 1.
12. In arc 1, the parsing result of NP<the opposite> is set in register.
13.In arc 2, VP<number about 5000> is pushed down to VP subnet.
14.In arc 9, <number> is interpreted as a linking verb according to (Number((CTGY.LINKV))), and the result of parsing is set in register.
16.In arc 12, the interpretation of Adv<about> is set in register.
19.In arc 10, the result of parsing NumP<about 5000> set in register.
20.In arc 11, after parsing VP<number about 5000> successfully and smoothly, system returns to arc 2.
21.In arc 2, VP<number about 5000> is set in register.
22.In arc 3, both NP<the opposite> and VP<number about 5000> are set in register and the whole parsing of S<The opposite number about 5000> is completed.System returns "SUCCESS" and parsing is over.From the algorithm in Table 5, we can see the distinctive feature of parsing is the existence of "backtracking", at which breakdown happens and system has to return to the original crossing to find another road out.This optional procedure needs the help of lexical, semantic, grammatical and cognitive knowledge.
q. SUCCESS From the parsing above, we can know example 6 is another GP sentence since there is breakdown in the processing.In example 6, "record(verb)" and "record(noun)" can be chosen randomly.However, NP<the new record> has a high probability of parsing.This is the reason why the priority parsing selects "record(noun)" rather than "record(verb)".The process of choosing can be shown in ATN networks.In Fig. 7, NP subnet structure is the obvious reason why the GP phenomenon appears.Both NP→Det Adj and NP→ Det Adj N are reasonable and acceptable when "the new record" is parsed.Generally speaking, Adj is used to modify the Noun, the model of NP→Det Adj N is the prototype of parsing, and system interprets example 6 by means of this programming rule rather than NP→Det Adj.After completing the NP subnet parsing of <the new record>, system returns to S network to seek VP.However, the left phrase <the song> has no VP factor according to the lexicon knowledge, and system stops, backtracks and transfers to another programming rule, i.e.NP→Det Adj.Cognitive breakdown happens.The whole processing algorithm of example 6 is shown in Table 6.
www.ijacsa.thesai.org22.In arc 3, system finishes the parsing of NP<the new> and VP<record the song>.S<the new record the song> is saved.System returns "SUCCESS" and parsing is over.
From the discussion about example 5 and example 6, we can find both of them have the distinctive feature of "backtracking".The fact that high probability parsing in GP sentences has to be replaced by the low probability interpretation is the fundamental distinction from pregrammatical sentences, common sentences and ambiguous sentences.Processing breaks down when system backtracks to find new path out.
Based on the analyses of computational linguistics shown above, we can see more likeness and unlikeness exist between the ambiguous sentences and GP sentences.An effective and systematic attempt at comparison and contrast may contribute to our understanding of the special phenomenon.

IV. THE COMPARISON AND CONTRAST OF AMBIGUOUS SENTENCES AND GP SENTENCES
Ambiguous sentences and GP sentences have close similarities and significant differences in many aspects, e.g.lexicon knowledge, syntactic structures and decoding procedures.

A. The Similarity and Difference in Lexicon Knowledge
The lexicon knowledge is the basic information for system to parse and a detailed analysis of related category is essential and necessary.Let's firstly compare the similarity and contrast the difference among example 3, example 4, example 5 and example 6, which are shown as follows.
In example 3, the lexicon analysis includes Det<the, an>, N<detective, criminal>, Prep<with> and V<hit>.Since the singular noun N<detective> needs present verb <hits> or past verb <hit> to cooperate, example 3 must be a past tense rather than a present tense for there is no <hits> provided in the sentence.Example 3 is a structure-based ambiguous sentence and lexicon knowledge helps few for reducing ambiguities.In example 5, the lexical database comprises Det<the>, Adj<opposite>, LinkV<number>, N<number>, Adv<about>, and Number<5000>.The homonym <number> has two grammatical functions, i.e. linking verb and noun.
The different choices result in different sentences.According to the probability, NP<the opposite number> is the prototype parsing, and correspondingly, N<number> is adopted firstly even though this path is considered to be a dead end finally.Generally speaking, the lexical crossing leads to the processing breakdown of GP sentence.From the discussion above, we can see the existence of homonyms is an obvious reason which brings ambiguous phenomenon and GP effect, just as in example 4, example 5 and example 6.However, this is not the only reason for the appearance of ambiguity or GP phenomenon.Sometimes, the divergence of syntactic structures also leads to ambiguity or GP effect.www.ijacsa.thesai.org

B. The Similarity and Difference in Syntactic Structures
Stanford parser is a very useful parser which is created by means of both highly optimized PCFG (probabilistic context free grammar), lexicalized dependency parsers and lexicalized PCFG."Probabilistic parsers use knowledge of language gained from hand-parsed sentences to try to produce the most likely analysis of new sentences."The Stanford parser can be used to parse example3, example 4, example 5 and example 6 on line.The results of syntactic structures are provided as follows.
Stanford parser provides one of the two interpretations, namely model of "VP→VP PP" rather than the model of "NP →NP PP" since the former has higher probability than the latter from the perspective of statistics.In other words, "VP→ VP PP" is the prototype parsing for its simpler syntactic structure.In example 4, the tags are <Failing/NN>, <student/NN>, <looked/VBD> and <hard/JJ>.This is another whole parsed structure in which all the components are interpreted successfully.The word of <failing> is considered Noun (i.e.Grd); <hard>, JJ (i.e.Adj).The parsed syntactic structure is similar to "Grd+Adj" which is the highest probability in statistics of parsing database among four ambiguous models.The hierarchical level is II shown in Table 4.In example 5, tags are < the/DT >, < opposite/JJ >, <number/NN >, < about/RB> and <5000/CD >.According to Stanford parser, this is a part-parsed sentence since the final result is NP rather than S, which shows the prototype of NP<the opposite number> has the higher probability than NP<the opposite>.In other words, Stanford parser only finishes the first part of the parsing before the backtracking in Table 5.In example 6, tags comprise <the/DT>, <new/JJ>, <record/NN >, and <song/NN>.This is another example of part-parsed structure in which only the programming rule of N→{record} is adopted while V→{record} fails to be used.That means NP<the new record> has stronger statistical probability than NP<the new >.Stanford parser only parses the steps from 1-7 in Table 6 and then system gives the final result is NP instead of S, which ignores the left parsing steps after the backtracking.From the discussion about syntactic structures, we can see both ambiguous sentences and GP sentences can have more than one syntactic structure.According to PCFG, the strongest probability parsing is the final result in Stanford parser.If another more complex structure is adopted, cognitive burden of decoders will be lifted and increased.Once this happens, another ambiguous sentence will be provided by means of the ambiguous syntactic structure besides the original one.On the contrast, if probability-based parsing returns the final result of a GP sentence as a part-parsed structure, the rule-based programming will be activated and a full-parsed new structure can be obtained only if the processing breakdown can be overcome.
During the re-parsing procedures, an ambiguous structure can bring different full-parsed results, while a GP sentence breaks down firstly for its part-parsed structure and then moves on to another full-parsed path.An ambiguous structure leads to multi-results, all of which are reasonable and acceptable while a GP sentence structure only brings one fullinterpreted result besides the processing breakdown.
V. CONCLUSION By comparing programming procedures, lexicon knowledge, parsing algorithms and syntactic structures between pre-grammatical sentences, common sentences, ambiguous sentences and GP sentences, we conclude that the formal methods of computational linguistics, e.g.CFG, BNF, and ATN, are useful for computational parsing.
Pregrammatical sentences have part-parsed structure and system returns the final result to be Phrases rather than S. Common sentences are normal in grammar and semantics, and there is www.ijacsa.thesai.orgno lexical or syntactic crossing for parsing.Ambiguous sentences have ambiguity created by ambiguous structures or lexicons, both of which can bring full-parsed results.GP sentences comprise part-parsed structure built by the high statistical probability method, and full-parsed structures created by rule-based method.When the parsing shifts from part-parsed structure to the full-parsed one, processing breakdown of GP sentences occurs.This paper supports the idea raised by Pritchett [42]that processing breakdown is a distinctive feature in the parsing of a GP sentence.

Example 1 :
*The new singers the song.G={Vn, Vt, S, P} Vn={Det, Adj, N, NP, S, VP, V} Vt={the, new, singers, of Example 1, we can see the whole structure of sentence is [The new singers]NP+[the song]NP， namely the absence of V is the reason why it fails to be parsed successfully.

Figure 3
Figure 3 ATN of Example 2

1 .
System tries to seek NP in arc 1 and then PUSH NP <The new singers> to NP subnet; 2. NP subnet begins to parse NP <The new singers>.In arc 4, Det <the> is set in register.

Figure 4
Figure 4 ATN of Example 3From the Fig.4, we can notice the difference of PP subnet which can be attached to NP subnet in arc 4 or to VP subnet in arc 8.The parsing algorithm of example 3 in "VP → VP PP" includes 24 steps and highest level of syntactic structure is "IV".

Figure 5
Figure 5 ATN of Example 4In Fig.5, we can see both NP subnet and VP subnet have bi-arcs which act as the same function of grammar.For example, arc 4 and arc 5 before NP1 exist in the same syntactic position and have the same function.Meanwhile, arc 8 and arc 10 before VP2 perform similar grammatical function in VP subnet.The BNF of example 4 is provided as follows.

Figure 6
Figure 6 ATN of Example 5

Figure 7
Figure 7 ATN of Example 6

Table 1 Parsing
Algorithm of Example 2

Table 2 Parsing
Algorithm of Example 3 in "VP→VP PP"

Table 4 Parsing
Algorithm of Example 4

Table 5
Parsing Algorithm of Example 5