The ideas are inspired and illustrated by a study undertaken of the EPIC system, an ILE for learning proof construction in propositional logic, (Twidale 1992a) which revealed various problems with summative evaluation. The study revealed the unexpected ways in which students may interact with an ILE. These are difficult to predict and can have substantial effects on the overall effectiveness of the system. Therefore it is vital to undertake studies with prototypes as early as possible in systems development. This will enable the developers to avoid putting excessive labour into elements that ultimately have little effect and to redirect effort to other aspects of the design that appear to have a major effect. The results of our study indicated that the interface had a much more important role in overall effectiveness than had been expected. This is fortunate as it is relatively straightforward to prototype and to try out improvements. However it can be a rather neglected area of ILE development which can have a deleterious effect on the performance of the system as a whole.
The paper raises problems rather than presents solutions and aims to
inform future evaluators of problems they may face. Mark and Greer (1993)
have analysed and described a large variety of methods. Our aim is to highlight
problems that may arise from using some of these methods, in particular
the danger of the results being misleading.
Controlled evaluation is the technique which provides the most useful
information to interested parties partly outside the research domain such
as educationalists and funding bodies. Such groups frequently want some
external objective measurement of research quality, productivity or effectiveness.
When bidding for research funds, ILE projects often claim that the completed
system will offer various improvements in the speed, quality and enjoyment
of learning. Included in the bid there is frequently a commitment to evaluate
whether and to what extent these claims have been met. There are various
techniques (Shute & Regian 1993) to increase the rigour and hence the
validity of the experiment. For the purposes of external justification,
a controlled experiment may be the best method (Legree & Gillis 1991).
Besides illustrating the negative effect an interface can have on overall performance, this case also illustrates a common problem that occurs when evaluating the effect of newer versions of a system. It is frequently the case that a number of fairly minor changes are made to the system and then it is hard to attribute the improvement to the individual change. This naturally occurs due to the cost of experimentation; it just is not feasible to do an experiment for every individual change.
In some circumstances the changes are inextricably linked anyway. For
example, a new explanation generation facility may necessitate a new interface
feature in order to operate. The improvement in performance may be due
more to the greater explanatory clarity of the interface than to the explanation
The culture clash can lead to different expectations about the necessity of experimental evaluation. There has been some debate on the role of science in HCI (Newell & Card 1985, Carroll & Campbell 1986, Newell & Card 1986) which applies equally to ILEs. Within the engineering paradigm, one regularly has to make decisions between design options, involving trade-offs on a number of dimensions. Evaluations can be used to provide the information for such trade-offs between alternatives. Although experimental evaluations may be used, their expense will restrict them to a very few particularly significant decisions out of the very many that any design includes. Experiments whose results are widely applicable across designs are of course most useful. In the main however, decisions are made using more informal evaluation techniques, sometimes just the intuition of an individual designer. These many, intuitive design decisions can though be informed by the experience of informal evaluations such as watching the use by students of prototypes.
The tension between the two viewpoints which are manifested in the preferences for either formal or informal evaluation, is directly comparable to the debate in Sociology between positivism and naturalism (Wilson 1971, Hammersley & Atkinson 1983). Positivism promotes the use of quantitative methods whereas naturalism emphasises ethnography.
The rapid advance in hardware and software development can lead to additional problems for experimental evaluation of computer systems. It takes time to undertake a thorough experiment. Also, it is desirable that the experiment should build upon earlier ones in order to permit comparisons. Unfortunately the hardware and software may improve so rapidly during the course of a sequence of experiments as to greatly reduce the value of the information gained. For example, there have been numerous thorough and elegant experiments undertaken to evaluate the effectiveness of various features of line based editors which coincided with the rise to dominance of screen based editors.
In a similar manner, as hardware and software improve, the results of earlier evaluations that had negative conclusions can be overruled on the grounds that any newly developed system could have features incorporated to avoid the problems that arose in the earlier system. With any experiment involving human input, including those relating to education, it is always possible to criticise the result if it seems counter to one's expectations by either criticising the methodology or by claiming that the effect is not generalisable across groups, domains, cultures etc. When the experiment involves computer systems as well, it becomes even easier to criticise because of the rapidly increasing sophistication of computer systems. Therefore an unexpected or unwanted result can be blamed on some technical flaw such as a primitive interface feature.
For example within HCI design there is some debate about the advantage of adaptive user interfaces (Browne et al. 1990); whether the improved adaptivity to the user's needs outweighs the potential bewilderment to the user of an interface that is continually varying. We can envisage an experiment to compare a conventional set of menus with one that altered its ordering according to the user's observed usage pattern of the menu options. Let us assume that the experiment revealed that the new system had a negative effect on performance. A computer scientist who believed in adaptive user interfaces could dismiss the general applicability of this result by claiming that it was due to the poor way the adaptive interface was designed, and propose an improved design that would indeed improve upon the conventional menu system.
The scientific paradigm (Toulmin 1972) appears to operate in its purest
form in the domains of Physics and Medicine. In particular, teams reproduce
the published experiments of others in order to test them further. An example
would be the intense worldwide activity in cold fusion (Close 1991). It
is interesting to speculate why this happens so rarely in the field of
ILE or even general computing, even by those who are most in favour of
summative evaluation. One might argue that this was due to the rapid change
in the field which would make the reproducing of experiments secondary
to the development of improved systems. But Physics and Medicine are also
rapidly changing. Another argument is that funding bodies only want to
fund new research and reproducing other's work would not attract any money.
By a similar argument, researchers may only gain academic respect by pursuing
novel research. In both cases it can be that the greater the novelty, the
better, making the comparison with the observations of earlier work more
problematic. If this is indeed the case, why does it not appear to apply
so much in Physics and Medicine? One reason may be that systems were often
developed on particular hardware and using particular combinations of software.
These configurations might be essential for the system to operate and also
be prohibitively expensive to reproduce. It might be expected that with
the trend towards more open systems and greater software portability that
this reason would decline in effect over time. Issues of intellectual property
and other commercial pressures (Thimbleby 1990) may also make researchers
disinclined to share their software to enable experiments to be reproduced.
Finally it may boil down to the interests of the researchers. Those from
a computing background may just prefer to devote resources to building
and improving systems rather than running experiments and so only do the
latter when they absolutely have to. The consequence is a preference against
controlled evaluation amongst Computer Scientists and a preference in favour
amongst Educationalists and Psychologists.
The rapid prototyping development methodology (Lantz 1986) is the most extreme case of the use of formative evaluation to drive a computing project. It is most useful when dealing with interface issues and others where the user is closely involved. Thus it is appropriate for developing ILEs. It is frequently difficult for the developer to acquire sufficient objectivity to determine the likely effect that the interface being developed will have on the user.
Informal evaluation is also useful when supporting incremental improvement of an ILE. In this case, small focussed studies of the ILE in use can be undertaken. Cases are collected where the ILE's performance on a particular learning episode was less than ideal. These cases are listed by order of importance. Another list is created by order of ease of re-implementation. A third list is created using the other two to create a priority order for the changes to be implemented given the available resources.
It is possible to undertake experiments for formative evaluation. These experiments need not be as rigorous as those used in controlled evaluation. One very popular technique is the 'Wizard of Oz' method (Mackay, 1988; Sandberg, Winkels & Breuker, 1988; McKevitt, 1990). In this the user interacts with a computer interface but the information is passed to a human 'processor' in another location who does the processing and passes back the reply via the interface. Often this method is used to test the efficacy of the interface in advance of the development of the internal components. It can also be used to test individual internal components of an incomplete system. In this case the human processor would select and prepare the data that the component under test would use, pass the data to that component, take the result and process it, mentally simulating the activity of all the incomplete components and pass the result back to the user.
Such a method was used in the BELLOC project (Twidale et al. 1992) to determine the efficacy of an ILE to support the learning of a foreign language. The component tested was the diagnostic description of students' applicable rules (consistent but non-standard grammar rules) and a technique for verifying the diagnosis. This could be tested for feasibility without the need to adapt and attach a robust parsing module. The experiment was valuable in revealing the problems that arose when using the verification technique. Once a potential misconception arose, the completed system was to give the student a sentence and ask whether or not it was correct. The sentence was constructed so that the reply would discriminate between students knowing the correct grammatical rule and those possessing the particular misconception that had been diagnosed. The study revealed both false positives and false negatives. There were cases where the students said the sentence was incorrect, but not for the expected reason; an additional grammatical misconception led them to diagnose an error in another part of the sentence. There were also cases where the students said a sentence was correct when they were expected to say that it was incorrect. In such cases, part of the sentence contained a construction that the students knew that they frequently made errors on. They focussed on this construct, verified that it was correct and hence decided that the whole sentence was correct, ignoring the rest of the sentence containing the 'real' error that we were concerned about. The study reveals some of the advantages of formative evaluation. It was very small and thus cheap in terms of time and effort to undertake. Nevertheless it revealed important complicating factors that a completed ILE would have to cope with. If these factors had not been discovered early on, not only would they have reduced the effectiveness of a completed ILE but they would also have made the interpretation of the results of any summative controlled evaluation very problematic.
Although experimental in nature, Wizard of Oz techniques lack the objectivity of conventional experiments and so offer less convincing proof of any observed effect. Also, they usually involve fewer subjects than controlled experiments, since they are often more labour intensive to undertake. They are though quite appropriate for formative evaluation purposes.
Informal studies of the system should lay far greater stress on negative
evidence of effectiveness than positive evidence. If the system is incomplete,
its coverage is limited. Therefore problems that arise with a component
that are not solely due to the absence of a future component can be guaranteed
to recur in the scaled up completed system. On the other hand, successes
with one component may not necessarily scale up as other components are
added. We might summarise the difference between informal and controlled
evaluation by saying that one should use controlled evaluation when one
wants to show the advantages of a system, and informal evaluation when
one wants to reveal difficulties.
Such activity can lead to the system substantially misinterpreting the student's actions, particularly when the activity takes place as a side issue in the middle of solving a problem. This misinterpretation may lead to inappropriate remediation that can be either useless, irritating or even misleading for the student. Of course once a case such as this is cited it becomes obvious that ILEs should allow for spontaneous student experiments. The point is whether other such activities can be predicted and allowed for. Certain activities may only occur when an ILE is sufficiently sophisticated and supportive that it worth the student's effort to undertake them. Consequently they will be rarely if ever observed before the system is built. Also, it may be that only an in-depth study will reveal what is occurring; in this case it was only by sitting alongside the students and encouraging articulation that it was possible to discover the motivation of the action which quite changes both its meaning and the appropriate response of an improved system.
Another unexpected result that the study of EPIC revealed was that a sophisticated interface can have a supportive effect in addition to its main intended purpose of acquiring information about the student. The interface had been designed to enable the student to describe to the system the plans and goals she was using to find a logic proof. Figure 1 gives an example of the interface in use.
Figure 1. The EPIC Interface.
For others, although they had some understanding of the use of plans and goals, this was so hazy that in the pilot study they were quite unable to describe how they used them to tackle a problem. In both cases the menu and form based approach for the interface brought immediate effects in clarifying their understanding of the domain before the ILE could take advantage of the information provided for its own pedagogic purposes. It provides the student with a vocabulary for articulating learning issues that she may not even be aware of.
In addition to the effect on students unaware of higher level planning issues, the interface had an effect on other students in a less dramatic and more supportive manner. By making planning explicit, it externalised the process. This made it much easier for the student to analyse her own thought processes. It also allowed the system to catch slips at the planning level, and by providing an external memory, to reduce the likelihood of slips caused by working memory overload. This has a further advantage in that it reduces the chance of a slip leading to a long and unproductive exploration of a dead end within the search space.
The study revealed this issue when it was observed that students were correcting buggy plans before the plan checking elements of EPIC had a chance to criticise them; once a student had partially instantiated certain plans she would see for herself that they were inappropriate and abandon them before completing them and declaring that this was to be her plan.
The increased attention now paid to situated learning also leads to
the desirability of increased use of rapid prototyping and formative evaluation
(Clancey 1992) in order to assess the impact of the system in the context
of its normal use. This approach can be aided by more interdisciplinary
work and participatory design.
The ethnographic approach (Hammersley & Atkinson 1983) tackles the issue stated earlier of the need to focus on individual learners and separate learning episodes if we are to learn more about the individualisation abilities of a system. It comes from a quite different paradigm than either computing, psychology or education research, namely that of the Chicago school of sociological observation, with strong links to anthropology. Within this paradigm the provision of anecdotal evidence and judicious selection of telling and representative quotations or episodes is considered to be acceptable data. (NB. There are some educationalists whose work is in this style, such as Piaget).
There is a danger that those from a computing background will regard ethnographers as members, alongside psychologists and educational researchers, of a fairly uniform group of arts/humanities people. This can lead to misconceptions of the viewpoints and activities of the different research traditions which can impair collaborative work. Indeed there are wide ranging practices even within ethnography that are continually developing. The use of experiments (including rigorous ones) clearly separates the psychologists and education researchers from the ethnographers. Indeed on a scale of formality of evaluation techniques used, computer scientists may find themselves between the two groups.
Ethnographic study is particularly concerned with the observation of behaviour in a natural setting, known as ecological validity. In the context of computer systems it thus lays stress on the interactions not just between the user and the system as they occur in the workplace, but also interactions between the user and other users, co-workers and other systems, computational, social, organizational etc. Due to this, strict ethnographic analysis may not be appropriate for formative evaluation as the prototype under investigation may not be robust enough to operate in its anticipated natural setting, particularly if that is to be the classroom. Where the intended users are not schoolchildren but adult learners, and the intended context of use is self study in a computer lab such as at a University, formative studies in a computing research lab may, with a little care, be undertaken without doing too much violence to the principles of ethnography, and still reap many of the benefits.
Although the in-depth study approach seems to have more of the flavour
of sociology about it than might be expected in the development of computer
systems, it does have similarities with certain existing activities within
computing-engineering. The study may be regarded as a technique of requirements
capture from the perspective of Software Engineering (Sommerville 1992)
and to be similar to the style of user-centred system design pioneered
in Scandinavia (Ehn 1989). From the perspective of AI and Expert Systems
development the study can be seen to be a form of Knowledge Acquisition
The choice of a domain will be constrained by the domain expertise available. It is surely no coincidence that the majority of ITS cover domains such as arithmetic, algebra, calculus, basic electronics and elementary programming; those domains where the developer probably does not need to consult another domain expert. There is a tension in domain choice between keeping to a well known domain so that effectiveness can be directly compared with earlier systems and choosing a novel domain in order to demonstrate originality, extend the domain coverage of the field of ILE and so improve the chances of acceptance of submitted papers.
Another crucial issue to consider in domain choice is the degree to which it facilitates evaluation. If the domain chosen will only be of relevance to a small number of students (within the environment where the research is undertaken, such as a particular university) then it may be difficult to get adequate numbers of students for a rigorous experiment. Furthermore, one should also allow for the unexpected need for a repeat experiment. Despite careful pilot studies, problems may arise either with the ILE, the course which it is supporting or the subjects themselves (such as a high drop-out rate). Where the ILE is supporting a course, this may mean that one has to wait for the course to be re-run, which could be in a year's time.
There are two alternatives:
2) To choose a domain that a group of volunteers might be expected to find interesting to learn about but which does not fit within their actual learning requirements. This can cover a range from 'toy' problems or puzzles such as those frequently used in cognitive science experiments (such as Hobbits and Orcs; McDaniel & Schlager 1990) to areas of general public interest such as electoral reform (Dillenbourg & Self 1992).
An example of a study undertaken in a 'toy' domain involved the discovery
of the equation determining the result of perfectly inelastic linear collisions
(Twidale 1991). The aim of the study was to investigate the difficulties
faced by students when learning using simulations, in order to determine
the kinds of support that an Intelligent Simulation Learning Environment
(ISLE) should provide (Hijne & Berkum 1992). Therefore an extremely
simple simulation domain and environment was chosen. The argument was that
problems that arose in this simple case would also arise in the case of
more sophisticated simulations, even if the latter also had additional
problems. In a similar manner, even though the undergraduate volunteer
subjects all appeared to be quite sophisticated learners, the results could
also be of use for ISLEs intended for less sophisticated learners, in that
the difficulties that those subjects faced could be expected to be faced
by others even if they also had additional difficulties. The study revealed
the unexpectedly wide range of problem-solving activities undertaken by
students in a simulation environment. This has major implications for systems
attempting to model such activity. Thus a simplified domain and sophisticated
learners can provide a first pass for the studying of learning issues while
avoiding the complexities of a more realistic setting.
It should also reveal those elements that appear to be successful even in stripped down prototypical form, providing evidence that in the improved version of the system, they would be all the more effective. For example, the information gained from the novel interface components in the EPIC system was used to provide very simple diagnosis and remediation. There was neither a sophisticated user model nor pedagogical expert. Since the results showed that the use of explicitness can achieve pedagogical results even with a minimal tutoring system, we can also argue that a system with a more sophisticated user model and pedagogical expert would have been yet more effective (Mark & Greer 1991).
The acquisition of anecdotal evidence can also be used to answer the
criticism of many studies both formally experimental and not, that they
are unique to their circumstances. If an anecdotal finding of a certain
kind of student behaviour (eg. the deliberate making of mistakes in order
to get the system to explain a poorly understood issue) is described for
different systems in different domains by different researchers in different
institutions then we can be more confident about its generality, than if
it was the result of a single controlled experiment (assuming that an experimenter
even set out to observe it). This method of verification of findings is
analogous to the use of the technique of triangulation in ethnography (Hammersley
& Atkinson 1983), where a variety of observational methods are employed
in the same study to determine the similarity of their findings.
2) It is feasible to create a simple expert system to mimic proof generation.
3) It is a small and self contained subject area.
4) There was an available pool of students wanting to learn the domain: it was taught as part of a first year introductory Logic course in the Philosophy department.
5) The course lecturer was willing to offer guidance and encourage participation by the students.
6) There was a genuine need for remedial support; some students found the domain very difficult and fell behind, particularly those with an arts/humanities background.
2) Evidence that the intermediate representations support the student's problem-solving attempts by clarifying planning and supporting error recovery from slips and planning and goalstack errors.
3) Evidence that a supportive environment encourages exploration, experimentation and more adventurous problem-solving.
4) Evidence that explicit intermediate representations have the potential for aiding the discussion of planning actions by system and by student.
5) Information about the acceptability to the student of articulating intermediate levels of problem solving.
6) Information about the acceptability and effectiveness (in supporting solving, discussion of techniques and in diagnosing errors) of explicit heuristics made available to the student.
2) Students of Propositional Calculus were known to have a variety of misconceptions and levels of ability. Their competency could also vary within a session, either improving as a result of learning and degrading as a result of cognitive overload. Given a limited supply of volunteers and very idiosyncratic performance, statistical evidence would be of dubious value.
3) The interface was known to have a substantial influence on the ILE's effectiveness which could swamp other variables.
4) The information to be obtained was to inform development and improvement of intermediate representations. The question the evaluation was to answer was not an overall performance question, such as "Do more students get more right answers if an instantiator is provided than if not?". Rather it is at the level of individual actions within a problem: "Under which circumstances did the instantiator do well and under which did it do badly?"
5) There were not enough resources available for a large-scale controlled experiment.
For example, by encouraging articulation of beliefs and focussing on the understanding of a few students, the EPIC study revealed that in certain cases, students may be able to competently perform simple exercises in a domain while still having major misconceptions about that domain, that only become apparent when they tackle a harder problem (one that cannot be solved by their non-standard techniques). This was observed at three levels of problem solving analysis:
This was unexpected. From conversations with the lecturer and observations of students, it had been assumed that students understood these. In fact they often had great difficulties (which they tried to hide from human tutors) with the less frequently used rules.
2) Misconceptions about plans and goals.
There could be substantially buggy techniques here that still managed to rapidly yield the optimal proof for very simple problems. On a slightly more difficult problem however the buggy technique is quite useless leading to bewilderment on the part of the student and aimless searching of the solution space.
3) Misconceptions about the whole nature and purpose of proof.
These could lead to students happily breaking fundamental rules in circumstances where they could think of no simple solution method.
Conventionally, the straightforward questions that may be liable to a non-standard and non-generalisable solution methods are presented first in an exercise. This is only to be expected and serves the purpose of boosting a student's confidence before progressing to more difficult problems. As a consequence the more challenging problems that can reveal such misconceptions are posed last in an exercise, and are generally few in number. They are frequently used to serve an additional purpose to that of testing deep understanding; that of keeping more able students occupied and interested while teacher time can be devoted to less able students. Although a laudable aim in terms of classroom management this can have the unfortunate consequence of lowering the significance ascribed to such questions so that in the extreme they are regarded as optional. A student who got 7 out of 10 in an exercise might be deemed to be quite competent in the domain, perhaps just making a couple of trivial slips, but if she got the first 7 questions right and the last 3 hard questions wrong, she may still have substantial misunderstandings. A situation such as this has been observed in elementary algebra (Kuchemann 1983). The general lesson from this observation is that student models need to be conservative in inferring understanding.
This issue is significant in determining the meaning of the numerical
results of summative experimental evaluations. Overall percentage results
are liable to obscure this effect. It is a problem that pervades much of
our educational system. For example, in British Universities a First Class
overall degree result requires getting a mark at or around 70%. In scientific
disciplines (the conventions for the marking of arts/humanities essays
may mean the argument does not apply there) the widespread assumption that
70% implies mastery can mean that students could have substantial misconceptions
about parts of the course and still be considered outstanding. This is
particularly unfortunate where one course is a prerequisite to another.
The tutor of the more advanced course may end up spending a substantial
proportion of the teaching time re-teaching the prerequisite knowledge.
The main aim of this paper has been to advocate the frequent use of
informal evaluation during development of any ILE. The field of ILE research
is still quite new and there is much to be learned by simple observation.
Also the provision of more sophisticated ILEs can lead to significant changes
in how learners behave that may be impossible to predict. As work proceeds
on the development of more sophisticated user interfaces, there is a trend
to allow the user to articulate more of her beliefs than is possible in
earlier systems (and sometimes more than is usual in human based tutoring).
Much needs to be learnt about this and its effect on the learning process
as a whole. We can even hope that an ILE can serve as a tool in its own
right to support the study of human learning. As such an ILE can serve
as a 'cognitive microscope', making visible activities occurring during
learning that are normally invisible because they occur in the student's
mind and usually leave a minimal external trace. The informal observations
advocated here would then be analogous to the observations and discoveries
of the first users of optical microscopes.
The research for this paper is supported by a Science and Engineering Research Council Junior Research Fellowship. The author would like to thank the following people who commented on earlier drafts of this paper: Dave Nichols, Michael Pengelly, John Self and Gary Spiers.
Browne, D., Totterdell, P., Norman, M. (Eds) (1990). Adaptive User Interfaces. Academic Press, London.
Carroll, J.M. & Campbell, R.L. (1986). Softening up hard science: reply to Newell and Card, Human Computer Interaction, 2 (2) 227-49.
Clancey, W.J. (1992). Guidon-Manage revisited: a socio-technical systems approach. In Frasson, C., Gauthier, G. & McCalla, G.I. (Eds.) Intelligent Tutoring Systems. Lecture Notes in Computer Science No. 608. 21-36.
Close, F.E. (1991). Too hot to handle: the race for cold fusion. Princeton U.P.
Corbett, A.T., Anderson, J.R. & Fincham, J.M. (1991). Menu selection v typing: effects on learning in an intelligent programming tutor. In Birnbaum, L. (Ed.) Proceedings, The International Conference on the Learning Sciences, Evanston, Illinois. 107-112. AACE, Charlottesville, VA.
Dillenbourg, P & Self, J. (1992). PeoplePower: a human-computer collaborative learning system. In Frasson, C., Gauthier, G. & McCalla, G.I. (Eds.) Intelligent Tutoring Systems. Lecture Notes In Computer Science, 608. Springer-Verlag, Berlin 651-660.
Ehn, P. (1989). Work-oriented design of computer artifacts. Hillsdale, NJ: Lawrence Erlbaum Associates.
Hammersley, M. & Atkinson, P. (1983). Ethnography principles in practice. Routledge, London.
Hijne, H., & Berkum, J.V. (1992). A functional architecture for intelligent simulation learning environments. In: Cerri, S.A. & Whiting, J. (Eds.) Learning Technology in the European Communities. Kluwer, Dordrecht. 91-108.
Kuchemann, D. (1983). Quantitative and formal methods for solving equations. Mathematics in School, 12(5) 17-19.
Lantz, K.E. (1986). The Prototyping Methodology. Prentice Hall, Englewood Cliffs New Jersey.
Legree, P.J. & Gillis, P.D. (1991). Product effectiveness evaluation criteria for intelligent tutoring systems. Journal of Computer-Based Instruction 18, 57-62.
Lemmon, E.J. (1965). Beginning Logic, London: Nelson.
Mackay, W. E. (1988). Tutoring information databases and iterative design. In Jonassen, D. H. (Ed) Instructional Design for Microcomputer Courseware Hillsdale, NJ Lawrence Erlbaum 327-44.
Mark, M.A. & Greer, J.E. (1991). The VCR tutor: evaluating instructional effectiveness. Proceedings of the 13th Annual Conference of the Cognitive Science Society. pp564-569.
Mark, M.A. & Greer, J.E. (1993). Evaluation methodologies for intelligent tutoring systems. Journal of AI and Education (this volume).
McDaniel, M.A. & Schlager M.S. (1990). Discovery learning and transfer of problem-solving skills. Cognition and Instruction 7 (2), 129-159.
McKevitt, P. (1990). Acquiring User Models for Natural Language Dialogue Systems through Wizard-of-Oz Techniques. Proc. of the Second International Workshop on User Modeling, Honolulu, HI, 1-13.
Newell, A. & Card, S.K. (1985). The prospects for psychological science in human-computer interaction. Human-Computer Interaction, 1 (3) 209-42.
Newell, A. & Card, S.K. (1986). Straightening out softening up: response to Carroll and Campbell. Human-Computer Interaction, 2 (3) 251-67.
Sandberg, J., Winkels, R. & Breuker, J. (1988). Knowledge Acquisition for Intelligent Tutoring Systems. Procs. 2nd European Knowledge Acquisition for Knowledge Based Systems Workshop, Bonn.
Shute, V.J. & Reigian, J.W. (1993). Principles for evaluating intelligent tutoring systems. Journal of AI and Education (this volume).
Sommerville, I. (1992). Software Engineering. Fourth Edition. Addison Wesley, Wokingham.
Suchman, L.A. (1987). Plans and situated actions. Cambridge University Press.
Thimbleby, H. (1990). User Interface Design. ACM Press, New York.
Toulmin, S. (1972). Human Understanding. Clarendon Press, Oxford.
Twidale, M. B. (1989). The use of explicit intermediate representations in intelligent tutoring systems. Unpublished PhD. thesis. Lancaster University.
Twidale, M. B. (1990). Knowledge Acquisition for Intelligent Tutoring Systems. Proceedings, Cognitive modelling and interactive environments NATO workshop, Eindhoven.
Twidale, M.B. (1991). Cognitive agoraphobia and dilettantism: issues for reactive learning environments. In Birnbaum, L. (Ed.) Proceedings, The International Conference on the Learning Sciences, Evanston, Illinois. 406-413. AACE, Charlottesville, VA.
Twidale, M., Pengelly, M., Chanier, T., and Self, J. (1992). Experiments on knowledge acquisition for learner modelling. In: Cerri, S.A. & Whiting, J. (Eds.) Learning Technology in the European Communities. Kluwer, Dordrecht. Kluwer, Dordrecht. 355-368.
Twidale, M. B. (1992a). Improving error diagnosis using intermediate representations. Instructional Science (in press).
Twidale, M. B. (1992b). Student activity in an Intelligent Learning Environment, in Nwana, H. S. (Ed.) Mathematical Intelligent Learning Environments. Intellect (in press).
Wilson, T.P. (1971). Normative and interpretive paradigms in sociology. In Douglas, J.D. (ed.) Understanding Everyday Life. Routledge, London.