Redressing the balance: the advantages of informal evaluation techniques for Intelligent Learning Environments

Michael Twidale

Computing Department, Lancaster University, Lancaster LA1 4YR, UK.
 

Abstract

The paper discusses issues to be considered when evaluating an Intelligent Learning Environment. In particular it considers problems that may arise when using rigorous experimental methods and the usefulness of informal techniques. It advocates the frequent use of informal in-depth studies on prototypes during the development of a system to reveal problems with the ILE in use and to raise general issues applicable across ILEs. In particular it is claimed that student interaction with a novel ILE is very likely to include unexpected and unpredictable aspects that can only be revealed and thus accommodated by such studies.
 

Introduction

Whenever one describes work in Intelligent Learning Environments (ILEs) one can rightly expect to be asked how the work has been evaluated. Although there are many methods of evaluation available, implicit within such requests there is often an expectation of a rigorous, formal, experimental, controlled, summative evaluation (hereafter referred to as controlled evaluation). This method has the great advantage of objectivity and is seen as a proper mechanism for assessment of a research project and any resultant ILE. There is a danger that if any other method has been used this will be regarded as a poor alternative, implying that the researchers are afraid to submit their research to a 'proper' evaluation. The aim of this paper is to propose arguments to redress the implicit preference in favour of controlled evaluation. We consider some of the problems that can arise when using controlled evaluation, and how informal techniques can be powerful in obtaining certain kinds of information. We claim that the academic background of ILE research and its researchers and the discipline's relationship to other disciplines also have an effect on the choice of evaluation methodology.

The ideas are inspired and illustrated by a study undertaken of the EPIC system, an ILE for learning proof construction in propositional logic, (Twidale 1992a) which revealed various problems with summative evaluation. The study revealed the unexpected ways in which students may interact with an ILE. These are difficult to predict and can have substantial effects on the overall effectiveness of the system. Therefore it is vital to undertake studies with prototypes as early as possible in systems development. This will enable the developers to avoid putting excessive labour into elements that ultimately have little effect and to redirect effort to other aspects of the design that appear to have a major effect. The results of our study indicated that the interface had a much more important role in overall effectiveness than had been expected. This is fortunate as it is relatively straightforward to prototype and to try out improvements. However it can be a rather neglected area of ILE development which can have a deleterious effect on the performance of the system as a whole.

The paper raises problems rather than presents solutions and aims to inform future evaluators of problems they may face. Mark and Greer (1993) have analysed and described a large variety of methods. Our aim is to highlight problems that may arise from using some of these methods, in particular the danger of the results being misleading.
 

Controlled evaluation

Formal experiments are mainly (but not exclusively) used for summative evaluation, where the aim is to assess the overall effectiveness of a completed system. Likewise, the informal techniques described later are most often associated with formative evaluation. The advantages of formal experimental evaluation are self-evident; chiefly as an objective measure of overall effectiveness averaged over a large (ideally) number of students. It fits within the scientific paradigm of objectivity and reproducibility, an issue discussed later. We have already noted the higher status that can be accorded to controlled evaluation by fellow researchers compared to other techniques. This can contribute to the decision of which technique to employ for evaluation. Another contributory factor can be the requirements or expectations of external groups.

Controlled evaluation is the technique which provides the most useful information to interested parties partly outside the research domain such as educationalists and funding bodies. Such groups frequently want some external objective measurement of research quality, productivity or effectiveness. When bidding for research funds, ILE projects often claim that the completed system will offer various improvements in the speed, quality and enjoyment of learning. Included in the bid there is frequently a commitment to evaluate whether and to what extent these claims have been met. There are various techniques (Shute & Regian 1993) to increase the rigour and hence the validity of the experiment. For the purposes of external justification, a controlled experiment may be the best method (Legree & Gillis 1991).
 

Limitations of controlled evaluation

Rigorous experiments are large, slow and costly

As Shute and Regian (1993) note, a crucial activity is to define the goals of any evaluation study. Unfortunately there can be an unlimited number of things one would like to know about. It is important to note that selection of evaluation techniques is not only determined by the goals of the study but also by the resources available. All evaluation techniques consume resources (time, labour and materials) and a rigorous, controlled evaluation is one of the most expensive. Indeed evaluation can be a research project in its own right, quite separate from the development of the system being evaluated. Project managers must decide how to distribute resources between development and evaluation, aware that more of the latter means less of the former. A further complicating factor is that it may be difficult to get sufficient numbers of appropriate subjects, with the right sort of prerequisite knowledge.
 

A controlled experiment only really measures one thing

A controlled experiment can only properly measure the effect of a single variable. By clever design one may combine several sub-experiments into one, enabling the measurement of several variables. However the number is still quite small. With a novel project there may be many variables about which one wishes to obtain information and for which at the early stages of either development of that project or of complete systems development in that field, the rigorous experimental method is unnecessary and constraining.
 

A controlled experiment produces averaged out figures of overall performance.

The purpose of ILE research is surely to offer individualised instruction, and our concern should therefore equally be on how individuals fare with the system as well as the mass of students. It is also of interest to assess how the system has coped with an individual's learning issue at a given time. Besides the overall effect we may also wish to know which features of the system are contributing to this effect. Sometimes this can be done by further controlled evaluation. For example, if one wishes to assess what the effect of, say, the video explanation unit is, one can run an experiment with two conditions one with and one without the unit. Problems arise when the features are closely interleaved. It may be impossible to detach one feature and still have a meaningful system, or the effect may be synergistic; only occurring when all the features are present. It may also turn out that one feature unexpectedly is having a disproportionate effect on the output. In a research area, systems are never completed. Indeed there may be major educational 'bugs' in the system (in addition to the inevitable software bus) that were never eradicated. These can have a substantial negative effect on overall performance. This may swamp any effect of the feature that one is attempting to measure.
 

Unexpected interactions may lead to misleading results

The combination of different system components can lead to results that are quite different from when the components are used individually. The provision of a novel learning environment may precipitate novel learning behaviours by the student. Even though an overall measure of effect may be desired, these complex interactions may mean that a longer period of familiarisation with the ILE should be provided than is usually possible in projects if definitive results are to be obtained. Furthermore, developers of future ILEs will need to know in more detail about such effects (whether positive or negative for learning) than the averaged information that a summative evaluation can provide.
 

The effect of the interface

The interface to an ILE can have a crucial effect on its learning outcomes. As has been frequently stated, for the user the interface is the computer. The effect of the interface can be so great as to overwhelm that of other features of the ILE, including those that are providing the claimed 'intelligence'. A poorly designed interface can have a negative effect on the overall learning process. Unfortunately this negative effect can have a substantial impact on any measure of the learning outcome. For example, in EPIC, the study revealed a frequent error made by students in applying one of the rules of derivation in a certain context. This error was only rarely seen when students attempted problems using pen and paper. It turned out that the students making the error were misinterpreting one of EPIC's prompts for rule application. In the context where the error occurred, the prompt was poorly worded. If EPIC's effectiveness were to be assessed by summative evaluation in terms of errors committed by the student, then an improved system EPIC* could claim a significant improvement in performance just by rephrasing the offending prompt and so eliminating a whole class of errors!

Besides illustrating the negative effect an interface can have on overall performance, this case also illustrates a common problem that occurs when evaluating the effect of newer versions of a system. It is frequently the case that a number of fairly minor changes are made to the system and then it is hard to attribute the improvement to the individual change. This naturally occurs due to the cost of experimentation; it just is not feasible to do an experiment for every individual change.

In some circumstances the changes are inextricably linked anyway. For example, a new explanation generation facility may necessitate a new interface feature in order to operate. The improvement in performance may be due more to the greater explanatory clarity of the interface than to the explanation generator itself.
 

Learning to learn with an ILE takes time

When a student begins interacting with an ILE she has to learn two different subjects; the domain to be taught and how to use the interface provided. In addition she may also have to cope with higher level issues such as the concept of learning in a novel learning environment and discovering how to go about learning in this environment. This is particularly the case when the ILE is attempting to provide a more open-ended, student-centred ethos, which the student may be unfamiliar with. For all these reasons, the student may take some considerable time to get up to speed in terms of learning effectiveness. Unfortunately, once the code of the ILE has stabilised, there can be considerable pressure to evaluate immediately, in order to justify the programming effort.
 

Issues of Paradigms

Research in Intelligent Learning Environments may be undertaken by researchers originating from very different academic backgrounds. The three most common are Computing, Psychology and Education Research. Although related and interlinked, problems can arise due to these disciplines belonging to different paradigms. Psychology and Education Research belong to the scientific paradigm which lays great stress on the formal objective summative experiment as a means of justifying theories (although there are groups in both disciplines advocating more informal techniques). By contrast, parts of computing research are more closely associated with the engineering paradigm, which also employs proof by construction; if the program works in the manner expected then the theory has been justified. The amount of time that each discipline devotes to teaching various experimental techniques to their undergraduates illustrates the point.

The culture clash can lead to different expectations about the necessity of experimental evaluation. There has been some debate on the role of science in HCI (Newell & Card 1985, Carroll & Campbell 1986, Newell & Card 1986) which applies equally to ILEs. Within the engineering paradigm, one regularly has to make decisions between design options, involving trade-offs on a number of dimensions. Evaluations can be used to provide the information for such trade-offs between alternatives. Although experimental evaluations may be used, their expense will restrict them to a very few particularly significant decisions out of the very many that any design includes. Experiments whose results are widely applicable across designs are of course most useful. In the main however, decisions are made using more informal evaluation techniques, sometimes just the intuition of an individual designer. These many, intuitive design decisions can though be informed by the experience of informal evaluations such as watching the use by students of prototypes.

The tension between the two viewpoints which are manifested in the preferences for either formal or informal evaluation, is directly comparable to the debate in Sociology between positivism and naturalism (Wilson 1971, Hammersley & Atkinson 1983). Positivism promotes the use of quantitative methods whereas naturalism emphasises ethnography.

The rapid advance in hardware and software development can lead to additional problems for experimental evaluation of computer systems. It takes time to undertake a thorough experiment. Also, it is desirable that the experiment should build upon earlier ones in order to permit comparisons. Unfortunately the hardware and software may improve so rapidly during the course of a sequence of experiments as to greatly reduce the value of the information gained. For example, there have been numerous thorough and elegant experiments undertaken to evaluate the effectiveness of various features of line based editors which coincided with the rise to dominance of screen based editors.

In a similar manner, as hardware and software improve, the results of earlier evaluations that had negative conclusions can be overruled on the grounds that any newly developed system could have features incorporated to avoid the problems that arose in the earlier system. With any experiment involving human input, including those relating to education, it is always possible to criticise the result if it seems counter to one's expectations by either criticising the methodology or by claiming that the effect is not generalisable across groups, domains, cultures etc. When the experiment involves computer systems as well, it becomes even easier to criticise because of the rapidly increasing sophistication of computer systems. Therefore an unexpected or unwanted result can be blamed on some technical flaw such as a primitive interface feature.

For example within HCI design there is some debate about the advantage of adaptive user interfaces (Browne et al. 1990); whether the improved adaptivity to the user's needs outweighs the potential bewilderment to the user of an interface that is continually varying. We can envisage an experiment to compare a conventional set of menus with one that altered its ordering according to the user's observed usage pattern of the menu options. Let us assume that the experiment revealed that the new system had a negative effect on performance. A computer scientist who believed in adaptive user interfaces could dismiss the general applicability of this result by claiming that it was due to the poor way the adaptive interface was designed, and propose an improved design that would indeed improve upon the conventional menu system.

The scientific paradigm (Toulmin 1972) appears to operate in its purest form in the domains of Physics and Medicine. In particular, teams reproduce the published experiments of others in order to test them further. An example would be the intense worldwide activity in cold fusion (Close 1991). It is interesting to speculate why this happens so rarely in the field of ILE or even general computing, even by those who are most in favour of summative evaluation. One might argue that this was due to the rapid change in the field which would make the reproducing of experiments secondary to the development of improved systems. But Physics and Medicine are also rapidly changing. Another argument is that funding bodies only want to fund new research and reproducing other's work would not attract any money. By a similar argument, researchers may only gain academic respect by pursuing novel research. In both cases it can be that the greater the novelty, the better, making the comparison with the observations of earlier work more problematic. If this is indeed the case, why does it not appear to apply so much in Physics and Medicine? One reason may be that systems were often developed on particular hardware and using particular combinations of software. These configurations might be essential for the system to operate and also be prohibitively expensive to reproduce. It might be expected that with the trend towards more open systems and greater software portability that this reason would decline in effect over time. Issues of intellectual property and other commercial pressures (Thimbleby 1990) may also make researchers disinclined to share their software to enable experiments to be reproduced. Finally it may boil down to the interests of the researchers. Those from a computing background may just prefer to devote resources to building and improving systems rather than running experiments and so only do the latter when they absolutely have to. The consequence is a preference against controlled evaluation amongst Computer Scientists and a preference in favour amongst Educationalists and Psychologists.
 

Informal Evaluation

Informal techniques are frequently used in formative evaluation, where the ILE is known to be incomplete and so a formal assessment of overall performance is inappropriate. Assessment of those components that have been completed is however appropriate. The techniques can also be usefully applied to a completed system. However the nature of the results is often of a more prescriptive nature than that of summative evaluations (whose descriptive results are however likely to be more precise). That is, the informal evaluation will describe the shortcomings of aspects of the system, with implications for future development.

The rapid prototyping development methodology (Lantz 1986) is the most extreme case of the use of formative evaluation to drive a computing project. It is most useful when dealing with interface issues and others where the user is closely involved. Thus it is appropriate for developing ILEs. It is frequently difficult for the developer to acquire sufficient objectivity to determine the likely effect that the interface being developed will have on the user.

Informal evaluation is also useful when supporting incremental improvement of an ILE. In this case, small focussed studies of the ILE in use can be undertaken. Cases are collected where the ILE's performance on a particular learning episode was less than ideal. These cases are listed by order of importance. Another list is created by order of ease of re-implementation. A third list is created using the other two to create a priority order for the changes to be implemented given the available resources.

It is possible to undertake experiments for formative evaluation. These experiments need not be as rigorous as those used in controlled evaluation. One very popular technique is the 'Wizard of Oz' method (Mackay, 1988; Sandberg, Winkels & Breuker, 1988; McKevitt, 1990). In this the user interacts with a computer interface but the information is passed to a human 'processor' in another location who does the processing and passes back the reply via the interface. Often this method is used to test the efficacy of the interface in advance of the development of the internal components. It can also be used to test individual internal components of an incomplete system. In this case the human processor would select and prepare the data that the component under test would use, pass the data to that component, take the result and process it, mentally simulating the activity of all the incomplete components and pass the result back to the user.

Such a method was used in the BELLOC project (Twidale et al. 1992) to determine the efficacy of an ILE to support the learning of a foreign language. The component tested was the diagnostic description of students' applicable rules (consistent but non-standard grammar rules) and a technique for verifying the diagnosis. This could be tested for feasibility without the need to adapt and attach a robust parsing module. The experiment was valuable in revealing the problems that arose when using the verification technique. Once a potential misconception arose, the completed system was to give the student a sentence and ask whether or not it was correct. The sentence was constructed so that the reply would discriminate between students knowing the correct grammatical rule and those possessing the particular misconception that had been diagnosed. The study revealed both false positives and false negatives. There were cases where the students said the sentence was incorrect, but not for the expected reason; an additional grammatical misconception led them to diagnose an error in another part of the sentence. There were also cases where the students said a sentence was correct when they were expected to say that it was incorrect. In such cases, part of the sentence contained a construction that the students knew that they frequently made errors on. They focussed on this construct, verified that it was correct and hence decided that the whole sentence was correct, ignoring the rest of the sentence containing the 'real' error that we were concerned about. The study reveals some of the advantages of formative evaluation. It was very small and thus cheap in terms of time and effort to undertake. Nevertheless it revealed important complicating factors that a completed ILE would have to cope with. If these factors had not been discovered early on, not only would they have reduced the effectiveness of a completed ILE but they would also have made the interpretation of the results of any summative controlled evaluation very problematic.

Although experimental in nature, Wizard of Oz techniques lack the objectivity of conventional experiments and so offer less convincing proof of any observed effect. Also, they usually involve fewer subjects than controlled experiments, since they are often more labour intensive to undertake. They are though quite appropriate for formative evaluation purposes.

Informal studies of the system should lay far greater stress on negative evidence of effectiveness than positive evidence. If the system is incomplete, its coverage is limited. Therefore problems that arise with a component that are not solely due to the absence of a future component can be guaranteed to recur in the scaled up completed system. On the other hand, successes with one component may not necessarily scale up as other components are added. We might summarise the difference between informal and controlled evaluation by saying that one should use controlled evaluation when one wants to show the advantages of a system, and informal evaluation when one wants to reveal difficulties.
 

Discovery of the unexpected

Sometimes the student undertakes activities with the ILE that were not predicted when it was being developed. For example, it was observed in the study of EPIC in use that some students would make deliberate mistakes. These were not done merely to get a more interesting response from the system as noted for other systems (Burton & Brown 1982), nor in order to test the system. Rather it appeared that the student was quite aware that the action was erroneous (they stressed to the experimenter that they did know that the step was probably illegal), but was unsure about the reason why. They knew the system would give a response, which would provide them with the information they were curious about.

Such activity can lead to the system substantially misinterpreting the student's actions, particularly when the activity takes place as a side issue in the middle of solving a problem. This misinterpretation may lead to inappropriate remediation that can be either useless, irritating or even misleading for the student. Of course once a case such as this is cited it becomes obvious that ILEs should allow for spontaneous student experiments. The point is whether other such activities can be predicted and allowed for. Certain activities may only occur when an ILE is sufficiently sophisticated and supportive that it worth the student's effort to undertake them. Consequently they will be rarely if ever observed before the system is built. Also, it may be that only an in-depth study will reveal what is occurring; in this case it was only by sitting alongside the students and encouraging articulation that it was possible to discover the motivation of the action which quite changes both its meaning and the appropriate response of an improved system.

Another unexpected result that the study of EPIC revealed was that a sophisticated interface can have a supportive effect in addition to its main intended purpose of acquiring information about the student. The interface had been designed to enable the student to describe to the system the plans and goals she was using to find a logic proof. Figure 1 gives an example of the interface in use.

 
 

Figure 1. The EPIC Interface.

Although implemented as a knowledge acquisition tool for the immediate benefit of the system, the explicit planning interface also became a useful pedagogical tool. Frequently the techniques of plan and goal manipulation are left for the student to learn implicitly. For some students the provision of this interface feature had a very dramatic effect; beforehand they were not even aware that they were meant to be thinking in terms of plans and goals. For them logic proof was a matter of blind search. The presence of the interface had the effect of directly explaining to them a new way to regard the problem and a means of undertaking plan and goal-based reasoning.

For others, although they had some understanding of the use of plans and goals, this was so hazy that in the pilot study they were quite unable to describe how they used them to tackle a problem. In both cases the menu and form based approach for the interface brought immediate effects in clarifying their understanding of the domain before the ILE could take advantage of the information provided for its own pedagogic purposes. It provides the student with a vocabulary for articulating learning issues that she may not even be aware of.

In addition to the effect on students unaware of higher level planning issues, the interface had an effect on other students in a less dramatic and more supportive manner. By making planning explicit, it externalised the process. This made it much easier for the student to analyse her own thought processes. It also allowed the system to catch slips at the planning level, and by providing an external memory, to reduce the likelihood of slips caused by working memory overload. This has a further advantage in that it reduces the chance of a slip leading to a long and unproductive exploration of a dead end within the search space.

The study revealed this issue when it was observed that students were correcting buggy plans before the plan checking elements of EPIC had a chance to criticise them; once a student had partially instantiated certain plans she would see for herself that they were inappropriate and abandon them before completing them and declaring that this was to be her plan.

The increased attention now paid to situated learning also leads to the desirability of increased use of rapid prototyping and formative evaluation (Clancey 1992) in order to assess the impact of the system in the context of its normal use. This approach can be aided by more interdisciplinary work and participatory design.
 

Ethnography as an alternative methodology

The use of ethnographic methods in the formative evaluation of ILEs was pioneered by Suchman (1987). Her detailed observations of a small number of users working with the intelligent help system of a photocopying machine revealed various episodes where the help system either was of no use or even managed to mislead the users due to their misinterpretation of its prompts and its misinterpretation of their consequent actions. Of interest here is that a conventional experimental evaluation would have failed to reveal these micro-problems if the overall performance of the ILE was positive.

The ethnographic approach (Hammersley & Atkinson 1983) tackles the issue stated earlier of the need to focus on individual learners and separate learning episodes if we are to learn more about the individualisation abilities of a system. It comes from a quite different paradigm than either computing, psychology or education research, namely that of the Chicago school of sociological observation, with strong links to anthropology. Within this paradigm the provision of anecdotal evidence and judicious selection of telling and representative quotations or episodes is considered to be acceptable data. (NB. There are some educationalists whose work is in this style, such as Piaget).

There is a danger that those from a computing background will regard ethnographers as members, alongside psychologists and educational researchers, of a fairly uniform group of arts/humanities people. This can lead to misconceptions of the viewpoints and activities of the different research traditions which can impair collaborative work. Indeed there are wide ranging practices even within ethnography that are continually developing. The use of experiments (including rigorous ones) clearly separates the psychologists and education researchers from the ethnographers. Indeed on a scale of formality of evaluation techniques used, computer scientists may find themselves between the two groups.

Ethnographic study is particularly concerned with the observation of behaviour in a natural setting, known as ecological validity. In the context of computer systems it thus lays stress on the interactions not just between the user and the system as they occur in the workplace, but also interactions between the user and other users, co-workers and other systems, computational, social, organizational etc. Due to this, strict ethnographic analysis may not be appropriate for formative evaluation as the prototype under investigation may not be robust enough to operate in its anticipated natural setting, particularly if that is to be the classroom. Where the intended users are not schoolchildren but adult learners, and the intended context of use is self study in a computer lab such as at a University, formative studies in a computing research lab may, with a little care, be undertaken without doing too much violence to the principles of ethnography, and still reap many of the benefits.

Although the in-depth study approach seems to have more of the flavour of sociology about it than might be expected in the development of computer systems, it does have similarities with certain existing activities within computing-engineering. The study may be regarded as a technique of requirements capture from the perspective of Software Engineering (Sommerville 1992) and to be similar to the style of user-centred system design pioneered in Scandinavia (Ehn 1989). From the perspective of AI and Expert Systems development the study can be seen to be a form of Knowledge Acquisition (Twidale 1990).
 

The effect of the subject domain on evaluation

In order to test a theory about ILEs one needs to build a system which can then be evaluated. In order to do this one must choose a subject domain. This may be predetermined if one has an existing ILE and one is modifying a component (eg Corbett et al. 1991). Frequently one has to build a system from scratch and so can choose the domain to teach. An interesting question is the effect that domain choice has on the result of the evaluation. It may be that certain domains are more amenable to the technique under test. Ideally one would build more than one system in order to demonstrate the generality of the technique and test both. However the cost both of systems development and of undertaking experiments on two or more systems will usually preclude this.

The choice of a domain will be constrained by the domain expertise available. It is surely no coincidence that the majority of ITS cover domains such as arithmetic, algebra, calculus, basic electronics and elementary programming; those domains where the developer probably does not need to consult another domain expert. There is a tension in domain choice between keeping to a well known domain so that effectiveness can be directly compared with earlier systems and choosing a novel domain in order to demonstrate originality, extend the domain coverage of the field of ILE and so improve the chances of acceptance of submitted papers.

Another crucial issue to consider in domain choice is the degree to which it facilitates evaluation. If the domain chosen will only be of relevance to a small number of students (within the environment where the research is undertaken, such as a particular university) then it may be difficult to get adequate numbers of students for a rigorous experiment. Furthermore, one should also allow for the unexpected need for a repeat experiment. Despite careful pilot studies, problems may arise either with the ILE, the course which it is supporting or the subjects themselves (such as a high drop-out rate). Where the ILE is supporting a course, this may mean that one has to wait for the course to be re-run, which could be in a year's time.

There are two alternatives:

1) To choose a domain where there is a continuous stream of people wanting to learn and circumstances permit them to be taught on numerous occasions throughout the year. This can include such things as introductions to word-processing or programming for people wanting to learn about them but who are not enrolled on formal programs, or where the introductory course is offered several times in an academic year. Training courses often fit this remit.

2) To choose a domain that a group of volunteers might be expected to find interesting to learn about but which does not fit within their actual learning requirements. This can cover a range from 'toy' problems or puzzles such as those frequently used in cognitive science experiments (such as Hobbits and Orcs; McDaniel & Schlager 1990) to areas of general public interest such as electoral reform (Dillenbourg & Self 1992).

The advantage of toy domains is that they can be easier to develop, can have fewer problems with assumptions of prerequisite knowledge possessed by the student and can be used by a larger group of potential subjects. The disadvantages include problems of scaling up; whether the results could equally apply to a larger ILE teaching a 'proper' subject. There may also be motivation problems in that the student is less committed to learn about the domain than can be expected for a course she has opted to take.

An example of a study undertaken in a 'toy' domain involved the discovery of the equation determining the result of perfectly inelastic linear collisions (Twidale 1991). The aim of the study was to investigate the difficulties faced by students when learning using simulations, in order to determine the kinds of support that an Intelligent Simulation Learning Environment (ISLE) should provide (Hijne & Berkum 1992). Therefore an extremely simple simulation domain and environment was chosen. The argument was that problems that arose in this simple case would also arise in the case of more sophisticated simulations, even if the latter also had additional problems. In a similar manner, even though the undergraduate volunteer subjects all appeared to be quite sophisticated learners, the results could also be of use for ISLEs intended for less sophisticated learners, in that the difficulties that those subjects faced could be expected to be faced by others even if they also had additional difficulties. The study revealed the unexpectedly wide range of problem-solving activities undertaken by students in a simulation environment. This has major implications for systems attempting to model such activity. Thus a simplified domain and sophisticated learners can provide a first pass for the studying of learning issues while avoiding the complexities of a more realistic setting.
 

Towards greater rigour for formative evaluation

The informal techniques described for formative evaluation, when perceived from the scientific paradigm, can appear hopelessly vague and to open the floodgates to papers consisting of endless discursive anecdotes. There are ways to keep these very real dangers in check, without losing the advantages of speed and simplicity that are major advantages of the informal methods. Firstly it must be remembered that the information acquired is to be used in formative evaluation. The test for the effectiveness of the evaluation technique is whether it leads to the building of better systems. (NB whether they are indeed better may well require a controlled summative technique.) The justification is the engineering style proof by construction. A formative study should reveal problems in the system that can be corrected in subsequent versions. Also, in a restricted domain it should reveal those problems that are also likely to occur in more realistic, complex domains. If the latter were only investigated, these fundamental problems may be initially swamped by more superficial domain-specific problems. Just because a systems study has identified problems in a simple case and led to their subsequent remediation, there is no guarantee that new problems will not arise in the more complex case, but we can guarantee that if the simple case problems had not been identified, they would have appeared in the complex case, but perhaps in a form less easy to identify.

It should also reveal those elements that appear to be successful even in stripped down prototypical form, providing evidence that in the improved version of the system, they would be all the more effective. For example, the information gained from the novel interface components in the EPIC system was used to provide very simple diagnosis and remediation. There was neither a sophisticated user model nor pedagogical expert. Since the results showed that the use of explicitness can achieve pedagogical results even with a minimal tutoring system, we can also argue that a system with a more sophisticated user model and pedagogical expert would have been yet more effective (Mark & Greer 1991).

The acquisition of anecdotal evidence can also be used to answer the criticism of many studies both formally experimental and not, that they are unique to their circumstances. If an anecdotal finding of a certain kind of student behaviour (eg. the deliberate making of mistakes in order to get the system to explain a poorly understood issue) is described for different systems in different domains by different researchers in different institutions then we can be more confident about its generality, than if it was the result of a single controlled experiment (assuming that an experimenter even set out to observe it). This method of verification of findings is analogous to the use of the technique of triangulation in ethnography (Hammersley & Atkinson 1983), where a variety of observational methods are employed in the same study to determine the similarity of their findings.
 

The study of EPIC in use: an example of informal evaluation

The EPIC system (Twidale 1992a&b) was developed in order to test the ideas of making explicit the plans and goals of the student when problem solving by use of menus, form filling, annotation and explicit instantiation. The description here illustrates the use of the informal techniques advocated.
 
Choice of domain
As an example domain to test the theory, proof generation in Propositional Calculus in the style of Lemmon (1965) was chosen. The domain was chosen because it had certain desirable features: 1) The process of checking a proof is computationally trivial.

2) It is feasible to create a simple expert system to mimic proof generation.

3) It is a small and self contained subject area.

4) There was an available pool of students wanting to learn the domain: it was taught as part of a first year introductory Logic course in the Philosophy department.

5) The course lecturer was willing to offer guidance and encourage participation by the students.

6) There was a genuine need for remedial support; some students found the domain very difficult and fell behind, particularly those with an arts/humanities background.

The ease of checking the steps of a proof enabled greater effort to be devoted to supporting the underlying planning of proof construction.
 
Purpose of the study
The purpose of developing and evaluating EPIC was to investigate the feasibility of explicit instantiation and planning both in terms of ease of use by the student and potential diagnostic and remedial capability by the system. The study of EPIC in use was intended to obtain the following information: 1) Evidence that explicit planning and instantiation can more accurately and more easily pinpoint errors and misconceptions than conventional techniques such as plan recognition.

2) Evidence that the intermediate representations support the student's problem-solving attempts by clarifying planning and supporting error recovery from slips and planning and goalstack errors.

3) Evidence that a supportive environment encourages exploration, experimentation and more adventurous problem-solving.

4) Evidence that explicit intermediate representations have the potential for aiding the discussion of planning actions by system and by student.

5) Information about the acceptability to the student of articulating intermediate levels of problem solving.

6) Information about the acceptability and effectiveness (in supporting solving, discussion of techniques and in diagnosing errors) of explicit heuristics made available to the student.

Reason for using small scale in-depth studies
It was decided to attempt to obtain the above information using a formative in-depth method of observing the use of the system by a small number of paid volunteers. This was for the following reasons: 1) EPIC is not a complete ILE but rather embodies some techniques that were to be investigated. Consequently, using the system in a real learning environment leads to problems with students making errors that are beyond EPIC's ability to cope with, but that a larger system employing the same techniques would be able to deal with. We were interested in the potential for such techniques as much as their effectiveness as they happened to have been actually implemented.

2) Students of Propositional Calculus were known to have a variety of misconceptions and levels of ability. Their competency could also vary within a session, either improving as a result of learning and degrading as a result of cognitive overload. Given a limited supply of volunteers and very idiosyncratic performance, statistical evidence would be of dubious value.

3) The interface was known to have a substantial influence on the ILE's effectiveness which could swamp other variables.

4) The information to be obtained was to inform development and improvement of intermediate representations. The question the evaluation was to answer was not an overall performance question, such as "Do more students get more right answers if an instantiator is provided than if not?". Rather it is at the level of individual actions within a problem: "Under which circumstances did the instantiator do well and under which did it do badly?"

5) There were not enough resources available for a large-scale controlled experiment.

The methodology used was somewhat like Suchman's (1987). Like Suchman we were interested in diagnosing the intentions and especially the understanding of the students. This was in order to compare diagnoses made by a human observer using natural language and other channels available in human-human communication (body language, tone of voice etc) with those available to a system - both those actually implemented in EPIC and those which could feasibly be added to EPIC. One of the claims for explicit intermediate representations is their ability to improve diagnosis without much complexity. The observations were to assess the limits of this diagnosis, both actual and potential. Note that this is a form of engineering-style evaluation; assessing whether the system produces (or could produce) the expected output for a certain unpredictable input. It is a separate issue from whether the output produced has the desired effect on the student.
 
Issues in running a session
Each session with EPIC lasted approximately an hour. Considerable care was taken in preparing students for the session. (Note that though these studies are described as informal, they do require care in execution. Preparation of subjects is important. It is possible, indeed essential, to take time over this as only a limited number of subjects are used.) It was explained that the purpose of the study was to discover the kinds of mistakes that people made and that the chief interest of the experimenter was in finding out how people coped with the system and why mistakes were made. The aim of this explanation was to engender a positive and non-judgemental attitude to errors to facilitate subsequent questioning about errors as they arose. The students were encouraged to articulate their thinking. They were warned that if they were silent for long periods they would be interrupted and asked to state their thoughts. It was stressed that just because the experimenter may ask why a student was doing a particular step, this did not necessarily mean that it was wrong. This was important because a frequent pedagogic technique in one-on-one tutoring is to ask just such a question to imply that a student has indeed made a mistake. It is often used when it is suspected that a slip has been made and that merely directing the student to reconsider her reasoning will bring the error to light.
 
Unexpected Discoveries: Masked Misconceptions
One of the advantages of an in-depth study is that in addition to permitting detailed observation of the features that one set out to investigate, it can also reveal quite unexpected aspects of the leaning process. It is important to discover these aspects not merely in order to add to our understanding of human learning, but also because they can have a direct impact on students' interactions and learning activities with the ILE being investigated. These aspects can be quite subtle and so only an in-depth study is likely to reveal them.

For example, by encouraging articulation of beliefs and focussing on the understanding of a few students, the EPIC study revealed that in certain cases, students may be able to competently perform simple exercises in a domain while still having major misconceptions about that domain, that only become apparent when they tackle a harder problem (one that cannot be solved by their non-standard techniques). This was observed at three levels of problem solving analysis:

1) Misconceptions about how to apply the rules of derivation.

This was unexpected. From conversations with the lecturer and observations of students, it had been assumed that students understood these. In fact they often had great difficulties (which they tried to hide from human tutors) with the less frequently used rules.

2) Misconceptions about plans and goals.

There could be substantially buggy techniques here that still managed to rapidly yield the optimal proof for very simple problems. On a slightly more difficult problem however the buggy technique is quite useless leading to bewilderment on the part of the student and aimless searching of the solution space.

3) Misconceptions about the whole nature and purpose of proof.

These could lead to students happily breaking fundamental rules in circumstances where they could think of no simple solution method.

The masking of the misconceptions was due to some students having rules that provided correct answers for simple problems but not for more complex ones. As a result of the preparation with the students, they were more candid about their lack of knowledge when using EPIC than when observed in tutorials. Their performance on the earlier, simpler problems would in the absence of other information lead an observer (including any student modelling system) to infer considerable competence to such students. Their action steps in these simple problems were identical to that of an expert. In fact they would admit that they did not know why they were doing the steps other than a hunch or a memorised 'recipe'.

Conventionally, the straightforward questions that may be liable to a non-standard and non-generalisable solution methods are presented first in an exercise. This is only to be expected and serves the purpose of boosting a student's confidence before progressing to more difficult problems. As a consequence the more challenging problems that can reveal such misconceptions are posed last in an exercise, and are generally few in number. They are frequently used to serve an additional purpose to that of testing deep understanding; that of keeping more able students occupied and interested while teacher time can be devoted to less able students. Although a laudable aim in terms of classroom management this can have the unfortunate consequence of lowering the significance ascribed to such questions so that in the extreme they are regarded as optional. A student who got 7 out of 10 in an exercise might be deemed to be quite competent in the domain, perhaps just making a couple of trivial slips, but if she got the first 7 questions right and the last 3 hard questions wrong, she may still have substantial misunderstandings. A situation such as this has been observed in elementary algebra (Kuchemann 1983). The general lesson from this observation is that student models need to be conservative in inferring understanding.

This issue is significant in determining the meaning of the numerical results of summative experimental evaluations. Overall percentage results are liable to obscure this effect. It is a problem that pervades much of our educational system. For example, in British Universities a First Class overall degree result requires getting a mark at or around 70%. In scientific disciplines (the conventions for the marking of arts/humanities essays may mean the argument does not apply there) the widespread assumption that 70% implies mastery can mean that students could have substantial misconceptions about parts of the course and still be considered outstanding. This is particularly unfortunate where one course is a prerequisite to another. The tutor of the more advanced course may end up spending a substantial proportion of the teaching time re-teaching the prerequisite knowledge.
 

Conclusion

Although there may be considerable pressure to evaluate an ILE using controlled experiments, researchers need to be aware of the problems that can arise with such techniques. Their frequent use as a measure of overall effectiveness can obscure significant variation in the ILE's effectiveness for individual learning events. A particular problem is that if a result is obtained that runs counter to the intuitions or expectations of the reader, she may all too easily discount it by criticising the method of the experiment or features of the ILE. There are many situations when an informal small scale ethnographic in-depth study of the system may be more appropriate either as an alternative to controlled experiments or as a supplement to them. The challenge is to make such studies acceptable as a way of providing meaningful evidence of the features described while retaining the technique's flexibility and low cost. Nevertheless there will remain situations when controlled summative experiments are the best means of evaluation. In such circumstances, the potential problems that this paper has outlined should be taken account of in designing the experiment.

The main aim of this paper has been to advocate the frequent use of informal evaluation during development of any ILE. The field of ILE research is still quite new and there is much to be learned by simple observation. Also the provision of more sophisticated ILEs can lead to significant changes in how learners behave that may be impossible to predict. As work proceeds on the development of more sophisticated user interfaces, there is a trend to allow the user to articulate more of her beliefs than is possible in earlier systems (and sometimes more than is usual in human based tutoring). Much needs to be learnt about this and its effect on the learning process as a whole. We can even hope that an ILE can serve as a tool in its own right to support the study of human learning. As such an ILE can serve as a 'cognitive microscope', making visible activities occurring during learning that are normally invisible because they occur in the student's mind and usually leave a minimal external trace. The informal observations advocated here would then be analogous to the observations and discoveries of the first users of optical microscopes.
 

Acknowledgements

The systems developed and studied that are described in this paper were funded by the Science and Engineering Research Council and Logica Cambridge as part of a CASE award and the Commission of the European Community as part of the DELTA initiative, projects SIMULATE & NAT*LAB.

The research for this paper is supported by a Science and Engineering Research Council Junior Research Fellowship. The author would like to thank the following people who commented on earlier drafts of this paper: Dave Nichols, Michael Pengelly, John Self and Gary Spiers.

References

Burton, R.R. & Brown, J.S. (1982). An investigation of computer coaching for informal learning activities. In Sleeman, D.H. and Brown, J.S. (Eds.) Intelligent Tutoring Systems. Academic Press, London. 79-98.

Browne, D., Totterdell, P., Norman, M. (Eds) (1990). Adaptive User Interfaces. Academic Press, London.

Carroll, J.M. & Campbell, R.L. (1986). Softening up hard science: reply to Newell and Card, Human Computer Interaction, 2 (2) 227-49.

Clancey, W.J. (1992). Guidon-Manage revisited: a socio-technical systems approach. In Frasson, C., Gauthier, G. & McCalla, G.I. (Eds.) Intelligent Tutoring Systems. Lecture Notes in Computer Science No. 608. 21-36.

Close, F.E. (1991). Too hot to handle: the race for cold fusion. Princeton U.P.

Corbett, A.T., Anderson, J.R. & Fincham, J.M. (1991). Menu selection v typing: effects on learning in an intelligent programming tutor. In Birnbaum, L. (Ed.) Proceedings, The International Conference on the Learning Sciences, Evanston, Illinois. 107-112. AACE, Charlottesville, VA.

Dillenbourg, P & Self, J. (1992). PeoplePower: a human-computer collaborative learning system. In Frasson, C., Gauthier, G. & McCalla, G.I. (Eds.) Intelligent Tutoring Systems. Lecture Notes In Computer Science, 608. Springer-Verlag, Berlin 651-660.

Ehn, P. (1989). Work-oriented design of computer artifacts. Hillsdale, NJ: Lawrence Erlbaum Associates.

Hammersley, M. & Atkinson, P. (1983). Ethnography principles in practice. Routledge, London.

Hijne, H., & Berkum, J.V. (1992). A functional architecture for intelligent simulation learning environments. In: Cerri, S.A. & Whiting, J. (Eds.) Learning Technology in the European Communities. Kluwer, Dordrecht. 91-108.

Kuchemann, D. (1983). Quantitative and formal methods for solving equations. Mathematics in School, 12(5) 17-19.

Lantz, K.E. (1986). The Prototyping Methodology. Prentice Hall, Englewood Cliffs New Jersey.

Legree, P.J. & Gillis, P.D. (1991). Product effectiveness evaluation criteria for intelligent tutoring systems. Journal of Computer-Based Instruction 18, 57-62.

Lemmon, E.J. (1965). Beginning Logic, London: Nelson.

Mackay, W. E. (1988). Tutoring information databases and iterative design. In Jonassen, D. H. (Ed) Instructional Design for Microcomputer Courseware Hillsdale, NJ Lawrence Erlbaum 327-44.

Mark, M.A. & Greer, J.E. (1991). The VCR tutor: evaluating instructional effectiveness. Proceedings of the 13th Annual Conference of the Cognitive Science Society. pp564-569.

Mark, M.A. & Greer, J.E. (1993). Evaluation methodologies for intelligent tutoring systems. Journal of AI and Education (this volume).

McDaniel, M.A. & Schlager M.S. (1990). Discovery learning and transfer of problem-solving skills. Cognition and Instruction 7 (2), 129-159.

McKevitt, P. (1990). Acquiring User Models for Natural Language Dialogue Systems through Wizard-of-Oz Techniques. Proc. of the Second International Workshop on User Modeling, Honolulu, HI, 1-13.

Newell, A. & Card, S.K. (1985). The prospects for psychological science in human-computer interaction. Human-Computer Interaction, 1 (3) 209-42.

Newell, A. & Card, S.K. (1986). Straightening out softening up: response to Carroll and Campbell. Human-Computer Interaction, 2 (3) 251-67.

Sandberg, J., Winkels, R. & Breuker, J. (1988). Knowledge Acquisition for Intelligent Tutoring Systems. Procs. 2nd European Knowledge Acquisition for Knowledge Based Systems Workshop, Bonn.

Shute, V.J. & Reigian, J.W. (1993). Principles for evaluating intelligent tutoring systems. Journal of AI and Education (this volume).

Sommerville, I. (1992). Software Engineering. Fourth Edition. Addison Wesley, Wokingham.

Suchman, L.A. (1987). Plans and situated actions. Cambridge University Press.

Thimbleby, H. (1990). User Interface Design. ACM Press, New York.

Toulmin, S. (1972). Human Understanding. Clarendon Press, Oxford.

Twidale, M. B. (1989). The use of explicit intermediate representations in intelligent tutoring systems. Unpublished PhD. thesis. Lancaster University.

Twidale, M. B. (1990). Knowledge Acquisition for Intelligent Tutoring Systems. Proceedings, Cognitive modelling and interactive environments NATO workshop, Eindhoven.

Twidale, M.B. (1991). Cognitive agoraphobia and dilettantism: issues for reactive learning environments. In Birnbaum, L. (Ed.) Proceedings, The International Conference on the Learning Sciences, Evanston, Illinois. 406-413. AACE, Charlottesville, VA.

Twidale, M., Pengelly, M., Chanier, T., and Self, J. (1992). Experiments on knowledge acquisition for learner modelling. In: Cerri, S.A. & Whiting, J. (Eds.) Learning Technology in the European Communities. Kluwer, Dordrecht. Kluwer, Dordrecht. 355-368.

Twidale, M. B. (1992a). Improving error diagnosis using intermediate representations. Instructional Science (in press).

Twidale, M. B. (1992b). Student activity in an Intelligent Learning Environment, in Nwana, H. S. (Ed.) Mathematical Intelligent Learning Environments. Intellect (in press).

Wilson, T.P. (1971). Normative and interpretive paradigms in sociology. In Douglas, J.D. (ed.) Understanding Everyday Life. Routledge, London.