Tuesday, April 15, 2008

Cautions about web-based testing and grading

1. Outline

This note describes hazards and errors in the evaluation of web- and computer-based educational materials. Brief summaries follow:

Section 2: Preconceptions

The only certainty about the future of education is that it will be different from the past. Preconceptions are more likely to mislead than help, and in particular traditional classroom procedures should not be taken uncritically as models for web materials.

This is illustrated with a discussion of homework and class attendance.

Section 3: Data and statistics

Sophisticated statistical methods lull us into believing that what we want to know can be extracted from the data we collect. Or at least we can test the extent to which this is true. However this seriously underestimates the complexity of real students and the range of motivations and behaviors invisible to our data.

This is illustrated by a discussion of sub-populations with different characteristics, and assessment of practice work.

Section 4: Resource constraints

Most educational systems are highly resource-constrained. Methods that require more resources cannot be widely adopted no matter what their virtues, and consequently assessments that ignore resource requirements have little relation to real-life potential.

This is illustrated with computer-based precalculus courses that are no better in educational outcomes but are more efficient and release resources that profoundly benefit the rest of the program.


2. Preconceptions

It is widely believed that homework is beneficial. The study Webwork effectiveness in Rutgers calculus by Chuck Weibel and Lew Hirsch (2002) is cited as showing this is true for web-based assignments. In this study a number of sections were involved, some using the system and others not, with a common final exam. Careful statistics showed a correlation between doing well on the homework and doing well on the final, and this correlation was strongest for freshmen.

The WebWork homework system provides a static set of exercises that are generally not challenging, and students can keep working and checking scores until they get a perfect score. These are nearly free points; why would any student not take advantage of this, particularly if there is no mechanism to prevent them from getting help? Actually it is only free if you count student time as worthless. Most students value their time and the fact that most won't do the work suggests they see the benefit and credit as not worth the time. Freshmen may be more tolerant because boring homework and point accumulation are facts of life in high school.

Does good homework cause good exam performance? Or do they correlate (roughly) because students with high grade goals both learn the material and are willing to do the busywork? The study design and analysis do not distinguish these possibilities. The study concludes that WebWork is effective because causality is assumed.

Another possible conclusion: most students do not seem to feel WebWork exercises - even with guaranteed credit - are worth their time. Is there some other approach to practice work that they would see as more beneficial? Or in other words, could we make better use of their time?

Data from a course with opportunities for non-credit practice but no homework at all in the standard sense leads me to believe this second interpretation is more valid. In this case more emphasis on WebWork would probably be counterproductive. In any case the study depends on an unexamined preconception.

Another preconception is that class attendance is beneficial. Non-attendance penalties are often used to generate a correlation with grades but actual benefit is largely untested. I had several hundred students go through a computer-tested course with supporting web materials. They were told they should come to class if they found it useful but attendance was optional. For analysis they were put in three groups with the first group attending fewer than ten percent of class meetings. Grade distributions were virtually identical. Note that comparing distributions is much more sensitive than averages that may hide balancing increases in high and low grades.

Cautious conclusions: in this college calculus course some students found classroom contact beneficial, some didn't need it and didn't need to be forced to come. In this course, at least, the preconception that everyone should come to class was not justified.

A final preconception: students entering courses should be filtered (via placement tests) to ensure they have the necessary background and skills. In fact student skills are somewhat plastic and a different approach is a "gateway". Students who do not pass immediately have a week or so in which to get up to speed and retry the test. Gateways are tougher than placement tests and the objectives are positive (ensure good skills) rather than negative (weed out likely failures).


3. Data and Statistics

Statistical analysis destroys most of the information in a data set. This is a good thing if all important parameters are measured and the information destroyed is irrelevant "noise". Unfortunately this is rarely the case: parameters not measured are presumed to be unimportant because they are invisible, and information destroyed is defined to be noise. It is hard to show that something important is missing but there are some illustrations.

In the homework study referenced above they report:

"It quickly became apparent that three sub-populations of students had different profiles. First-year students made up the largest sub-population ..... 'non-repeaters,' consisted of upper-class students who were taking calculus for the first time. The third sub-population consisted of students who were repeating the course.. "

Many investigators, particularly with smaller sample sizes, would not have made these distinctions and to the extent that different sub-populations need different treatment they would have reached faulty conclusions. This was a non-science calculus course. In a science and engineering calculus course I found an additional distinct sub-population: freshmen out of sequence because they got advanced placement. In this example sub-populations that would usually be overlooked could be identified using the data at hand and we can compare the results of using or discarding the information.

In other cases there is little or no data. For example in a course with tests that could be taken multiple times for credit the strongest correlate with performance seemed to be procrastination! There seemed to be a sub-population (about ten percent) who could wait until the last minute and do fine. Either they knew the material very well or they worked best under pressure. Outside this population there seemed to be an additive negative consequence for putting off a test to the last minute. This makes sense but was unexpected because it is far outside the things usually measured and it was found by chance. At any rate the conclusion was that developing psychological incentives to take tests early had a very high priority.

Student time is another example of an invisible but vitally important factor. Students instinctively do cost/benefit analysis and are reluctant to do tasks that do not use their time effectively. Sub-populations with time constraints (jobs or important dating) are particularly intolerant. This seems to be impossible to measure even by proxy. I believe that non-credit practice tests with solution hints and links to web reference materials are more efficient and effective than for-credit homework such as WebWork. I believe this because I have watched students use the materials in our computer lab, and talked with them. On the other hand I have records of when and how many practice tests were used, and the scores, and see no signal at all in this data. Scores for instance are typically zero: practice tests are used mainly as study guides and since there is no credit or penalty they don't bother to fill them out. Students who get fifty or more tests are probably looking for patterns or systematic omissions that would let them skip something. I believe that traditional homework is relatively ineffective but - because there is credit attached - there is at least a signal in the data.

The conclusion is that developing new ways of learning is going to be delicate and complicated, and current use of data and statistics are not likely to be good guides.



4. Resource constraints

At one time it was the policy in my department to have some computer involvement in every class. The involvement ranged from demonstrations to labs. Computer consoles and projectors were put in nearly all classrooms and a few had computers for each student. It quickly became clear that this took a lot of faculty time, and not just learning curves and development of materials, but additional time every time the facilities were used. There was no possibility of compensating by reducing teaching loads or class size. We did not get a sense of whether computer work was effective but this didn't matter: the additional workload was unsustainable.

I yet to see an educational study that even acknowledged resource constraints. The words "... this study was supported by a grant ..." in most cases mean "... these results were obtained using unrealistic and unsustainable resources. Even if correct they are probably not relevant to real life."

The urgent questions are:

- Can we get better outcomes with the same resources?

- Can we get similar outcomes with fewer resources?

It is essential that "resources" include faculty time. Ideally it would also include student time. Add-ons for instance are never cost-effective unless something is omitted to balance time requirements.

In the current system the focus is on outcomes and resources are disregarded. An answer to the second question would be evaluated negatively. Ironically this is probably where the greatest short-term potential is.

In my department precalculus courses are now computer-based and more than 10,000 students a year take these courses. There are no teachers in the traditional sense but there is a large and vitally important help staff so students can get more - and more timely - personal help than is possible in a traditional course. I don't believe educational outcomes are better but the system is much more efficient. The cost per student is significantly less than traditional classes with adjuncts or graduate students, and well under half the cost of professorial-rank faculty.

During the same period math enrollments have risen and budget cuts have forced a significant reduction in math faculty size. Simple arithmetic shows that if we were still teaching traditional classes then class sizes would be higher, low-enrollment graduate courses would have to be curtailed, and most of the faculty would have higher teaching loads. Instead - thanks to the computer-based course efficiency - advanced undergraduate classes and the graduate program have not been effected and teaching loads for some research faculty have actually been reduced.

The conclusion is that the computer-education effort has been enormously beneficial, but this can only be seen in the context of the entire departmental mission.



Please see the essays and reports at Frank Quinn's web page for details and additional information.

Friday, October 5, 2007

Online Grading and Testing

It is often said that you don't learn mathematics by watching or listening -- you learn by doing! Most mathematicians agree. For that reason, generations of mathematics teachers have assigned work for students to do outside class, and that work has been a fundamental part of student learning.

Doing that outside work without any feedback, however, can be a frustrating experience for many students, especially those who are already struggling with the material. But in many large universities, with classes of hundreds of student, grading homework or even giving regular tests has become nearly impossible.

In recent years, some mathematicians have promoted a technology-based solution to this problem that could make a huge difference in the way mathematics is taught in universities everywhere, but especially when it is taught in large classes. Software that generates, administers, and grades homework, quizzes and exams is now being used at many institutions (although still only a minority). Packages such as WebWork, WebAssign, and Maple T.A. all provide an environment in which one can generate assignments and grade student work. This enables the instructor to give feedback to large numbers of students and opens up the possibility of switching to mastery exams. For example, the instructor might give a differentiation exam that a student can repeat several times but is required to achieve a desired level of mastery.

Many of those using these packages rave about the results; some complain that the extra work involved isn't worth the extra effort to set up and administer the system. But whatever the merits of individual packages may be, the maxim that one learns mathematics by doing should surely lead us to find out how best to use this technology to improve the experience of students in mathematics courses.

We want to hear of YOUR experiences with this technology. Does it make a difference? What is the evidence? Also, does the use of such technolgy result in an increase or decrease in the time commitment required from instructors? What is the best way to use these packages to improve the first-year experience? Please add your comments below.

-AMS First-year Task Force