Saturday, August 15, 2015

How to Fix Common Core: Test Results

I hoped that by now, the SBAC results for my home state of California will have been released. But the latest I've heard is that the test results won't be given until after the first day of school. Of course, this is one thing that many Common Core opponents dislike about the tests -- the fact that students don't get their scores until they've moved on to the next grade.

And of course, I agree wholeheartedly. It defeats the purpose of having a computerized test if the results can't be given in a timely manner. This is why, in my proposed computerized test, the results are given instantly. Any question -- at least in math -- that can't be scored instantaneously (such as a performance task, for example) isn't worth having on the test.

Instead, let me comment on the results for Washington -- another SBAC state. The following page gives a link to the preliminary results for the state of Washington:

http://www.k12.wa.us/Communications/pressreleases2015/PrelimSmarterBalancedResults.aspx

I present the results from the above link as a table so that I can compare the ELA to the math results (given as percent proficient) within each grade:

Third Grade: ELA 53%, Math 57%
Fourth Grade: ELA 55%, Math 54%
Fifth Grade: ELA 58%, Math 48%
Sixth Grade: ELA 55%, Math 46%
Seventh Grade: ELA 58%, Math 49%
Eighth Grade: ELA 58%, Math 48%
High School: ELA 62%, Math 29%

We notice is that only in third grade are there more students proficient in math than ELA -- in all higher grades, there are more students proficient in ELA than in math. Indeed, we observe that as the grade level increases, so does the gap between the ELA and math percentages -- from one percentage point in fourth grade to 9-10 points through middle school, to a whopping 33 points in high school.

The reasons for this is obvious -- in the third grade, math is easy arithmetic while the students may not have learned how to read yet. As the students get older, they learn to read and increase their ELA scores, while the math becomes more difficult. That ten-point gap begins in fifth grade -- the grade most associated with the study of fractions -- and then that gap becomes a chasm in high school. when algebra appears on the test. It's easy to predict that once the California scores are finally released, they will follow the same pattern.

Indeed, back in November, the SBAC released predictions of how many students would score proficient in each grade level. It's uncertain whether this prediction refers to all SBAC states or to California only (but the source is a California website):

http://edsource.org/2014/under-half-of-students-projected-to-test-well/70227

Here are the percentages of students predicted to score proficient -- that is, a 3 or 4 (since although there is a four-point scale, only two scores actually matter, proficient and deficient):

Third Grade: ELA 38%, Math 39%
Fourth Grade: ELA 41%, Math 37%
Fifth Grade: ELA 44%, Math 33%
Sixth Grade: ELA 41%, Math 33%
Seventh Grade: ELA 38%, Math 33%
Eighth Grade: ELA 41%, Math 32%
High School: ELA 41%, Math 33%

Some people don't like these predictions here. The problem is in determining how many questions a student needs to get correct to earn a reported score of 1, 2, 3, or 4. Some believe that the whole scoring system is a conspiracy -- for example, the high school math scores are assigned so that the top 11% of students get a 4, then the next 30% of students get a 3, and so on. Then by definition, the actual scores match the predicted scores! And as the conspiracy theory continues, in subsequent years, the cut scores will be changed so that slightly higher percentages of students earn scores of 3 or 4 -- so that the story can be told that Common Core is responsible for the increase. The conspiracy theorists believe that SBAC wants to make Common Core appear more effective than it actually is.

Here's my problem with this theory -- such conspiracy theories are often made by people who don't understand the predictive power of mathematics, especially statistics. For example, the statistician Nate Silver made a famous prediction in 2012 -- he predicted the winner of the presidential election accurately in all 50 states. He accomplished this by using recent polling data and statistics to make an educated guess in the ten or so swing states, as the winner in the other 40 safe states was already a foregone conclusion. Still, we expect someone to have only a 1 in 1024 probability of guessing the results of ten swing states accurately, which makes Silver's prediction all the more impressive.

But here's the thing -- suppose that, instead of working as a media correspondent, Silver were actually the sitting president in 2012, trying to predict his own chances of reelection. Now suppose that President Silver gave the correct election results for all 50 states. Of course, people would quickly suspect that the president cheated -- he didn't actually count the votes in any states, but simply had the media report his predictions as being the actual results. Clearly, our democracy would be in peril if a sitting president projected his own reelection chances -- no matter how mathematically accurate said predictions might be.

And so it is with the Common Core scores. Just as it is a conflict of interest for a sitting president to calculate his own reelection probabilities, the conspiracy theorists consider it a conflict of interest for an agency that has much to gain with the success of Common Core -- such as SBAC -- to predict the scores of its own test. They feel that the purpose of the projections is to hide the possibility that Common Core may be bad -- since if enough people found that Common Core is bad, it would obviously hurt the SBAC.

It's difficult to come up with a good scoring system that people will see as trustworthy. Based on what the traditionalists wrote in my last "How to Fix Common Core" post, they are more likely to trust the scores if proficiency corresponded to getting a raw score of 70% -- that is, 70% of the questions are answered correctly, since this corresponds to a grade of C. We've seen that it doesn't matter how difficult the questions are -- if students getting raw scores of less than 70% are being told that they are proficient, the traditionalists immediately question the accuracy of the scoring.

On the SBAC, we see that the raw score isn't converted directly to a score of 1, 2, 3, or 4. Instead, the raw score is converted to a score of 2000 to 3000, and then that score is converted to one of the four bins listed above. At the Edsource link that I gave earlier in the post, a commenter wondered why there is a 2000-to-3000 scale. A poster going by the username of "FloydThursby1941" sharply criticized the 2000-to-3000 scale.

Now the name "Floyd Thursby" is most likely a pseudonym because it is the name of a character in Dashiell Hammett's The Maltese Falcon. But based on his comments in this thread, Thursby has views similar to those of the traditionalists (like Dr. Katharine Beals, for example) -- enough for me to consider Thursby to be a traditionalist himself.

But as we see in this post, Thursby doesn't even like the 200-to-800 scale of the SAT, a test that most traditionalists hold in high regard:

It’s to appease extreme liberals. It’s like grade inflation. Tests tell more truth than grades due to this. It’s designed to make people doing terribly feel they are almost doing as well as those doing well. Let’s give everyone a trophy. My kid’s graduating, yay, they read at a 7th grade level and won’t hold up in college and they watched 40 hours a week TV and played video games and rarely opened a book, but they’re graduating, I feel good, let’s celebrate, yay me! It’s like the free 400 SAT Points. We might as well concoct a test that goes up to 100,000 and has a minimum score of 99,000. Total goofball loser kids will get 99,050 and kids who work super hard will get 99,980. Almost the same right? Then when one makes 200k or more and another makes minimum wage, we can say both are middle class. Yay me! Let’s all feel good in the face of disturbing realities.

Of course, that 99,000-to-100,000 scale is an exaggeration. But the point being made is that Thursby won't consider a test score to be reliable unless the lowest possible score on the test is zero. Any test whose lowest score is not zero is, in Thursby's eyes, unreliable and akin to "grade inflation."

Back in January, when I first came up with my own proposed scoring system, I explained why the lowest score on the SAT is 200 -- 500 is the mean and 100 is the standard deviation, so a score of 200 represents three standard deviations below the mean. And since the number of students who would score more than three standard deviations below (or above) the mean is statistically indistinguishable from zero, the SAT simply uses a scale of 200-to-800. If we had used a senary (or base 6) system rather than decimal (or base 10), the SAT would have been on a scale of 0 to 1000 -- 300 would be the mean and 100 the standard deviation, and so zero would be three sigmas below the mean.

Thursby mentions "grade inflation" in this comment. Of course, "grade inflation" can mean many different things, but in the context of this comment, I assume that Thursby is referring to the practice of having the lowest grade at some schools be 50% rather than zero. I mentioned this on the blog back in October -- the purpose of making 50% the lowest grade is to protect students from being mathematically or realistically eliminated from passing the class with weeks left in the semester. A student who has a grade of 18% at the quarter can study 40 hours a week -- even 100 hours a week -- long enough to get 100% of every test and assignment left in the semester. But then the semester grade would be only 59% -- so that the only letter that can appear on the semester report card is F. In that case, why should the student bother to study 40 hours each of those last nine weeks if the only letter than can possibly appear on the report card is F? On the other hand, if we change that 18% to 50%, suddenly those 40 hours of studying are meaningful. Now the student can work to earn a D, or possibly even a C, because it's now mathematically possible to get those grades.

On my proposed test, students start at a base score of 300 in the third grade. A score of zero is unlikely, since the student would have to get 300 questions wrong in 30 minutes to reach that zero, when it's unlikely that the student would even answer that many. A score of zero would correspond to the achievement of a kindergartner on the first day of school. If we really want a score of zero to be possible, we would either have to have kindergartners take the test (or first graders, since a first grader could conceivably get 100 questions wrong in 30 minutes), or set up the below grade-level questions so that students lose more than one point for incorrect answers. Actually, this might not be a bad idea.

Looking at the rest of the comment thread, I notice a few other things that Thursby writes that are typical of many other traditionalists. Thursby writes:

And by the way, if my kid gets 99,001 and yours get’s 99,999 on the test, it is probably just proof that my kid has more spatio-balance and emotional intelligence and is more well-rounded. Your kid still deserves no praise. He’s being shallow to even mention it.

Here "spatio-balance" and "emotional intelligence" are parodies of Howard Gardner's theory of multiple intelligences. (The proper names as defined by Gardner are "visual-spatial" and, most likely, "interpersonal" intelligence.) As Thursby is being sarcastic here, his implication is that only "linguistic" and "logical-mathematical" intelligences are valid intelligences, and that those who lack those intelligences are properly labeled unintelligent, not having some alternate intelligence. Other traditionalists would agree with Thursby here.

Earlier in the thread, the commenter Alan Cook argues that students will perform better in secondary math classes if they can be engaged with more real-life projects. Thursby -- along with most traditionalists -- disagrees here:

Kids, and parents, are at fault some, significantly. Kids don’t study much and immigrant groups prove when you do, you do just fine. Not all subjects are interesting. Kids of strong moral character study hard in every class because they want to be the best. We’re not going to be the best in education if kids should only study if teaching is perfect and the subject is interesting. It’s great when it is, but how many hours you study is a test of moral character and diligence. Asians prove it can be done, but it takes a sacrifice most American children, and parents, aren’t willing to give. Expecting it to be all fun and games is part of the problem, not part of the solution. A child has poor moral character who studies only if they have Jaime Escalante as their teacher. A diligent and good kid studies hard no matter what, day in, day out, summers, weekends, whatever it takes it’s the priority. These are the facts, and they are undisputed. If they are disputed, they are disputed by those who are failing in our current system. And those who succeed need to be the ones we lionize and tell our kids to emulate, to strive to be like.

Here's the problem -- Cook isn't saying that students "should only study if teaching is perfect and the subject is interesting." Thursby accuses Cook of prescribing when students should study (i.e., when the subject is made interesting) when in reality, he is describing when they actually study. For many teenagers, entertainment is more important than almost anything else, and so if a subject isn't entertaining, they won't study. That the students may be wrong to believe so (or, in Thursby's words, that they have "poor moral character") doesn't change the fact that they do believe so. Simply telling the students that they have poor moral character won't change their character -- indeed, it may put them on the defensive and make them less likely to change their character.

So the solution is not the prescription that students should study more even if the math is boring. The solution is to find a way to teach the students who fit the description that they won't study if they consider math to be boring --  by making math less boring. This is what Cook suggests. This is what the "three-act" activities posted all over the Internet are for. This is what I do on my blog -- I post activities, especially at times when students are the least receptive to direct instruction (such as first period on Monday, sixth period on Friday, the last school day before Thanksgiving, and so on).

Finally, Thursby's last statement above is the dream of math teachers and high-achieving math students alike -- that those who excel at math are viewed as heroes, not nerds. But that is all that is -- a pipe dream. On the blog, I propose that the word "dren" -- nerd spelled backwards -- be used to describe anyone who can't perform third-grade math. But that small proposal won't change the fact that those who succeed in math -- especially algebra and above -- will be viewed as nerds.

(By the way, in an actual classroom, I'd probably avoid calling a student a "dren" directly -- that would be as useless as telling students that they have poor moral character. Instead, I'd indirectly hint that someone who doesn't master a third-grade concept would be a "dren," and that a "dren" is something that a student should avoid being.)

Returning to the Common Core standards, I still wish to find a way to make my proposed tests so that they aren't "one-size-fits-all" -- a frequent complaint of the Common Core tests. I suggested that a computer-adaptive test would ask different students different questions, so that such a test would not be "one-size-fits-all." But this isn't necessarily what traditionalists mean -- that is, simply making a test computer-adaptive wouldn't avoid that particular criticism.

Many years ago (before Common Core), I once read a proposal that completely changes the way that standardized tests are given. Once again, it's on a website that I'd like to link to now, except that I can't find the site today.

Actually, I know exactly the name of the person who made the proposal -- Timothy F. Travis. He once had a webpage called "Six Revolutionary Ideas for the Fifth Millennium." His webpage still exists, but only two of those six ideas are still posted there -- a dozenal (base-12) numbering system and a new calendar system. (Yes, he is a calendar reformer. In fact, "fifth millennium" is what he calls the year 2000, since he comes up with a different year numbering scheme.)

http://www.fivecandles.net/RaenboProject/

But one of the other four ideas that is no longer a part of Travis's Raenbo Project is a "Certificate of Academic Proficiency." If I remember correctly, any student can take a test in any of a number of different subjects. Students who pass receive a Certificate of Academic Proficiency, or CAP. As a CAP shows exactly what a student can do, colleges and employers will be interested in seeing how many CAPs an applicant has.

Some traditionalists make suggestions that sound similar to Travis's CAP proposal. For example, here's a link to an old post from the traditionalist Dr. Katharine Beals:

http://oilf.blogspot.com/2014/03/moving-beyond-grades.html

Grades, let’s face it, are problematic. They are non-transparent, refracting a multitude of factors beyond mere subject-specific mastery. Worse, the more a teacher’s subjective impressions of student grit, creativity, cooperative learning skills, presentation skills, and so-called “higher-level thinking” skills figure into grades, the more those grades distort what many people still assume grades are mostly about—namely, academic achievement. A “B” in biology may mean that you don’t have a complete mastery of biological processes, or it may mean that your poster was sloppy, you didn’t make enough eye contact during your presentation, and that you didn’t get along with your lab partners. 

So why not dispense with grades altogether and replace them with something a little more indicative of what people can actually do? Why not have a list of concrete skills for each school subject, with mastery tests for each one? Following the latest cognitive science research, deliver these mastery tests over time (to both assess and enhance long term recall), set a high bar for mastery (say 95% correct), and allow students to retake these tests as many times as needed (and whenever they choose to). Also allow them to move through the material at their own pace. The upshot, instead of a report card or transcript, should be a list of the skills currently mastered. 

Not all subjects, of course, naturally break down into lists of discrete sub-skills. Product oriented faculties like writing might better be demonstrated with actual products—i.e., student work samples—though, to ensure that they are purely the student’s own work, proctored, in-class samples would be best. 

Another caveat: the skills measured should always reflect what the student can do independently, with any “supports” limited to things like enlarged print or sign language interpretation or keyboards, which enhance access, as opposed to things like simplified texts, movie versions of novels, and word-prediction software, which end up doing for the student a significant part of what’s being measured. 

We see here that Beals was actually complaining about grades, not standardized tests. But the mastery tests that she proposes here sound a lot like Travis's CAP tests. Students can take the CAP tests as often as needed and whenever they choose.

How exactly would the CAP's be scored? Notice that Travis uses base-12 numbering (another one of his six "revolutionary ideas"), and like base-6, base-12 is actually convenient here. We can make the average score 600 and the standard deviation 200. This gives us a scale from 0 to 1000, so that the lowest score is indeed zero, just as Thursby desires.

But we notice that, in some ways, our CAP's would be more like the SAT II than the SAT I. We notice that, while the average score on the each section of the SAT I is indeed near 500, we see that the average score of any section of the SAT II is greater than 500 -- in some cases much greater. This is because not everyone who takes the SAT I takes every section of the SAT II -- in general, only those students who are good at a subject would take the corresponding SAT II test.

Because of this, we can include more complex topics on the test -- the example given in the Beals post mentions simplifying rational expressions in Algebra II -- without worrying about the low scores of students who can't do this. And we can make the score required to pass much higher -- we could make the passing score 900, or even use some of Travis's new base-12 digits, (dek)00 or (brad)00 (where dek = ten and brad = eleven). Beals mentions a passing score of 95% -- on a linear scale from 0 to 1000, this would be a score around (brad)60.

Actually, as we see, Beals doesn't want to see either letters or numbers on a transcript. Instead, she wants to see "a list of the skills currently mastered" -- a list of the things students can do. This is exactly what Travis accomplishes with his CAP proposal. In any case, students should get their CAP results much faster than they currently receive their Common Core test results.

This proposal also avoids the "one-size-fits-all" problem. Students aren't forced to take CAP's in Calculus if they have no intention of going into a STEM career. Indeed, there could be CAP's in subjects that have nothing to do with ELA or math. Even certain vocational training can be covered by the CAP program. On the other hand, students who want a STEM career aren't hindered in their ability to take CAP's in Calculus.

The idea suggested by Beals -- that the CAPs replace grades as well as test scores -- also solves another problem. Some traditionalists lament the fact that in some classes (especially in math), upwards of 80-90% of the students lack the basic skills needed to pass. In an accurate grading system, the only grade that those 80-90% deserve is F. In theory, the response to getting low grades in a class is that it's the students' fault, and that if the students would just study harder (as much as 40 hours a week according to Thursby), suddenly they'll be getting correct answers and passing grades. But in reality, the natural response of students -- and parents -- and administrators -- to such a grading distribution is that the teacher's standards are impossibly high. Parents and administrators ask the teachers to raise the grades.

Now, CAP avoids this problem because there are no grades, only standardized tests scored by someone other than the teacher. It's not that we teachers can't be trusted to grade our students -- it's that by having the grader be someone other than the teacher removes the conflict of interest. The natural response of parents, seeing their children not pass the CAP in a subject, is that the teacher should work harder to teach the children the subject matter of the CAP -- not lower the standards -- because they know that the teacher isn't the grader.

To me, Travis's CAP proposal sounds like an interesting idea. But I'm not sure whether I can fully endorse the CAP program. For one thing, I wonder what an actual school would look like if the Common Core tests were replaced with CAP. Beals writes, "allow them to move through the material at their own pace." But at a public school, class sizes smaller than 20 aren't economically feasible, so no matter how homogeneous the classes are, the needs of any one student must be weighed against those of the other 19. What if only a single student in a class wants to take a certain CAP on a certain day, what do the other 19 students do while the teacher prepares the student for the CAP?

Some readers may notice that while a student may choose to avoid Calculus CAPs, what if a student chooses to avoid a CAP in simple arithmetic? One could require students to take certain basic CAPs in order to graduate. A more authentic CAP program truer to Travis's vision would allow complete freedom of what CAPs a student chooses to take -- but then a student who doesn't pass basic CAPs in subjects like reading, writing, and arithmetic will end up not getting many college or job offers.

Another problem I have with CAPs is that it differs too radically from what other nations have, including those nations whose school systems we wish to emulate. I also worry that once students earn a CAP showing that they know how to do it, they'll immediately forget how to do it. But I suppose that this is a problem with any grading or testing system.

Thus concludes this post. My next post, on spherical geometry, will be in just a few days.

No comments:

Post a Comment