In January, Florida governor Rick Scott released rankings of school districts based on scores on Florida's statewide test, the FCAT (McGrory and Isensee, 2012, Miami Herald). Teachers' unions criticized the rankings on the grounds that these do not show the play between test scores and poverty. Governor Scott was " willing to re-jigger the rankings to reflect those factors," according to Miami Herald reporters Bousquet and McGrory (2012), but, citing the importance of "transparency" (the public's right to know), he released them.
St. John's County (Saint Augustine), with a per capita income of around $36,000 and with at least 92 percent of its residents with high school diplomas or better, according to U.S. Census Bureau data, was the highest ranked. Largely rural Madison County ranked last.
Income and Scores
Zwick and Green (2007) report on the disparity between average SAT reasoning scores for children coming from families with incomes of over $100,000 (professional, wealthy) and those from families with incomes under $10,000 (poverty). The differences say Zwick and Green were about equal to a "standard deviation." (A "standard deviation" is a calculated measurement; differences greater than one standard deviation are more likely to be considered "statistically significant.") This is equivalent to the difference between scoring 400 and 500 on the SAT, or alternately, the difference between scoring 500 and 600.
According to Zwick and Green, it did not matter whether the test compared students to "norms" (that is to peers' scores), like the SAT, or was based supposedly on school content, like the FCAT. The same discrepancy was observed regardless.
It's been argued that standardized tests, whether the FCAT or the SAT, merely reflect family income. Alas, Zwick and Green note that student grades also, when compared within a school (but not across schools), were equally correlated with income.
High Stakes Testing and "No Child Left Behind"
The high-stakes testing is the legacy of the 2001 "No Child Left Behind" (NCLB) Act. The idea is that parents should have standardized measures of how their children are faring in school, with the schools required to bring/keep all students at grade level, no matter what the students' backgrounds, and held accountable for student progress, with no excuse that a school serves students from lower socioeconomic backgrounds and thus can't meet standards. All children, says NCLB, must be exposed to challenging content.
The FCAT is Florida's test to assess children's progress toward grade level standards . It combines "criterion-referenced" assessment (the criteria are the "Sunshine State Standards;" the FCAT's aim is to assess student competence in these), with norm-referenced assessment, the latter comparing Florida's students with other students nationwide.
Theoretically the FCAT is objective, except of course that assessing its writing component relies on "holistic scoring" (student writing is assessed as a whole by a rater, rather than having points assigned to different sections or points taken off based on the percentage of errors). Prior to 2010 two raters read each student essay, and the average of the two ratings was the student's score. Since 2010 only one rater reads each piece (and relatively rapidly too).
According to Geisinger and Sireci (2011), twenty percent of essays are nevertheless rescored by a second rater to check for score reliability, with raters "blind" as to which are rescored. Pearson, which oversees scoring, requires "60% inter-rater reliability." Actual inter-rater reliability values for 2011's FCAT however were slightly below this (62 percent for fourth grade; but only 57 percent for 8th grade, and 54 percent for tenth grade), say Geisinger and Sireci, who reported somewhat better agreements between rater rankings and supervisor-set rankings than agreement between various raters. Supervisors check approximately five-to-ten percent of the essays. (Research studies in education, as opposed to student tests, may require higher inter-rater reliability, as high as eighty percent when holistic rankings are used.)
Norm- versus Criterion-Reference
Criterion-referenced tests measure how well students do on specific criteria. Either you can tie a knot or not, make sense of a particular text or not, solve a particular math problem or not. Criteria may be specified for various grade levels: for example, first graders may be expected to do simple addition, recognize the sounds of English consonants, or the parts of a book or a story; whereas third graders may be expected to do addition with carrying, subtraction with borrowing, simple or complex multiplication, or recognize misspellings of age-appropriate words, and read a paragraph or part of a story, making predictions about what follows, or what a character might be thinking.
Norm-based tests compare students to other students in the same grade. One measure commonly used is "percentile rank," where a student's performance on a particular test may be said to be in say the top thirty percent (at the 70th percentile), or in the bottom twenty percent (twentieth percentile). To rank in the 70th percentile means 30 percent of test-takers scored better and 70 percent worse.
Nevertheless, a norm-referenced test still measures how examinees perform on particular subject matter. Many tests actually combine subject matter for two grade levels together (first and second or third and fourth). Test takers are also generally assessed on content that is one grade below or one grade above grade level (to better assess what grade they are working at). Thus test takers in both the first and second grades might be tested on a mixture of kindergarten, first, second, and third grade content.
Typical norm-referenced tests include the Scholastic Aptitude Test (SAT); the ACT, which is also criterion-referenced to some degree; and Harcourt/Psychological Corporation's Stanford Achievement Test (which assesses nationwide grade level standards while also comparing students to norms). The General Education Diploma (GED) exam is also norm-referenced, since, although it assesses test takers' performance on secondary curriula, scores are "normed" against scores of high school students to whom the test has also been administered.
According to Martin Kehe of the GED testing service (2011), however, the GED may be transitioning from norm- to criterion-referenced. Until then it's normed using a representative sample of high school seniors. To pass, an examinee must score better than forty percent of that norm group.
Norming Tests and the "Question Bank"
To "norm" a test, you need a "representative sample," usually a group demographically like those who will actually take the test. Test questions are tested first on the sample. Typically, questions which forty-to-sixty percent of the sample gets right are used in the actual test. This is considered the best way to make fine distinctions in rankings for the bulk of test takers.
Questions on which, say males and females, or different ethnic groups, perform differently may be reviewed but are not necessarily removed from the question bank. In addition, up to five percent of the questions may have errors (perhaps with no or multiple right answers provided).
More Scores: Linear Scores
The Scholastic Aptitude Test (SAT) actually uses, not percentile rankings, but linear scores. What are linear scores? First, let's define the "Bell Curve" or "normal" curve. Given a large enough population taking a test, statistics predict that most (slightly over 60%) should score in the middle range. Only a small percentage should score somewhat higher than the middle, and an equally small percentage should score somewhat lower. The graph of these scores, should thus, in an ideal statistical world, be perfectly symmetrical with a bulk at the middle. Persons scoring within the middle bulk on this "normal" statistical curve are within one "standard deviation" of the mean score.
On the SAT, 500 is the linear score assigned to the 50th percentile rank of a "representative sample" of test takers. Scores are "forced" onto a "normal curve," with a score difference of 100 used to represent a difference of a "standard deviation." Thus a score of 400 is given to an examinee whose raw score is the same as the raw score of someone in the "representative sample" who has scored a "standard deviation" below the mean score for the sample, while an examinee whose raw score is a standard deviation above the mean for the sample receives a score of 600.
Can tests like the FCAT, and drilling done to boost student scores on it, prepare students for the SAT or the ACT? A bit of practice may help scores, according to Kulik, Kulik, and Bangert (1984), at least when the test form practiced on is roughly identical to the one taken. Alas, gains from practice decrease as the test form (and no doubt format) varies from that of the test to be taken.
Another word of caution: Ball State University's Greg Marchant (2004) conducted a study suggesting that exit examinations might not after all be linked to higher SAT scores. Practice on the FCAT may likewise do little for SAT or ACT scores.
If practice has its drawbacks, what then? If poor students do poorly when compared to wealthier classmates, as recorded in test scores or report cards, why is this? Better teaching and more spending in wealthier schools? Or more literate parents (whose literacy and language/dialect is closer to the standard)? Or simply more parental involvement (many poorer homes are headed by a single working parent).
Research links standardized test scores as much to parental education as to income, and it's possible that what standardized tests test above all is literacy. And finally, there is student attitude to testing to consider, which in turn may perhaps be linked to perceived job options (schooling/school achievement in rural Morocco may be linked to perceived job options, according to Khandker, Lavy, and Filmer's 1994 World-Bank-commissioned study # 264; "Schooling and Cognitive Achievements of Children in Morocco: Can the Government Improve Outcomes?").
In any case, drilling on any test other than a test similar or almost identical to the one to be taken, as noted above, may be linked to little or no score improvement.
Flexibility in No Child Left Behind?
It's of course possible for states to request the right to enforce NCLB with a bit more flexibility than previously. In 2011, Colorado, Florida, Georgia, Kentucky, Massachusetts, Minnesota, New Jersey, New Mexico, Oklahoma, and Tennessee requested more flexibility in implementing "No Child Left Behind." Additional states, plus Washington D.C. and Puerto Rico, have since requested more flexibility. The next deadline is February 28, 2012.