Validity and Reliability

One of the goals of the SAILS project is to create a test that is a good measure of information literacy. To determine how well the SAILS instrument measures information literacy as defined by the ACRL Information Competency Standards for Higher Education, we conducted a series of validation and reliability tests with the SAILS test items and the test as a whole. Below is a brief description of our reliability and validity testing program.

External Validation

To determine whether the SAILS tests correlates to other, similar measures, we compared SAILS results with two external instruments.

We gathered SAT/ACT scores for participating students and determined that groups of students who scored higher on the SAT/ACT also scored higher on SAILS, which is what we would expect.

We also compared student performance on the SAILS test with performance on the Information Literacy Test (ILT) developed at James Madison University. The ILT has established reliability and validity and is very similar to the SAILS test.

In order to compare student performance on both tests, we conducted a correlation study. The observed correlation of scores on the ILT and cohort SAILS test was 0.67, which is a moderate positive correlation. When disattenuating the correlation, which adjusts for the reliability of the two tests, the correlation was 0.72, which is a strong positive correlation. Both the observed and disattenuated correlations offer evidence that the tests are measuring the same construct, information literacy.

We also conducted performance testing in which students performed tasks based on the ACRL objectives and then answered the SAILS items for those objectives. We were looking for consistency and achieved it; however, the tasks and SAILS items proved too easy for our sample, so this process was inconclusive.

Item Reliability and Difficulty Level

Using item responses gathered from students over the three-year development phase of the project, item reliability has been established as high using the Rasch software, Winsteps. To ensure that results were not inflated due to our large sample size, we analyzed the reliability using a smaller representative sample and were satisfied. All reliability estimates were greater than .80.

We are also interested in how difficult each item is. Our goal is to have items that span a large range from very easy to very difficult in order to adequately test a wide range of student abilities. We worked with three librarians who rated the difficulty level of each item in a skill set using a three-point scale. We compared those ratings with the difficulty estimates provided by the output of data analysis. We ran inter-rater reliability analyses which yielded satisfactory inter-rater reliability scores (ranging from .65 to .80) for most of the skill sets. We have taken steps to increase the inter-rater reliability for the skill sets that were not satisfactory. These steps include identifying and eliminating items that do not contribute meaningful information to the test and re-configuring the skill sets from the original twelve into the current eight. Although items are routinely reviewed to ensure their accuracy and timeliness, all eight skill sets are now stable, with appropriate content coverage and a suitable range of item difficulty.

Validity and Reliability of the Individually Scored Tests

Because the individually scored tests use the same item bank as the cohort test, they benefit from the same item development, validation, and review processes. Validity and reliability testing of the individually scored tests focused on the constructs that were being tested. First, we sought to establish the reliability of each test. Both tests were found to be reliable with Cronbach Alpha’s over 0.80.

We then sought to establish that the two versions of the individually scored test were parallel so that they could be used interchangeably. To do so, we conducted a correlation study. The observed correlation of scores on the two versions of the individually scored tests was 0.76, which is a strong positive correlation. When disattenuating the correlation, which adjusts for the reliability of the two tests, the correlation was 0.98, which is a nearly perfect positive correlation. Both the observed and disattenuated correlations offer strong evidence that these two versions of the test are indeed parallel and can be used interchangeably. They also offer strong evidence that both versions of the test are measuring the same construct, information literacy.

Up Next: Pricing