Cheat-proof testing: Why all assessments are not born equal

A common question our consultants encounter in their work with clients is some variant of: “Can test results be faked?” or “How do you prevent candidates from cheating at assessments?”

This is a legitimate concern. Companies invest considerably in assessment results, using them to make better talent decisions, inform their succession strategies, and understand the future shape of their talent pools. The possibility that such results may not be accurate is extremely worrying and threatens to undermine not only the use of assessments but also large parts of our professional value-add as IOPs.

But what is the actual situation when it comes to cheating and faking test behavior and perhaps more importantly, how can responsible professionals mitigate against such risks?

In this article, we will examine current (and past) best practices in test construction and administration with an eye to understanding the risks of cheating and faking in psychometric testing.

Cheating versus item leakage: The real threat

Inherent in the questions raised by clients mentioned above is the belief that participants in assessments can somehow cheat or provide dishonest answers in the measures used to test their capacities, behavior and aptitudes.

Of course, participants can (and sometimes do) provide random answers to questions, but the result will often be a very poor capacity score in the case of aptitude assessments or a highly inconsistent one in the case of typical behavioral measures.

Such cases are not what assessment end-users worry about. The real concern is whether test-takers can find ways to artificially inflate or improve their results through some form of outside assistance or cheating.

Especially in the case of non-proctored, online assessments, the worry often revolves around the possibility of candidates using search engines, calculators, or help from another individual when responding to test items.

At face value, this seems like a potential problem for all assessments that are delivered online and unsupervised. But, as we’ve argued in past articles, the research on this phenomenon shows little to no difference in test performance when supervised assessment results are compared to unsupervised counterparts.

What are some of the potential reasons for this lack of significant difference between these modes of assessment delivery? For one, it is worth noting that results of this kind cannot prove that candidates are not trying to cheat. But what they do illustrate is that if cheating occurs because of candidates not being supervised, and presumably, being able to use calculators or “phoning a friend”, it does not benefit them.

And once one understands the structure and format of typical modern aptitude assessments, which are almost always timed, it becomes obvious that any form of outside assistance, use of a search engine or using a calculator will only make it that much more likely that the candidate will run out of time and thus perform poorly.

With behavioral measures, which are commonly self-report instruments, outside help and similar dishonest behavior are likely to present a candidate in a way that is contrary to other assessments such as interviews and may well trigger various inconsistency and social desirability warnings found in these measures that were designed to detect just such threats.

In fact, research on test behavior suggests that personality and behavioral styles assessments are far less susceptible to cheating attempts, whether delivered supervised or unsupervised, online, or pen-and-paper (Kriek & Joubert, 2009). This is no doubt explained by the nature of these assessments. Since they have no right or wrong answers (unlike ability tests), they are considered “performic tests” in that they attempt to assess typical performance of a candidate during a day of work.

On maximum performance tests such as aptitude and ability assessments, cheating motivation may be higher because there are in fact, right and wrong answers. This may motivate dishonest candidates to have their “clever friend” complete such assessments on their behalf.

But, even though assessment providers must take steps to ensure that candidates are who they say they are when completing supervised or unsupervised assessments, there is no guarantee that a “fake candidate” will necessarily perform any better at assessments than the person they are trying to assist. The reason for this is simple: If people were good at detecting typically assessed-for aptitudes like verbal, numerical and abstract reasoning simply through observation, the need to develop aptitude tests would never have existed. And in the rare instances where candidates have better performing doppelgangers, the use of verification checks and additional assessments can reveal such subterfuge and mitigate against it.

So, what is the real threat to test integrity that employers, testing professionals, and assessment end-users should be concerned about?

Other than the simplistic (and almost guaranteed-to-fail) strategies employed by some cheating test-takers, a far more sinister and potentially destructive form of cheating is the public distribution of test items and answers.

Unfortunately, it is not difficult to find answers to a variety of aptitude and other psychometric measures online. One need not even be familiar with the so-called “dark web” to find sites that offer past answers and questions associated with talent assessments (for a price, of course).

A casual Google search will reveal a multitude of sites that offer to sell full question and answer guides to everything from numerical aptitude assessments to competency-based interviews conducted by well-known international companies.

While it may be interesting to speculate on how such items come to be on sites like these, what is more important is to consider the implications for test results gathered using such compromised measures.

In the case of aptitude assessments, it can be potentially disastrous. Once test items are harvested from various test-takers and testing administrations, proprietors of test question and answer sites have all the time in the world to compare notes, figure out answers, and test such answers by applying them to actual assessment opportunities. Once the items (and their answers) have been established as fully compromised, such sites can sell them to candidates who wish to gain an unfair advantage.

And while receiving in-session help will almost always disadvantage candidates, studying and memorizing the correct answers based on a compromised test memorandum will inflate a candidate’s performance.

To fully appreciate the nature of this problem, as well as strategies to mitigate the risks, we need to briefly review the nature of psychometric test design and publication.

The problem of test item leakage and test design

When test developers publish their assessments in the psychometric marketplace, it is usually the result of a series of research studies and tests of validity and reliability that is meant to assure potential end-users that the measures they are using have been subjected to the rigors of scientific scrutiny, and that, among other things, they:

Measure what they purport to measure,
Show negligible bias toward any one group of people or demographic variable,
Remain stable across time,
Are appropriate for local use.

While any respected test developer will be quite capable of demonstrating the above-mentioned attributes, what is less clear is how vulnerable their instruments may be to test item leakage.

The potential release of test items into the public is a function of (a) whether the assessment is used for selection and high-stakes decision-making, (b) the design of a test, and (c) the technologies used to deliver test items to participants.

Once an assessment gains traction in the marketplace and clients start to use it to inform their employment decisions, it is that much more likely that the assessment will attract the attention of those who wish to gain (or offer) unfair advantages to test-takers.

Whether attempts at compromising test items will succeed largely depends on whether the test developer has employed modern methods of analysis to their test design, such as Item Response Theory (IRT), and whether they have used such insights to inform robust and cutting edge technologies that deliver test items to test-takers in novel ways.

To better illustrate the issues at stake, consider the following example:

Imagine that you create a test that measures a person’s knowledge of a particular discipline. For the sake of this example, let’s imagine that your test measures a person’s understanding of accounting principles.

In developing the test, you eventually settle on a test length of 20 items, each one measuring a different domain of accounting know-how. In making sure that the test is accurate, you might have accounting experts evaluate the items, and you may also have people with known knowledge of accounting (e.g. chartered accountants) complete the test and compare their results with people who have less knowledge of accounting (e.g. high-school accounting students). If your test is a good one, it should distinguish clearly between these two groups of test-takers.

If you then release use this test for a specific application, perhaps as an admissions test to a college program, you could be forgiven for thinking that your work as a test developer is now done. And this is exactly the problem with many psychometric tests that are currently available to assessment end-users.

The problem is this:

If your test is only composed of the 20 items you constructed, and these items remain the same (as they must), any compromise of the items will also compromise your test’s validity from that point onward.

All a dishonest candidate will have to do to cheat at such assessments and perform brilliantly in your accounting test is to pay the site that compromised it for the questions and answers. Then, the dishonest candidate simply memorizes the correct answers (or has them at hand) and when they complete your test, their performance (and therefore their predicted knowledge of accounting) will be grossly inflated.

Once items are publicly available to paying customers, your test’s validity is essentially zero and its use would be ill-advised for the purposes it was originally designed for (i.e. testing people’s knowledge of accounting).

An important note is that the inflated performance of dishonest test-takers who had access to compromised test items would have occurred whether the assessment was supervised or not. That is because the cheating behavior happened well before the actual testing event, and thus all that supervision would achieve will be to falsely assure the user of the assessment results of the test’s (non-existent) accuracy.

How can the risk of test item leakage be mitigated? As mentioned previously, the answer lies in the use of modern statistical techniques and the application of robust assessment delivery technologies.

Not all assessments are created equally: Item banking and dynamic item generation

To reduce the risks of item leakage, a sensible strategy is to ensure that an assessment measure contains far more items than is shown to any one test-taker. So, to return to our example of the compromised accounting test, a way that you as a test developer could mitigate the problem of item leakage is to construct hundreds of items that all measure the same constructs, at the same level of difficulty as your original 20 items.

Having such an item bank available will help you mitigate the risk of items being released into the public. Why? Because you could present a different set of 20 items to each new test-taker, and therefore, even if some of these test-takers succeed at stealing the items they were presented with, it would not compromise your entire assessment. And if you construct many thousands of items, the likelihood that any one set of 20 items will be repeated (and thus be susceptible to item compromise) will be very low indeed.

If you were particularly sophisticated in your approach, you may even decide to periodically review the performance of your test items, and when you notice an item that should be reasonably difficult suddenly becoming very easy, you could remove that item from the item bank under the assumption that it may have been compromised. By periodically checking item performance and replacing potentially compromised items with new items, you can almost entirely eliminate the threat of item leakage.

Returning to psychometric assessments, best-of-breed test developers employ just such techniques to mitigate the risks of item leakage as those described above.

For instance, assessment providers like Saville and Aon create thousands of equivalent items for their aptitude assessments. These items are calibrated for difficulty using modern techniques of item analysis such as IRT which allows items to be seamlessly replaced by other items that are similar in difficulty.

When combined with technologies that can dynamically compile items on the fly as each new test-taker starts their assessment, the likelihood that any one candidate will be tested using the exact same array of items becomes extremely unlikely. And with algorithms that constantly analyze item performance for potential removal, even if specific items are compromised, they will not be part of future assessment administrations.

But not all test developers have the knowledge or technological resources that are required for item banking, dynamic item generation, or the subsequent analyses of item performance.

Indeed, at TTS we ensure that new assessment products are measured against their developers’ capacity to demonstrate the use of item banking, dynamic item generation, and persistent monitoring of item integrity.

Other test integrity measures: Beyond item banking and item generation

The use of item banking and dynamic item generation goes a long way toward ensuring that item leakage problems are reduced or even eliminated. But there are other threats to test integrity that need to be considered.

For instance, the risk of internal data compromise from hackers and ransomware developers cannot be discounted as a potential challenge to test integrity.

Here it behooves the prudent user of assessment products to ensure that product providers have robust and best-of-class data security protocols in place and periodically review their defenses using techniques like data forensics, white-hat hacking, and penetration tests.

Unfortunately, such strategies can be costly, and therefore it is a good idea to use test providers who devote considerable budget to data security or who have access to extensive technical resources. In this regard, we feel comforted that our assessment partners like Aon assessment services spend millions of dollars each year on data security and integrity.

As with item banking technology, extensive data security practices often fall beyond the reach of smaller, more localized test developers. This has, in part, informed TTS’s preference for internationally recognized, globally applied assessment products and providers since our inception.

Final thoughts and caveats

As we’ve argued in previous articles, the reliance on supervision as a deterrent to cheating at assessments is unfounded. In fact, it should be clear from the above discussion that the true threat to test integrity is not so much during-administration dishonesty, but rather before-the-fact test item leakage and compromise.

As long as there are high-stakes assessments, there will always be unscrupulous entities that will seek to profit from the compromise and re-selling of test items and answers. The only practical methods that can mitigate against such risks are the ones discussed in this article, namely item banking, dynamic item generation (by virtue of using IRT methods of item development) and persistent item performance monitoring.

When paired with robust data protection strategies on the part of the test developer, clients need not be concerned about test integrity or cheating behavior. The caveat is of course to ensure that the assessments used have been developed by internationally-benchmarked and recognized test providers who have sufficient resources at their disposal to implement the correct analyses, technologies, and protection measures needed.

At TTS, we have been at the forefront of bringing just such products to the local talent assessment market, and our assessment partners are internationally recognized developers who have been employing technologies such as item banking and item generation for many years now.

As a result, our clients benefit from accurate and high-integrity assessment data that avoids the pitfalls of test item compromise while also ensuring future-proof protection against data security threats.

If you would like to take this conversation further, or if you have questions regarding the integrity of the tests you are using currently, why not connect with us at info@tts-talent.com?

Sources and Further Reading

Brown, A., Bartram, D., Holtzhausen, G., Mylonas, G. & Carstairs, J. (2005). Online personality and motivation testing: Is unsupervised administration an issue? Paper presented at the 20^th annual SIOP conference, LA.

Joubert, T., & Kriek, H.J. (2009). Psychometric comparison of paper and-pencil and online personality assessments in a selection setting. SA Journal of Industrial Psychology, 35(1)

Tippins, N.T., Beaty, J., Drasgow, F., Gibson, W.M., Pearlman, K., Segall, D.O. & Shepherd, W. (2006). Unproctored Internet testing. Personnel Psychology, 59, 189-225.

Cheat-proof testing: Why all assessments are not born equal