Large language models like ChatGPT hold the promise of greater efficiency, helping us with tasks that we mostly find onerous (such as producing summaries of texts, etc), and assisting in creative endeavours. But what about talent assessments?
In this field, concerns have been raised about how technologies like ChatGPT may be used to circumvent test security, or at the very least, undermine the trust test users may have in the accuracy of results, especially where tests such as SJTs and simulations rely (in part) on language-based answers.
In a recent review of this question, TTS’s best-of-breed partner, Saville Assessments, provided cogent answers that we discuss below.
Before we do so let us revisit exactly what large language models are and how they work, because an accurate understanding of this technology is critical when evaluating concerns around its impact on test integrity and talent assessments in general.
An overview of ChatGPT
Although we have discussed the basic definitions of large language models like ChatGPT in previous articles, it may be useful to revisit the basic concepts.
One useful way of understanding the technology is by using the metaphor of a library:
- Imagine a vast library, larger than any in the world, filled with billions of books. Each book contains information on countless topics, conversations, stories, questions, and answers.
- When ChatGPT was trained, it acted like a diligent librarian who read through every single page of every book in this vast library. The librarian did not have to memorize everything word for word, but instead became skilled at understanding patterns, themes, styles, structures, and meanings.
- Therefore, when a user asks ChatGPT a question or requests it to perform a task, it is like asking that diligent librarian for information on a topic.
- The librarian does not have to recite the exact page and book for every answer, but instead , can synthesize what has been read and provide an answer based on the patterns and knowledge seen across many books.
For most queries, the above metaphor of a library and librarian holds true.
For tasks that involve more adaptive responses, such as asking ChatGPT to reproduce text in a particular style (e.g. Shakespearean English), the librarian in our metaphor above simply has to connect patterns in one set or series of books (i.e. the works of Shakespeare), with a pattern in another, in this case the required text entered by the user.
Much like a real librarian, large language models like ChatGPT are not infallible. For instance, if the question posed references highly obscure or propriety information, or if there simply aren’t very many accurate sources of information, the librarian (and ChatGPT) will not have adequate answers to provide.
Talent assessments and ChatGPT
For the most part, concerns about Artificial Intelligence technologies like ChatGPT’s influence on assessments are easily answered when considering standard psychometric measures that test aptitude. In such assessments, candidates are timed and any outside help they may receive from any source, ChatGPT included, is not only unlikely to help them, but will likely disadvantage their performance due to time restrictions.
As with all talent assessments, the real threat to test integrity is not individual dishonesty, but rather concerns around test item integrity as a whole.
In other articles, we have argued that the best way to overcome such concerns is to rely on best-of-breed, global test providers who use item banking, dynamic test item creation, and employ state-of-the-art database security and monitoring (for the full article, click here).
Where concerns regarding ChatGPT’s influence on talent assessments may be more valid, is where a given test requires a candidate to provide written information that is then evaluated.
As a recent survey by ResumeBuilder.com suggests, CVs and cover letters are the most obvious examples where candidates may use ChatGPT to help produce better quality writing, leading to a better first impression. Online application forms that require candidates to submit written evidence of competencies is another example of a method that may be vulnerable to assistance by ChatGPT.
However, IO Psychologists and professionals in the talent field have known for many years that CVs, resumes, and self-reported experience are all very poor predictors of job performance. And while technologies like ChatGPT may well help someone appear more desirable, this feature only becomes problematic when organizations rely overly on such inaccurate and low prediction methods to make talent decisions.
In this regard, the danger is not so much in candidates using ChatGPT, but in hiring decision-makers not employing accurate prediction methods to begin with.
Perhaps a more important area to watch is where ChatGPT could assist candidates in preparing interview answers or “scripts”. By using such AI-enhanced scripts in asynchronous or live interviews, a candidate may well be able to gain some advantage, providing that they (a) know the interview questions beforehand, and (b) have sufficient skills and time available to rehearse such scripts to appear natural in their delivery.
Concerns about talent assessments
While CV reviews and interviews clearly rely on a candidate’s linguistic responses, the vast majority of talent assessments used by our clients to help them make better talent decisions do not. The question therefore still stands: can ChatGPT help candidates score better in talent assessments?
To answer that question, Saville Assessments conducted a series of trials and experiments to test how vulnerable typical talent assessments like aptitude tests, personality measures, and SJTs are to AI-enhanced responses.
While verbal aptitude tests are text-based, most others are based on non-verbal information or a mix of verbal and non-verbal information, such as diagrams and symbols.
ChatGPT and similar models cannot understand such non-verbal inputs, but even when exposed to verbal or text-based questions, Saville’s experiments found that ChatGPT faired poorly.
Why? The researchers often observed ChatGPT making simple logical errors, failing to understand the context of a question, and not being able to adequately process arguments.
This highlights one of the key limitations of all large language models like ChatGPT: although they may at times appear almost intelligent, models like ChatGPT cannot reason or apply logic in ways that tests like Saville’s aptitude assessments are designed to detect and measure.
As previously mentioned, an additional barrier to receiving assistance from tools like ChatGPT while completing aptitude tests is that the tests are strictly timed. Given the required time needed for inputting questions, clarification, and refinement common to large language models, it makes the possibility of enhancing aptitude test results using ChatGPT highly unlikely.
Even if some of these operations were to be enhanced on the part of ChatGPT, it is worth noting that the scoring keys to such tests are not known to the language model. Any response will therefore only be using inputs and open-source references which were used in the model’s initial training.
When turning to measures of workplace behaviors and styles, such as Saville’s Wave measure, the researchers found that the Wave’s “rate and rank” format made it difficult for ChatGPT to not only provide a precise rating on an item, but also interfered with its capability to sensibly rank two or more items while remaining consistent across different constructs.
More importantly, and given the limitations of large language models discussed earlier, ChatGPT does not “know” how to replicate a personality profile that is appropriate for a particular job, or one that can reflect the candidate’s actual personality, thus making efforts to replicate the persona at interview or in a feedback session highly problematic.
Situational Judgement Tests (SJTs)
Similar to reasons given for behavioral measures’ resistance to AI-enhanced responses, sophisticated SJTs that use ranking item responses will have the same confounding effects on technologies like ChatGPT.
However, less sophisticated SJT formats, such as those where there are clear right and wrong responses, may be vulnerable to ChatGPT assistance. This vulnerability is easily remedied by creating items that are not overly obvious, as well as presenting items in such a way that requires candidates to respond to each item independently.
In their experiments with such formats, Saville found that ChatGPT, much like was the case with aptitude tests and behavioral measures, failed to accurately understand or master the required judgments.
Technologies like ChatGPT, while being fantastic tools for efficiency and eliminating drudgery, may also sometimes threaten established processes and methods that organizations still rely on when making talent decisions. Certainly, an over-reliance on CV screening, for instance, could expose hiring companies to the influence of ChatGPT-assisted responses and possibly, less than-ideal decisions about candidates.
To this point, the greatest concern ChatGPT brings to talent assessments are those which involve written materials prepared by candidates in their own time, such as CVs, cover letters, and competency-based application forms.
Given the ever-increasing popularity and sophistication of technologies like ChatGPT, organizations may be well advised to reduce the emphasis they place on such methods and to ensure the use of more objective assessments that have demonstrable predictive power for measuring candidates’ suitability to specific roles.
When it comes to using psychometric assessments, the use of multiple assessments is always warranted, even if ChatGPT did not exist. Such multi-instrument batteries help talent professionals to make better decisions because of the breadth of constructs measured, while also offering multiple layers of security and verification.
But when considering the nature and features of best-of-breed aptitude, behavioral, and personality measures, end-users should be assured that ChatGPT and similar models do not pose a material threat to the integrity of talent assessment results.
This is not only because of the rigor and technology that are employed in designing such measures by test providers like Saville, but also because of the inherent limitations of all large language and AI technologies to date: Their inability to truly reason, think, and understand in ways that the human mind can.
If you are interested in how Saville and other talent assessments can benefit you in making better talent decisions, why not reach out to us at firstname.lastname@example.org and we will be in contact with you.
Source: Chan, S & Smith, J. (2023) What Does the Emergence of ChatGPT Mean for the World of Assessment? Saville Assessments Report.