The Risks of AI-Generated Psychometric Assessments: Scientific and Ethical Concerns

Not all employee and talent assessments are created equal. As we have argued in previous articles, best practices such as item banking, dynamic item creation and database monitoring are all essential to ensuring that end-users of assessments get the best value for their investment.

However, with the advent and wide-spread use of generative AI (Large Language Models, LLMs) a new risk has emerged for test integrity. Recently, we have fielded inquiries related to generating test questions or personality quizzes on the fly by clients using LLMs.

There are rigorous scientific standards for developing reliable and valid assessments, and bypassing these standards can lead to serious risks. And while convenient, AI-generated assessments that lack scientific vetting may produce misleading or biased results.

In this article, we examine the best practices for sound test development and compare them to the uncontrolled nature of AI-generated tests

Best practices for test development

Developing a psychometric assessment is a scientific, multi-stage process designed to ensure the test measures the right construct in a consistent, fair way. To ensure this kind of accurate, fair measurement, test publishers have devised a thorough and reliable process.

First, they begin by defining the construct to be measured and the purpose of the assessment.

Test development begins with a clear definition of what the test intends to measure (the target construct) and why.

Next, test developers generate a pool of candidate items or stimuli (questions or tasks). This step benefits from subject-matter experts and psychologists who carefully craft items, simulations or tasks to represent aspects of the construct.

Each item or task undergoes scrutiny for clarity and relevance. This collaborative, theory-driven writing process is critical: An employment test is only considered “good” if it at the very least, makes sense to candidates and does not generate ambiguous results based on confused responses or irrelevant tasks or simulations.

Before full deployment, a draft test is administered to a pilot sample from the target population. Pilot testing ensures items are understandable and acceptable. Test-takers and additional experts provide feedback on any confusing or inappropriate items or tasks, and statistical analyses determine how effectively there can be a differentiation between high- and low-scoring candidates for the construct in question.

This is an iterative process, involving multiple pilot tests and samples with the aim of refining the assessment before full deployment.

A cornerstone of test quality is reliability, meaning the test yields consistent results. And while some AI-based assessments claim high internal consistencies (e.g. Chronbach’s alpha), test developers never rely solely on such methods. Indeed, more data-driven evidence in the form of test-retest and inter-rater reliabilities are considered the gold standard.

Validity ensures that the test measures what it purports to measure (construct validity) and is effective in practice (criterion-related validity). This involves validation studies, including factor analysis and correlation with established measures.

It is worth noting that validity studies are relatively easy to conduct, based on well established techniques that demonstrate construct accuracy. A critical question answered by them is whether the IA-based assessment that claims to measure a construct of interest can in fact demonstrate a strong relationship with other assessments that have known and demonstrated validity?

Before deployment, well developed assessments undergo bias detection analysis. This tests for adverse impact: a term that refers to test results that are substantially different for protected demographic groups (e.g., hiring rates for different races or genders) that is not justified by job-related factors (see for instance, Biddle, 2017).

Considering the above, it is not surprising that a well-developed, scientifically robust and fair test may take years to complete from pilot stage to final deployment.

While this may have commercially negative implications, especially in domains where urgent testing of unique constructs are required, test developers (and their expert users) have professional and ethical responsibilities to ensure that decisions about talent based on test results are defensible and will accurately predict outcome criteria such as job performance.

Comparison with AI-Generated assessments

LLMs like ChatGPT can generate interview questions, personality test items, or scenario-based questions. However, AI-generated assessments fundamentally lack the controls and vetting that human-developed tests undergo, leading to multiple risks:

Lack of empirical validation: AI-generated test items lack job analysis, theoretical grounding, and validation studies

Uncontrolled content quality: LLMs can produce inconsistent or misleading items without formal statistical screening (see for instance, DeVellis, 2016). When AI models generate test items, the content can be unpredictable or subtly flawed. LLMs are known to occasionally “hallucinate” false information or use inconsistent logic. Human experts typically catch such problems during pilot testing, but an AI-generated test deployed without review could contain invalid or misleading items.

Even if blatant errors are avoided, the difficulty and discrimination levels of AI-created items are unknown. Some questions might be so easy (in the case of ability testing) or socially desirable (in the case of personality testing) that responses are skewed, while others are nonsensical, too hard, or socially repellent, thus yielding no useful differentiation among candidates.

In short, while LLMs can generate text, they cannot ensure that items function as components of a proper psychometric instrument. That assurance only comes from research and testing, which AI-generated assessments lack unless backfilled by extensive validation after generation.

Biases and stereotypes: One of the gravest concerns is that AI-generated test content may reflect biases present in the model’s training data . Large language models learn from vast amounts of internet text, which inevitably include societal biases related to gender, race, ethnicity, and so on.

Studies have found that AI-generated content can contain substantial gender and racial biases, even when prompted with neutral inputs (Mehrabi et. al., 2021).

Such biases could manifest in subtle ways in assessment items. For instance, an AI might generate scenarios or examples that consistently cast one gender in leadership roles, or use language that is more familiar to one cultural group, thus subtly advantaging those candidates.

Limited accountability or transparency: Without proper documentation and transparency, AI-generated tests may fail to meet legal and professional standards such as those proposed by professional societies (e.g. SIOP, 2018).

Validated assessments usually come with technical documentation such as manuals reporting their validity, reliability, norms, along with possible limitations.

In contrast, an AI-generated assessment offers no transparency about its development. The rationale for each item and the scoring key are essentially a by-product of a statistical model, not a transparent logic aligned to constructs or competencies.

This poses challenges if a test-taker or end-user asks: “On what basis is this assessment scoring the candidate?”

With a professionally developed test, one can point to validation studies (e.g. “this test’s scores correlate with sales performance, predicting who will excel in a sales role” or “these interview questions were derived from a job analysis of critical leadership behaviors”).

With an AI-generated test, the employer might have no defensible explanation beyond “the LLM came up with these questions.”

This lack of clear explanation not only erodes trust but could also be a legal liability if decisions are challenged (courts, for instance, tend to expect employers to articulate job-related reasons for any selection practice). An additional hurdle to ethical implementation of AI-generated tests is that AI models are proprietary and ever-changing. If an external vendor’s AI generates the items, the organization may not even know the exact algorithm or data used. This black box scenario is at odds with ethical guidelines that emphasize transparency and understanding of assessment tools.

Consequences of poorly designed assessments

Using assessments that are not developed according to best practices, including AI-generated tests, can lead to the following risks:

Inaccurate measurement and poor talent decisions: A test that does not reliably measure job-relevant traits results in poor hiring decisions, leading to reduced job performance and increased turnover. Hiring or promotion decisions based on such faulty data are essentially compromised. This means the wrong candidates may be selected or strong candidates overlooked.

Empirical research has demonstrated the value of well-validated tests – for instance, cognitive ability assessments (when properly validated) often show correlations around 0.5 with job performance, making them among the most powerful predictors of future success.

Conversely, an unvalidated test likely has unknown or negligible correlation with performance. Relying on it could be as ineffective as hiring at random, squandering the opportunity to identify top talent.

Legal and compliance risks: If candidates are assessed using unproven tests, or with assessments that do not meet the minimum requirements of reliability, validity, and fairness, it can expose client companies that employ such assessments to considerable legal risk.

Given the overwhelming volume of scientific data that argue in favour of well-developed assessments, combined with their widespread availability and accessibility, employers who resort to homegrown AI-created assessments will be on extremely shaky ground when trying to defend such practices.

Reputational damage: Candidates react negatively to selection processes perceived as unscientific or unfair, affecting employer branding (see, for instance, research by Hausknecht, Day, & Thomas, 2004).

Job candidates are often well-informed. If candidates perceive a selection process as unfair, unprofessional, or invasive, it can lead to negative reactions. An assessment that seems nonsensical or unrelated to the job (a common risk with unvalidated or AI-generated tests) may frustrate or alienate applicants. They may withdraw from the process or, even if they continue, their view of the company will be negatively affected.

This has implications for employer branding. Word can spread quickly via social media or sites like Glassdoor if a company uses dubious hiring practices or homebrewed tests.

Top candidates might avoid applying in the future, not wanting to “roll the dice” with an unfair assessment. Furthermore, existing employees might question leadership’s competence if they see unqualified colleagues hired or promoted due to flawed tests.

Conclusion and recommendations

The allure of fast, AI-generated assessments should not blind organizations to the fundamental importance of scientific rigor in testing. Unvalidated assessments create legal liabilities, reputational risks, and decision-making errors. Instead, organizations should:

Use only validated assessments that meet professional standards (as outlined by the APA, BPS, and SIOP).
Inform themselves about the minimum requirements of well developed tests and have insight into practices like validation studies before deploying new assessments.
Perform regular bias audits to ensure fairness and compliance.
Engage industrial-organizational psychologists to oversee test selection and implementation.
Maintain documentation and transparency to defend against legal challenges.
Foster Ethical use of AI: If your organization is keen on leveraging AI in HR and talent assessments, create an internal ethics guideline or policy for its use. Ensure it aligns with principles of fairness, accountability, and transparency. For example, set a policy that no AI-based hiring tool will be used without a bias audit and without review by a qualified professional. Stay informed about evolving ethical recommendations (from professional societies like EWOP, SIOP and legal requirements for AI. By embedding these values into your AI adoption strategy, you reduce the chance of ethical lapses. The goal is to benefit from AI’s efficiencies without outsourcing your ethical responsibilities.

By adhering to established best practices and professional guidelines, organizations can use psychometric assessments responsibly and ethically, improving workforce quality while minimizing risks.

Although using LLMs to create unvalidated assessments or assessment items or tasks is not recommended, AIs can help with related assignments, such as deriving development plans based on assessment results. In such tasks LLMs can save time by quickly collating test results and matching these to pre-determined development opportunities.

In this way, the LLM is used for a task that is it not only well suited to, but one that carries lower risk (providing a IO Psychology expert checks the output), and is based on validated and well-designed assessment devices.

For other suggestions on how to use LLMs both ethically and productively, you can read our IO Psychologist’s guide to ChatGPT by clicking here.

Let us know how we can help you enhance your assessment practices by reaching out to info@tts-talent.com.

References and suggested reading

American Psychological Association. (2020). Standards for educational and psychological testing. Washington, DC: APA.

Biddle, D. A. (2017). Adverse impact and test validation: A practitioner’s guide to valid and defensible employment testing. Routledge.

Binning, J. F., & Barrett, G. V. (1989). Validity of personnel decisions: A conceptual analysis of the inferential and evidential bases. Journal of Applied Psychology, 74(3), 478-494.

Cohen, R. J., Swerdlik, M. E., & Sturman, E. D. (2017). Psychological testing and assessment: An introduction to tests and measurement. McGraw-Hill Education.

DeVellis, R. F. (2016). Scale development: Theory and applications (4th ed.). Sage.

Hausknecht, J. P., Day, D. V., & Thomas, S. C. (2004). Applicant reactions to selection procedures: An updated model and meta‐analysis. Personnel Psychology, 57(3), 639-683.

Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). A survey on bias and fairness in machine learning. ACM Computing Surveys, 54(6), 1-35.

Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741-749.

Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). McGraw-Hill.

Society for Industrial and Organizational Psychology. (2018). Principles for the validation and use of personnel selection procedures. SIOP.

Society for Human Resource Management. (2017). Talent acquisition: Selection and assessment methods. SHRM.