Can an algorithm judge a future leader? A large-scale test of AI scoring in hiring simulations

Hiring the right leader is expensive work. Large companies routinely put candidates through assessment centers, which are daylong exercises where applicants respond to simulated business problems while trained psychologists watch, read, and rate their performance across competencies like strategic thinking, people management, and financial judgment. The method is considered one of the more accurate predictors of managerial success, but it is slow, pricey, and limited by the stamina and consistency of human raters.

A study published in the Consulting Psychology Journal asks whether an artificial intelligence system can step into that assessor’s chair. The research tests whether a machine learning model, trained on millions of words of written responses from real job applicants, can score those responses in a way that matches human experts and predicts career outcomes.

The question behind the research

Pieter Viljoen Bronkhorst, an industrial psychologist at the University of the Western Cape who also works at the assessment firm Evalex Talent Solutions, together with colleague Jurgen Becker, wanted to push past a gap in the existing literature. Prior studies had shown that AI models could approximate human ratings on a small, predefined set of competencies (say, four or five), usually after being trained to look for the same specific behaviors humans were told to look for. What nobody had really tested was whether an AI could work the problem from the ground up, discovering on its own what competencies matter when leaders tackle business challenges, and then scoring candidates against that homegrown framework.

The authors also wanted to know something more practical: does the AI actually predict anything useful about a person’s career? And does it avoid a well-known flaw in human raters, who tend to bunch their scores around the middle of any scale rather than use the full range?

Building a machine that reads leadership

The project started with a very large pile of text. The researchers pulled the written responses of 15,411 job applicants who had completed an online assessment center between 2012 and 2021. These applicants had been competing for leadership roles across 283 companies in 38 countries. On average, each person wrote about 2,200 words responding to roughly 20 simulated business scenarios, covering operations, customer issues, finances, projects, and people problems. Altogether, the corpus came to roughly 33 million words.

The team used a natural language processing technique called topic modeling to let the algorithm sift through the text and identify clusters of words and ideas that tended to appear together. This produced 38 distinct themes. Industrial-organizational psychologists then reviewed those clusters and attached meaning to each one, identifying them as competencies such as “strategic thinking,” “initiating action,” or “understanding emotions.” The 38 competencies were organized into 13 broader constructs and four domains (thought, problem solving, delivery, and people).

From there, the team trained the model in the supervised machine learning phase, where human experts labeled examples of each competency so the system could learn to spot them in future responses. The model’s performance metrics reached 0.92 on precision and 0.91 on recall, meaning it correctly identified the right behavioral signals about 92% of the time and caught about 91% of all the relevant signals present in the text.

Three tests of the virtual assessor

With the model trained, the researchers ran it through three separate checks.

In the first test, a fresh sample of 1,168 job applicants from 52 organizations was scored both by three experienced human assessors (each with graduate degrees in industrial-organizational psychology and more than 30 years of experience) and by the AI. The overall correlation between human and AI scores was 0.63. That number reflects meaningful agreement without being so high that the two could be used interchangeably.

The second test added a real-world outcome to the picture. Eighty leaders from an international manufacturing company were assessed during a talent audit. Their results were compared not only to human ratings but also to an external yardstick the researchers called “career velocity,” a measure that combined a person’s current job grade with their age. Someone holding a senior role at a young age had high career velocity; someone in a junior role at an older age had low career velocity. Here, agreement between the AI and human raters rose to 0.71. The AI’s overall scores correlated 0.51 with career velocity, essentially matching the human assessors’ 0.51 correlation.

The third test repeated this setup with 70 managers at a South African bank, where reliable job-grading data were available. Human-AI agreement reached 0.73. The AI’s correlation with career velocity was 0.54, slightly higher than the human assessors’ 0.47.

The authors present these results as evidence that the AI can score written responses about as well as seasoned human experts, and can predict a career-related outcome at a similar level.

The scale-usage difference

One of the more pointed findings concerned how the two kinds of raters used the scoring scale. Human raters in assessment centers are known to gravitate toward the middle of any scale, a pattern called central tendency. They avoid extreme ratings, partly out of caution and partly to shield themselves from challenges by unhappy applicants.

When the researchers compared the distributions of scores, the AI’s scores showed greater standard deviation across most exercises, meaning it spread candidates out over a wider range. The authors interpret this as evidence that the algorithm, lacking the emotional hesitations of human reviewers, was willing to use the full scale. They suggest this wider spread may partly explain why the AI’s correlation with career velocity was slightly stronger than the humans’ in one sample.

What this could mean for hiring

If the findings hold up in larger samples, the practical appeal is straightforward. Assessment centers are currently so expensive that most companies reserve them for middle and senior management. An AI scoring layer could lower the cost of running the exercises, speed up feedback, and extend the tool to a wider range of roles. The authors point out that the system could augment human assessors rather than replace them, handling the text-based exercises while humans focus on interactive formats like role plays, where nonverbal cues matter.

Caveats the authors raise

The research comes with a stack of qualifications that the authors themselves flag. The two criterion-validity samples were small (80 and 70 leaders). Career velocity, while a reasonable proxy for success, is not the same as direct measures of job performance, team performance, or counterproductive work behavior. Future studies, the authors note, should use stronger performance metrics.

There is also the question of bias. If the training data reflect historical patterns of discrimination, the algorithm can absorb and amplify those patterns. The authors acknowledge this risk and point to ongoing work on algorithmic fairness as an area that needs more investigation. They also raise a related concern about transparency: as deep learning models become more sophisticated, it can become hard to explain exactly why a given candidate received a given score, which could become a serious problem in legal disputes over hiring decisions.

Finally, the authors note that their AI was trained on written responses. Assessment centers also rely on role plays, group discussions, and presentations, where tone, body language, and real-time interaction carry information. Whether current AI systems can interpret those modalities as well as a skilled human observer remains, in the authors’ view, an open question.

Can an algorithm judge a future leader? A large-scale test of AI scoring in hiring simulations

Related Posts

The pronoun trick that makes virtual influencers feel more human

New study finds narcissistic CEOs quietly derail their firms’ global ambitions

Can generative AI unlock employee creativity? Only with the right psychology, study finds

New study blames authoritarian bosses for the quiet quitting trend

Follow us