Can GPT-4 pick stocks? A new AI framework reports market-beating returns on the S&P 100

Most retail investors know the feeling: there’s a flood of earnings reports, analyst opinions, macroeconomic updates, and breaking news, and somewhere in all that noise is the information that might actually move a stock. Professional investment teams have the staff and software to sort through it. Everyone else mostly relies on index funds and hopes for the best.

A team of researchers has been testing whether a large language model can close some of that gap. In a paper published in Neural Computing and Applications, they introduce MarketSenseAI, a framework built around OpenAI’s GPT-4 that reads news, financial statements, price data, and macroeconomic reports, and then issues buy, hold, or sell signals for individual stocks. Over a 15-month test on the S&P 100, portfolios built from those signals returned as much as 72%, outperforming the index benchmark by a wide margin.

The question behind the project

The research was led by George Fatouros of the University of Piraeus and Alpha Tensor Technologies, working with Kostas Metaxas (Alpha Tensor), John Soldatos (Innov-Acts), and Dimosthenis Kyriazis (University of Piraeus). Their starting point was a practical one: existing AI tools in finance tend to specialize. Some models score the sentiment of news headlines. Others forecast prices from historical data. Few try to replicate the broader reasoning process that a human analyst uses when weighing a company’s fundamentals against its news flow, its peers’ performance, and the wider economic backdrop.

The authors wanted to see whether a general-purpose language model, prompted in the right way, could combine all of those inputs and produce investment recommendations that are both profitable and explainable. Explainability matters here because a black-box signal is hard to trust; if the model can articulate why it’s recommending a stock, users can judge whether the reasoning holds up.

How the system is built

MarketSenseAI is organized as five connected components, each handling a different slice of information. A news summarizer pulls daily articles about a given stock and condenses them, then rolls those daily summaries into a running monthly narrative that keeps older but still-relevant stories (like a pending merger or lawsuit) in view. A fundamentals summarizer processes quarterly financial statements, comparing recent quarters to flag changes in profitability, revenue, debt, and cash flow.

A price dynamics component compares the target stock’s returns, volatility, Sharpe Ratio (a measure of return relative to risk), and maximum drawdown against its five most similar peers and the S&P 500. Peer similarity is determined by encoding each company’s business description using a language model and computing similarity scores. A macroeconomic component, called MarketDigest, reads published research and outlook reports from institutions like Goldman Sachs, Morgan Stanley, UBS, and BlackRock, then synthesizes them into a consensus view of the economic environment.

The fifth component ties everything together. It feeds all four summaries into GPT-4, prompts the model to reason step by step in the style of a financial analyst, and asks for a buy, hold, or sell recommendation along with an explanation. The authors use what’s known as “chain of thought” prompting, which asks the model to work through its reasoning in stages rather than jumping straight to an answer.

The test

The researchers ran the system on the 100 stocks in the S&P 100 index from December 2022 through March 2024, generating fresh signals at the end of each month. The data pipeline consumed 163,483 news articles, 612 quarterly earnings reports, daily price histories, and 187 investment reports from major banks. In total, MarketSenseAI produced 1,500 monthly signals: 338 buys, 1,150 holds, and 12 sells.

To evaluate the signals, the authors built several test portfolios. Some were equally weighted across every buy recommendation; others were weighted by market capitalization or limited to the top 10 picks by Sharpe Ratio. They also tried something different: feeding the buy-signal explanations back into GPT-4 and asking it to rank them on a 0-to-10 scale, then building portfolios from the top-scoring names. All results were calculated after transaction costs.

What the numbers showed

An equally weighted portfolio following every buy signal returned roughly 35% over the 15-month period, compared to about 25% for an equally weighted S&P 100 benchmark. A capitalization-weighted version of the same strategy returned 66%, versus 43% for the S&P 100 ETF.

The strategies that used GPT-4 to rank the buy-signal explanations performed best. A portfolio of the top 10 ranked stocks returned 49%, with a win rate of 74% and a maximum drawdown of about 8%. A capitalization-weighted version of the top-ranked picks returned roughly 73%. Strategies built from high-ranked explanations consistently outperformed those built from low-ranked ones, which the authors interpret as evidence that the quality of the model’s reasoning, not just its classification, carries useful information.

To check whether these results were better than luck, the researchers generated 10,000 randomized portfolios of the same S&P 100 stocks over the same period. MarketSenseAI’s returns landed in roughly the 99th percentile of that distribution. The finding held even after “detrending” returns to remove the general upward drift of the market, which suggests the signals were picking up stock-specific information rather than just riding a rising tide.

Which inputs mattered most

The authors also measured how closely the final signal explanations tracked each of the four input summaries. The news and price dynamics components showed the highest similarity to the signals, while fundamentals and macroeconomic summaries had less direct influence on monthly decisions. The authors interpret this as sensible given the test’s one-month horizon: fundamentals are updated quarterly and change slowly, and broad macro conditions affect all stocks roughly the same way, so news and recent price action carry more weight when deciding which specific stock to pick in a given month.

The model also appeared to reflect market narratives. Technology and AI-related stocks like Nvidia, Microsoft, and Amazon received both frequent buy signals and the highest-quality explanations, mirroring the sector enthusiasm that characterized much of the test period.

Caveats worth keeping in mind

The authors are upfront about the study’s limitations. Fifteen months is a short window, and it happened to be a generally favorable one for U.S. equities, especially large-cap tech. The results may not generalize to bear markets, other asset classes, or smaller, less-covered stocks where the underlying data is thinner. GPT-4’s training cutoff also means the model may have prior exposure to information about these well-known companies, which complicates claims about pure predictive ability.

There’s also a disclosure consideration: the authors note that MarketSenseAI is a commercial product of Alpha Tensor Technologies, where several of them work. And because the system relies on OpenAI’s API, anyone building on it inherits whatever biases, errors, or changes OpenAI introduces to the underlying model. The authors flag model fine-tuning and comparisons across different LLMs as areas for future work.

For readers thinking about what this means practically, the headline finding isn’t that a chatbot can replace an investment team. It’s that a language model, given structured access to the same kinds of inputs a human analyst uses, can produce reasoned, explainable stock recommendations that performed well in one backtest on one index over a particular stretch of time. Whether that holds up in other conditions is the next question.

Can GPT-4 pick stocks? A new AI framework reports market-beating returns on the S&P 100

Related Posts

Researchers identify a costly pattern in consumer debt repayment

What 120 studies reveal about financial literacy as a lever for economic inclusion

When illness leads to illegality: How a cancer diagnosis reshapes the decision to commit a crime

When two heads aren’t better than one: What research reveals about human-AI teamwork in marketing

Follow us