A scientific article by the lecturer (Dr. Karim Abis Handoul) entitled "The Continued Failure of Artificial Intelligence to Match Humans in Programming"

A scientific article by the lecturer (Dr. Karim Abis Handoul) entitled "The Continued Failure of Artificial Intelligence to Match Humans in Programming"

13/04/2025 Share :

203

The use of AI models from companies like OpenAI and Anthropic to assist in programming tasks is increasing. However, a new study by Microsoft has revealed a shocking fact about this trend. The new study, conducted by Microsoft Research, the research and development division of Microsoft, found that even some of today’s best AI models still struggle to correct coding errors that pose no challenge to experienced developers. The study concluded that models including “Claude 3.7 Sonnet” from Anthropic and “o3-mini” from OpenAI fail to fix many issues in the software development benchmark known as “SWE-bench Lite,” according to a report by TechCrunch, reviewed by Al Arabiya Business. The findings are a stark reminder that despite bold claims from companies like OpenAI, AI still doesn’t match human experts in fields like programming. The researchers tested nine different models as the core of an AI agent that had access to various debugging tools, including a Python debugger. They tasked this agent with solving a selected set of 300 debugging tasks from SWE-bench Lite. According to the researchers, even when using the most advanced models, the agent rarely completed more than half of the tasks successfully. The highest average success rate was achieved by the “Claude 3.7 Sonnet” model, with 48.4%, followed by OpenAI’s “o1” model at 30.2% and “o3-mini” at 22.1%. Why the disappointing performance? Some models struggled to use the debugging tools provided and to understand how different tools help address different issues. But the biggest problem, according to the researchers, is the lack of data. They believe there isn’t enough data representing “sequential decision-making processes” — i.e., tracking how humans debug code — in the current training datasets. The researchers suggested that training or fine-tuning the models with specialized data could make them more effective debugging tools, but this would require targeted training data. It’s worth noting that these results aren’t entirely surprising. Many studies have shown that AI-generated code tends to contain security vulnerabilities and errors due to the AI's weakness in understanding programming logic. A recent evaluation of the popular AI programming tool “Devin” found that it could only complete 3 out of 20 programming tests. "AL_mustaqbal University is the first university in Iraq" <a href=https://uomus.edu.iq/Default.aspx target=_blank>al-mustaqbal University Website</a>