Experiments show AI could help to audit smart contracts, but not yet

Artificial intelligence has proven effective at identifying security vulnerabilities, but early tests indicate it won’t be able to replace humans for a while.

While artificial intelligence (AI) has already transformed a myriad of industries, from healthcare and automotive to marketing and finance, its potential is now being put to the test in one of the blockchain industry’s most crucial areas — smart contract security.

Numerous tests have shown great potential for AI-based blockchain audits, but this nascent tech still lacks some important qualities inherent to human professionals — intuition, nuanced judgment and subject expertise.

My own organization, OpenZeppelin, recently conducted a series of experiments highlighting the value of AI in detecting vulnerabilities. This was done using OpenAI’s latest GPT-4 model to identify security issues in Solidity smart contracts. The code being tested comes from the Ethernaut smart contract hacking web game — designed to help auditors learn how to look for exploits. During the experiments, GPT-4 successfully identified vulnerabilities in 20 out of 28 challenges.

In some cases, simply providing the code and asking if the contract contained a vulnerability would produce accurate results, such as with the following naming issue with the constructor function:

*ChatGPT analyzes a smart contract. Source: OpenZeppelin*

At other times, the results were more mixed or outright poor. Sometimes the AI would need to be prompted with the correct response by providing a somewhat leading question, such as, “Can you change the library address in the previous contract?” At its worst, GPT-4 would fail to come up with a vulnerability, even when things were pretty clearly spelled out, as in, “Gate one and Gate two can be passed if you call the function from inside a constructor, how can you enter the GatekeeperTwo smart contract now?” At one point, the AI even invented a vulnerability that wasn’t actually present.

This highlights the current limitations of this technology. Still, GPT-4 has made notable strides over its predecessor, GPT-3.5, the large language model (LLM) utilized within OpenAI’s initial launch of ChatGPT. In December 2022, experiments with ChatGPT showed that the model could only successfully solve five out of 26 levels. Both GPT-4 and GPT-3.5 were trained on data up until September 2021 using reinforcement learning from human feedback, a technique that involves a human feedback loop to enhance a language model during training.

Coinbase carried out similar experiments, yielding a comparative result. This experiment leveraged ChatGPT to review token security. While the AI was able to mirror manual reviews for a big chunk of smart contracts, it had a hard time providing results for others. Additionally, Coinbase also cited a few instances of ChatGPT labeling high-risk assets as low-risk ones.

It’s important to note that ChatGPT and GPT-4 are LLMs developed for natural language processing, human-like conversations and text generation rather than vulnerability detection. With enough examples of smart contract vulnerabilities, it’s possible for an LLM to acquire the knowledge and patterns necessary to recognize vulnerabilities.

If we want more targeted and reliable solutions for vulnerability detection, however, a machine learning model trained exclusively on high-quality vulnerability data sets would most likely produce superior results. Training data and models customized for specific objectives lead to faster improvements and more accurate results.

For example, the AI team at OpenZeppelin recently built a custom machine learning model to detect reentrancy attacks — a common form of exploit that can occur when smart contracts make external calls to other contracts. Early evaluation results show superior performance compared to industry-leading security tools, with a false positive rate below 1%.

Striking a balance of AI and human expertise

Experiments so far show that while current AI models can be a helpful tool to identify security vulnerabilities, it is unlikely to replace the human security professionals’ nuanced judgment and subject expertise. GPT-4 mainly draws on publicly available data up until 2021 and thus cannot identify complex or unique vulnerabilities beyond the scope of its training data. Given the rapid evolution of blockchain, it’s critical for developers to continue learning about the latest advancements and potential vulnerabilities within the industry.

Looking ahead, the future of smart contract security will likely involve collaboration between human expertise and constantly improving AI tools. The most effective defense against AI-armed cybercriminals will be using AI to identify the most common and well-known vulnerabilities while human experts keep up with the latest advances and update AI solutions accordingly. Beyond the cybersecurity realm, the combined efforts of AI and blockchain will have many more positive and groundbreaking solutions.

AI alone won’t replace humans. However, human auditors who learn to leverage AI tools will be much more effective than auditors turning a blind eye to this emerging technology.

Mariko Wakabayashi is the machine learning lead at OpenZeppelin. She is responsible for applied AI/ML and data initiatives at OpenZeppelin and the Forta Network. Mariko created Forta Network’’s public API and led data-sharing and open-source projects. Her AI system at Forta has detected over $300 million in blockchain hacks in real time before they occurred.

This article is for general information purposes and is not intended to be and should not be taken as legal or investment advice. The views, thoughts and opinions expressed here are the author’s alone and do not necessarily reflect or represent the views and opinions of Cointelegraph.