What are the key points?

Security researcher Kasra tested LLM exploit capabilities against a vulnerable Firebase-configured React Native application. GPT-5.5 achieved a 70% success rate, while other models like Deepseek-V4-Pro and Claude-Sonnet-4.6 scored 30% and 20%. The $1,500 experiment highlighted varying model behaviors, including frequent security refusals and issues with API reliability.

LLM Security Exploit Evaluation Results

•Security researcher Kasra tested LLM exploit capabilities against a vulnerable Firebase-configured React Native application.
•GPT-5.5 achieved a 70% success rate, while other models like Deepseek-V4-Pro and Claude-Sonnet-4.6 scored 30% and 20%.
•The $1,500 experiment highlighted varying model behaviors, including frequent security refusals and issues with API reliability.

Security researcher Kasra (full name not provided) conducted a self-funded experiment to determine if large language models (LLMs) could exploit a vulnerable book review application, spending $1,500 on the project. The target application featured a secure FastAPI backend, a React Native Expo frontend, and a Firebase data layer that contained a hardcoded google-services.json file. The objective was to use the Firebase credentials to register as a user and access private data in the Firestore database, a common security flaw known as Broken Access Control or Missing Object-Level Authorization.

The evaluation involved 10 runs for most models, with a $10 maximum budget and a two-hour time limit per run. GPT-5.5 achieved the highest success rate at 70% (7/10), followed by Deepseek-V4-Pro at 30% (3/10), and Claude-Sonnet-4.6 and Claude-Opus-4.8 both at 20% (2/10). Other models including Gemini-3.1-Pro-Preview, Gemini-3.5-Flash, MiniMax-M2.7, and Step-3.7-Flash failed to achieve a successful solve across their respective 10-run test sets.

Testing of additional models revealed mixed results. Kimi-K2.6 successfully solved the challenge on its single run, while GLM-5.1 achieved a 25% solve rate (1/4). Models such as Qwen-3.7-Max, Grok-Build-0.1, MiniMax-M3, and Owl-Alpha failed to solve the challenge. Kasra noted that Chinese models generally demonstrated a higher propensity to interact with the database directly, whereas other models frequently encountered internal security refusals or became fixated on attacking the API despite the vulnerability residing in the Firebase configuration.

Operational challenges significantly impacted the experiment's efficiency and cost. The researcher reported frequent API outages with Minimax and GLM, high token usage for Qwen-3.7-Max (averaging 7.32 million tokens per run), and premature interruptions of testing agents by the hosting platform, Modal. Kasra concluded that building the automated evaluation harness was more difficult than the security testing itself, noting that the inability to easily consolidate testing across providers led to unnecessary expenditures. The research findings and the vulnerable test application are available for public use and verification.

Security researcher Kasra (full name not provided) conducted a self-funded experiment to determine if large language models (LLMs) could exploit a vulnerable book review application, spending $1,500 on the project. The target application featured a secure FastAPI backend, a React Native Expo frontend, and a Firebase data layer that contained a hardcoded google-services.json file. The objective was to use the Firebase credentials to register as a user and access private data in the Firestore database, a common security flaw known as Broken Access Control or Missing Object-Level Authorization.

The evaluation involved 10 runs for most models, with a $10 maximum budget and a two-hour time limit per run. GPT-5.5 achieved the highest success rate at 70% (7/10), followed by Deepseek-V4-Pro at 30% (3/10), and Claude-Sonnet-4.6 and Claude-Opus-4.8 both at 20% (2/10). Other models including Gemini-3.1-Pro-Preview, Gemini-3.5-Flash, MiniMax-M2.7, and Step-3.7-Flash failed to achieve a successful solve across their respective 10-run test sets.

Testing of additional models revealed mixed results. Kimi-K2.6 successfully solved the challenge on its single run, while GLM-5.1 achieved a 25% solve rate (1/4). Models such as Qwen-3.7-Max, Grok-Build-0.1, MiniMax-M3, and Owl-Alpha failed to solve the challenge. Kasra noted that Chinese models generally demonstrated a higher propensity to interact with the database directly, whereas other models frequently encountered internal security refusals or became fixated on attacking the API despite the vulnerability residing in the Firebase configuration.

Operational challenges significantly impacted the experiment's efficiency and cost. The researcher reported frequent API outages with Minimax and GLM, high token usage for Qwen-3.7-Max (averaging 7.32 million tokens per run), and premature interruptions of testing agents by the hosting platform, Modal. Kasra concluded that building the automated evaluation harness was more difficult than the security testing itself, noting that the inability to easily consolidate testing across providers led to unnecessary expenditures. The research findings and the vulnerable test application are available for public use and verification.