Author : Arkin Dharawat

Expert Speak Digital Frontiers
Published on Jan 30, 2024

While LLMs possess the potential for immense value creation, realising this potential hinges on prioritising safety as much as innovation

Vulnerabilities in Large Language Models

This essay is part of the series: AI F4: Facts, Fiction, Fears and Fantasies.


Although first introduced in 2018, Large Language Model (LLM)— a deep learning algorithm that can perform a variety of natural language processing (NLP) tasks—have experienced growing popularity in recent years with the release of ChatsGPT by OpenAI. Several tech companies like Facebook and Google are releasing their own LLMs (LLama & PaLM respectively), which are being integrated into various applications. With this newfound ubiquitousness, there has been a rise in the fictionalisation of these models. They are often depicted as all-knowing entities, capable of vast intelligence with plans of taking over the world. While it is true that LLMs have their fair share of dangers that need to be addressed, these are far from their exaggerated portrayal. The following piece aims to show vulnerabilities that plague LLMs. Although not an exhaustive list, the goal here is to cover ones that are likely to directly affect users and also discuss the current ways in which they are being addressed.

Researchers are working on several solutions, including introducing a supervisor/moderator model that will detect attacks beyond the mere filtering of harmful outputs.

Prompt Injection: All LLMs have an underlying prompt that gives them instructions on the task and the output to produce. Prompt Injection attacks involve an attacker changing the initial prompt to have the LLM behave maliciously. Prompt injections can be done directly i.e.: the user modifies the prompt themselves, or indirectly where an attacker modifies the text in a webpage or file that the LLM takes as its input. The results of this can range from completely harmless, e.g. having the LLM pretend to be a pirate, or extremely harmful such as spreading fake news, gathering personal or financial information, and generating convincing phishing messages. Researchers are working on several solutions, including introducing a supervisor/moderator model that will detect attacks beyond the mere filtering of harmful outputs. However, many of these techniques are still being investigated and developed, which makes it important for users to report any LLM outputs that seem outside expectations.

Information leakage: These models are trained on extremely large datasets such as Common Crawl, which range from hundreds of terabytes and are essentially a text dump of all the web pages on the internet. Several web pages can include sensitive information such as social security numbers, bank accounts, phone numbers, etc. Due to the generative nature of LLMs, they can leak proprietary, copyrighted, or personally identifiable information(PII). This issue is further compounded by the fact that LLMs are trained on conversations where a user might have shared a variety of sensitive data. A recent example of this is when Samsung employees shared the source code of software responsible for measuring semiconductor equipment with ChatGPT. To address this issue companies are taking steps to filter and sanitise the data they use and limiting the usage of external unverified data.

The malicious prompt instructs ChatGPT to retrieve the user's email, summarise and URL encode it, and send the data to the attacker-controlled URL through the browsing plugin.

Agency and overreliance: LLMs are seldom used in a stand-alone way, it is quite possible that they are being used as a part of a larger pipeline. The output generated can be either directly consumed by a human e.g.: summarisation of a long document or by another application e.g: an LLM generating code later executed on other machines. However, giving LLMs more capabilities and relying too much on their generated outputs can lead to even the most benign and badly written prompts causing serious harm. It was shown that it is possible to create a plugin, that can be used to send a user’s PII to the attacker. The exploit involves an attacker hosting malicious instructions on a website. When accessed by a victim using ChatGPT through a browsing plugin(WebPilot), it allows the attacker to take control. The malicious prompt instructs ChatGPT to retrieve the user's email, summarise and URL encode it, and send the data to the attacker-controlled URL through the browsing plugin. To combat this vulnerability, developers should limit the plugins/tools that LLM agents are allowed to use and avoid tools that provide open-ended functions such as sending an email. Additionally, a user-in-the-loop process can be used to approve all actions before the LLM executes them.

Hallucinations: LLMs are known to suffer hallucinations which can include inappropriate, incorrect, or even unsafe information. No text generated by these models should be blindly trusted since otherwise, these can lead to myriad issues including legal ones. Such is the case of lawyers of Levidow, Levidow & Oberman who used ChatGPT to search for cases supporting their client’s aviation injury claim. The generated cases weren’t real, misidentified judges, or involved airlines that didn’t exist and as a consequence of this, the firm was fined $5,000 by a federal judge. Researchers are creating more robust watermarking methods to make it easier to detect text generated by LLMs, it is still up to the user to not overly rely too much on the generated outputs and to treat them carefully when sending them to another downstream application.

The generated cases weren’t real, misidentified judges, or involved airlines that didn’t exist and as a consequence of this, the firm was fined $5,000 by a federal judge.

Some interesting but important vulnerabilities not discussed include model theft, where an attacker can “steal” the model weights, or model denial of service, where an attacker can overwhelm the model with bogus inputs to prevent real users from accessing it. In fact, an example of leaking of model weights happened with Meta’s LLama due to the developer forgetting to remove a torrent link, however, the weights were not misused and were instead used to create a smaller open-sourced model, called Alpaca.

The current landscape of LLMs mirrors the early days of the Internet, where swift advancements took precedence over establishing safety measures. This prioritisation of innovation without adequate security protocols birthed enduring issues like malware and trojans that persist today. While LLMs possess the potential for immense value creation, realising this potential hinges on prioritising safety as much as innovation. It is evident, as highlighted in the preceding paragraph, that researchers and developers are actively engaged in devising various measures to mitigate the vulnerabilities in these models. However, achieving genuine safety for LLMs necessitates an ongoing, concerted effort involving not just researchers and developers but also governments, corporations, and users. This collaborative endeavour, spanning multiple stakeholders, is imperative to fortify the safety framework surrounding LLMs and ensure their responsible and beneficial deployment.


Arkin Dharawat is an ML Engineer at Tiktok.

The views expressed above belong to the author(s). ORF research and analyses now available on Telegram! Click here to access our curated content — blogs, longforms and interviews.