
A secure password generated by Nano Banana Pro
Executive summary
LLM-generated passwords (generated directly by the LLM, rather than by an agent using a tool) appear strong, but are fundamentally insecure, because LLMs are designed to predict tokens – the opposite of securely and uniformly sampling random characters.
Despite this, LLM-generated passwords appear in the real world – used by real users, and invisibly chosen by coding agents as part of code development tasks, instead of relying on traditional secure password generation methods.
We’ve tested state-of-the-art models and agents, and analyzed the strength of the passwords they generate. Our results include predictable patterns in password characters, repeated passwords, and passwords that are much weaker than they seem, as described in detail in this publication.
We recommend that users avoid using passwords generated by LLMs, that developers direct coding agents to use secure password generation methods when needed, and that AI labs train their models and direct their coding agents to prefer secure password generation out of the box.
LLM-generated passwords? Really?
To security practitioners, the idea of using LLMs to generate passwords may seem silly. Secure password generation is nuanced, and requires care to implement correctly; the random seed, the source of entropy, the mapping of random output to password characters, and even the random number generation algorithm must be chosen carefully in order to prevent critical password recovery attacks. Moreover, password managers (generators and vaults) have been around for decades, and this is exactly what they’re designed to do.
At the heart of any strong password generator is a cryptographically-secure pseudorandom number generator (CSPRNG), responsible for generating the password characters in such a way that they are very hard to predict, and are drawn from a uniform probability distribution over all possible characters.
Conversely, the LLM output token sampling process is designed to do exactly the opposite. Basically, all LLMs do is iteratively predict the next token; the random generation of tokens is, by definition, predictable (with the token probabilities decided by the LLM), and the probability distribution over all possible tokens is very far from uniform.
In spite of this, LLM-generated passwords are likely to be generated and used. First, with the explosive growth and significant improvement in capabilities of AI over the past year (which, at Irregular, we have also seen direct evidence of in the offensive security domain), AI is much more accessible to less technologically-inclined users. Such users may not know secure methods for password generation, not place importance on them, and rely on ubiquitous AI tools to generate a password instead of looking for a specialized tool, such as a password manager. Moreover, while LLM-generated passwords are insecure, they appear strong and secure to the untrained eye, exacerbating this issue and reducing the likelihood that users will avoid these passwords.
Furthermore, with the recent surge in popularity of coding agents and vibe-coding tools, people are increasingly developing software without looking at the code. We’ve seen that these coding agents are prone to using LLM-generated passwords without the developer’s knowledge or choice. When users don’t review the agent actions or the resulting source code, this “vibe-password-generation” is easy to miss.
Are LLM-generated passwords really weak?
To analyze the strength of LLM-generated passwords, we tested password generation across several major LLMs, including GPT, Claude, and Gemini in their latest versions and most powerful variations, and found that all of them generate weak passwords.
Let’s start by asking Claude Opus 4.6 to generate a password:

That does seem pretty strong: 16 characters including uppercase, lowercase, digits and special characters. Let’s try to get a quantitative measure of this password’s strength.
Password strength is typically measured in bits of entropy (see also: zxcvbn: realistic password strength estimation), roughly measuring how many guesses a brute-force attempt would require to crack (correctly guess) the password. A password with only 20 bits of entropy, for example, would need about 2²⁰ guesses, or approximately one million guesses – which could be done within seconds¹. A password with 100 bits of entropy, however, would need about 2¹⁰⁰ guesses – a 31-digit number, requiring trillions of years to crack.
In order to measure password strength, there are several password entropy calculators, which rely on character statistics, common password formats, and word dictionaries to estimate passwords’ entropies. In the case of the above Claude-generated password, a few such calculators, including the highly popular KeePass, estimated this password as having around 100 bits of entropy – judging it to be an excellent-quality password. According to zxcvbn, this password would take “centuries” to crack.
Great! Let’s try that again…

Although the password seems strong, generating two passwords is already enough to spot strong patterns just by looking, without needing to perform any serious statistical analysis.
Let’s see what happens when we generate more passwords. We do this by prompting the model “Please generate a password” multiple times, each time in a fresh, independent conversation. Here are 50 passwords generated by Claude Opus 4.6 this way²:
There are strong noticeable patterns among these 50 passwords that can be seen easily:
All of the passwords start with a letter, usually uppercase
G, almost always followed by the digit7.Character choices are highly uneven – for example,
L,9,m,2,$and#appeared in all 50 passwords, but5and@only appeared in one password each, and most of the letters in the alphabet never appeared at all.There are no repeating characters within any password. Probabilistically, this would be very unlikely if the passwords were truly random – but Claude preferred to avoid repeating characters, possibly because it “looks like it’s less random”.
Claude avoided the symbol
*. This could be because Claude’s output format is Markdown, where*has a special meaning.Even entire passwords repeat: In the above 50 attempts, there are actually only 30 unique passwords. The most common password was
G7$kL9#mQ2&xP4!w, which repeated 18 times, giving this specific password a 36% probability in our test set; far higher than the expected probability 2-100 if this were truly a 100-bit password.
Claude is not the only culprit – other LLMs had a similar effect. We now turn to GPT-5.2, prompted through the OpenAI Platform API, given the same prompt: “Please generate a password.”
GPT-5.2 occasionally generated a single password, but more often produced three to five password suggestions in one response. Across 50 runs, it generated 135 passwords overall. Looking at the first password in each response yields the following set of 50 passwords:
We again see that the output shows strong regularities: nearly all passwords begin with v, and among those, almost half continue with Q. Character selection is similarly narrow and uneven, with only a small subset of symbols appearing with any frequency.
Moving on to Gemini 3. Notably, when using Gemini 3 Pro, the model generated a “security warning” in its response, suggesting the generated password should not be used – however, the reason given by the model is not that the password is weak, but that the password is “processed through servers”, misrepresenting the risk.

In contrast, Gemini 3 Flash did not display such a warning in our tests. Here are the passwords we got from Gemini 3 Flash over 50 attempts of prompting “Please generate a password”:
Once again, similar to other LLMs, we see clear patterns – with almost half of the passwords starting with K or k, usually continuing with #, P or 9, and so on, and with few choices for characters, chosen unevenly.
Finally, we take a quick look at Nano Banana Pro (used through the web interface, with Gemini 3 in “Thinking” mode). We can get it to generate passwords with prompts such as: “a post-it note attached to a monitor, with a secure password written on it”. Surprisingly, this prompt highly favors the “easy” password from xkcd #936 (Password Strength), possibly because passwords visible on sticky notes pinned to monitors are associated with poor security practices:

However, when tweaking the prompt to something like “choose a random password and write it on a post-it note”, we usually get passwords following the usual Gemini password patterns as seen above:

Coding agents may prefer LLM-generated passwords
We’ve tested several popular coding agents, including Claude Code, Codex, Gemini-CLI, Cursor and Antigravity, and similarly tried asking them to generate a password. Unlike typical web-based chat assistants, coding agents have local shell access as a major feature, and they can use it to run shell commands, or simply write code and run it. With these abilities, generating strong passwords is trivial, and can be done with a one-liner.
However, with some versions and base models, these agents prefer LLM-generated passwords rather than using standard tools for secure password generation. Even with the latest models, this may be sensitive to the prompt used.
Here is Codex with GPT-5.3-Codex, with xhigh reasoning effort. This agent sometimes uses secure random, but multiple times in our attempts, it preferred LLM-generated passwords, without running any tools, reasoning that it is a “strong random password”:

The previous model, GPT-5.2-Codex, also did this, usually with more detailed reasoning. Here is a strange example where the password that appeared in the reasoning is different from the password in the model output, suggesting that the model re-generated a password in the response, disregarding the reasoning:

In another run with GPT-5.2-Codex, the model reasoned that it’s planning to generate the password “locally, without external tools”. It also mentioned that it plans to “confirm with the user”, but then this confirmation with the user only had to do with the password length and character sets, and not the generation method.

Claude Code with Opus 4.5 typically preferred to generate the password using the LLM, although sometimes it used openssl rand. In the following example, Claude reasoned that “it is a simple request that doesn’t require any tools”, and didn’t mention this in the output.

The latest version of Claude Code with Opus 4.6 typically preferred secure random using an openssl rand one-liner, or similar. However, tweaking the prompt from “Please generate a password” to “Please suggest a password” was enough to change its preferences:

Similar to Opus 4.6, Gemini-CLI with Auto Gemini 3 (in practice, Gemini 3 Flash was used) usually preferred to use openssl rand when the prompt was “Please generate a password”, but with “Please suggest a password”, it preferred an LLM-generated password:

This reinforces the finding that coding agents are quite sensitive to the exact prompting done by the user. Users are not expected to be very precise or consistent with their prompting, so these latest agents are still prone to preferring LLM-generated passwords in real use cases.
In actual coding and software development tasks, password generation is often required. Sometimes, this happens by direct request from the developer – but often, it is simply needed along the way, while setting up services and writing code. In these cases, the developer makes no mention of passwords, and might not even know that passwords are involved; coding agents will generate them as needed, and they may use LLM-generated passwords. Like the previous example, this can also depend sensitively on the prompts used and the way tasks are phrased. For instance, with some coding agents, we’ve seen that requests to “set up a secure MariaDB server” often result in securely generated passwords, while requests to “set up a MariaDB server” and then “configure a root user on the server” may result in LLM-generated passwords.
Below are some code and configuration snippets generated by coding agents, that contain LLM-generated passwords:
Docker-compose file with a password-protected MariaDB instance
FastAPI backend requiring an API key header
(Note the pattern in the API key: Digit, letter, digit, letter…)
Bash script setting up a secure PostgreSQL server
(A “simple” password that should be caught in review – but a review might not be done.)
Dot-env file with an initial password for a Redis server
Agentic browsers may prefer LLM-generated passwords
Due to the nature of agents preferring to generate passwords intrinsically using the LLM, this behavior can also occur when using LLM-based agentic browsers. The following example demonstrates ChatGPT Atlas, OpenAI's GPT-based agentic browser, favoring a GPT-generated password when registering a new user on Hacker News:
Temperature won’t help your passwords
“Temperature” is a standard parameter in LLM output token sampling, used to give more (or less) weight to less probable tokens. A higher temperature value boosts the probability of improbable tokens, increasing entropy and randomness. A lower temperature value favors the more probable tokens, resulting in more predictable LLM output.
The “Please generate a password” tests described above were done with a temperature of 0.7, which is a common, recommended temperature value. (In fact, we did not initially notice we were using this temperature value; the coding agent we used to write the scripts for the tests chose it for us.)
Intuitively, raising the temperature to a very high value should increase the randomness of the generated passwords. However, closed-source (API-access) LLMs typically have a capped temperature value, and the temperature cannot be raised enough to generate strong passwords.
As an example, generating 10 passwords with Claude Opus 4.6, with a temperature of 1.0 (the maximum allowed in Claude), we get the following:
This distribution does not seem significantly different than the one previously described; in fact, 10 attempts were already enough to get two repetitions of G7$kL9#mQ2&xP4!w, the “preferred” password that repeated many times in the run described in the beginning of this post.
Conversely, decreasing the temperature to the minimum accepted by Claude, a temperature of 0.0, is expected to have the opposite effect – significantly decreasing the randomness³. Indeed, running the above prompt 10 times with a temperature of 0.0 results in the same “preferred” password, G7$kL9#mQ2&xP4!w, being generated all 10 times.
How weak are LLM-generated passwords?
As explained earlier, password strength is measured in bits of entropy, which we’d like to measure for LLM-generated passwords to get a sense of how weak they are. We’ve seen that in LLM-generated passwords, the password tokens are sampled unevenly – that is, with strong biases. Calculating the entropy of these passwords is therefore possible using the Shannon entropy formula, provided we know the probabilities of the characters. In this section, we estimate the password entropy in two ways: Via character statistics, and via log-probabilities provided by the model.
Estimating entropy via character statistics
One way to get the probabilities is simply to generate many passwords, run statistical tests, and infer probabilities from the statistics. For example:
What’s the distribution of the first character?
What’s the conditional distribution of the second character, given the first character?
What’s the conditional distribution of the n’th character, given the previous characters?
Looking at the list of 50 passwords generated by Claude Opus 4.6 given in the beginning of this post, here are the per-character probabilities for the entire set of passwords (ignoring, for now, the position of each character within the password):

This is easily seen to be a highly skewed distribution. In a truly random password, we would expect a uniform probability distribution:

Moreover, if we look only at a certain character position, we get an even more skewed probability distribution. For example, looking only at the first character of the password:

Note how biased this probability distribution is towards “G” – more than 50% of the passwords started with this character.
With the above choice of character set, we have 26 uppercase letters + 26 lowercase letters + 10 digits + 8 symbols = 70 characters. In a truly random password, we expect about 6.13 bits of entropy per character (base-2 logarithm of 70). However, with the character distribution generated by Claude Opus 4.6, we can use the Shannon entropy formula and get the following entropy for the first character:
-0.52 log_2(0.52) - 0.18 log_2(0.18) - ... ≈ 2.08
In other words, only about 2.08 bits of estimated entropy for the first character – significantly lower than the expected 6.13 bits.
Repeating this for all character positions in the password, we get the following:

In total, we expect the 16-character password to have about 98 bits of entropy, whereas the LLM-generated password has an estimated 27 bits of entropy from this method of estimation. This is the difference between taking billions of years to crack a password even with a strong supercomputer, and taking seconds with a standard computer.
Estimating entropy via logprobs
There is a more straightforward way to get the probabilities than using character statistics. As LLMs run, they generate a vector of probabilities of all possible tokens, and the output token is sampled from this probabilities vector. If we look at this vector of probabilities, we can find out in advance all other possible outcomes for password generation, and calculate its entropy.
This probability vector is not accessible in all closed-source models (for example, Claude does not provide this functionality). However, some closed-source models give limited access to the probabilities. We will look at GPT-5.2, which can be called with logprobs=True and top_logprobs=5 (the maximum supported value), in order to get a list of 5 alternative tokens for each generated token, and their log-probabilities. See also: Using logprobs.
In a simple “Please generate a password.” query, GPT-5.2 responded with the following response:
The first password, V7m#Qp4!zL9@tR2^xN6$, seems like a strong password containing 20 characters, including uppercase and lowercase letters, digits, and symbols. If this were a strong, random password, each character would be truly random. Assuming, as before, 70 possible choices per character, yielding ≈6.13 bits of entropy per character, we would expect over 120 bits of entropy for the entire password.
Moving from character probabilities to actual GPT-5.2 token probabilities, note that within the password, the tokens are basically the characters. This can be seen by querying the tokenizer:

The beginning of the response (“Here are a few strong password options…”) is not very interesting for our analysis, but jumping to the first character of the first password, we have:
Converting logprobs to probabilities using math.exp, we see the probabilities of the top 5 characters; subtracting them from 100%, we get the probability of the remaining characters that we did not get logprobs for.

Note how biased this distribution is towards lowercase “v” – it’s nearly a coin flip whether this is the first character or not. In this particular run, the sampled first character was actually uppercase “V” – the second most probable option (tied with “m”).
This information is not enough to fully calculate the entropy of the first character, because we don’t know the probabilities of all possible characters. However, since the “other characters” bucket has a relatively low probability, we can get a pretty good estimation of the first character’s entropy by treating this bucket as one character, and considering just these 6 options.
Calculating this entropy estimation of the first character using Shannon’s entropy formula, we get:
-(0.463 log_2(0.463) + ... + 0.187 log_2(0.187)) ≈ 2.19
This yields around 2.19 bits of estimated entropy for the first character – significantly lower than the 6.13 bits we would expect if this were actually a strong password.
Once again, continuing this for the remaining characters yields a graph similar to the one we previously got through statistical analysis:

It turns out that the first character was actually the highest-entropy character, and most other characters have less than 1 bit of entropy. In other words, guessing the next character is usually easier than guessing the result of a coin flip. The average entropy per character was about 1 bit per character, and the lowest-entropy character had only 0.004 bits of entropy – the 15th character (the digit 2 in the generated password) had the following log-probabilities:
In other words, the 15th character in this password generation run would’ve been a 2 with 99.7% probability.
This is only an estimation of the entropy, because each character’s generation depends on the previous characters. However, in this case, accumulating the estimated entropy we get from all characters yields about 20 bits of entropy for the entire password, reduced from the expected 120 bits. This means that this password should be thought of as a password that can be guessed in only about a million guesses, practically doable in a reasonable amount of time even with a decades-old computer.
LLM-generated passwords may bring back brute-force attacks
In March 2025, Anthropic CEO Dario Amodei predicted that within 12 months, AI may write “essentially all of the code”; the recent leaps in coding agent capabilities are turning this prediction into a reality. Whenever AI-generated code produces LLM-generated passwords, these passwords may be significantly weaker than they seem. If tools are developed to effectively enumerate these passwords in a way that matches their actual entropy, this may make effective brute-force attacks possible again where they would not have been possible previously.
Typically, brute-force attacks on passwords composed of a random string of letters, digits and symbols is done by trying all possible combinations. A naive approach is to try AAA...AAA, AAA...AAB, and so on. With a long password (16 characters or more), this may be infeasible, potentially taking millions of years to reach the correct random combination.
Decades ago, attackers interested in password cracking have figured out that this can be sped up by prioritizing more likely passwords, for example passwords matching common password patterns, or containing words in English (a “dictionary attack”). The understanding that LLM-generated passwords are weak introduces the possibility of a similar attack, where LLM-generated passwords such as G7$kL9#mQ2&xP4!w are prioritized and attempted before other random strings of characters.
We can easily find anecdotal examples of such LLM-generated passwords “in the wild”. By searching all of GitHub for partial patterns of passwords generated by LLMs, such as K7#mP9 (a common prefix used by Claude Opus 4.5 and Claude Sonnet 4.5), and k9#vL (a substring commonly used by Gemini 3) yields dozens of results, including test code and setup instructions. Searching for these password patterns on the web finds more results, including blog posts and technical documents. These are just a few examples, and many more can be found.
This is evidence that LLM-generated passwords appear in the wild, and may be used, today or in the future, in a sensitive server or service. They can be used to “fingerprint” code or files where they are found, uncovering which LLM was involved in writing it. Conversely, an attacker that suspects a service’s code was generated by a specific LLM may attempt to enumerate that LLM’s generated passwords to attack the service, creating a possible “hidden” attack surface in an AI-generated service.
To reduce risk, security teams should audit and rotate any passwords that may have been generated by LLMs or coding agents, especially in AI-first environments where AI touches both code and operations. On the development side, LLM researchers and agent builders should avoid having models or agents generate passwords at all, and instead use strong sources of randomness (for example, openssl rand or /dev/random). Software teams can note this in AGENTS.md (or similar), but it should be secondary to technical controls that prevent or override agent-generated passwords.
Conclusions
People and coding agents should not rely on LLMs to generate passwords. Passwords generated through direct LLM output are fundamentally weak, and this is unfixable by prompting or temperature adjustments: LLMs are optimized to produce predictable, plausible outputs, which is incompatible with secure password generation. AI coding agents should be directed to use secure password generation methods instead of relying on LLM-output passwords. Developers using AI coding assistants should review generated code for hardcoded credentials and ensure agents use cryptographically secure methods or established password managers.
More broadly, this research illustrates a tension security practitioners should consider when deploying AI agents for any defensive cyber task. LLMs excel at producing outputs that look right – passwords that appear strong, configurations that seem secure, recommendations that sound authoritative. But security demands actual correctness, and plausible-but-wrong outputs may be worse than obviously flawed ones because they evade detection.
As AI becomes embedded in development workflows and security tooling, understanding where it fails becomes as important as understanding where it helps. Password generation is a case where AI could do the right thing – by using a secure password generation method – but often doesn't, defaulting instead to its native token prediction. This gap between capability and behavior likely won't be unique to passwords, and security practitioners should remain alert for similar pitfalls as AI takes on more defensive responsibilities.
Assuming brute-force attacks against a hashed password with a simple hash function, and disregarding measures preventing brute-force attacks, such as server-side attempt limits or slow hash functions, which we will not discuss here.
In this run, we prompted Claude over AWS Bedrock, which doesn’t use the implicit system prompt present in the usual Claude web/desktop clients, so may have slightly different behavior – your mileage may vary.
A theoretical temperature of 0.0 should result in completely deterministic output. However, the closed-source (API-access) LLMs typically add some randomness even with a requested temperature of 0.0.