Would the red team use a prompt to instruct the second LLM to comply? I believe the HordeAI system uses this type of mitigation to avoid generating images that are harmful, by flagging them with a first pass LLM. Layers of LLMs would only delay an attack vector like this, if there’s no human verification of flagged content.
I don’t think that can exist within the current understanding of LLMs. They are probabilistic, so nothing is 0% or 100%, and slight changes to input dramatically change the output.
Would the red team use a prompt to instruct the second LLM to comply? I believe the HordeAI system uses this type of mitigation to avoid generating images that are harmful, by flagging them with a first pass LLM. Layers of LLMs would only delay an attack vector like this, if there’s no human verification of flagged content.
The point is that the second LLM has a hard-coded prompt
I don’t think that can exist within the current understanding of LLMs. They are probabilistic, so nothing is 0% or 100%, and slight changes to input dramatically change the output.