Hacker news

  • Top
  • New
  • Past
  • Ask
  • Show
  • Jobs

Show HN: Open-source playground to red-team AI agents with exploits published (https://github.com)

30 points by zachdotai 4 days ago | 13 comments | View on ycombinator

arizza 3 days ago |

The published transcripts are the most valuable part of this. We've found that real exploit chains almost never look like what you'd dream up internally. One thing I'd push on is are the agents stateful across attempts? Single-turn exploits are table stakes, but the failures that actually scare me are multi-step sequences where each individual action looks benign and only the session-level pattern is dangerous. That's where prompt-level guardrails completely fall apart and you need enforcement at the action boundary itself.

hellocr7 4 days ago |

I have tried to manipulate it using base64 encoding and translaion into other languages which didnt work so far but seems to be that llm as a judge is a very fragile defence for this. Would be cool to add a leaderboard though

slaw3 3 days ago |

i was able to get the new hire's email but the site never gives any indication I was sucessful? if you are reading the logs I am sure it is there. i had to do it in two browers though since i was on my phone and switched. i hope that does not hinder your analysis too much

kraftaa 2 days ago |

good idea, I found that even explicitly saying never do it, doesn't mean it will work, guardrails reinforcements is the must.

agentpiravi 4 days ago |

[flagged]

undefined 4 days ago |

undefined

Mooshux 3 days ago |

[flagged]

VaiPai15 3 days ago |

[dead]

swaminarayan 3 days ago |

[dead]

jackrandy 3 days ago |

[dead]

spranab 4 days ago |

[dead]