publications
the first authors with * contributed equally
2025
- IEEE S&PPreference Poisoning Attacks on Reward Model LearningIEEE S&P, 2025
2024
- PreprintSafeguarding Vision-Language Models Against Patched Visual Prompt InjectorsarXiv preprint arXiv:2405.10529, 2024
- NeurIPS 2024Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety AlignmentThirty-Eighth Annual Conference on Neural Information Processing Systems, 2024
- NeurIPS 2024Consistency Purification: Effective and Efficient Diffusion Purification towards Certified RobustnessThirty-Eighth Annual Conference on Neural Information Processing Systems, 2024
- ACL 2024RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language ModelsIn Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024
- ICLR 2024Conversational Drug Editing Using Retrieval and Domain FeedbackIn The Twelfth International Conference on Learning Representations, 2024
2023
- PreprintTest-time backdoor mitigation for black-box large language models with defensive demonstrationsarXiv preprint arXiv:2311.09763, 2023
- PreprintAdversarial Demonstration Attacks on Large Language ModelsarXiv preprint arXiv:2305.14950, 2023
- NeurIPS 2023On the exploitability of instruction tuningAdvances in Neural Information Processing Systems, 2023
- ICML 2023A Critical Revisit of Adversarial Robustness in 3D Point Cloud Recognition with Diffusion-Driven PurificationIn Proceedings of the 40th International Conference on Machine Learning, 2023
2022
- ICLR 2022Densepure: Understanding diffusion models for adversarial robustnessIn The Eleventh International Conference on Learning Representations, 2022
- ICLR 2022Defending against Adversarial Audio via Diffusion ModelIn The Eleventh International Conference on Learning Representations, 2022
- ICML 2022Fast and reliable evaluation of adversarial robustness with minimum-margin attackIn International Conference on Machine Learning, 2022