Representation Bending for Large Language Model Safety
Ashkan Yousefpour Taeheon Kim Ryan S. Kwon Seungbeen Lee Wonje Jeung Seungju Han Alvin Wan Harrison Ngan Youngjae Yu Jonghyun Choi
Ashkan Yousefpour Taeheon Kim Ryan S. Kwon Seungbeen Lee Wonje Jeung Seungju Han Alvin Wan Harrison Ngan Youngjae Yu Jonghyun Choi
评论