Post
                  
                    Neel Nanda (at ICLR) @NeelNanda5
                  
                  
                    
                      ·
                      May 10, 2023
                    
                  
                
                
              Great paper and elegant set up! This is another nice illustration of how it is so, so easy to trick yourself when interpreting LLMs. I would love an interpretability project distinguishing faithful from unfaithful chain of thought! Anyone know what the smallest open source model…
Replies
No replies yet