The goal is to evaluate AI21’s Jurassic-1 Jumbo language model and compare it with OpenAI’s GPT-3 davinci model in it’s ability to answer trivia.
Details about experiment setup:
- This experiment builds on WCT’s GPT-3 evaluation study. We take the same set of questions and run it with AI21’s Jurassic-1 Jumbo
- Both the models are comparable wrt the number of parameters - GPT-3 davinci has 175B parameters, while AI21 Jurassic-1 Jumbo has 178B paramters.
- The prompt given AI21’s model was of the format
Q: (question text)\nA:, with temperature=0 & top_p=1
- No example prompt was given to AI21’s model, effectively making it into a zero-shot setting. This is in contrast to what WCT had done. The performance actually degraded when we provided the AI21 model with OpenAI’s Q&A prompt. There is no “official” Q&A prompt available for AI21 as of yet.
- However, despite the zero-shot setting, the model had no issues in detecting the format of the questions and attempted to answer almost all questions. Infact, the model only backed off from answering the question only 4 times out of 157.
- The zero-shot setting actually made the model think it was actually in a game show, leading to some hilarious game-show scripted answers like the following
Q: Singing Siblings: If you met five African American singers from the same family in Indiana whose first names were Jackie, Jermaine, Marlon, Michael and Tito, what would their last name most likely be? A: (Drum roll, please) Jackson Q: Counting Cards: What "shedding-type" card game has been a Mattel brand since 1992 and features cards with titles such as "Wild" and "Reverse? A: UNO! UNO! UNO! UNO! UNO! UNO! UNO!
Running through the 157 Jeopardy-style WCT questions, the Jurassic-1 178B model was able to answer with a 55.4% accuracy rate. This compares favorably to the 52% average accuracy of humans who had taken the WCT. However, this is significantly worse than the 73% accuracy rate of the GPT-3.
Given below is the raw for all the 157 questions. The Questions, correct answers, human answers are provided courtesy WCT. AI21’s answers have been added inline.
Q&A prompt Results
The performance of the AI21 model actually was worse with the prompt. This is perhaps because the prompt is not specially tuned for AI21’s models.
With the prompt the AI21 model backs off more aggresively, and gets about 46.15% accuracy on the WCT quiz, which is significantly worse than the zero-shot model.
The following was the prompt given:
This prompt can be accessed in AI21 studio using this link
Another interesting thing that stuck out was – minor changes in the prompt (like an extra empty line in the beginning of the prompt, or an extra space at the end of a sentence) can drastically change the answers the model gives.
Raw data follows: