LLM-Game-Benchmark

This repository is developed to evaluate Large Language Models (LLMs) via Game Playing. It includes the following components:

This repository welcomes contributions and suggestions. The LLM Game Benchmark repository is shared under the MIT License.

Tic-Tac-Toe Connect Four Gomoku
tictactoe connect4 gomoku

Game Simulation Webpage:

To run simulations of Tic-Tac-Toe, Connect Four, and Gomoku games, please visit the game simulation page. You can use your OpenAI API Key or Google Gemini API Key to run the simulations yourself. Below is a screenshot of the game simulation page. LLM-GameSimulation-Connect4Run

Interactions with the LLMs:

We have implemented the interaction between each game and the LLMs, as shown in the figure below. We utilized the web services provided by Open AI and Google for their models. You can simply use your own API Key to run the game simulations for OpenAI and Google models. To interact with the LLMs hosted on AWS Bedrock such as models developed by Anthropic and Meta, you can use the sample AWS Bedrock code provided in the webservice directory. App-Web-Interaction

Publication:

We have published the details of this study at ArXiv.org and submitted it to a leading IEEE journal in the field. If you utilize the repository, please cite the publication:

In a previous study, we evaluated the strategic thinking capabilities of various LLMs, including Claude 2.1, Gemini-Pro 1.0, GPT-3.5-Turbo, GPT-4, Llama2-70B, and Mistral Large, by having them play Tic-Tac-Toe through a mobile app. This study builds upon that research with additional games, more in-depth analysis, and a user-friendly web-based game simulation software to evaluate more recent LLMs.

If you have any questions, please contact research.explorations at gmail.