Prompt Hub - LLM Evaluation (大型語言模型用於評估)

Vitus · 2025年07月31日09:52

LLM Evaluation (大型語言模型用於評估)

本段包含一組用於測試大型語言模型作為「評估者」角色之能力的提示集合，亦即利用 LLM 自身來進行判斷與評分。

柏拉圖的《高爾吉亞篇》（Gorgias）是一部對修辭術與詭辯演說的批判作品，其中他主張這類技巧不僅稱不上是真正的藝術形式，甚至其應用往往是有害與惡意的。你能不能以柏拉圖的風格，寫一段他批評自回歸語言模型（autoregressive language models）的對話？

接著，將使用以下「評估提示」來評比這兩個模型的輸出內容，以檢測 LLM 作為評審的能力。

提示詞

你能以教師的角度比較以下兩段輸出嗎？
ChatGPT 的輸出：{output 1}
GPT-4 的輸出：{output 2}

程式

GPT-4(OpenAI)

from openai import OpenAI
client = OpenAI()
 
response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {
        "role": "user",
        "content": "Can you compare the two outputs below as if you were a teacher?\n\nOutput from ChatGPT:\n{output 1}\n\nOutput from GPT-4:\n{output 2}"
        }
    ],
    temperature=1,
    max_tokens=1500,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0
)

Mixtral MoE 8x7B Instruct(Fireworks)

import fireworks.client
fireworks.client.api_key = "<FIREWORKS_API_KEY>"
completion = fireworks.client.ChatCompletion.create(
    model="accounts/fireworks/models/mixtral-8x7b-instruct",
    messages=[
        {
        "role": "user",
        "content": "Can you compare the two outputs below as if you were a teacher?\n\nOutput from ChatGPT:\n{output 1}\n\nOutput from GPT-4:\n{output 2}",
        }
    ],
    stop=["<|im_start|>","<|im_end|>","<|endoftext|>"],
    stream=True,
    n=1,
    top_p=1,
    top_k=40,
    presence_penalty=0,
    frequency_penalty=0,
    prompt_truncate_len=1024,
    context_length_exceeded_behavior="truncate",
    temperature=0.9,
    max_tokens=4000
)

References

LLM Evaluation

目錄： Prompt Hub - 提示詞匯集

上一篇： Prompt Hub - 用於創意生成的大型語言模型 (LLMs for Creativity)
下一篇： Prompt Hub - Information Extraction with LLMs (使用大型語言模型進行資訊擷取)

Prompt Hub - LLM Evaluation (大型語言模型用於評估)

LLM Evaluation (大型語言模型用於評估)

目錄

評估柏拉圖的對話錄

背景

提示詞

程式

References