Applications - Tackling Generated Datasets Diversity

Bob · 2025年07月31日07:07

處理生成資料集的多樣性

前言

生成資料集時，若缺乏多樣性，模型容易過擬合或失去泛用性。本文將介紹如何透過提示設計與實體隨機組合，創建具備廣泛變化的高品質資料集。

本篇目標

提提示設計與資料多樣性技巧

在上一章中，我們討論了使用大型語言模型（LLM）生成合成資料集，以進一步微調本地檢索器模型的潛力。此方法之所以可行，是因為有大量的未標註文件語料庫可供使用。每份文件都會被用來生成一個或多個合成查詢，進而形成「查詢-文件」配對。

但是，如果信息檢索不是您的任務怎麼辦？假設您正在處理法律文件分類問題，但不允許將任何資料傳送至外部 API。在此情況下，您需要訓練本機模型。然而，收集數據可能成為重大障礙，導致產品開發延遲。

為簡單起見，假設目標是生成兒童故事。這項任務是研究的起點埃爾丹等人(2023).每個故事由 2-3 個段落組成，遵循簡單的情節和主題，而整個數據集涵蓋了孩子的詞彙量和事實知識。

語言不僅僅是一個規則和符號的系統;它傳達和解釋意義。使用大型語言模型產生訓練資料的主要挑戰是確保資料集的多樣性。即使模型設定高溫度，模型可能會產生缺乏必要多樣性的重複資料集（即使對於兒童的語言也是如此）。連貫性和相關性是其他自然語言生成挑戰。

為了應對多樣性問題，作者們準備了一個仿照典型兒童詞彙、約有1500個基礎單字的詞彙庫，並將其分為名詞、動詞和形容詞。在每一次的生成過程中，都會隨機選取一個動詞、一個名詞和一個形容詞，模型接著便會生成一個融合這些隨機詞彙的故事。

這種技術顯著拓寬了數據集的多樣性，確保故事涵蓋孩子的全部詞彙量和不同的概念組合。此外，作者還整合了潛在的故事特徵（如對話、情節曲折、糟糕的結局或道德教訓）。為每個故事選擇這些特徵的隨機子集，並提示模型包含它們。

prompt:

Write a short story (3-5 paragraphs) which only uses very simple words that a 3 year old child would likely understand. The story should use the verb ”{random.choice(verbs_list)}”, the noun ”{random.choice(nouns_list)}” and the adjective ”{random.choice(adjectives_list)}”. The story should have the following features: {random.choice(features_list)}, {random.choice(features_list)}. Remember to only use simple words!

假設空白已按如下方式填充：

prompt:

Write a short story (3-5 paragraphs) which only uses very simple words that a 3 year old child would likely understand. The story should use the verb ”decorate”, the noun ”thunder” and the adjective ”ancient”. The story should have the following features: the story should contain at least one dialogue, the story has a bad ending. Remember to only use simple words!

以下是藉由

Chatgpt 4o 模型
Gemini 2.5 Flash 模型進行回答問題、

回答問題

Chatgpt

The Big Tree and the Storm

Lily and Max found a big, ancient tree in the park. It had wide arms and strong roots. “Let’s decorate the tree!” said Lily. Max clapped. “Yes! We can make it pretty.”

They put red ribbons and yellow stars on the tree. They sang songs and danced. “This is the best tree,” said Max. “The best!” said Lily.

Then the sky got dark. Thunder went BOOM! The wind was strong. The tree shook. “Oh no!” said Max. “Hold my hand!” said Lily.

The rain came fast. The wind pulled the ribbons. The stars flew away. A loud thunder hit the tree. CRACK! The big tree fell down.

Lily and Max looked at the broken tree. No more songs. No more stars. “The tree is gone,” said Lily. Max was quiet. They walked home in the rain.

Gemini

Pip the bear was a very small bear. He wanted to make his home pretty. He found red leaves and yellow flowers. He found shiny stones, too. Pip wanted to decorate his cave with all the pretty things.

Pip put the leaves on the walls. He put the flowers near the door. He put the shiny stones on a big, ancient rock outside his cave. The rock was very, very old. It was big like a house! Pip smiled. His home looked so nice.

Pip saw his friend, Chirp the bird. "Look, Chirp!" said Pip. "I decorate my home!" Chirp flew around and sang a happy song. Pip felt happy.

Then, a big, loud sound went BOOM! It was thunder! The sky turned dark. Rain came down, plink, plonk, plink! The rain washed away all of Pip's pretty leaves and flowers. The shiny stones rolled away. Pip was very, very sad. His home was not pretty anymore.

在這裡，我們依賴於最新的文本生成模型（GPT-3.5 和 GPT-4），它們可以按照說明生成大量的合成內容。由於我們每次都會隨機制定提示，並且模型精確地遵守它，因此故事變得非常多樣化。即使一兩個詞保持不變，敘述也會完全不同。從本質上講，我們應用這種技術將隨機性注入提示中，從而生成一個多樣化的數據集。

概念是這樣的：

找出在您的合成資料集中，不同樣本之間可能有所變動的參數或實體；
生成或手動編纂這些實體的集合，用以填補空缺；
透過隨機選取要插入的實體來產出資料集。最好將生成溫度設定得比預設值高，但低於最大值；
使用 ChatGPT/GPT-4 的生成結果來訓練本地模型。

值得注意的是，用來引導生成的實體之一可以是類別標籤。舉例來說，在情感分類任務中，您可以在提示中直接提及「正面」或「負面」，以接收帶有相應標籤的生成文本。接著，便可在此資料上訓練一個本地分類器。

迭代（分層）合成數據生成

透過使用更多實體，格式可以變得更為複雜，其中有些實體是由 LLM 本身預先生成的。舉例來說，可以先要求模型生成一則故事摘要，並擬定一句必須出現在故事中的句子。之後，再將那些生成的中間資料用於最終的請求中。

提示語：


Summary: {a short summary generated by LLM, using the approach above}

Features: {copy the features from the initial prompt}

Sentence: {a sentence generated by LLM, which should be present in the story}

Words: {copy the words from the initial prompt}

Story:

填寫的提示範例：

提示：


Summary: Lily and Timmy build a sandcastle together and learn to compromise, but it gets knocked over by a gust of wind. They find beauty in the broken sandcastle and play happily with a butterfly.

Features: Dialogue, Foreshadowing, Twist

Sentence: One day, she went to the park and saw a beautiful butterfly.

Words: disagree, network, beautiful

Story:

以下是藉由

Chatgpt 4o 模型
Gemini 2.5 Flash 模型進行回答問題、

回答問題

Chatgpt


One day, Lily and Timmy went to the beach. Lily wanted to build a round sandcastle. Timmy said, "No, I want a tall one!" They started to disagree. But then Lily smiled and said, "Let’s make it round and tall!" Timmy nodded. "Okay!" They worked together and had fun.

The clouds in the sky looked like a big soft network. “Do you think it will rain?” Lily asked. Timmy shook his head. “No, it’s just cloudy,” he said. But the wind was picking up, and the waves got louder.

When they finished, the sandcastle was beautiful. It had towers and little seashell doors. “This is the best castle ever,” said Lily. Then—whoosh! A big gust of wind came. The castle fell down. Sand flew everywhere.

Lily and Timmy looked at the broken castle. Timmy was quiet. Lily said, “It looks like a mountain now. A beautiful one.” Just then, a butterfly flew down and landed on the sand. “One day,” Lily whispered, “she went to the park and saw a beautiful butterfly… just like this one.”

They laughed and chased the butterfly. Even though the castle was gone, they were still smiling. Sometimes, something broken can still be fun.

Gemini


Lily skipped to the park, her bucket and shovel clutched tight. The sun warmed her face, and she hummed a happy tune. Her plan for the day was grand: the biggest, most beautiful sandcastle the park had ever seen.

She found the perfect spot by the old oak tree, where the sand was soft and damp. Just as she began to dig, a voice piped up beside her.

"Hey! Can I help?" It was Timmy from next door, his own shovel dragging behind him.

Lily frowned. She liked building alone. "I guess so," she mumbled, not looking at him.

They started to build, but it wasn't long before they began to disagree.

"Let's make it super tall!" Lily declared, piling on more sand.

"No, a wide base is better for stability," Timmy argued, flattening the sides.

Their initial enthusiasm began to network into a series of small arguments. Lily wanted flags, Timmy wanted moats. Lily envisioned turrets, Timmy preferred a drawbridge. After a few minutes of bickering, they both sat back, arms crossed, staring at their half-finished, lopsided creation.

"It's not working," Lily sighed.

"No, it's not," Timmy agreed. "Maybe we can combine our ideas?"

They started fresh, taking turns, listening to each other. Lily designed the elegant spires, and Timmy reinforced the sturdy walls. They laughed as they decorated it with seashells and pebbles, creating a truly magnificent sandcastle, a testament to their newfound teamwork. It was, indeed, the most beautiful sandcastle in the park.

"It's perfect!" Lily exclaimed, clapping her hands.

Just then, a sudden gust of wind swept through the park. In an instant, their masterpiece crumbled, sand scattering in all directions. Lily and Timmy stared, mouths agape, at the flattened mound where their castle had stood. Lily felt a tear well up. All that hard work... gone.

But then, a vibrant blue butterfly fluttered down and landed on a broken turret of sand. It fanned its wings, as if admiring the ruins. Lily looked at Timmy, and Timmy looked at Lily. A small smile touched Lily's lips. The castle was gone, but the intricate patterns the wind had made in the scattered sand were surprisingly lovely, and the butterfly seemed to think so too. They spent the rest of the afternoon chasing the butterfly, their earlier disappointment forgotten in the joy of the moment.

因此，可以產生數十萬個非常多樣化的範例來訓練模型。假設您需要訓練一個分類器來確定文本是否包含對話或情節轉折。由於初始提示包含標籤，因此知道每個產生的樣本需要預測哪個目標值。

教科書就是你所需的一切

這種方法產生的一個關鍵問題是，在為實際應用訓練網路時，資料集的合成是否能夠真正提供好處。幸運的是，作者透過進行調查並驗證使用從最先進的法學碩士衍生的合成資料訓練較小語言模型的功效來解決這個問題。

在他們的研究中，Gunasekar等人（2023）強調高品質訓練數據在其模型中的重要性。他們認為，如果語言模型在類似於廣受推崇的「教科書」特徵的材料上進行訓練，它們會更有效：清晰、全面、信息豐富和公正。

這些原則構成了創建名為 Phi-1 的半合成數據集來訓練 LLM 的基礎。主要評估任務是生成遵循給定文本描述或文檔字串的 Python 函數。使用 HumanEval 基準 (Chen et al., 2021).

作者們強調了這種方法中多樣性的重要性，原因有幾點：它讓語言模型接觸到各種程式編寫表達和問題解決方法，降低了過度擬合或依賴特定模式的風險，並提升了模型處理不熟悉或創新任務的能力。

為了解決程式編寫的挑戰，作者們創建了類似教科書的文件，這些文件著重於促進推理和基本演算法技能的主題。他們透過對主題和目標受眾施加限制來實現多樣性。

主題
目標受眾

不幸的是，作者們並未提供用於生成合成數據的具體提示模板資訊。不過，他們確實展示了最終的輸出。他們選擇使用 ChatGPT (GPT-3.5) 而非 GPT-4，即使這個策略在他們使用合成數據微調模型後，也產生了出色的結果。

圖片解釋

顯示了一段關於使用大型語言模型（LLM）生成教科書的英文和中文內容。它討論了高質量訓練數據的重要性，以及如何通過半合成數據集來訓練LLM。圖片底部還包含了之前關於奇異矩陣和非奇異矩陣的定義和Python程式碼範例

圖片來源：Gunasekar 等人（2023）

必須注意目標受眾的限制。指定目標受眾是增強合成數據多樣性的絕佳方法。高中生、本科生和博士生肯定會以不同的方式表達解釋和處理信息，從而使生成的數據更加多樣化。讓我們嘗試重新創建提示。

提示：


Write an extract from a Computer Science textbook for a 1st-year bachelor. The coding language is Python 3.6.

This is an extract from the middle of the following topic: Singular matrices.

The extract starts with a high-level overview of the topic. Then, it presents an example and describes the solution in natural language. After that, it provides 1-2 code snippets, following the example. Each snippet has no more than 10 rows. There should be no text after code snippets.

Keep in mind that the extract should be concise and explain only one concept at a time. The code should represent a Python function & its call. All the common ML/DS libraries are available.

Don't be too verbose. The expected length of the extract is 2 paragraphs at most.

以下是藉由

Chatgpt 4o 模型
Gemini 2.5 Flash 模型進行回答問題、

回答問題

Chatgpt

Singular Matrices in Linear Algebra
In linear algebra, a singular matrix is a square matrix that does not have an inverse. This occurs when the matrix’s determinant is zero, often implying that its rows (or columns) are linearly dependent. Singular matrices pose problems in numerical computing because operations like matrix inversion or solving linear systems require a non-singular matrix.

For example, consider solving the system Ax = b where A is a singular matrix. Attempting to invert A using np.linalg.inv() will result in a LinAlgError. Instead, we can check for singularity using the determinant and avoid operations that require inversion.


import numpy as np

def is_singular(matrix):
    det = np.linalg.det(matrix)
    return np.isclose(det, 0)

A = np.array([[2, 4], [1, 2]])
print("Is A singular?", is_singular(A))


def solve_system(A, b):
    if is_singular(A):
        return "Cannot solve: matrix is singular."
    return np.linalg.solve(A, b)

b = np.array([6, 3])
print(solve_system(A, b))

Gemini

A square matrix is called singular if it does not have a multiplicative inverse. The most important characteristic of a singular matrix is that its determinant is zero. This occurs when the matrix’s rows or columns are not linearly independent; in other words, at least one row or column can be created by a linear combination of the others. This linear dependency implies that the matrix transformation collapses space onto a lower dimension, a process that cannot be reversed.

For example, consider the matrix A shown below. Notice that the second row, [2,4,6], is exactly two times the first row, [1,2,3]. This linear dependency between the rows ensures that the determinant of A is 0, thus classifying it as a singular matrix. We can programmatically verify this by creating a function that calculates the determinant. If the determinant is close to zero (accounting for potential floating-point inaccuracies), the matrix is singular.

$$ A = \begin{pmatrix}
1 & 2 & 3 \
2 & 4 & 6 \
7 & 8 & 9
\end{pmatrix} $$


import numpy as np

def is_singular(matrix):
    """Checks if a matrix is singular."""
    determinant = np.linalg.det(matrix)
    return np.isclose(determinant, 0)

A = np.array([[1, 2, 3], [2, 4, 6], [7, 8, 9]])
is_singular(A)

相當接近！

作者總共生成了 1B 標記來增強模型的訓練集，允許較小的模型(只有 1.5B 參數)與十倍於其大小的模型競爭(有關詳細資訊，請參閱文章Gunasekar 等人（2023）).

圖片來源：Gunasekar 等人（2023）.

圖片解釋開始:

這張圖想告訴我們：

不是越大就越好，小模型像 phi-1 表現也可以超強。
選對模型很重要：如果你只是想用 AI 幫忙寫教科書、做練習題，像 phi-1 這樣的小模型就很夠用，速度快、成本低、效果好。
這些 AI 模型可以拿來幫忙：

出題目
解說程式
生教材

對於您的任務，您可能不需要如此大量的合成數據（因為作者研究了預訓練，這需要大量資源）。然而，即使作為估計，以每 1k 個代幣的價格（標準 ChatGPT 定價）計算，生成的代幣的成本和提示的成本也大致相同。$0.002$2000

請記住，隨著領域變得更加利基，對合成數據的微調變得更加有價值，尤其是在語言偏離英語（以及其他因素）的情況下。此外，此方法適用於思維鏈（CoT），幫助本地模型提高其推理能力。其他提示技術也有效。並且不要忘記像 Alpaca （[Taori等人，（2023年）]）(Stanford CRFM) 和 Vicuna （Zheng et al.，（2023））通過對合成數據進行微調來表現出色。

References

Tackling Generated Datasets Diversity

目錄 : Applications - 大型語言模型應用和指南

上一篇 : Applications - 為RAG生成合成資料集
下一篇 : Applications - 生成程式碼