代码编织梦想

LLMs之Vicuna:《Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality》翻译与解读

导读:作者提出了一个开源的聊天机器人Vicuna-13B。它是通过训练从ShareGPT收集的用户共享对话,然后在LLaMA基础模型上进行调整而产生的。根据初步的GPT-4评估,Vicuna-13B的质量达到了ChatGPT和Bard 90%的质量,超过其他开源模型如LLaMA和Alpaca。作者提出利用GPT-4作为评估工具来评估不同聊天机器人的有效性,通过它产生的答案和分数。尽管存在局限性,但这证明了自动化评估的潜力。Vicuna-13B的训练成本很低,大约只有300美元,采用了内存优化、多轮对话的改进方法,并通过Spot实例降低了成本。该模型的代码、参数和在线演示向公众开放。最后,作者强调Vicuna存在的限制,如在涉及推理和数学的任务方面存在问题,缺少安全优化。但它可以作为未来研究解决这些限制的开端。

目录

《Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality》翻译与解读

How Good is Vicuna? Vicuna-13B的性能有多好?

Online Demo在线演示

Overview概述

Table 1. Comparison between several notable models

Training训练:训练方法基于alpaca构建+内存优化+通过Spot实例降低成本

Serving服务:分布式工作节点+灵活添加GPU节点

How To Evaluate a Chatbot?如何评估聊天机器人?——提出了一种基于GPT-4的评估框架来自动评估聊天机器人的性能

Table 2. Total Scores Assessed by GPT-4.

Limitations局限性—不擅长涉及推理或数学

Release发行

License许可证

The Team团队

Students (alphabetical order):

Acknowledgment致谢

Citation


时间

2023年3月30日

地址

Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality | LMSYS Org

作者

The Vicuna Team

这是与来自多个机构的合作者的共同努力,包括加州大学伯克利分校、CMU、斯坦福大学、加州大学圣地亚哥分校和MBZUAI。

《Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality》翻译与解读

Vicuna (generated by stable diffusion 2.1)

We introduce Vicuna-13B, an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. Preliminary evaluation using GPT-4 as a judge shows Vicuna-13B achieves more than 90%* quality of OpenAI ChatGPT and Google Bard while outperforming other models like LLaMA and Stanford Alpaca in more than 90%* of cases. The cost of training Vicuna-13B is around $300. The code and weights, along with an online demo, are publicly available for non-commercial use.

Vicuna (generated by stable diffusion 2.1)

我们推出了Vicuna-13B,这是一个通过在LLaMA上调整ShareGPT收集的用户共享对话进行训练的开源聊天机器人。利用GPT-4作为评判,初步评估显示Vicuna-13B达到OpenAI ChatGPT和谷歌Bard 90%*的质量,在90%*的情况下超过LLaMA和斯坦福Alpaca等其他模型。训练Vicuna-13B的成本约为300美元。代码、参数以及在线演示均向公众开放用于非商业用途。

*According to a fun and non-scientific evaluation with GPT-4. Further rigorous evaluation is needed.

*根据GPT-4的有趣和非科学评估。需要进一步严格评估。

How Good is Vicuna? Vicuna-13B的性能有多好

 Figure 1. Relative Response Quality Assessed by GPT-4*

After fine-tuning Vicuna with 70K user-shared ChatGPT conversations, we discover that Vicuna becomes capable of generating more detailed and well-structured answers compared to Alpaca (see examples below), with the quality on par with ChatGPT.

在对Vicuna进行7万用户共享ChatGPT对话的调整后,我们发现与Alpaca相比,Vicuna能够生成更详细和结构更好的答案(见下例),质量与ChatGPT相当。

However, evaluating chatbots is never a simple task. With recent advancements in GPT-4, we are curious whether its capabilities have reached a human-like level that could enable an automated evaluation framework for benchmark generation and performance assessments. Our initial finding indicates that GPT-4 can produce highly consistent ranks and detailed assessment when comparing chatbots’ answers (see above example of GPT-4 judgment). Preliminary evaluations based on GPT-4, summarized in Figure 1, show that Vicuna achieves 90%* capability of Bard/ChatGPT. While this proposed framework shows a potential to automate chatbot assessment, it is not yet a rigorous approach. Building an evaluation system for chatbots remains an open question requiring further research. More details are provided in the evaluation section.

然而,评估聊天机器人从来不是一件简单的任务。随着GPT-4的最新进展,我们好奇其能力是否达到了人类级别,能否实现基准生成和性能评估的自动化评估框架。我们的初步发现表明,GPT-4可以产生高度一致的排名和详细的评估,以比较聊天机器人的答案(见GPT-4判断的上例)。基于GPT-4的初步评估总结在图1中,显示Vicuna达到Bard/ChatGPT的90%*能力。虽然这种提议的框架显示出自动化评估聊天机器人的潜力,但这还不是一个严格的方法。建立聊天机器人的评估系统仍然是一个需要进一步研究的开放问题。更多详情在评估部分提供。

Online Demo在线演示

Overview概述

 Figure 2. Workflow Overview

The rapid advancement of large language models (LLMs) has revolutionized chatbot systems, resulting in unprecedented levels of intelligence as seen in OpenAI's ChatGPT. However, despite its impressive performance, the training and architecture details of ChatGPT remain unclear, hindering research and open-source innovation in this field. Inspired by the Meta LLaMA and Stanford Alpaca project, we introduce Vicuna-13B, an open-source chatbot backed by an enhanced dataset and an easy-to-use, scalable infrastructure. By fine-tuning a LLaMA base model on user-shared conversations collected from ShareGPT.com, Vicuna-13B has demonstrated competitive performance compared to other open-source models like Stanford Alpaca. This blog post provides a preliminary evaluation of Vicuna-13B's performance and describes its training and serving infrastructure. We also invite the community to interact with our online demo to test the capabilities of this chatbot.

大规模语言模型(LLM)的快速发展彻底改变了聊天机器人系统,表现出前所未有的智能,如OpenAI的ChatGPT。然而,尽管性能令人印象深刻,ChatGPT的训练和架构细节仍不清楚,阻碍了该领域的研究和开源创新。受Meta LLaMA和斯坦福Alpaca项目的启发,我们推出了Vicuna-13B,这是一个由增强数据集和易于使用的可扩展基础设施支持的开源聊天机器人。通过在LLaMA基础模型上调整从ShareGPT.com收集的用户共享对话,Vicuna-13B已经展示出与其他开源模型(如斯坦福Alpaca)相媲美的性能。本博客文章对Vicuna-13B的性能进行初步评估,并描述了其训练和服务基础设施。我们还邀请社区与我们的在线演示互动,测试此聊天机器人的能力。

Figure 2 provides an overview of our work. To begin, we collected around 70K conversations from ShareGPT.com, a website where users can share their ChatGPT conversations. Next, we enhanced the training scripts provided by Alpaca to better handle multi-round conversations and long sequences. The training was done with PyTorch FSDP on 8 A100 GPUs in one day. For serving the demo, we implemented a lightweight distributed serving system. We conducted a preliminary evaluation of the model quality by creating a set of 80 diverse questions and utilizing GPT-4 to judge the model outputs. To compare two different models, we combine the outputs from each model into a single prompt for each question. The prompts are then sent to GPT-4, which assesses which model provides better responses. A detailed comparison of LLaMA, Alpaca, ChatGPT, and Vicuna is shown in Table 1 below.

图2概述了我们的工作。首先,我们从ShareGPT.com网站收集了约7万段对话,用户可以在该网站上共享他们的ChatGPT对话。其次,我们改进了Alpaca提供的训练脚本,更好地处理多轮对话和长序列。训练在8个A100 GPU上一天内完成,使用PyTorch FSDP。为了演示服务,我们实现了一个轻量级的分布式服务系统。我们通过创建80个多样化的问题,并利用GPT-4判断模型输出来对模型质量进行初步评估。为了比较两个不同的模型,我们将每个模型的输出组合在每个问题的单个提示中。然后将提示发送给GPT-4,它会评估哪个模型提供更好的回答。LLaMA,Alpaca,ChatGPT和Vicuna的详细比较见下表1。

Table 1. Comparison between several notable models

Model NameLLaMAAlpacaVicunaBard/ChatGPT
DatasetPublicly available datasets
(1T token)
Self-instruct from davinci-003 API
(52K samples)
User-shared conversations
(70K samples)
N/A
Training codeN/AAvailableAvailableN/A
Evaluation metricsAcademic benchmarkAuthor evaluationGPT-4 assessmentMixed
Training cost
(7B)
82K GPU-hours$500 (data) + $100 (training)$140 (training)N/A
Training cost
(13B)
135K GPU-hoursN/A$300 (training)N/A

Training训练训练方法基于alpaca构建+内存优化+通过Spot实例降低成本

Vicuna is created by fine-tuning a LLaMA base model using approximately 70K user-shared conversations gathered from ShareGPT.com with public APIs. To ensure data quality, we convert the HTML back to markdown and filter out some inappropriate or low-quality samples. Additionally, we divide lengthy conversations into smaller segments that fit the model's maximum context length.

Vicuna是通过调整LLaMA基础模型来创建的,使用从ShareGPT.com收集的约7万段用户共享对话。为确保数据质量,我们将HTML转换回markdown,并过滤掉一些不适当或低质量的样本。此外,我们将较长的对话分成较小的段,以符合模型的最大上下文长度。

Our training recipe builds on top of Stanford’s alpaca with the following improvements.

Memory Optimizations: To enable Vicuna's understanding of long context, we expand the max context length from 512 in alpaca to 2048, which substantially increases GPU memory requirements. We tackle the memory pressure by utilizing gradient checkpointing and flash attention.

Multi-round conversations: We adjust the training loss to account for multi-round conversations and compute the fine-tuning loss solely on the chatbot's output.

Cost Reduction via Spot Instance: The 40x larger dataset and 4x sequence length for training poses a considerable challenge in training expenses. We employ SkyPilot managed spot to reduce the cost by leveraging the cheaper spot instances with auto-recovery for preemptions and auto zone switch. This solution slashes costs for training the 7B model from $500 to around $140 and the 13B model from around $1K to $300.

我们的训练方法基于斯坦福大学Stanford’s alpaca构建,具有以下改进。

内存优化:为了使Vicuna理解长上下文,我们将alpaca中的最大上下文长度从512扩展到2048,这大大增加了GPU内存需求。我们通过使用梯度检查点gradient checkpointing闪光注意力flash attention来解决内存压力。

多轮对话:我们调整训练损失,以考虑多轮对话,并仅根据聊天机器人的输出计算调整损失。

通过Spot实例降低成本:数据集扩大40倍序列长度增加4倍的训练会带来相当大的训练费用挑战。我们采用SkyPilot托管的Spot实例SkyPilot managed spot,利用更便宜的Spot实例与自动恢复预防和自动区域切换来降低成本。该解决方案将7B模型的训练成本从500美元降低到约140美元,13B模型的训练成本从约1,000美元降低到300美元

Serving服务分布式工作节点+灵活添加GPU节点

We build a serving system that is capable of serving multiple models with distributed workers. It supports flexible plug-in of GPU workers from both on-premise clusters and the cloud. By utilizing a fault-tolerant controller and managed spot feature in SkyPilot, this serving system can work well with cheaper spot instances from multiple clouds to reduce the serving costs. It is currently a lightweight implementation and we are working on integrating more of our latest research into it.

我们构建了一个能够使用分布式工作节点服务多个模型的服务系统。它支持从本地集群和云中灵活添加GPU工作节点。通过利用容错控制器和SkyPilot中的托管Spot功能,此服务系统可以与来自多个云的更便宜的Spot实例很好地配合使用,以降低服务成本。这目前是一个轻量级实现,我们正在努力将我们最新的研究成果集成进去

How To Evaluate a Chatbot?如何评估聊天机器人?——提出了一种基于GPT-4的评估框架来自动评估聊天机器人的性能

 Figure 3. Response Comparison Assessed by GPT-4

Evaluating AI chatbots is a challenging task, as it requires examining language understanding, reasoning, and context awareness. With AI chatbots becoming more advanced, current open benchmarks may no longer suffice. For instance, the evaluation dataset used in Stanford’s Alpaca, self-instruct, can be effectively answered by SOTA chatbots, making it difficult for humans to discern differences in performance. More limitations include training/test data contamination and the potentially high cost of creating new benchmarks. To tackle these issues, we propose an evaluation framework based on GPT-4 to automate chatbot performance assessment.

评估AI聊天机器人是一项具有挑战性的任务,因为它需要检查语言理解推理上下文意识。随着AI聊天机器人变得更加先进,目前的开放基准可能不再足够。例如,斯坦福大学Alpaca使用的评估数据集self-instruct可以被当前最先进的聊天机器人有效回答,这使人类难以 辨别性能差异。更多限制包括训练/测试数据污染和潜在的创建新基准的高成本。为解决这些问题,我们提出了一种基于GPT-4的评估框架来自动评估聊天机器人的性能。

First, we devised eight question categories, such as Fermi problems, roleplay scenarios, and coding/math tasks, to test various aspects of a chatbot's performance. Through careful prompt engineering, GPT-4 is able to generate diverse, challenging questions that baseline models struggle with. We select ten questions per category and collect answers from five chatbots: LLaMA, Alpaca, ChatGPT, Bard, and Vicuna. We then ask GPT-4 to rate the quality of their answers based on helpfulness, relevance, accuracy, and detail. We discover that GPT-4 can produce not only relatively consistent scores but also detailed explanations on why such scores are given (detailed examples link). However, we also notice that GPT-4 is not very good at judging coding/math tasks.

首先,我们设计了八个问题类别,如费米问题、角色扮演场景和编码/数学任务,以测试聊天机器人性能的各个方面。通过精心设计提示,GPT-4能够生成基线模型难以应对的多样化和具有挑战性的问题。我们从五个聊天机器人中选择每个类别十个问题的答案:LLaMA、Alpaca、ChatGPT、Bard和Vicuna。然后我们要求GPT-4根据有用性、相关性、准确性和细节评价它们的答案质量。我们发现GPT-4不仅可以产生相对一致的分数,而且能够详细解释为什么给出这样的分数(详细示例链接)。然而,我们也注意到GPT-4在判断编码/数学任务方面不是很好

Figure 3 displays the comparison results between all baselines and Vicuna. GPT-4 prefers Vicuna over state-of-the-art open-source models (LLaMA, Alpaca) in more than 90% of the questions, and it achieves competitive performance against proprietary models (ChatGPT, Bard). In 45% of the questions, GPT-4 rates Vicuna's response as better or equal to ChatGPT's. As GPT-4 assigns a quantitative score to each response on a scale of 10, we calculate the total score for each (baseline, Vicuna) comparison pair by adding up the scores obtained by each model on 80 questions. As shown in Table 2, Vicuna’s total score is 92% of ChatGPT’s. Despite recent advancements, these chatbots still face limitations, such as struggling with basic math problems or having limited coding ability.

图3显示了所有基准线和Vicuna之间的比较结果。在80%的问题中,GPT-4更喜欢Vicuna而不是最先进的开源模型(LLaMA,Alpaca),并达到专有模型(ChatGPT,Bard)的竞争性能。在45%的问题中,GPT-4将Vicuna的回答评为优于或等于ChatGPT的回答。由于GPT-4在10点量表上为每个回答分配一个定量得分,我们通过将每个模型在80个问题上获得的得分相加来计算每个(基准,Vicuna)比较对的总得分。如表2所示,Vicuna的总得分是ChatGPT的92%。尽管最近有所进展,但这些聊天机器人仍面临一些限制,如难以应对基本的数学问题或具有有限的编码能力

While this proposed evaluation framework demonstrates the potential for assessing chatbots, it is not yet a rigorous or mature approach, as large language models are prone to hallucinate. Developing a comprehensive, standardized evaluation system for chatbots remains an open question requiring further research.

虽然此提出的评估框架展示了评估聊天机器人的潜力,但由于大语言模型容易产生幻觉,所以这还不是一个严格或成熟的方法。开发全面标准化的聊天机器人评估系统仍然是一个需要进一步研究的开放问题。

Table 2. Total Scores Assessed by GPT-4.

BaselineBaseline ScoreVicuna Score
LLaMA-13B513.0694.0
Alpaca-13B583.0704.0
Bard664.0655.5
ChatGPT693.0638.0

Limitations局限性—不擅长涉及推理数学

We have noticed that, similar to other large language models, Vicuna has certain limitations. For instance, it is not good at tasks involving reasoning or mathematics, and it may have limitations in accurately identifying itself or ensuring the factual accuracy of its outputs. Additionally, it has not been sufficiently optimized to guarantee safety or mitigate potential toxicity or bias. To address the safety concerns, we use the OpenAI moderation API to filter out inappropriate user inputs in our online demo. Nonetheless, we anticipate that Vicuna can serve as an open starting point for future research to tackle these limitations.

我们注意到,与其他大型语言模型一样,Vicuna也存在某些限制。例如,它不擅长涉及推理数学的任务,并且在准确识别自己或确保其输出的事实准确性方面可能存在限制。此外,它还没有得到足够的优化以确保安全性或减轻潜在的 toxicity或偏见。为了解决安全问题,我们在在线演示中使用OpenAI调解API过滤掉不适当的用户输入。尽管如此,我们预计Vicuna可以作为未来研究解决这些限制的开放起点。

Release发行

In our first release, we will share the training, serving, and evaluation code on a GitHub repo: https://github.com/lm-sys/FastChat. We also released the Vicuna-13B model weights, please find the instructions here. There is no plan to release the dataset. Join our Discord server and follow our Twitter to get the latest updates.

在我们的首次发布中,我们将在GitHub repo:https://github.com/lm-sys/FastChat上共享训练,服务和评估代码。我们还发布了Vicuna-13B模型权重,请在这里找相关说明。暂无计划发布数据集。加入我们的Discord服务器并关注我们的Twitter以获取最新动态。

License许可证

The online demo is a research preview intended for non-commercial use only, subject to the model License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Please contact us If you find any potential violation.\ The code is released under the Apache License 2.0.

在线演示仅供非商业用途,受LLaMA模型许可证、OpenAI生成的数据使用条款和ShareGPT的隐私实践的约束。如果发现任何潜在违规行为,请联系我们。\代码根据Apache许可证2.0版发布。

The Team团队

This is a joint effort with collaborators from multiple institutions, including UC Berkeley, CMU, Stanford, UC San Diego, and MBZUAI.

这是与来自多个机构的合作者的共同努力,包括加州大学伯克利分校、CMU、斯坦福大学、加州大学圣地亚哥分校和MBZUAI。

Students (alphabetical order):

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang

Advisors (alphabetical order):

Joseph E. Gonzalez, Ion Stoica, Eric P. Xing

学生(按字母顺序):

Wei-Lin Chiang,Zhuohan Li,Zi Lin,Ying Sheng,Zhanghao Wu,Hao Zhang,Lianmin Zheng,Siyuan Zhuang,Yonghao Zhuang

顾问(按字母顺序):

Joseph E. Gonzalez,Ion Stoica,Eric P. Xing

Acknowledgment致谢

We would like to thank Xinyang Geng, Hao Liu, and Eric Wallace from BAIR; Xuecheng Li, and Tianyi Zhang from Stanford Alpaca team for their insightful discussion and feedback; Qirong Ho from MBZUAI for providing support on the serving cluster. Please check out a blog post from BAIR about a concurrent effort on their chatbot, Koala.

我们要感谢BAIR的Xinyang Geng,Hao Liu和Eric Wallace;斯坦福Alpaca团队的Xuecheng Li和Tianyi Zhang提供的精辟讨论和反馈;MBZUAI的Qirong Ho为服务集群提供的支持。请查看BAIR关于他们的聊天机器人Koala的同期工作的博客文章。

Citation

@misc{vicuna2023,
    title = {Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\%* ChatGPT Quality},
    url = {https://lmsys.org/blog/2023-03-30-vicuna/},
    author = {Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P.},
    month = {March},
    year = {2023}
}

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/qq_41185868/article/details/130876638