본문 바로가기

전체 글110

Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference ModelsIn this work, we investigate whether small language models can determine high-quality subsets of large-scale text datasets that improve the performance of larger language models. While existing work has shown that pruning based on the perplexity of a largearxiv.org1. Methods전체 dataset 중에서 일부 data를 사용하여, perplexity를.. 2025. 3. 5.
Data Selection for Language Models via Importance Resampling Data Selection for Language Models via Importance ResamplingSelecting a suitable pretraining dataset is crucial for both general-domain (e.g., GPT-3) and domain-specific (e.g., Codex) language models (LMs). We formalize this problem as selecting a subset of a large raw unlabeled dataset to match a desired target diarxiv.org1. MethodDSIR FrameworkLarge raw dataset에서 target data의 distribution과 일치하.. 2025. 3. 5.
Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting https://arxiv.org/abs/2407.08223 Speculative RAG: Enhancing Retrieval Augmented Generation through DraftingRetrieval augmented generation (RAG) combines the generative abilities of large language models (LLMs) with external knowledge sources to provide more accurate and up-to-date responses. Recent RAG advancements focus on improving retrieval outcomes througharxiv.org0. AbstractRAG는 LLM의 생성 기능과.. 2025. 1. 22.
Retrieval-Augmented Generation for Large Language Models: A Survey (2) 2024. 11. 20.