Text Splitter
Split text into chunks for RAG, embeddings, and AI processing. Supports multiple splitting strategies.
What is Text Chunking?
Text chunking (or text splitting) is the process of breaking large documents into smaller, manageable pieces. This is essential for RAG (Retrieval-Augmented Generation) systems, vector databases, and LLM applications that have token limits.
Splitting Methods
- By Tokens: Split based on estimated token count. Best for staying within LLM context limits.
- By Characters: Split at exact character counts. Simple but may break mid-word.
- By Words: Split at word boundaries. Preserves whole words.
- By Sentences: Split at sentence boundaries. Preserves semantic units.
- By Paragraphs: Split at paragraph breaks. Best for structured documents.
- Custom Delimiter: Split on any custom string (e.g., "---" for markdown sections).
What is Overlap?
Overlap creates redundancy between consecutive chunks. This ensures context isn't lost at chunk boundaries. For example, with 500 token chunks and 50 token overlap, the last 50 tokens of chunk 1 will also appear at the start of chunk 2. This improves retrieval quality in RAG systems.
Use Cases
- RAG Pipelines: Prepare documents for retrieval-augmented generation
- Vector Embeddings: Create properly-sized chunks for embedding models
- LLM Context: Split documents to fit within context windows
- Document Processing: Break large files into manageable pieces
- Semantic Search: Optimize chunk sizes for search relevance
Recommended Chunk Sizes
- OpenAI Embeddings: 500-1000 tokens
- Semantic Search: 200-500 tokens
- Q&A Systems: 500-1500 tokens
- Summarization: 1000-2000 tokens
Best Practices
- Use 10-20% overlap to maintain context
- Prefer sentence or paragraph splitting to preserve meaning
- Smaller chunks improve retrieval precision but may lose context
- Larger chunks provide more context but may reduce relevance
- Test different chunk sizes for your specific use case