Text Splitter

Split text into chunks for RAG, embeddings, and AI processing. Supports multiple splitting strategies.

Split Method

Tokens per chunk (estimated)

Overlap

Text to Split

What is Text Chunking?

Text chunking (or text splitting) is the process of breaking large documents into smaller, manageable pieces. This is essential for RAG (Retrieval-Augmented Generation) systems, vector databases, and LLM applications that have token limits.

Splitting Methods

By Tokens: Split based on estimated token count. Best for staying within LLM context limits.
By Characters: Split at exact character counts. Simple but may break mid-word.
By Words: Split at word boundaries. Preserves whole words.
By Sentences: Split at sentence boundaries. Preserves semantic units.
By Paragraphs: Split at paragraph breaks. Best for structured documents.
Custom Delimiter: Split on any custom string (e.g., "---" for markdown sections).

What is Overlap?

Overlap creates redundancy between consecutive chunks. This ensures context isn't lost at chunk boundaries. For example, with 500 token chunks and 50 token overlap, the last 50 tokens of chunk 1 will also appear at the start of chunk 2. This improves retrieval quality in RAG systems.

Use Cases

RAG Pipelines: Prepare documents for retrieval-augmented generation
Vector Embeddings: Create properly-sized chunks for embedding models
LLM Context: Split documents to fit within context windows
Document Processing: Break large files into manageable pieces
Semantic Search: Optimize chunk sizes for search relevance

Recommended Chunk Sizes

OpenAI Embeddings: 500-1000 tokens
Semantic Search: 200-500 tokens
Q&A Systems: 500-1500 tokens
Summarization: 1000-2000 tokens

Best Practices

Use 10-20% overlap to maintain context
Prefer sentence or paragraph splitting to preserve meaning
Smaller chunks improve retrieval precision but may lose context
Larger chunks provide more context but may reduce relevance
Test different chunk sizes for your specific use case