ORC Tools
Work with Optimized Row Columnar format for big data processing
ORC File Viewer
View ORC file schema, metadata, and statistics
Generate a schema from CSV data using the Schema Generator tab to see ORC file information here.
About ORC Format
Optimized Row Columnar (ORC) is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is similar to other columnar formats like Parquet and RCFile, but has been optimized for Hadoop workloads.
Key Features:
- Type-aware: Knows the data type of each column for better encoding
- Self-describing: Includes metadata about the data structure
- Compression: Built-in support for ZLIB, Snappy, LZO, and LZ4
- Indexes: Lightweight indexes including min/max values and bloom filters
- Streaming: Large files can be read without loading entire file
File Structure:
- Stripes: Large units of data (default 64MB) for efficient reads
- Row Groups: 10,000 rows grouped together within stripes
- Streams: Column data stored in separate streams
- File Footer: Contains file metadata and schema
When to Use ORC:
- Hive data warehouses requiring ACID transactions
- Analytics workloads with selective column reads
- ETL pipelines needing high compression
- Spark applications processing large datasets
- Long-term data archival with query capability