Sequence Alignment/Map (SAM) Format
The Sequence Alignment/Map (SAM) format is a generic, widely-used format for storing large nucleotide sequence alignments. It is used to represent the alignment of reads (DNA or RNA fragments) to a reference genome. Developed to replace older, less scalable formats, it provides a flexible and efficient way to manage and analyze high-throughput sequencing data.
Structure and Components
A SAM file consists of a header section and an alignment section.
Header Section
The header section contains meta-information about the alignment. Each header line begins with the '@' character followed by a two-letter tag specifying the record type. Common tags include:
- @HD: Header record indicating SAM format version and sort order.
- @SQ: Sequence dictionary defining the reference sequences (chromosomes/contigs) used in the alignment. Contains information such as sequence name and length.
- @RG: Read group, used to group reads that originate from the same source (e.g., same sequencing library).
- @PG: Program record, describing the program(s) used to generate the alignment.
- @CO: Comment lines for additional information.
Alignment Section
The alignment section contains the alignment information for each read. Each line represents a single read and its alignment to the reference genome. The alignment information is organized into mandatory and optional fields. The mandatory fields are:
- QNAME: Query name (read identifier).
- FLAG: Bitwise flag indicating alignment properties (e.g., whether the read is paired, mapped, reverse complemented).
- RNAME: Reference sequence name (chromosome/contig) where the read is aligned.
- POS: 1-based leftmost mapping position on the reference sequence.
- MAPQ: Mapping quality (Phred-scaled), indicating the confidence in the alignment.
- CIGAR: CIGAR string (Compact Idiosyncratic Gapped Alignment Report), describing the alignment of the read to the reference. Uses letters such as M (match/mismatch), I (insertion), D (deletion), N (skipped region), S (soft clipping), H (hard clipping), P (padding), = (sequence match), and X (sequence mismatch).
- MRNM: Mate reference name (chromosome/contig) for paired-end reads.
- MPOS: Mate mapping position for paired-end reads.
- ISIZE: Inferred insert size for paired-end reads.
- SEQ: Read sequence.
- QUAL: Read quality scores (Phred-scaled).
Optional fields are provided in the form of tags (e.g., "AS:i:90", "MD:Z:100M") and can include alignment score, edit distance, and other relevant information.
Binary Alignment/Map (BAM) Format
The Binary Alignment/Map (BAM) format is the binary compressed version of the SAM format. It provides a more compact representation and allows for faster access, especially when working with large datasets. BAM files are typically indexed to allow for efficient retrieval of alignments within specific genomic regions.
CRAM Format
The CRAM (Compressed Read Alignment Map) format is a further compressed version of BAM. It uses reference-based compression to achieve higher compression rates, reducing storage space. CRAM can store the original, uncompressed data, or it can reference the reference genome itself, reducing file size considerably.
Tools and Applications
Numerous bioinformatics tools are available for working with files in the SAM/BAM/CRAM formats. These tools support tasks such as:
- Alignment of reads to a reference genome.
- Sorting and indexing of alignment files.
- Variant calling.
- Visualization of alignments.
- Quality control and filtering.