Keep uniquely mapped reads
Jul 02, 2013So you’ve sequenced a bunch of reads, mapped them to a reference with a short read aligner, and now you find that some reads map to multiple locations. You hate multi-mappers, so you’d like to discard them and keep only uniquely mapped reads.
Say you have your read alignment in .sam format, like this:
1:497:R:-272+13M17D24M 113 1 497 37 37M 15 100338662 0 CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG 0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>> XT:A:U NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37
19:20389:F:275+18M2D19M 99 1 17644 0 37M = 17919 314 TATGACTGCTAATAATACCTACACATGTTAGAACCAT >>>>>>>>>>>>>>>>>>>>>>>4::>>:>>>>>>9>>> XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:5 X1:i:0 XM:i:0 XO:i:1 XG:i:2 MD:Z:35
You can download the example.sam file for testing purposes.
Here is how to do it in pure Linux commmand line:
# Before you proceed, make sure your alignment file is sorted by the read name column (or any column with a unique identifier for each read)
# Separate the read name column with "cut". The parameter of -f should be the column of the read name. This is 1 for .sam, while it is 4 for .bed
cut -f1 example.sam > reads.txt
# Count the number of occurances (i.e. mapped locations) of each read name
uniq -c reads.txt > count.txt
# Add the count to the last column of each line. "join" merges the two files based on column 1 of the first file and column 2 of the second file. If the original file is in .bed format, use "-1 4" instead.
join -1 1 -2 2 example.sam count.txt > example_with_count.txt
# Keep only lines with a count of 1
awk '$NF == 1' example_with_count.txt > example_unique.txt
# Remove the last column
awk '{delim = ""; for (i=1;i example_unique.sam
You are done!
Here is a shell script keepunique.sh that you can download and use.
By the way, it is officially discouraged to use uniqueness to filter reads. See this and this. So use at your own risk.