Walkthrough
After running the IsoSeq protocol, you should have three output files:
- A FASTA file containing your transcript sequences
- A GFF file summarizing the exonic structure of your transcripts
- An abundance.txt file, listing the number of reads counted for each transcript
It is recommended to run a genome correction tool on your sequences and replace your original FASTA file with the result of that. If you are interested in looking at ORF predictions for your transcripts, you can also run any ORF finder on your sequences and generate a FASTA/.faa file of those as well (but ORF file input is optional). The software SQANTI can accomplish both goals in one step.
Once you have the files needed, the first step is to compile those files into a Raw Database object. Replacing filenames appropriately, run the following from your R console or within Rstudio:
library(IsoPops) transcript_filename <- "~/IsoPops_Data/all_data.fasta" abundance_filename <- "~/IsoPops_Data/all_data.abundance.txt" ORF_filename <- "~/IsoPops_Data/all_data.ORFs.fasta" GFF_filename <- "~/IsoPops_Data/all_data.gff" rawDB <- compile_raw_db(transcript_filename, abundance_filename, GFF_filename, ORF_filename)
The rawDB
variable now contains a table of transcript information, as well as your GFF and ORF data. You can look at either with View(rawDB$TranscriptDB)
or View(rawDB$GffDB)
in Rstudio. If you're old school and would rather view data in Excel or another program, you can save any data frame into a tsv file with:
write.table(rawDB$TranscriptDB, "path/to/file_name.tsv", quote = F, row.names = F, sep = "\t")
To view the set of transcripts in any Database object in IGV or other genome browsers, you can create a GFF file from the Database:
write_gff_data(rawDB, "path/to/file/name.gff")
To finish processing your dataset, you'll need to provide IsoPops with a way to associate PacBio transcript IDs with the genes you're interested in. Because IsoPops is completely annotation and genome free, this gene information just takes the form of a two-column tab-delimited file. List gene names in the first column and PacBio gene ID prefixes in the second, like so:
Gene ID # include header
Gene1 PB.1
Gene2 PB.2
Gene3 PB.3
... ...
You can also make a data frame in R directly with the same information and format. Then finish processing the Database by passing in the gene ID table:
gene_ID_table <- read.table("~/IsoPops_Data/gene_IDs_and_names.tsv", header = T) DB_all <- process_db(rawDB, gene_ID_table)
The variable DB_all
now contains the same table of transcript information (DB_all$TranscriptDB)
), with added gene info, plus a table of gene-level info, a table of info for each unique ORF, and GFF data. These new tables can be inspected by View(DB_all$GeneDB)
, View(DB_all$OrfDB)
, and
View(DB_all$GffDB)
.
If you'd like to filter your dataset before analysis, there are multiple functions available for doing so. Each filtering operation creates a new Database object, so you can chain filters by using the output of one filtering step as the input for your next filtering step:
DB_filter_4ex <- filter_db(DB_all, exon_min_count = 4) DB_filter_notrunc <- filter_truncations(DB_filter_4ex) DB_filter_95perc <- filter_db(DB_filter_notrunc, abund_cutoff = 0.95, recalc_abundances = F)
Once you're satisfied with your filtering, you can start playing around with the rest of the functions in IsoPops. Check out the Standard Plots and PCA + t-SNE pages to see what's possible. If you'd rather go straight to examples to run yourself, you can try out code from the Vignette on Github.
Any Database object can be saved and loaded back into memory at a later time. This is recommended once you've processed or filtered a Database (so you don't need to rebuild the Database every time).
saveRDS(DB_final, "path/to/file_name.db") # saves the DB_final Database to a specific file name DB_loaded <- readRDS("path/to/file_name.db") # loads saved Database into the variable DB_loaded