The following .html report details the analyses performed to detect and classify Phaeocystis cordata genes that were up and downregulated when cellls are acantharian endosymbionts.
The raw R markdown file used to build this document (Pcordata_gene_expression_analysis.Rmd
), as well as all the intermediate data files, are available in the GitHub repository: https://github.com/maggimars/PcordataSymbiosisDGE
The raw sequences from this study are available from the NCBI sequence read archive (SRA) with the accession numbers listed below. Sample names beginning with “PC” are Phaeocystis cordata culture replicates and samples beginning with “A” are individual acantharian holobionts.
Sample | SRA Run Number | SRA Experiment Number |
---|---|---|
A7 | SRR10997202 | SRX7658295 |
A12 | SRR10997200 | SRX7658297 |
A14 | SRR10997199 | SRX7658298 |
A20 | SRR10997198 | SRX7658299 |
A21 | SRR10997197 | SRX7658300 |
A22 | SRR10997196 | SRX7658301 |
A25 | SRR10997215 | SRX7658282 |
A27 | SRR10997214 | SRX7658283 |
A28 | SRR10997213 | SRX7658284 |
A29 | SRR10997212 | SRX7658285 |
A35 | SRR10997211 | SRX7658286 |
A36 | SRR10997210 | SRX7658287 |
PC2 | SRR10997209 | SRX7658288 |
PC3 | SRR10997208 | SRX7658289 |
PC4 | SRR10997207 | SRX7658290 |
The following packages are used throughout the analyses:
library(readr)
library(tximport)
library(ggplot2)
library(DESeq2)
library(wesanderson)
library(pheatmap)
library(stringi)
library(tidyverse)
library(edgeR)
library(GOstats)
library(GSEABase)
library(scales)
library(ggrepel)
library(shiny)
library(plotly)
library(reshape2)
library(data.table)
library(jsonlite)
library(knitr)
library(pathview)
The External RNA Controls Consortium (ERCC) produces spike-in controls for RNA-seq experiments. The ERCC spike-in controls were added to RNA extracted from culture replicates before poly-A selection and cDNA synthesis with the SMART-seq v4 kit. By comparing the number of reads mapping to the ERCC sequences with the known concentrations of RNA molecules in the spike-in mix, bias introduced through PCR amplification or other steps of library-preparation and sequencing can be detected. This can be accomplished by plotting the concentration by the FPKM of mapped reads and performing a regression analysis.
The quality trimmed P. cordata culture reads were mapped to the ERCC sequences and counted with RSEM.
To plot results and perform regression analyses:
Import the counts of reads mapping to ERCC spike-in standard sequences in each sample:
`PC2.ercc`<- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/ERCC/PC2.genes.results")
`PC3.ercc` <- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/ERCC/PC3.genes.results")
`PC4.ercc` <- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/ERCC/PC4.genes.results")
FPKM<- data.frame(PC2.ercc$gene_id,PC2.ercc$FPKM,PC3.ercc$FPKM,PC4.ercc$FPKM)
#add column names
names(FPKM)<-c("gene_id","PC2","PC3","PC4")
Import ERCC spike-in standard concentrations for each ERCC sequence:
ercc_conc <- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/ERCC/cms_095046.txt", stringsAsFactors=FALSE)
ercc_conc<-ercc_conc[,2:7]
ercc_conc<-ercc_conc[order(ercc_conc$'ERCC ID'),]
Plot ERCC concentrations (log2Conc.) by their counts (Log2FPKM) in each sample and apply a linear regression. The slope of the linear regression line and the R-squared of the regression model should both be close to 1. If so, we can be fairly confident that PCR and other handling did not cause substantial or sytematic bias in the results.
FPKM$conc <- ercc_conc$'concentration in Mix 1 (attomoles/ul)'
row.names(FPKM) <- FPKM$gene_id
FPKM <- FPKM[,-1]
FPKM[FPKM==0] <- NA
FPKM <- FPKM[complete.cases(FPKM),]
for (i in c(1:4)) {
FPKM[[i]] = log2(FPKM[[i]])
}
lm_eqn <- function(df){
y<-df[[1]]
x<-df[[2]]
m <- lm(y ~ x, df);
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,
list(a = format(coef(m)[1], digits = 2),
b = format(coef(m)[2], digits = 2),
r2 = format(summary(m)$r.squared, digits = 3)))
as.character(as.expression(eq));
}
EqDF <- data.frame( c(lm_eqn(data.frame(FPKM$PC2, FPKM$conc)),lm_eqn(data.frame(FPKM$PC3, FPKM$conc)), lm_eqn(data.frame(FPKM$PC4, FPKM$conc))) , c("PC2", "PC3", "PC4" ))
names(EqDF) <- c("eq", "variable")
mERCC <- melt(FPKM, id.vars = c("conc"))
spikeinPlot<-ggplot(mERCC, aes(x=conc,y=value)) +geom_point()+ geom_smooth(method=lm, se=FALSE, color="black") +
theme(panel.background =element_blank(), panel.grid.major = element_blank(), panel.grid.minor =element_blank()) +
theme(axis.line = element_line(color = 'black'))+
labs(x=expression(paste("Log"[2]," Concentration", sep=""))) +
labs(y=expression(paste("Log"[2]," FPKM", sep=""))) + facet_wrap(~variable)
spikeinPlotWeq <- spikeinPlot + geom_text(data=EqDF,aes(x = 0, y = 18,label=eq), parse = TRUE, inherit.aes=FALSE) + facet_grid(variable~.)
spikeinPlotWeq
## `geom_smooth()` using formula 'y ~ x'
In this case, the slopes of the fitted lines and the R-squared for the regression models are both close to 1, so we can be relatively confident in our data.
The Phaeocystis cordata CCMP3104 transcriptome was assembled with RNA reads from 3 biological culture replicates of P. cordata grown in standard culture conditions with L1 media. RNA was extracted in late exponential phase and libraries were prepared following the same protocols as the individual acantharian holobiont transcriptomes (i.e. with SMART-seq v4 and Nextera XT). Following RNA extraction, but before library preparation, diluted ERCC spike-in mix 1 was added to each sample. The RNA libraries were pooled and sequenced with an Illumina HiSeq 4000. Following sequencing, the reads were trimmed and quality filtered with Trimmomatic software. The reads mapping to ERCC sequences (after trimming) in each sample were filtered out to keep them from being assembled into the transcriptome. Remaining reads from the three replicates were then assembled using Trinity de novo transcriptome assembler. The assembly was dereplicated with CD-HIT EST (95%) and then the entire transcriptome was BLASTed against the nr/nt reference database. Assembled sequences that aligned best to a bacterial sequence from the database were removed from the assembly.
Trimmomatic command for quality trimming raw reads from all samples:
java -jar trimmomatic-0.36.jar PE -phred33 sampleID_R1_001.fastq.gz \
sampleID_R2_001.fastq.gz \
sampleID_1_paired.fq sampleID_1_unpaired.fq \
sampleID_2_paired.fq sampleID_2_unpaired.fq \
ILLUMINACLIP:adapters/NexteraPE-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Number of sequencing reads per sample before and after quality trimming with Trimmomatic:
sample | number of sequencing reads | remaining after trimming |
---|---|---|
A35 | 85,124,248 | 72,558,691 |
A7 | 96,255,554 | 83,430,343 |
A12 | 83,855,671 | 72,850,246 |
A14 | 103,587,165 | 88,442,308 |
A22 | 96,529,442 | 79,528,859 |
A25 | 91,208,346 | 76,733,652 |
A27 | 90,585,006 | 78,488,845 |
A20 | 348,088,442 | 304,901,030 |
A21 | 303,661,266 | 268,735,500 |
A28 | 299,236,369 | 254,086,670 |
A29 | 137,747,279 | 121,041,843 |
A36 | 142,694,491 | 124,725,007 |
PC4 | 83,711,987 | 68,530,568 |
PC2 | 81,369,009 | 67,254,517 |
PC3 | 77,926,482 | 62,200,308 |
total | 2,121,580,757 | 1,823,508,387 |
Reads from the P. cordata culture samples that mapped to ERCC standards were removed by first mapping reads to a bowtie2 reference index prepared from the ERCC sequences and then filtering out the mapped reads:
bowtie2 -t -x ERCC_reference_bt2_index \
-1 sampleID_1_paired.fq \
-2 sampleID_2_paired.fq \
-S sampleID_ercc.sam
samtools view -bS sampleID_ercc.sam > sampleID.bam
samtools sort sampleID.bam sampleID_sorted
samtools view -b -f 13 sampleID_sorted.bam > sampleID_unmapped.bam
samtools sort -n sampleID_unmapped.bam sampleID.qsort
bedtools bamtofastq -i sampleID.qsort.bam -fq sampleID_1_paired.fq -fq2 sampleID_2_paired.fq
The filtered reads from the P. cordata culture replicates were then assembled with Trinity:
Trinity --seqType fq --max_memory 475G \
--left PC2_1_paired.fq,PC3_1_paired.fq,PC4_1_paired.fq \
--right PC2_2_paired.fq,PC3_1_paired.fq,PC4_2_paired.fq \
--CPU 12
The assembly was dereplicated with CD-HIT_EST:
cd-hit-est -i Trinity.fasta -o Trinity_Pc_clustered_95 -c 0.95 -n 8 -p 1 -g 1 -M 200000 -T 8 -d 40
The Trinity assembly was filtered to remove bacterial contamination by first running a blastn (v2.6.0+) against the nr/nt NCBI database:
blastn -query $DATA/Trinity_Pc_clustered_95.fasta -task blastn -db nt -num_threads 12 -max_target_seqs 1 -outfmt 5 > pcTrinity_PC_clustered_95Blast.xml
The blast results were then parsed with the following script: TrinityBlastXML.ipynb
: https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/TrinityBlastXML.ipynb
and reads that had their highest quality blast result corresponding to a bacterial sequence were removed from the transcriptome with the following script: FilterTrinityEukNotEuk.ipynb
: https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/FilterTrinityEukNotEuk.ipynb
This script produced the filtered transcriptome file pc_euk_seqs.fasta
(https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/pc_euk_seqs.fasta) that was used in read mapping and counting for differential gene expression testing.
The “completeness” of the transcriptome was evaluated by determining how many Benchmarking Universal Single-Copy Orthologs (BUSCOs) were complete, fragmented, or missing in the transcriptome. This was accomplished with the BUSCO software (v3) and the translated amino acid sequences for the P. cordata transcriptome. The BUSCO software was also used to assess the completeness of the P. cordata transcriptome previously sequenced as part of the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP) as a comparison point. BUSCO scores were calculated for Eukaryote BUSCOs and protist BUSCOs.
The P. cordata assembled for this project is more complete than the MMETSP transcriptome, which is likely due to much deeper sequencing performed for this study.
The P. cordata transcriptome was annotated with the dammit de novo transcriptome annotation software: https://github.com/dib-lab/dammit
Pfam annotation results (https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/pc_euk_seqs.fasta.x.pfam-A.gff3) were parsed with the following script: https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/Pfam_gffParsing.ipynb
The final pfam annotation file is pfam.csv
(https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/pfam.csv)
The assembly statistics and annotation results are summarized in the below table (including GO and KEGG annotations, which are described in later sections):
P. cordata CCMP3104 transcriptome assembly & annotation statistics | |
---|---|
Number of contigs | 32,621 |
Sum bp | 37,640,601 |
min contig length | 201 |
max contig length | 13,685 |
med contig length | 955 |
mean contig length | 1153 |
N50 | 1866 |
GC% | 67.4 |
With Pfam annotation | 13,548 (41.5%) |
With GO annotation | 7,119 (21.8%) |
With KO annotation | 7,719 (23.7%) |
With Kegg Pathway | 4,618 (14.2%) |
Quality filitered reads from the P. cordata culture replicates and individual acantharians were mapped to the P. cordata transcriptome (pc_euk_seqs.fasta
) using the Salmon read mapping software.
salmon quant -i PC_index -l A \
-1 $DATA2/sampleID_1_paired.fq \
-2 $DATA2/sampleID_2_paired.fq \
-p 10 --validateMappings -o quants/sampleID_quant
The package tximport
was used for importing Salmon read count results files to R.
All Salmon results are available here: https://github.com/maggimars/PcordataSymbiosisDGE/tree/master/quants
Salmon mapping rates (percent of the reads from each sample that mapped to the P. cordata transcriptome assembly):
sample_names<- c("A7", "A12", "A14", "A20", "A21", "A22","A25", "A27", "A28", "A29", "A35", "A36", "PC2", "PC3", "PC4")
mapping_rates<- c(fromJSON("quants/A7_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A12_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A14_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A20_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A21_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A22_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A25_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A27_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A28_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A29_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A35_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A36_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/PC2_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/PC3_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/PC4_quant/aux_info/meta_info.json")$percent_mapped
)
mappingrate_df<- data.frame("Sample" = sample_names, "Mapping Rate" = round(mapping_rates, digits=2))
kable(mappingrate_df, align = "l")
Sample | Mapping.Rate |
---|---|
A7 | 3.73 |
A12 | 1.73 |
A14 | 2.56 |
A20 | 1.12 |
A21 | 3.04 |
A22 | 3.29 |
A25 | 2.56 |
A27 | 4.86 |
A28 | 1.72 |
A29 | 1.62 |
A35 | 1.53 |
A36 | 3.87 |
PC2 | 73.66 |
PC3 | 75.30 |
PC4 | 71.68 |
Read in results files for each sample into R with tximport
:
samples<-read.table("quants/samples.txt", header = TRUE)
files <- file.path("quants", samples$run, "quant.sf")
names(files) <- samples$run
tx2gene <- read_csv("quants/tx2gene.csv")
txi <- tximport(files, type="salmon", tx2gene= tx2gene)
create DESeq data object:
ddsTxi <- DESeqDataSetFromTximport(txi,
colData = samples,
design = ~ condition)
## using counts and average transcript lengths from tximport
Pre-filtering:
“Note that more strict filtering to increase power is automatically applied via independent filtering on the mean of normalized counts within the results function. The adjusted p values for the genes which do not pass the filter threshold are set to NA” – from DESeq2 manual
keep <- rowSums(counts(ddsTxi )) >= 10
dds <- ddsTxi[keep,]
Re-level so that log-fold change is in relation to free-living (e.g. if logfold change is 3, then that gene is expressed 3x more in symbiotic samples than free-living replicates):
dds$condition <- relevel(dds$condition, ref = "F")
dds <- DESeq(ddsTxi)
res <- results(dds, alpha=0.05)
head(res, n =3)
## log2 fold change (MLE): condition S vs F
## Wald test p-value: condition S vs F
## DataFrame with 3 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## TRINITY_DN1_c0_g1_i1 0.0178204337170013 1.98354048709337 3.7365328592521
## TRINITY_DN10_c0_g1_i1 14.8187038538356 -7.72809144623968 2.37954484998336
## TRINITY_DN10_c1_g1_i1 7.55170822351109 -6.75555987522243 2.38815198139567
## stat pvalue padj
## <numeric> <numeric> <numeric>
## TRINITY_DN1_c0_g1_i1 0.530850540276097 0.595522354038567 NA
## TRINITY_DN10_c0_g1_i1 -3.24771833835942 0.0011633436943023 0.00916218836691006
## TRINITY_DN10_c1_g1_i1 -2.82878138738657 0.00467256046288919 0.0210043806317098
“condition S vs F” confirms that we are testing symbiotic samples against free-living samples.
summary(res)
##
## out of 32210 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 589, 1.8%
## LFC < 0 (down) : 5211, 16%
## outliers [1] : 423, 1.3%
## low counts [2] : 17088, 53%
## (mean count < 2)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
589 genes are significantly upregulated in symbiosis and 5211 genes are significantly downregulated in symbiosis
convert DESeq results to a data frame:
resultsall<- as.data.frame(res)
resultsall$Trinity <- row.names(resultsall)
Volcano Plot (plots “genes” by log2fold change and their adjusted p-values, with significantly differentially expressed genes indicated in red)
res_table<- data.frame(res)
# Obtain logical vector regarding whether padj values are less than 0.05 (to color significant points)
threshold <- res_table$padj < 0.05
res_table$threshold <- threshold
ggplot(res_table) +
geom_point(aes(x=log2FoldChange, y=-log10(padj), colour=threshold)) +
ggtitle("") + theme_bw()+ scale_colour_manual(values = c("#bbbbbb", "red")) +
xlab("log2 fold change") +
ylab("-log10 adjusted p-value") +
scale_x_continuous(limits = c(-15,15)) +
scale_y_continuous(limits = c(0,30)) +
theme(legend.position = "none",
plot.title = element_text(size = rel(1.5), hjust = 0.5),
axis.title = element_text(size = rel(1.25))) + theme(text = element_text(size=20))
Count Data Transformations for ranking and visualizations (e.g. PCA plots and heatmaps)
Variance stabilizing transformation (vst): “This function calculates a variance stabilizing transformation (VST) from the fitted dispersion-mean relation(s) and then transforms the count data (normalized by division by the size factors or normalization factors), yielding a matrix of values which are now approximately homoskedastic (having constant variance along the range of mean values). The transformation also normalizes with respect to library size.”" – from DESeq documentation
#Heat Map of the variance stabilizing transformed count matrix (vsd):
vsd<-vst(dds, blind = TRUE)
select <- order(rowMeans(counts(dds,normalized=TRUE)),
decreasing=TRUE)[1:1000]
pheatmap(assay(vsd)[select,], cluster_rows=TRUE, show_rownames=FALSE,
cluster_cols=TRUE)
#PCA by condition (Free-living or Symbiotic):
data1 <- plotPCA(vsd, returnData=TRUE)
data1$group<-gsub(" : ","_",as.character(data1$group))
percentVar1 <- round(100 * attr(data1, "percentVar"))
PCA<-ggplot(data1, aes(PC1, PC2, color = condition))+ theme_bw()+
geom_point(size=9, alpha = 0.8) + scale_colour_manual(values = c("#44aabb","#bbbbbb"), labels = c("Free-living", "Symbiotic"), name = "")+
xlab(paste0("PC1: ",percentVar1[1],"% variance")) +
ylab(paste0("PC2: ",percentVar1[2],"% variance")) +
theme(text = element_text(size=20))
PCA
The majority of variance in gene expression (63%) is explained by sample type. There is much more variation between symbiotic samples than between culture replicates.
The Pfam annotations for the P. cordata transcriptome were assigned using HMMER and HMM profiles for the Pfam database. Annotations with scores < 1e-05 were retained and included in the following analyses. The GO term to Pfam mapping comes from: http://current.geneontology.org/ontology/external2go/pfam2go.
Get dataframe of significantly differentially expressed genes:
sigDEgenes <- subset(resultsall, padj < 0.05)
sigDEgenes$trinity <- row.names(sigDEgenes)
Import Pfam annotation results from dammit and fix pfam annotation formatting to match that used in the pfam2go mapping file:
pfamannot<- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/pfam.csv", header = TRUE)
pfamannot$Dbxref <- stri_sub(pfamannot$Dbxref , from = 2, to = -2)
pfamannot$Pfam <- gsub("\\..*","", pfamannot$Dbxref)
dim(pfamannot)
## [1] 13548 6
13,548 transcripts have good (score<1e-05) pfam annotations
Import and format trinity ID to dammit transcript ID mapping file:
trinity2dammit <- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/pc_euk_seqs.fasta.dammit.namemap.csv", header = TRUE)
trinity2dammit1 <- trinity2dammit %>%
separate(original, c("trinity"), sep = " ")
## Warning: Expected 1 pieces. Additional pieces discarded in 32621 rows [1, 2, 3,
## 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
names(trinity2dammit1) <- c("trinity", "seqid")
Import and format pfam2go mapping file:
pfam2go <- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/pfam2go4R.txt", header = FALSE)
pfam2go1 <- pfam2go %>%
separate(V1, c('V1_1', 'V1_2'), sep = '>') %>%
separate(V1_1, c("Pfam", "name"), sep = " ") %>%
separate(V1_2, c("GO_desc", "GO"), sep = ";")
## Warning: Expected 2 pieces. Additional pieces discarded in 2 rows [3361, 6204].
## Warning: Expected 2 pieces. Additional pieces discarded in 10503 rows [1, 2, 3,
## 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 10503 rows [1, 2,
## 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
pfam2go1$GO <-stri_sub(pfam2go1$GO, from = 2)
Create transcript to GO reference:
pfam1 <- merge(pfamannot, pfam2go1, by = "Pfam")
pfam2<- merge(pfam1, trinity2dammit1, by ="seqid" )
pfam3 <- merge(pfam2, sigDEgenes, by = "trinity")
pfamUP <- pfam3[pfam3$log2FoldChange >0 & pfam3$padj <0.05,]
pfam4<-merge(pfamannot, trinity2dammit1, by = "seqid")
pfam5<-merge(pfam4, sigDEgenes, by = "trinity")
length(unique(pfam1$seqid))
## [1] 7119
7119 annotated transcripts also have a GO term assigned.
Create list of all upregulated GO terms:
length(pfamUP$GO_desc)
## [1] 411
length(unique(substring(pfamUP$GO_desc,5,nchar(pfamUP$GO_desc))))
## [1] 134
unique_upGOlist <- unique(substring(pfamUP$GO_desc,5,nchar(pfamUP$GO_desc)))
unique_upGOlist[135] <- ""
kable(data.frame(unique_upGOlist[1:45],unique_upGOlist[46:90], unique_upGOlist[91:135] ))
unique_upGOlist.1.45. | unique_upGOlist.46.90. | unique_upGOlist.91.135. |
---|---|---|
transmembrane transport | biosynthetic process | inorganic diphosphatase activity |
membrane | proton-transporting ATP synthase activity, rotational mechanism | hydrogen-translocating pyrophosphatase activity |
ion channel activity | ATP synthesis coupled proton transport | fructose-bisphosphate aldolase activity |
ion transport | proton-transporting ATP synthase complex, catalytic core F(1) | inositol-3-phosphate synthase activity |
ATP binding | phosphoglycerate kinase activity | phospholipid biosynthetic process |
nucleic acid binding | kinase activity | inositol biosynthetic process |
structural constituent of ribosome | COPII vesicle coat | GTPase activity |
ribosome | intracellular protein transport | iron-sulfur cluster binding |
translation | ER to Golgi vesicle-mediated transport | nitrogen compound metabolic process |
purine nucleotide biosynthetic process | eukaryotic translation initiation factor 3 complex | glutamate synthase activity |
phosphoribosylaminoimidazolecarboxamide formyltransferase activity | translational initiation | carbohydrate metabolic process |
IMP cyclohydrolase activity | translation initiation factor activity | hydrolase activity, hydrolyzing O-glycosyl compounds |
protein binding | translation initiation factor binding | primary metabolic process |
photosystem II stabilization | acetyl-CoA carboxylase activity | zinc ion binding |
photosystem II assembly | fatty acid biosynthetic process | transcription coregulator activity |
oxygen evolving activity | response to stress | nucleus |
photosystem II oxygen evolving complex | unfolded protein binding | histone acetyltransferase activity |
intracellular | protein folding | cytoskeleton organization |
oxidation-reduction process | rRNA binding | heme binding |
oxidoreductase activity | hydrolase activity | pyridoxal phosphate binding |
photosynthetic electron transport in photosystem II | coenzyme binding | peroxidase activity |
electron transporter, transferring electrons within the cyclic electron transport pathway of photosynthesis activity | cell redox homeostasis | response to oxidative stress |
photosynthesis, light reaction | violaxanthin de-epoxidase activity | cellular amino acid biosynthetic process |
photosystem | chloroplast | protein dimerization activity |
chlorophyll binding | RNA binding | cytoplasm |
photosynthetic electron transport chain | base-excision repair | ATP hydrolysis coupled proton transport |
ATPase activity | ubiquitin-dependent protein catabolic process | proton transmembrane transporter activity |
protein phosphorylation | oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor | proton-transporting V-type ATPase, V0 domain |
protein kinase activity | motor activity | cellular metabolic process |
GTP binding | myosin complex | transcription, DNA-templated |
proton transmembrane transport | proteasome core complex | DNA-directed 5’-3’ RNA polymerase activity |
ATP metabolic process | threonine-type endopeptidase activity | DNA binding |
glucose-6-phosphate isomerase activity | proteolysis involved in cellular protein catabolic process | microtubule-based movement |
gluconeogenesis | oxidoreductase activity, acting on the CH-CH group of donors | microtubule motor activity |
glycolytic process | mitochondrial outer membrane | dynein complex |
phosphoenolpyruvate carboxykinase (ATP) activity | peptidyl-prolyl cis-trans isomerase activity | aldehyde-lyase activity |
aminoacyl-tRNA ligase activity | protein peptidyl-prolyl isomerization | U5 snRNA binding |
tRNA aminoacylation for protein translation | vesicle-mediated transport | inositol catabolic process |
catalytic activity | membrane coat | inositol oxygenase activity |
trehalose biosynthetic process | actin binding | iron ion binding |
protein transport | regulation of transcription, DNA-templated | integral component of membrane |
metabolic process | ribosome binding | ATP:ADP antiporter activity |
electron transfer activity | mature ribosome assembly | nucleotide transport |
copper ion binding | transmembrane transporter activity | proton-transporting two-sector ATPase complex, proton-transporting domain |
nucleotide binding | protein import |
Create list of all downregulated GO terms:
pfamDOWN <- pfam3[pfam3$log2FoldChange <0 & pfam3$padj <0.05,]
length(pfamDOWN$GO_desc)
## [1] 3200
length(unique(pfamDOWN$GO_desc))
## [1] 552
(too long to print)
Format transcript to GO mapping for GOstats:
GOdf <- data.frame(pfam2$trinity, pfam2$GO)
GOdf$evidence <- "ISS"
names(GOdf) <- c("isoform", "GO", "evidence")
#reorder columns
GOdf <- GOdf[,c("GO","evidence","isoform")]
GOdf$GO <- as.character(GOdf$GO)
GOdf$isoform<- as.character(GOdf$isoform)
goframe=GOFrame(GOdf)
goAllFrame=GOAllFrame(goframe)
gsc <- GeneSetCollection(goAllFrame, setType = GOCollection())
Prepare universe gene set, up-regulated gene set, and down-regulated gene set:
#make list of all genes
universe <-trinity2dammit1$trinity
#make list of upregulated genes
sigDEup <- sigDEgenes[sigDEgenes$log2FoldChange >0,]
uplist <- sigDEup$trinity
#make list of downregulated genes
sigDEdown<- sigDEgenes[sigDEgenes$log2FoldChange <0,]
downlist <- sigDEdown$trinity
GOterm enrichment in up-regulated genes:
upregulated = hyperGTest(
GSEAGOHyperGParams(name = "Symbiosis Upregged",
geneSetCollection=gsc,geneIds = uplist,
universeGeneIds=universe,ontology = "BP",pvalueCutoff = 0.05,conditional = TRUE,testDirection = "over"))
upregulated
## Gene to GO BP Conditional test for over-representation
## 289 GO BP ids tested (68 have p < 0.05)
## Selected gene set size: 117
## Gene universe size: 4341
## Annotation package: Based on a GeneSetCollection Object
htmlReport(upregulated, file="enriched_up_go.html")
upgoterms <-data.frame(summary(upregulated))
write.csv(upgoterms,"upgo.csv")
Full Results (conditional = FALSE): https://maggimars.github.io/AcanthSymbiontDGE/Pcordata/enriched_up_GO.html
List of significantly enriched GO terms in UP-regulated genes (when conditional = TRUE , less redundancy):
kable(data.frame(GOterm =upgoterms$Term))
GOterm |
---|
amide biosynthetic process |
peptide metabolic process |
translation |
organonitrogen compound biosynthetic process |
cellular biosynthetic process |
macromolecule biosynthetic process |
cellular nitrogen compound metabolic process |
photosynthetic electron transport chain |
proton transmembrane transport |
generation of precursor metabolites and energy |
gluconeogenesis |
monosaccharide biosynthetic process |
inorganic ion transmembrane transport |
cation transmembrane transport |
ATP metabolic process |
ribonucleoside monophosphate metabolic process |
purine nucleoside monophosphate metabolic process |
purine nucleotide metabolic process |
establishment of protein localization |
ribonucleoside triphosphate metabolic process |
purine nucleoside triphosphate metabolic process |
peptide transport |
photosynthesis |
purine-containing compound biosynthetic process |
small molecule metabolic process |
nucleoside triphosphate biosynthetic process |
purine ribonucleoside triphosphate biosynthetic process |
ribonucleotide metabolic process |
purine ribonucleoside monophosphate biosynthetic process |
nucleoside monophosphate biosynthetic process |
purine ribonucleotide biosynthetic process |
ribose phosphate biosynthetic process |
nitrogen compound transport |
monocarboxylic acid biosynthetic process |
organic substance catabolic process |
intracellular protein transport |
translational initiation |
organic acid biosynthetic process |
inositol metabolic process |
cellular macromolecule localization |
establishment of localization in cell |
vesicle-mediated transport |
protein import |
ribonucleoside diphosphate metabolic process |
nucleoside diphosphate phosphorylation |
purine nucleoside diphosphate metabolic process |
glycolytic process |
ADP metabolic process |
hexose metabolic process |
pyruvate metabolic process |
proteolysis involved in cellular protein catabolic process |
pyridine nucleotide biosynthetic process |
nucleotide transport |
nucleoside phosphate metabolic process |
ATP synthesis coupled proton transport |
response to oxidative stress |
protein catabolic process |
cellular macromolecule catabolic process |
nicotinamide nucleotide metabolic process |
cellular process |
pyridine-containing compound metabolic process |
nucleoside phosphate catabolic process |
cellular macromolecule metabolic process |
protein folding |
carboxylic acid metabolic process |
oxidoreduction coenzyme metabolic process |
nucleotide biosynthetic process |
cellular catabolic process |
GOterm enrichment in significantly down-reguated genes:
downregulated = hyperGTest(
GSEAGOHyperGParams(name = "Symbiosis DownRegged",
geneSetCollection=gsc,geneIds = downlist,
universeGeneIds=universe,ontology = "BP",pvalueCutoff = 0.05,conditional = FALSE,testDirection = "over"))
downregulated
## Gene to GO BP test for over-representation
## 902 GO BP ids tested (31 have p < 0.05)
## Selected gene set size: 1001
## Gene universe size: 4341
## Annotation package: Based on a GeneSetCollection Object
htmlReport(downregulated, file = "enrichedDOWNgostats.html")
downgoterms <- data.frame(summary(downregulated))
write.csv(downgoterms,"downgo.csv")
Full Results: https://maggimars.github.io/AcanthSymbiontDGE/Pcordata/enriched_down_GO.html
List of significantly enriched GO terms in DOWN-regulated genes:
kable(data.frame("GOterm" = downgoterms[["Term"]]))
GOterm |
---|
carbohydrate derivative metabolic process |
DNA replication |
glycosylation |
protein glycosylation |
macromolecule glycosylation |
glycoprotein metabolic process |
glycoprotein biosynthetic process |
transport |
establishment of localization |
localization |
cation transport |
carbohydrate derivative biosynthetic process |
carbohydrate derivative transport |
fructose metabolic process |
nucleobase-containing compound transport |
respiratory electron transport chain |
nucleotide-sugar transmembrane transport |
pyrimidine nucleotide-sugar transmembrane transport |
cellular respiration |
metal ion transport |
transmembrane transport |
DNA-dependent DNA replication |
macromolecule catabolic process |
protein peptidyl-prolyl isomerization |
peptidyl-proline modification |
protein homooligomerization |
protein complex oligomerization |
pathogenesis |
copper ion transport |
molybdopterin cofactor metabolic process |
prosthetic group metabolic process |
Make results file to import into the ReviGO online tool:
Down4revi <-as.data.frame(downgoterms$GOBPID)
names(Down4revi) <- c("id")
Down4revi$updown <- 0
up4revi <-as.data.frame(upgoterms$GOBPID)
names(up4revi)<-c("id")
up4revi$updown <- 1
revi<-rbind(up4revi, Down4revi)
write.csv(revi, "revi.csv")
ReviGO summarizes long lists of GO terms by removing redundant terms and clustering terms with similar meanings. The results can be plotted as an MDS plot:
source("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/ReviGO_nmds.R")
reviGOplot
This plot has several of the GO terms labeled, but not all of them. Below is an interactive version (hover mouse over points to see the name of the GO term):
key = one.data$description
p1 <- ggplot( data = one.data, aes(key= key, x= plot_X, y= plot_Y, color = value) ) +
geom_point(alpha = I(0.6), size =7) +
scale_colour_manual(values =c("#3B9AB2", "red"), labels= c("Down", "Up")) +
geom_point(shape = 21, fill = "transparent", colour = I (alpha ("black", 0.6) ), size = 7) + scale_size_area() + theme_bw() + labs (y = "Axis 2", x = "Axis 1") +
theme(legend.key = element_blank()) + theme(text = element_text(size=16)) + theme(legend.title=element_blank())
ggplotly(p1, tooltip="key")
TreeMap for GO terms enriched in significantly upregulated gene set:
source("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/ReviGO_treemap_UP.R")
TreeMap for GO terms enriched in significantly downregulated gene set:
source("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/ReviGO_treemap_DOWN.R")
KEGG annotations (K numbers) were attained by submitting the transdecoder peptide sequences for the P. cordata transcriptome assemble as a GhostKoala job through https://www.kegg.jp/ghostkoala/
The protein fasta submitted to GhostKOALA can be found here: https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/pc_euk_seqs.fasta.transdecoder.pep
Results Summary:
Results were downloaded as user_ko.csv
and are available here: https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/user_ko.csv
In order to get the KEGG pathway that each K number belongs to, the KEGG API was accessed with each annotated K number:
# read in results with pathway assigned to K numbers
keggs <- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/user_ko.csv", header =TRUE)
## format data frame
kegg <- keggs %>% separate(transcript, c("transcript", "protein"), "\\.")
trinity2dammit <- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/pc_euk_seqs.fasta.dammit.namemap.csv", header = TRUE)
trinity2dammit_k <- trinity2dammit %>%
separate(original, c("trinity"), sep = " ")
names(trinity2dammit_k) <- c("trinity", "transcript")
trinity2kegg <- merge(kegg, trinity2dammit_k, by = "transcript")
trinity2kegg<- trinity2kegg[,c(3,4)]
trinity2kegg<- trinity2kegg[trinity2kegg$KO!="",]
## Get pathway ids
koslist<- trinity2kegg$KO
paths <- character(length=0)
for(i in 1:length(koslist)) {
pathway <- system(paste0("curl http://rest.kegg.jp/link/pathway/", koslist[[i]]), intern=TRUE)
if (pathway[[1]] == "") {
paths[[i]]<- "NA"
} else if (length(pathway) == 1) {
paths[[i]] <- stri_sub(pathway[[1]], 16)
} else {
l <- length(pathway)
paths[[i]] <- stri_sub(pathway[[2]], 16)
}
}
trinity2kegg$path<-paths
## pathway ids
write.csv(trinity2kegg, "trinity_KO_pathway.csv")
Then a new data frame was made with KEGG annotations (K number AND pathway) alongside DESeq results:
trinity2kegg<- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/trinity_KO_pathway.csv", header = TRUE)
trinity2kegg<-trinity2kegg[,-1]
trinity2kegg_noNA <- trinity2kegg[trinity2kegg$path != "NA",]
trinity2kegg_noNA <- trinity2kegg_noNA[trinity2kegg_noNA$path !='PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"',]
resultsframe<- as.data.frame(res)
resultsframe$trinity <- row.names(resultsframe)
resALL<-merge(trinity2kegg, resultsframe, by = 'trinity')
dim(resALL)[1]
## [1] 7720
KEGG pathway enrichment was performed with the kegga
function from the edgeR
package. The inputs for this function include a list of pathways (either corresponding to up or downregulated genes) and a ‘universe’ list including all pathways annotated in the reference transcriptome. The function was called twice, once for significantly upregulated genes and once for significantly downregulated genes.
To prepare the input lists, the dataframe with KEGG annotations and DE results was subset to contain only significantly UPregulated genes:
KOs_up<- resALL[resALL$log2FoldChange > 0 & resALL$padj <= 0.05 ,]
KOs_up_list <- KOs_up$KO
KOs_up_list <- (unique(KOs_up_list))
and only significantly DOWNregulated genes:
KOs_down<- resALL[resALL$log2FoldChange < 0 & resALL$padj <= 0.05 ,]
KOs_down_list <- KOs_down$KO
KOs_down_list <- (unique(KOs_down_list))
To get the list of all pathways with genes annotated in the P. cordata transcriptome (the “universe”):
resALLnoNA <- resALL[resALL$path != "NA",] #kegga requires no NA values
resALLnoNA <- resALLnoNA[resALLnoNA$path !='PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"',] #remove this message from accessing the API
genepathways <- data.frame(resALLnoNA$trinity, resALLnoNA$path)
names(genepathways)<-c("geneID", "pathwayID")
universelist<- genepathways$geneID
universelist<-as.vector(universelist)
length(universelist)
## [1] 4618
Perform enrichment test for upregulated genes and access the KEGG API again to acquire more information about the enriched pathways (from pathway id, get pathway name and class/description):
ko_up<- resALLnoNA[resALLnoNA$log2FoldChange > 0 & resALLnoNA$padj <= 0.05 ,]
de<-ko_up$trinity
de<- as.vector(de)
keggUP<-kegga(de, universe = universelist, species = NULL, species.KEGG = NULL, convert = FALSE,
gene.pathway = genepathways, pathway.names = NULL,
prior.prob = NULL, covariate=NULL, plot=FALSE)
names(keggUP) <- c("Pathway", "N", "DE" , "p")
keggUP$Pathway<-row.names(keggUP)
KeggUpsig<- keggUP[keggUP$p < 0.05,]
kosup<- row.names(KeggUpsig)
name <- character(length=0)
class <- character(length=0)
for(i in 1:length(kosup)) {
pathway <- system(paste0("curl http://rest.kegg.jp/get/", kosup[[i]]), intern=TRUE)
name[[i]] <- pathway[[2]]
class[[i]] <-pathway[[3]]
}
KeggUpsig$name <- name
KeggUpsig$class <- class
kable(KeggUpsig[,c(1:5)], row.names=FALSE)
Pathway | N | DE | p | name |
---|---|---|---|---|
ko03010 | 193 | 40 | 0.0000000 | NAME Ribosome |
ko00010 | 98 | 13 | 0.0002064 | NAME Glycolysis / Gluconeogenesis |
ko00860 | 43 | 6 | 0.0086868 | NAME Porphyrin and chlorophyll metabolism |
ko00061 | 35 | 7 | 0.0005283 | NAME Fatty acid biosynthesis |
ko00760 | 17 | 3 | 0.0325781 | NAME Nicotinate and nicotinamide metabolism |
ko00195 | 28 | 9 | 0.0000012 | NAME Photosynthesis |
ko00190 | 75 | 7 | 0.0377923 | NAME Oxidative phosphorylation |
ko05166 | 8 | 2 | 0.0420081 | NAME Human T-cell leukemia virus 1 infection |
ko00053 | 7 | 2 | 0.0323935 | NAME Ascorbate and aldarate metabolism |
ko05162 | 1 | 1 | 0.0422352 | NAME Measles |
Perform enrichment test for downregulated genes and access the KEGG API again to acquire more information about the enriched pathways (from pathway id, get pathway name and class/description):
ko_down<- resALLnoNA[resALLnoNA$log2FoldChange < 0 & resALLnoNA$padj <= 0.05 ,]
de<-ko_down$trinity
de<- as.vector(de)
keggDOWN<-kegga(de, universe = universelist, species = NULL, species.KEGG = NULL, convert = FALSE,
gene.pathway = genepathways, pathway.names = NULL,
prior.prob = NULL, covariate=NULL, plot=FALSE)
names(keggDOWN) <- c("Pathway", "N", "DE" , "p")
keggDOWN$Pathway<-row.names(keggDOWN)
KeggDOWNsig<- keggDOWN[keggDOWN$p < 0.05,]
kosup<- row.names(KeggDOWNsig)
name <- character(length=0)
class <- character(length=0)
for(i in 1:length(kosup)) {
pathway <- system(paste0("curl http://rest.kegg.jp/get/", kosup[[i]]), intern=TRUE)
name[[i]] <- pathway[[2]]
class[[i]] <-pathway[[3]]
}
KeggDOWNsig$name <- name
KeggDOWNsig$class <- class
kable(KeggDOWNsig[,c(2:5)], row.names=FALSE)
N | DE | p | name |
---|---|---|---|
45 | 14 | 0.0308998 | NAME Galactose metabolism |
51 | 15 | 0.0421142 | NAME Ras signaling pathway |
45 | 15 | 0.0136572 | NAME Glycosaminoglycan degradation |
28 | 11 | 0.0088506 | NAME Pyrimidine metabolism |
33 | 12 | 0.0127166 | NAME Peroxisome |
18 | 7 | 0.0368175 | NAME Folate biosynthesis |
34 | 13 | 0.0059705 | NAME Glutathione metabolism |
42 | 17 | 0.0008279 | NAME DNA replication |
5 | 3 | 0.0484208 | NAME Biotin metabolism |
5 | 3 | 0.0484208 | NAME Chlorocyclohexane and chlorobenzene degradation |
22 | 9 | 0.0129969 | NAME Fructose and mannose metabolism |
15 | 6 | 0.0456244 | NAME Antifolate resistance |
75 | 24 | 0.0038249 | NAME Oxidative phosphorylation |
7 | 4 | 0.0263902 | NAME Various types of N-glycan biosynthesis |
8 | 4 | 0.0450832 | NAME Taste transduction |
8 | 4 | 0.0450832 | NAME Glycosaminoglycan biosynthesis - heparan sulfate / heparin |
Add logFold change values to KEGG pathway maps to visualize results.
First, prepare data frames with all K number annotations and log fold change (ko_all
) and all significantly differentially expressed K numbers and their log fold changes (ko_sig
):
ko_all <-resALL[,c(2,5)]
#7,720
ko_all<-as.data.frame(ko_all[!duplicated(ko_all[,c('KO')]),])
#3,631
row.names(ko_all) <- ko_all[[1]]
ko_all<-ko_all[,2,drop=FALSE]
ko_sig<- resALL[resALL$padj <= 0.05 ,]
ko_sig<- ko_sig[complete.cases(ko_sig), c(2,5)]
ko_sig<-as.data.frame(ko_sig[!duplicated(ko_sig[,c('KO')]),])
#1,243
row.names(ko_sig) <- ko_sig$KO
ko_sig<-ko_sig[,2,drop=FALSE]
with only significantly differentially expressed genes colored:
pathvect<- c("ko03030")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)
And with all KO annotations (not just sig diff expressed):
pathvect<- c("ko03030")
pv.out<- pathview(gene.data = ko_all, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "all", kegg.native = TRUE)
Cell Cycle (significant only)
pathvect<- c("ko04110")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)
Cell Cycle with all KO annotations (not just sig diff expressed):
pathvect<- c("ko04110")
pv.out<- pathview(gene.data = ko_all, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "all", kegg.native = TRUE)
Nitrogen Metabolism (significant):
pathvect<- c("ko00910")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)
Ribosome (significant only):
pathvect<- c("ko03010")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)
Photosynthesis (significant only):
pathvect<- c("ko00195")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)
Chlorophyll / porphyrin metabolism (significant only):
pathvect<- c("ko00860")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)
Carbon Fixation in Photosynthetic Organisms (significant only):
pathvect<- c("ko00710")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)
Calvin-Benson cycle is upregulated
pathvect<- c("ko00010")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)
gluconeogenesis = pathway that makes new glucose
glycolysis = glycolysis produces two molecules of pyruvate, two atp, 2 NADH, 2 H20
Pentose Phosphate Pathway (significant only):
pathvect<- c("ko00030")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)
TCA cycle (significant only):
pathvect<- c("ko00020")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)
Respiratory oxidative phosphorylation (significant only):
pathvect<- c("ko00190")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)
with all KO annotations (not just significant):
pathvect<- c("ko00190")
pv.out<- pathview(gene.data = ko_all, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "all", kegg.native = TRUE)
Ras Signaling Pathway (significant only):
pathvect<- c("04014")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)
ERK is downregulated
MAPK Signaling Pathway (significant only):
pathvect<- c("ko04010")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)
phoA and ehap1
Alkaline phosphatase genes were annotated in the P. cordata transcriptome by blasting the transcriptome against a local database of Emiliania huxleyi phosphorus metabolism genes (https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/PhosMetab.pep) using dammit.
Results file: https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/pc_euk_seqs.fasta.x.Pmetab.pep.crbl.csv
Two alkaline phosphates were found that were also expressed:
goi<- c("TRINITY_DN16018_c0_g1_i1", "TRINITY_DN19611_c0_g1_i1")
stopifnot(all(goi %in% names(dds)))
tcounts <- t(log2((counts(dds[goi, ], normalized=TRUE, replaced=FALSE)+.5))) %>%
merge(colData(dds), ., by="row.names") %>%
gather(gene, expression, (ncol(.)-length(goi)+1):ncol(.))
annots<-c(rep("ehap1", 15), rep("phoA", 15))
tcounts$annots <- annots
tcounts$annots_f = factor(tcounts$annots, levels=c("phoA", "ehap1"))
pal<-wes_palette(name = "Zissou1", type="discrete")
colors<- c(pal[2],pal[3])
goiplot<- ggplot(tcounts, aes(condition, expression, fill=condition)) +geom_dotplot(binaxis='y', stackdir='center', dotsize=1)+
facet_wrap(~annots_f, scales="fixed") +
labs(x="",
y="Expression (log normalized counts)",
fill="Phenotype",
title="") + theme_bw() + scale_fill_manual(values= colors) + theme(text = element_text(size=19)) + theme(legend.position="none") + theme(panel.grid.minor = element_blank(), panel.grid.major = element_blank()) +scale_x_discrete(labels=c("F" = "Free-living", "S" = "Symbiotic"))
goiplot
Chloroplast division genes wer annotated by creating a local database of gene sequences and then blasting the P. cordata transcriptome against the database using dammit software.
local database: https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/chloroplast_division_proteins.fasta
results: https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/chlordiv_annotations.csv
DRP5B, FtsZ, PDR1
chlordiv <- read.csv("./pc_euk_seqs.fasta.dammit/chlordiv_annotations1.csv")
goi<- as.character(chlordiv$trinity)
stopifnot(all(goi %in% names(dds)))
tcounts <- t(log2((counts(dds[goi, ], normalized=TRUE, replaced=FALSE)+.5))) %>%
merge(colData(dds), ., by="row.names") %>%
gather(gene, expression, (ncol(.)-length(goi)+1):ncol(.))
chlordiv$gene <- chlordiv$trinity
tcounts<- merge(tcounts, chlordiv, by= "gene")
pal<-wes_palette(name = "Zissou1", type="discrete")
colors<- c(pal[2],pal[3])
tcounts$identifier = paste(tcounts$geneName ,tcounts$seqid )
tcounts <- tcounts %>% filter(gene %in% c("TRINITY_DN10430_c0_g4_i2", "TRINITY_DN10705_c0_g1_i1", "TRINITY_DN1346_c1_g2_i1"))
goiplot<- ggplot(tcounts, aes(condition, expression, fill=condition)) +geom_dotplot(binaxis='y', stackdir='center', dotsize=1)+
facet_wrap(~geneName, scales="fixed") +
labs(x="",
y="Expression (log normalized counts)",
fill="Phenotype",
title="") + theme_bw() + scale_fill_manual(values= colors) + theme(text = element_text(size=11)) + theme(legend.position="none") + theme(panel.grid.minor = element_blank(), panel.grid.major = element_blank()) +scale_x_discrete(labels=c("F" = "Free-living", "S" = "Symbiotic"))
goiplot
TRINITY_DN10430_c0_g4_i2 - not differentially expressed (padj = 0.115076, lfc = -1.733) → PDR1
TRINITY_DN10705_c0_g1_i1 - not differentially expressed (padj = 0.1150529, lfc = 1.9) → DRP5B
TRINITY_DN1346_c1_g2_i1 - not differentially expressed (padj = 0.732, lfc = -0.6) → FtsZ
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] pathview_1.22.3 org.Hs.eg.db_3.7.0
## [3] treemap_2.4-2 GO.db_3.7.0
## [5] knitr_1.29 jsonlite_1.7.0
## [7] data.table_1.13.0 reshape2_1.4.3
## [9] plotly_4.9.2.1 shiny_1.5.0
## [11] ggrepel_0.8.1 scales_1.1.1
## [13] GSEABase_1.44.0 annotate_1.60.1
## [15] XML_3.98-1.20 GOstats_2.48.0
## [17] graph_1.60.0 Category_2.48.1
## [19] Matrix_1.2-18 AnnotationDbi_1.44.0
## [21] edgeR_3.24.3 limma_3.38.3
## [23] forcats_0.4.0 stringr_1.4.0
## [25] dplyr_1.0.2 purrr_0.3.4
## [27] tidyr_1.1.2 tibble_3.0.3
## [29] tidyverse_1.3.0 stringi_1.4.3
## [31] pheatmap_1.0.12 wesanderson_0.3.6.9000
## [33] DESeq2_1.22.2 SummarizedExperiment_1.12.0
## [35] DelayedArray_0.8.0 BiocParallel_1.16.6
## [37] matrixStats_0.56.0 Biobase_2.42.0
## [39] GenomicRanges_1.34.0 GenomeInfoDb_1.18.2
## [41] IRanges_2.16.0 S4Vectors_0.20.1
## [43] BiocGenerics_0.28.0 ggplot2_3.3.2
## [45] tximport_1.10.1 readr_1.3.1
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 backports_1.1.9 Hmisc_4.3-0
## [4] plyr_1.8.6 igraph_1.2.5 lazyeval_0.2.2
## [7] splines_3.5.0 gridBase_0.4-7 digest_0.6.25
## [10] htmltools_0.5.0 fansi_0.4.0 magrittr_1.5
## [13] checkmate_1.9.4 memoise_1.1.0 cluster_2.1.0
## [16] Biostrings_2.50.2 modelr_0.1.8 colorspace_1.4-1
## [19] blob_1.2.0 rvest_0.3.6 haven_2.3.1
## [22] xfun_0.16 crayon_1.3.4 RCurl_1.95-4.12
## [25] genefilter_1.64.0 survival_3.2-3 glue_1.4.2
## [28] gtable_0.3.0 zlibbioc_1.28.0 XVector_0.22.0
## [31] Rgraphviz_2.26.0 DBI_1.1.0 Rcpp_1.0.5
## [34] viridisLite_0.3.0 xtable_1.8-4 htmlTable_2.0.1
## [37] foreign_0.8-72 bit_1.1-14 Formula_1.2-3
## [40] AnnotationForge_1.24.0 htmlwidgets_1.5.1 httr_1.4.2
## [43] RColorBrewer_1.1-2 acepack_1.4.1 ellipsis_0.3.1
## [46] pkgconfig_2.0.3 nnet_7.3-12 dbplyr_1.4.2
## [49] locfit_1.5-9.1 tidyselect_1.1.0 rlang_0.4.7
## [52] later_1.1.0.1 munsell_0.5.0 cellranger_1.1.0
## [55] tools_3.5.0 cli_2.0.2 generics_0.0.2
## [58] RSQLite_2.2.0 broom_0.7.0 evaluate_0.14
## [61] fastmap_1.0.1 yaml_2.2.0 bit64_0.9-7
## [64] fs_1.3.1 KEGGREST_1.22.0 RBGL_1.58.2
## [67] mime_0.7 KEGGgraph_1.42.0 xml2_1.2.2
## [70] compiler_3.5.0 rstudioapi_0.11 png_0.1-7
## [73] reprex_0.3.0 geneplotter_1.60.0 lattice_0.20-38
## [76] vctrs_0.3.3 pillar_1.4.6 lifecycle_0.2.0
## [79] bitops_1.0-6 httpuv_1.5.4 R6_2.4.1
## [82] latticeExtra_0.6-28 promises_1.1.1 gridExtra_2.3
## [85] assertthat_0.2.1 withr_2.1.2 GenomeInfoDbData_1.2.0
## [88] hms_0.5.3 grid_3.5.0 rpart_4.1-15
## [91] rmarkdown_2.3 lubridate_1.7.4 base64enc_0.1-3