The following .html report details the analyses performed to detect and classify Phaeocystis cordata genes that were up and downregulated when cellls are acantharian endosymbionts.

The raw R markdown file used to build this document (Pcordata_gene_expression_analysis.Rmd), as well as all the intermediate data files, are available in the GitHub repository: https://github.com/maggimars/PcordataSymbiosisDGE

1 Sequence data

The raw sequences from this study are available from the NCBI sequence read archive (SRA) with the accession numbers listed below. Sample names beginning with “PC” are Phaeocystis cordata culture replicates and samples beginning with “A” are individual acantharian holobionts.

Sample SRA Run Number SRA Experiment Number
A7 SRR10997202 SRX7658295
A12 SRR10997200 SRX7658297
A14 SRR10997199 SRX7658298
A20 SRR10997198 SRX7658299
A21 SRR10997197 SRX7658300
A22 SRR10997196 SRX7658301
A25 SRR10997215 SRX7658282
A27 SRR10997214 SRX7658283
A28 SRR10997213 SRX7658284
A29 SRR10997212 SRX7658285
A35 SRR10997211 SRX7658286
A36 SRR10997210 SRX7658287
PC2 SRR10997209 SRX7658288
PC3 SRR10997208 SRX7658289
PC4 SRR10997207 SRX7658290

2 Packages

The following packages are used throughout the analyses:

library(readr)
library(tximport)
library(ggplot2)
library(DESeq2)
library(wesanderson)
library(pheatmap)
library(stringi)
library(tidyverse)
library(edgeR)
library(GOstats)
library(GSEABase)
library(scales)
library(ggrepel)
library(shiny)
library(plotly)
library(reshape2)
library(data.table)
library(jsonlite)
library(knitr)
library(pathview)

3 Internal quality control with ERCC spike-in

The External RNA Controls Consortium (ERCC) produces spike-in controls for RNA-seq experiments. The ERCC spike-in controls were added to RNA extracted from culture replicates before poly-A selection and cDNA synthesis with the SMART-seq v4 kit. By comparing the number of reads mapping to the ERCC sequences with the known concentrations of RNA molecules in the spike-in mix, bias introduced through PCR amplification or other steps of library-preparation and sequencing can be detected. This can be accomplished by plotting the concentration by the FPKM of mapped reads and performing a regression analysis.

The quality trimmed P. cordata culture reads were mapped to the ERCC sequences and counted with RSEM.

To plot results and perform regression analyses:

Import the counts of reads mapping to ERCC spike-in standard sequences in each sample:

`PC2.ercc`<- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/ERCC/PC2.genes.results")
`PC3.ercc` <- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/ERCC/PC3.genes.results")
`PC4.ercc` <- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/ERCC/PC4.genes.results")
FPKM<- data.frame(PC2.ercc$gene_id,PC2.ercc$FPKM,PC3.ercc$FPKM,PC4.ercc$FPKM)
#add column names
names(FPKM)<-c("gene_id","PC2","PC3","PC4") 

Import ERCC spike-in standard concentrations for each ERCC sequence:

ercc_conc <- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/ERCC/cms_095046.txt", stringsAsFactors=FALSE)
ercc_conc<-ercc_conc[,2:7]
ercc_conc<-ercc_conc[order(ercc_conc$'ERCC ID'),]

Plot ERCC concentrations (log2Conc.) by their counts (Log2FPKM) in each sample and apply a linear regression. The slope of the linear regression line and the R-squared of the regression model should both be close to 1. If so, we can be fairly confident that PCR and other handling did not cause substantial or sytematic bias in the results.

FPKM$conc <- ercc_conc$'concentration in Mix 1 (attomoles/ul)'
row.names(FPKM) <- FPKM$gene_id
FPKM <- FPKM[,-1]

FPKM[FPKM==0] <- NA
FPKM <- FPKM[complete.cases(FPKM),]

for (i in c(1:4)) {
  FPKM[[i]] = log2(FPKM[[i]])
}

lm_eqn <- function(df){
  y<-df[[1]]
  x<-df[[2]]
  m <- lm(y ~ x, df);
  eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2, 
                   list(a = format(coef(m)[1], digits = 2), 
                        b = format(coef(m)[2], digits = 2), 
                        r2 = format(summary(m)$r.squared, digits = 3)))
  as.character(as.expression(eq));                 
}

EqDF <- data.frame( c(lm_eqn(data.frame(FPKM$PC2, FPKM$conc)),lm_eqn(data.frame(FPKM$PC3, FPKM$conc)), lm_eqn(data.frame(FPKM$PC4, FPKM$conc)))  , c("PC2", "PC3", "PC4" ))
names(EqDF) <- c("eq", "variable")

mERCC <- melt(FPKM, id.vars = c("conc"))

spikeinPlot<-ggplot(mERCC, aes(x=conc,y=value)) +geom_point()+ geom_smooth(method=lm, se=FALSE, color="black") +
  theme(panel.background =element_blank(), panel.grid.major = element_blank(), panel.grid.minor =element_blank()) +
  theme(axis.line = element_line(color = 'black'))+
  labs(x=expression(paste("Log"[2]," Concentration", sep=""))) +
  labs(y=expression(paste("Log"[2]," FPKM", sep=""))) + facet_wrap(~variable)

spikeinPlotWeq <- spikeinPlot +  geom_text(data=EqDF,aes(x = 0, y = 18,label=eq), parse = TRUE, inherit.aes=FALSE) + facet_grid(variable~.)
spikeinPlotWeq
## `geom_smooth()` using formula 'y ~ x'

In this case, the slopes of the fitted lines and the R-squared for the regression models are both close to 1, so we can be relatively confident in our data.

4 Transcriptome assembly and annotation

The Phaeocystis cordata CCMP3104 transcriptome was assembled with RNA reads from 3 biological culture replicates of P. cordata grown in standard culture conditions with L1 media. RNA was extracted in late exponential phase and libraries were prepared following the same protocols as the individual acantharian holobiont transcriptomes (i.e. with SMART-seq v4 and Nextera XT). Following RNA extraction, but before library preparation, diluted ERCC spike-in mix 1 was added to each sample. The RNA libraries were pooled and sequenced with an Illumina HiSeq 4000. Following sequencing, the reads were trimmed and quality filtered with Trimmomatic software. The reads mapping to ERCC sequences (after trimming) in each sample were filtered out to keep them from being assembled into the transcriptome. Remaining reads from the three replicates were then assembled using Trinity de novo transcriptome assembler. The assembly was dereplicated with CD-HIT EST (95%) and then the entire transcriptome was BLASTed against the nr/nt reference database. Assembled sequences that aligned best to a bacterial sequence from the database were removed from the assembly.

4.1 Assembly

Trimmomatic command for quality trimming raw reads from all samples:

java -jar trimmomatic-0.36.jar PE -phred33 sampleID_R1_001.fastq.gz \
sampleID_R2_001.fastq.gz \
sampleID_1_paired.fq sampleID_1_unpaired.fq \
sampleID_2_paired.fq sampleID_2_unpaired.fq \
ILLUMINACLIP:adapters/NexteraPE-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

Number of sequencing reads per sample before and after quality trimming with Trimmomatic:

sample number of sequencing reads remaining after trimming
A35 85,124,248 72,558,691
A7 96,255,554 83,430,343
A12 83,855,671 72,850,246
A14 103,587,165 88,442,308
A22 96,529,442 79,528,859
A25 91,208,346 76,733,652
A27 90,585,006 78,488,845
A20 348,088,442 304,901,030
A21 303,661,266 268,735,500
A28 299,236,369 254,086,670
A29 137,747,279 121,041,843
A36 142,694,491 124,725,007
PC4 83,711,987 68,530,568
PC2 81,369,009 67,254,517
PC3 77,926,482 62,200,308
total 2,121,580,757 1,823,508,387

Reads from the P. cordata culture samples that mapped to ERCC standards were removed by first mapping reads to a bowtie2 reference index prepared from the ERCC sequences and then filtering out the mapped reads:

bowtie2 -t -x ERCC_reference_bt2_index \
-1 sampleID_1_paired.fq \
-2 sampleID_2_paired.fq \
-S sampleID_ercc.sam

samtools view -bS sampleID_ercc.sam > sampleID.bam

samtools sort sampleID.bam sampleID_sorted

samtools view -b -f 13 sampleID_sorted.bam > sampleID_unmapped.bam

samtools sort -n sampleID_unmapped.bam sampleID.qsort

bedtools bamtofastq -i sampleID.qsort.bam -fq sampleID_1_paired.fq -fq2 sampleID_2_paired.fq

The filtered reads from the P. cordata culture replicates were then assembled with Trinity:

Trinity --seqType fq --max_memory 475G \
--left  PC2_1_paired.fq,PC3_1_paired.fq,PC4_1_paired.fq \
--right PC2_2_paired.fq,PC3_1_paired.fq,PC4_2_paired.fq \
--CPU 12

The assembly was dereplicated with CD-HIT_EST:

cd-hit-est -i Trinity.fasta -o Trinity_Pc_clustered_95 -c 0.95 -n 8 -p 1 -g 1 -M 200000 -T 8 -d 40

The Trinity assembly was filtered to remove bacterial contamination by first running a blastn (v2.6.0+) against the nr/nt NCBI database:

blastn -query $DATA/Trinity_Pc_clustered_95.fasta -task blastn -db nt -num_threads 12 -max_target_seqs 1 -outfmt 5 > pcTrinity_PC_clustered_95Blast.xml

The blast results were then parsed with the following script: TrinityBlastXML.ipynb: https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/TrinityBlastXML.ipynb

and reads that had their highest quality blast result corresponding to a bacterial sequence were removed from the transcriptome with the following script: FilterTrinityEukNotEuk.ipynb: https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/FilterTrinityEukNotEuk.ipynb

This script produced the filtered transcriptome file pc_euk_seqs.fasta (https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/pc_euk_seqs.fasta) that was used in read mapping and counting for differential gene expression testing.

The “completeness” of the transcriptome was evaluated by determining how many Benchmarking Universal Single-Copy Orthologs (BUSCOs) were complete, fragmented, or missing in the transcriptome. This was accomplished with the BUSCO software (v3) and the translated amino acid sequences for the P. cordata transcriptome. The BUSCO software was also used to assess the completeness of the P. cordata transcriptome previously sequenced as part of the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP) as a comparison point. BUSCO scores were calculated for Eukaryote BUSCOs and protist BUSCOs.

The P. cordata assembled for this project is more complete than the MMETSP transcriptome, which is likely due to much deeper sequencing performed for this study.

4.2 Annotation

The P. cordata transcriptome was annotated with the dammit de novo transcriptome annotation software: https://github.com/dib-lab/dammit

Pfam annotation results (https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/pc_euk_seqs.fasta.x.pfam-A.gff3) were parsed with the following script: https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/Pfam_gffParsing.ipynb

The final pfam annotation file is pfam.csv (https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/pfam.csv)

The assembly statistics and annotation results are summarized in the below table (including GO and KEGG annotations, which are described in later sections):

P. cordata CCMP3104 transcriptome assembly & annotation statistics
Number of contigs 32,621
Sum bp 37,640,601
min contig length 201
max contig length 13,685
med contig length 955
mean contig length 1153
N50 1866
GC% 67.4
With Pfam annotation 13,548 (41.5%)
With GO annotation 7,119 (21.8%)
With KO annotation 7,719 (23.7%)
With Kegg Pathway 4,618 (14.2%)

5 Read mapping and Counting

Quality filitered reads from the P. cordata culture replicates and individual acantharians were mapped to the P. cordata transcriptome (pc_euk_seqs.fasta) using the Salmon read mapping software.

salmon quant -i PC_index -l A \
-1 $DATA2/sampleID_1_paired.fq \
-2 $DATA2/sampleID_2_paired.fq \
-p 10 --validateMappings -o quants/sampleID_quant

The package tximport was used for importing Salmon read count results files to R.

All Salmon results are available here: https://github.com/maggimars/PcordataSymbiosisDGE/tree/master/quants

Salmon mapping rates (percent of the reads from each sample that mapped to the P. cordata transcriptome assembly):

sample_names<- c("A7", "A12", "A14", "A20", "A21", "A22","A25", "A27", "A28", "A29", "A35", "A36", "PC2", "PC3", "PC4")

mapping_rates<- c(fromJSON("quants/A7_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A12_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A14_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A20_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A21_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A22_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A25_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A27_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A28_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A29_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A35_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A36_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/PC2_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/PC3_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/PC4_quant/aux_info/meta_info.json")$percent_mapped
)

mappingrate_df<- data.frame("Sample" = sample_names, "Mapping Rate" = round(mapping_rates, digits=2))

kable(mappingrate_df, align = "l")
Sample Mapping.Rate
A7 3.73
A12 1.73
A14 2.56
A20 1.12
A21 3.04
A22 3.29
A25 2.56
A27 4.86
A28 1.72
A29 1.62
A35 1.53
A36 3.87
PC2 73.66
PC3 75.30
PC4 71.68

Read in results files for each sample into R with tximport:

samples<-read.table("quants/samples.txt", header = TRUE)
files <- file.path("quants", samples$run, "quant.sf")
names(files) <- samples$run
tx2gene <- read_csv("quants/tx2gene.csv")
txi <- tximport(files, type="salmon", tx2gene= tx2gene)

6 DESeq2 differential expression testing

create DESeq data object:

ddsTxi <- DESeqDataSetFromTximport(txi,
                                   colData = samples,
                                   design = ~ condition)
## using counts and average transcript lengths from tximport

Pre-filtering:

“Note that more strict filtering to increase power is automatically applied via independent filtering on the mean of normalized counts within the results function. The adjusted p values for the genes which do not pass the filter threshold are set to NA” – from DESeq2 manual

keep <- rowSums(counts(ddsTxi )) >= 10
dds <- ddsTxi[keep,]

Re-level so that log-fold change is in relation to free-living (e.g. if logfold change is 3, then that gene is expressed 3x more in symbiotic samples than free-living replicates):

dds$condition <- relevel(dds$condition, ref = "F")
dds <- DESeq(ddsTxi)
res <- results(dds, alpha=0.05)
head(res, n =3)
## log2 fold change (MLE): condition S vs F 
## Wald test p-value: condition S vs F 
## DataFrame with 3 rows and 6 columns
##                                 baseMean    log2FoldChange            lfcSE
##                                <numeric>         <numeric>        <numeric>
## TRINITY_DN1_c0_g1_i1  0.0178204337170013  1.98354048709337  3.7365328592521
## TRINITY_DN10_c0_g1_i1   14.8187038538356 -7.72809144623968 2.37954484998336
## TRINITY_DN10_c1_g1_i1   7.55170822351109 -6.75555987522243 2.38815198139567
##                                    stat              pvalue                padj
##                               <numeric>           <numeric>           <numeric>
## TRINITY_DN1_c0_g1_i1  0.530850540276097   0.595522354038567                  NA
## TRINITY_DN10_c0_g1_i1 -3.24771833835942  0.0011633436943023 0.00916218836691006
## TRINITY_DN10_c1_g1_i1 -2.82878138738657 0.00467256046288919  0.0210043806317098

“condition S vs F” confirms that we are testing symbiotic samples against free-living samples.

summary(res)
## 
## out of 32210 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 589, 1.8%
## LFC < 0 (down)     : 5211, 16%
## outliers [1]       : 423, 1.3%
## low counts [2]     : 17088, 53%
## (mean count < 2)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

589 genes are significantly upregulated in symbiosis and 5211 genes are significantly downregulated in symbiosis

convert DESeq results to a data frame:

resultsall<- as.data.frame(res)
resultsall$Trinity <- row.names(resultsall)

Volcano Plot (plots “genes” by log2fold change and their adjusted p-values, with significantly differentially expressed genes indicated in red)

res_table<- data.frame(res)
# Obtain logical vector regarding whether padj values are less than 0.05 (to color significant points)
threshold <- res_table$padj < 0.05 
res_table$threshold <- threshold

ggplot(res_table) +
        geom_point(aes(x=log2FoldChange, y=-log10(padj), colour=threshold)) +
        ggtitle("") + theme_bw()+ scale_colour_manual(values = c("#bbbbbb", "red")) +
        xlab("log2 fold change") + 
        ylab("-log10 adjusted p-value") +
        scale_x_continuous(limits = c(-15,15)) +
        scale_y_continuous(limits = c(0,30)) +
        theme(legend.position = "none",
              plot.title = element_text(size = rel(1.5), hjust = 0.5),
              axis.title = element_text(size = rel(1.25)))  + theme(text = element_text(size=20)) 

Count Data Transformations for ranking and visualizations (e.g. PCA plots and heatmaps)

Variance stabilizing transformation (vst): “This function calculates a variance stabilizing transformation (VST) from the fitted dispersion-mean relation(s) and then transforms the count data (normalized by division by the size factors or normalization factors), yielding a matrix of values which are now approximately homoskedastic (having constant variance along the range of mean values). The transformation also normalizes with respect to library size.”" – from DESeq documentation

#Heat Map of the variance stabilizing transformed count matrix (vsd):
vsd<-vst(dds, blind = TRUE)
select <- order(rowMeans(counts(dds,normalized=TRUE)),
                decreasing=TRUE)[1:1000]
pheatmap(assay(vsd)[select,], cluster_rows=TRUE, show_rownames=FALSE,
         cluster_cols=TRUE)

#PCA by condition (Free-living or Symbiotic):
data1 <- plotPCA(vsd, returnData=TRUE)
data1$group<-gsub(" : ","_",as.character(data1$group))
percentVar1 <- round(100 * attr(data1, "percentVar"))

PCA<-ggplot(data1, aes(PC1, PC2, color = condition))+ theme_bw()+
  geom_point(size=9, alpha = 0.8) + scale_colour_manual(values = c("#44aabb","#bbbbbb"), labels = c("Free-living", "Symbiotic"), name = "")+
  xlab(paste0("PC1: ",percentVar1[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar1[2],"% variance")) +
  theme(text = element_text(size=20)) 
PCA

The majority of variance in gene expression (63%) is explained by sample type. There is much more variation between symbiotic samples than between culture replicates.

7 GO term enrichment

7.1 Annotation

The Pfam annotations for the P. cordata transcriptome were assigned using HMMER and HMM profiles for the Pfam database. Annotations with scores < 1e-05 were retained and included in the following analyses. The GO term to Pfam mapping comes from: http://current.geneontology.org/ontology/external2go/pfam2go.

Get dataframe of significantly differentially expressed genes:

sigDEgenes <- subset(resultsall, padj < 0.05)
sigDEgenes$trinity <- row.names(sigDEgenes)

Import Pfam annotation results from dammit and fix pfam annotation formatting to match that used in the pfam2go mapping file:

pfamannot<- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/pfam.csv", header = TRUE)
pfamannot$Dbxref <- stri_sub(pfamannot$Dbxref , from = 2, to = -2)
pfamannot$Pfam <- gsub("\\..*","", pfamannot$Dbxref)
dim(pfamannot)
## [1] 13548     6

13,548 transcripts have good (score<1e-05) pfam annotations

Import and format trinity ID to dammit transcript ID mapping file:

trinity2dammit <- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/pc_euk_seqs.fasta.dammit.namemap.csv", header = TRUE)
trinity2dammit1 <- trinity2dammit %>%
  separate(original, c("trinity"), sep = " ") 
## Warning: Expected 1 pieces. Additional pieces discarded in 32621 rows [1, 2, 3,
## 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
names(trinity2dammit1) <- c("trinity", "seqid")

Import and format pfam2go mapping file:

pfam2go <- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/pfam2go4R.txt", header = FALSE)
pfam2go1 <- pfam2go %>%
  separate(V1, c('V1_1', 'V1_2'), sep = '>') %>%
  separate(V1_1, c("Pfam", "name"), sep = " ") %>%
  separate(V1_2, c("GO_desc", "GO"), sep = ";")
## Warning: Expected 2 pieces. Additional pieces discarded in 2 rows [3361, 6204].
## Warning: Expected 2 pieces. Additional pieces discarded in 10503 rows [1, 2, 3,
## 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 10503 rows [1, 2,
## 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
pfam2go1$GO <-stri_sub(pfam2go1$GO, from = 2)

Create transcript to GO reference:

pfam1 <- merge(pfamannot, pfam2go1, by = "Pfam")
pfam2<- merge(pfam1, trinity2dammit1, by ="seqid" )
pfam3 <- merge(pfam2, sigDEgenes, by = "trinity")
pfamUP <- pfam3[pfam3$log2FoldChange >0 & pfam3$padj <0.05,]
pfam4<-merge(pfamannot, trinity2dammit1, by = "seqid")
pfam5<-merge(pfam4, sigDEgenes, by = "trinity")
length(unique(pfam1$seqid))
## [1] 7119

7119 annotated transcripts also have a GO term assigned.

Create list of all upregulated GO terms:

length(pfamUP$GO_desc)
## [1] 411
length(unique(substring(pfamUP$GO_desc,5,nchar(pfamUP$GO_desc))))
## [1] 134
unique_upGOlist <- unique(substring(pfamUP$GO_desc,5,nchar(pfamUP$GO_desc))) 
unique_upGOlist[135] <- ""
kable(data.frame(unique_upGOlist[1:45],unique_upGOlist[46:90], unique_upGOlist[91:135] ))
unique_upGOlist.1.45. unique_upGOlist.46.90. unique_upGOlist.91.135.
transmembrane transport biosynthetic process inorganic diphosphatase activity
membrane proton-transporting ATP synthase activity, rotational mechanism hydrogen-translocating pyrophosphatase activity
ion channel activity ATP synthesis coupled proton transport fructose-bisphosphate aldolase activity
ion transport proton-transporting ATP synthase complex, catalytic core F(1) inositol-3-phosphate synthase activity
ATP binding phosphoglycerate kinase activity phospholipid biosynthetic process
nucleic acid binding kinase activity inositol biosynthetic process
structural constituent of ribosome COPII vesicle coat GTPase activity
ribosome intracellular protein transport iron-sulfur cluster binding
translation ER to Golgi vesicle-mediated transport nitrogen compound metabolic process
purine nucleotide biosynthetic process eukaryotic translation initiation factor 3 complex glutamate synthase activity
phosphoribosylaminoimidazolecarboxamide formyltransferase activity translational initiation carbohydrate metabolic process
IMP cyclohydrolase activity translation initiation factor activity hydrolase activity, hydrolyzing O-glycosyl compounds
protein binding translation initiation factor binding primary metabolic process
photosystem II stabilization acetyl-CoA carboxylase activity zinc ion binding
photosystem II assembly fatty acid biosynthetic process transcription coregulator activity
oxygen evolving activity response to stress nucleus
photosystem II oxygen evolving complex unfolded protein binding histone acetyltransferase activity
intracellular protein folding cytoskeleton organization
oxidation-reduction process rRNA binding heme binding
oxidoreductase activity hydrolase activity pyridoxal phosphate binding
photosynthetic electron transport in photosystem II coenzyme binding peroxidase activity
electron transporter, transferring electrons within the cyclic electron transport pathway of photosynthesis activity cell redox homeostasis response to oxidative stress
photosynthesis, light reaction violaxanthin de-epoxidase activity cellular amino acid biosynthetic process
photosystem chloroplast protein dimerization activity
chlorophyll binding RNA binding cytoplasm
photosynthetic electron transport chain base-excision repair ATP hydrolysis coupled proton transport
ATPase activity ubiquitin-dependent protein catabolic process proton transmembrane transporter activity
protein phosphorylation oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor proton-transporting V-type ATPase, V0 domain
protein kinase activity motor activity cellular metabolic process
GTP binding myosin complex transcription, DNA-templated
proton transmembrane transport proteasome core complex DNA-directed 5’-3’ RNA polymerase activity
ATP metabolic process threonine-type endopeptidase activity DNA binding
glucose-6-phosphate isomerase activity proteolysis involved in cellular protein catabolic process microtubule-based movement
gluconeogenesis oxidoreductase activity, acting on the CH-CH group of donors microtubule motor activity
glycolytic process mitochondrial outer membrane dynein complex
phosphoenolpyruvate carboxykinase (ATP) activity peptidyl-prolyl cis-trans isomerase activity aldehyde-lyase activity
aminoacyl-tRNA ligase activity protein peptidyl-prolyl isomerization U5 snRNA binding
tRNA aminoacylation for protein translation vesicle-mediated transport inositol catabolic process
catalytic activity membrane coat inositol oxygenase activity
trehalose biosynthetic process actin binding iron ion binding
protein transport regulation of transcription, DNA-templated integral component of membrane
metabolic process ribosome binding ATP:ADP antiporter activity
electron transfer activity mature ribosome assembly nucleotide transport
copper ion binding transmembrane transporter activity proton-transporting two-sector ATPase complex, proton-transporting domain
nucleotide binding protein import

Create list of all downregulated GO terms:

pfamDOWN <- pfam3[pfam3$log2FoldChange <0 & pfam3$padj <0.05,]
length(pfamDOWN$GO_desc)
## [1] 3200
length(unique(pfamDOWN$GO_desc))
## [1] 552

(too long to print)

7.2 GO term enrichment testing with GOstats

Format transcript to GO mapping for GOstats:

GOdf <- data.frame(pfam2$trinity, pfam2$GO)
GOdf$evidence <- "ISS"
names(GOdf) <- c("isoform", "GO", "evidence")
#reorder columns
GOdf <- GOdf[,c("GO","evidence","isoform")]  
GOdf$GO <- as.character(GOdf$GO)
GOdf$isoform<- as.character(GOdf$isoform)
goframe=GOFrame(GOdf)
goAllFrame=GOAllFrame(goframe)
gsc <- GeneSetCollection(goAllFrame, setType = GOCollection())

Prepare universe gene set, up-regulated gene set, and down-regulated gene set:

#make list of all genes
universe <-trinity2dammit1$trinity 
#make list of upregulated genes
sigDEup <- sigDEgenes[sigDEgenes$log2FoldChange >0,] 
uplist <- sigDEup$trinity
#make list of downregulated genes
sigDEdown<- sigDEgenes[sigDEgenes$log2FoldChange <0,]
downlist <- sigDEdown$trinity

GOterm enrichment in up-regulated genes:

upregulated = hyperGTest(
  GSEAGOHyperGParams(name = "Symbiosis Upregged",
                     geneSetCollection=gsc,geneIds = uplist,
                     universeGeneIds=universe,ontology = "BP",pvalueCutoff = 0.05,conditional = TRUE,testDirection = "over"))

upregulated
## Gene to GO BP Conditional test for over-representation 
## 289 GO BP ids tested (68 have p < 0.05)
## Selected gene set size: 117 
##     Gene universe size: 4341 
##     Annotation package: Based on a GeneSetCollection Object
htmlReport(upregulated, file="enriched_up_go.html")
upgoterms <-data.frame(summary(upregulated))
write.csv(upgoterms,"upgo.csv")

Full Results (conditional = FALSE): https://maggimars.github.io/AcanthSymbiontDGE/Pcordata/enriched_up_GO.html

List of significantly enriched GO terms in UP-regulated genes (when conditional = TRUE , less redundancy):

kable(data.frame(GOterm =upgoterms$Term))
GOterm
amide biosynthetic process
peptide metabolic process
translation
organonitrogen compound biosynthetic process
cellular biosynthetic process
macromolecule biosynthetic process
cellular nitrogen compound metabolic process
photosynthetic electron transport chain
proton transmembrane transport
generation of precursor metabolites and energy
gluconeogenesis
monosaccharide biosynthetic process
inorganic ion transmembrane transport
cation transmembrane transport
ATP metabolic process
ribonucleoside monophosphate metabolic process
purine nucleoside monophosphate metabolic process
purine nucleotide metabolic process
establishment of protein localization
ribonucleoside triphosphate metabolic process
purine nucleoside triphosphate metabolic process
peptide transport
photosynthesis
purine-containing compound biosynthetic process
small molecule metabolic process
nucleoside triphosphate biosynthetic process
purine ribonucleoside triphosphate biosynthetic process
ribonucleotide metabolic process
purine ribonucleoside monophosphate biosynthetic process
nucleoside monophosphate biosynthetic process
purine ribonucleotide biosynthetic process
ribose phosphate biosynthetic process
nitrogen compound transport
monocarboxylic acid biosynthetic process
organic substance catabolic process
intracellular protein transport
translational initiation
organic acid biosynthetic process
inositol metabolic process
cellular macromolecule localization
establishment of localization in cell
vesicle-mediated transport
protein import
ribonucleoside diphosphate metabolic process
nucleoside diphosphate phosphorylation
purine nucleoside diphosphate metabolic process
glycolytic process
ADP metabolic process
hexose metabolic process
pyruvate metabolic process
proteolysis involved in cellular protein catabolic process
pyridine nucleotide biosynthetic process
nucleotide transport
nucleoside phosphate metabolic process
ATP synthesis coupled proton transport
response to oxidative stress
protein catabolic process
cellular macromolecule catabolic process
nicotinamide nucleotide metabolic process
cellular process
pyridine-containing compound metabolic process
nucleoside phosphate catabolic process
cellular macromolecule metabolic process
protein folding
carboxylic acid metabolic process
oxidoreduction coenzyme metabolic process
nucleotide biosynthetic process
cellular catabolic process

GOterm enrichment in significantly down-reguated genes:

downregulated = hyperGTest(
  GSEAGOHyperGParams(name = "Symbiosis DownRegged",
                     geneSetCollection=gsc,geneIds = downlist,
                     universeGeneIds=universe,ontology = "BP",pvalueCutoff = 0.05,conditional = FALSE,testDirection = "over"))

downregulated
## Gene to GO BP  test for over-representation 
## 902 GO BP ids tested (31 have p < 0.05)
## Selected gene set size: 1001 
##     Gene universe size: 4341 
##     Annotation package: Based on a GeneSetCollection Object
htmlReport(downregulated, file = "enrichedDOWNgostats.html")
downgoterms <- data.frame(summary(downregulated))
write.csv(downgoterms,"downgo.csv")

Full Results: https://maggimars.github.io/AcanthSymbiontDGE/Pcordata/enriched_down_GO.html

List of significantly enriched GO terms in DOWN-regulated genes:

kable(data.frame("GOterm" = downgoterms[["Term"]]))
GOterm
carbohydrate derivative metabolic process
DNA replication
glycosylation
protein glycosylation
macromolecule glycosylation
glycoprotein metabolic process
glycoprotein biosynthetic process
transport
establishment of localization
localization
cation transport
carbohydrate derivative biosynthetic process
carbohydrate derivative transport
fructose metabolic process
nucleobase-containing compound transport
respiratory electron transport chain
nucleotide-sugar transmembrane transport
pyrimidine nucleotide-sugar transmembrane transport
cellular respiration
metal ion transport
transmembrane transport
DNA-dependent DNA replication
macromolecule catabolic process
protein peptidyl-prolyl isomerization
peptidyl-proline modification
protein homooligomerization
protein complex oligomerization
pathogenesis
copper ion transport
molybdopterin cofactor metabolic process
prosthetic group metabolic process

7.3 ReviGO Plots

Make results file to import into the ReviGO online tool:

Down4revi <-as.data.frame(downgoterms$GOBPID) 
names(Down4revi) <- c("id")
Down4revi$updown <- 0
up4revi <-as.data.frame(upgoterms$GOBPID)
names(up4revi)<-c("id")
up4revi$updown <- 1
revi<-rbind(up4revi, Down4revi)
write.csv(revi, "revi.csv")

ReviGO summarizes long lists of GO terms by removing redundant terms and clustering terms with similar meanings. The results can be plotted as an MDS plot:

source("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/ReviGO_nmds.R")
reviGOplot

This plot has several of the GO terms labeled, but not all of them. Below is an interactive version (hover mouse over points to see the name of the GO term):

key = one.data$description

p1 <- ggplot( data = one.data, aes(key= key, x= plot_X, y= plot_Y, color = value) ) +
  geom_point(alpha = I(0.6), size =7) +
  scale_colour_manual(values =c("#3B9AB2", "red"), labels= c("Down", "Up")) +
  geom_point(shape = 21, fill = "transparent", colour = I (alpha ("black", 0.6) ), size = 7) +   scale_size_area() + theme_bw() + labs (y = "Axis 2", x = "Axis 1") +
  theme(legend.key = element_blank()) + theme(text = element_text(size=16)) + theme(legend.title=element_blank())

ggplotly(p1, tooltip="key")

TreeMap for GO terms enriched in significantly upregulated gene set:

source("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/ReviGO_treemap_UP.R")

TreeMap for GO terms enriched in significantly downregulated gene set:

source("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/ReviGO_treemap_DOWN.R")

8 KEGG pathway enrichment

8.1 Annotation

KEGG annotations (K numbers) were attained by submitting the transdecoder peptide sequences for the P. cordata transcriptome assemble as a GhostKoala job through https://www.kegg.jp/ghostkoala/

The protein fasta submitted to GhostKOALA can be found here: https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/pc_euk_seqs.fasta.transdecoder.pep

Results Summary:

Results were downloaded as user_ko.csv and are available here: https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/user_ko.csv

In order to get the KEGG pathway that each K number belongs to, the KEGG API was accessed with each annotated K number:

# read in results with pathway assigned to K numbers
keggs <- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/user_ko.csv", header =TRUE)

## format data frame

kegg <- keggs %>% separate(transcript, c("transcript", "protein"), "\\.")

trinity2dammit <- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/pc_euk_seqs.fasta.dammit.namemap.csv", header = TRUE)
trinity2dammit_k <- trinity2dammit %>%
  separate(original, c("trinity"), sep = " ") 
names(trinity2dammit_k) <- c("trinity", "transcript")

trinity2kegg <- merge(kegg, trinity2dammit_k, by = "transcript")
trinity2kegg<- trinity2kegg[,c(3,4)]
trinity2kegg<- trinity2kegg[trinity2kegg$KO!="",]

## Get pathway ids

koslist<- trinity2kegg$KO
paths <- character(length=0)
for(i in 1:length(koslist)) {
  pathway <- system(paste0("curl http://rest.kegg.jp/link/pathway/", koslist[[i]]), intern=TRUE)
   if (pathway[[1]] == "") {
     paths[[i]]<- "NA"
   } else if (length(pathway) == 1) {
     paths[[i]] <- stri_sub(pathway[[1]], 16)
   } else {
     l <- length(pathway)
     paths[[i]] <- stri_sub(pathway[[2]], 16)
   }
 }
trinity2kegg$path<-paths

## pathway ids 

write.csv(trinity2kegg, "trinity_KO_pathway.csv")

Then a new data frame was made with KEGG annotations (K number AND pathway) alongside DESeq results:

trinity2kegg<- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/trinity_KO_pathway.csv", header = TRUE)
trinity2kegg<-trinity2kegg[,-1]
trinity2kegg_noNA <- trinity2kegg[trinity2kegg$path != "NA",]
trinity2kegg_noNA <- trinity2kegg_noNA[trinity2kegg_noNA$path !='PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"',]
resultsframe<- as.data.frame(res)
resultsframe$trinity <- row.names(resultsframe)
resALL<-merge(trinity2kegg, resultsframe, by = 'trinity')
dim(resALL)[1]
## [1] 7720

8.2 Enrichment

KEGG pathway enrichment was performed with the kegga function from the edgeR package. The inputs for this function include a list of pathways (either corresponding to up or downregulated genes) and a ‘universe’ list including all pathways annotated in the reference transcriptome. The function was called twice, once for significantly upregulated genes and once for significantly downregulated genes.

To prepare the input lists, the dataframe with KEGG annotations and DE results was subset to contain only significantly UPregulated genes:

KOs_up<- resALL[resALL$log2FoldChange > 0 & resALL$padj <= 0.05 ,]
KOs_up_list <- KOs_up$KO
KOs_up_list <- (unique(KOs_up_list))

and only significantly DOWNregulated genes:

KOs_down<- resALL[resALL$log2FoldChange < 0 & resALL$padj <= 0.05 ,]
KOs_down_list <- KOs_down$KO
KOs_down_list <- (unique(KOs_down_list))

To get the list of all pathways with genes annotated in the P. cordata transcriptome (the “universe”):

resALLnoNA <- resALL[resALL$path != "NA",] #kegga requires no NA values
resALLnoNA <- resALLnoNA[resALLnoNA$path !='PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"',] #remove this message from accessing the API
genepathways <- data.frame(resALLnoNA$trinity, resALLnoNA$path)
names(genepathways)<-c("geneID", "pathwayID")
universelist<- genepathways$geneID
universelist<-as.vector(universelist)
length(universelist)
## [1] 4618

8.2.1 Pathways engriched among upregulated genes

Perform enrichment test for upregulated genes and access the KEGG API again to acquire more information about the enriched pathways (from pathway id, get pathway name and class/description):

ko_up<- resALLnoNA[resALLnoNA$log2FoldChange > 0 & resALLnoNA$padj <= 0.05 ,]
de<-ko_up$trinity
de<- as.vector(de)

keggUP<-kegga(de, universe = universelist, species = NULL, species.KEGG = NULL, convert = FALSE,
              gene.pathway = genepathways, pathway.names = NULL,
              prior.prob = NULL, covariate=NULL, plot=FALSE)

names(keggUP) <- c("Pathway", "N", "DE" , "p")
keggUP$Pathway<-row.names(keggUP)

KeggUpsig<- keggUP[keggUP$p < 0.05,]

kosup<- row.names(KeggUpsig)
name <- character(length=0)
class <- character(length=0) 

for(i in 1:length(kosup)) {
  pathway <- system(paste0("curl http://rest.kegg.jp/get/", kosup[[i]]), intern=TRUE)
  name[[i]] <- pathway[[2]]
  class[[i]] <-pathway[[3]]
}

KeggUpsig$name <- name
KeggUpsig$class <- class

kable(KeggUpsig[,c(1:5)], row.names=FALSE)
Pathway N DE p name
ko03010 193 40 0.0000000 NAME Ribosome
ko00010 98 13 0.0002064 NAME Glycolysis / Gluconeogenesis
ko00860 43 6 0.0086868 NAME Porphyrin and chlorophyll metabolism
ko00061 35 7 0.0005283 NAME Fatty acid biosynthesis
ko00760 17 3 0.0325781 NAME Nicotinate and nicotinamide metabolism
ko00195 28 9 0.0000012 NAME Photosynthesis
ko00190 75 7 0.0377923 NAME Oxidative phosphorylation
ko05166 8 2 0.0420081 NAME Human T-cell leukemia virus 1 infection
ko00053 7 2 0.0323935 NAME Ascorbate and aldarate metabolism
ko05162 1 1 0.0422352 NAME Measles

8.2.2 Pathways engriched among downregulated genes

Perform enrichment test for downregulated genes and access the KEGG API again to acquire more information about the enriched pathways (from pathway id, get pathway name and class/description):

ko_down<- resALLnoNA[resALLnoNA$log2FoldChange < 0 & resALLnoNA$padj <= 0.05 ,]
de<-ko_down$trinity
de<- as.vector(de)

keggDOWN<-kegga(de, universe = universelist, species = NULL, species.KEGG = NULL, convert = FALSE,
              gene.pathway = genepathways, pathway.names = NULL,
              prior.prob = NULL, covariate=NULL, plot=FALSE)

names(keggDOWN) <- c("Pathway", "N", "DE" , "p")
keggDOWN$Pathway<-row.names(keggDOWN)
KeggDOWNsig<- keggDOWN[keggDOWN$p < 0.05,]

kosup<- row.names(KeggDOWNsig)
name <- character(length=0)
class <- character(length=0) 

for(i in 1:length(kosup)) {
  pathway <- system(paste0("curl http://rest.kegg.jp/get/", kosup[[i]]), intern=TRUE)
  name[[i]] <- pathway[[2]]
  class[[i]] <-pathway[[3]]
}

KeggDOWNsig$name <- name
KeggDOWNsig$class <- class

kable(KeggDOWNsig[,c(2:5)], row.names=FALSE)
N DE p name
45 14 0.0308998 NAME Galactose metabolism
51 15 0.0421142 NAME Ras signaling pathway
45 15 0.0136572 NAME Glycosaminoglycan degradation
28 11 0.0088506 NAME Pyrimidine metabolism
33 12 0.0127166 NAME Peroxisome
18 7 0.0368175 NAME Folate biosynthesis
34 13 0.0059705 NAME Glutathione metabolism
42 17 0.0008279 NAME DNA replication
5 3 0.0484208 NAME Biotin metabolism
5 3 0.0484208 NAME Chlorocyclohexane and chlorobenzene degradation
22 9 0.0129969 NAME Fructose and mannose metabolism
15 6 0.0456244 NAME Antifolate resistance
75 24 0.0038249 NAME Oxidative phosphorylation
7 4 0.0263902 NAME Various types of N-glycan biosynthesis
8 4 0.0450832 NAME Taste transduction
8 4 0.0450832 NAME Glycosaminoglycan biosynthesis - heparan sulfate / heparin

8.3 KEGG Pathway Maps

Add logFold change values to KEGG pathway maps to visualize results.

First, prepare data frames with all K number annotations and log fold change (ko_all) and all significantly differentially expressed K numbers and their log fold changes (ko_sig):

ko_all <-resALL[,c(2,5)]

#7,720
ko_all<-as.data.frame(ko_all[!duplicated(ko_all[,c('KO')]),])
#3,631
row.names(ko_all) <- ko_all[[1]]
ko_all<-ko_all[,2,drop=FALSE]

ko_sig<- resALL[resALL$padj <= 0.05 ,]
ko_sig<- ko_sig[complete.cases(ko_sig), c(2,5)]
ko_sig<-as.data.frame(ko_sig[!duplicated(ko_sig[,c('KO')]),])
#1,243

row.names(ko_sig) <- ko_sig$KO
ko_sig<-ko_sig[,2,drop=FALSE]

8.3.1 DNA replication

with only significantly differentially expressed genes colored:

pathvect<- c("ko03030")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

And with all KO annotations (not just sig diff expressed):

pathvect<- c("ko03030")
pv.out<- pathview(gene.data = ko_all, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "all", kegg.native = TRUE)

8.3.2 Cell Cycle Progression

Cell Cycle (significant only)

pathvect<- c("ko04110")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

Cell Cycle with all KO annotations (not just sig diff expressed):

pathvect<- c("ko04110")
pv.out<- pathview(gene.data = ko_all, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "all", kegg.native = TRUE)

8.3.3 Nitrogen Metabolism / Limitation

Nitrogen Metabolism (significant):

pathvect<- c("ko00910")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

Ribosome (significant only):

pathvect<- c("ko03010")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

8.3.4 Photosynthesis

Photosynthesis (significant only):

pathvect<- c("ko00195")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

Chlorophyll / porphyrin metabolism (significant only):

pathvect<- c("ko00860")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

8.3.5 Carbon Fixation

Carbon Fixation in Photosynthetic Organisms (significant only):

pathvect<- c("ko00710")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

Calvin-Benson cycle is upregulated

pathvect<- c("ko00010")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

gluconeogenesis = pathway that makes new glucose

glycolysis = glycolysis produces two molecules of pyruvate, two atp, 2 NADH, 2 H20

Pentose Phosphate Pathway (significant only):

pathvect<- c("ko00030")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

8.3.6 Cellular respiration

TCA cycle (significant only):

pathvect<- c("ko00020")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

Respiratory oxidative phosphorylation (significant only):

pathvect<- c("ko00190")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

with all KO annotations (not just significant):

pathvect<- c("ko00190")
pv.out<- pathview(gene.data = ko_all, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "all", kegg.native = TRUE)

8.3.7 Cell Signaling

Ras Signaling Pathway (significant only):

pathvect<- c("04014")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

ERK is downregulated

MAPK Signaling Pathway (significant only):

pathvect<- c("ko04010")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

9 Genes of specific interest

9.1 Phosphorus Limitation

phoA and ehap1

Alkaline phosphatase genes were annotated in the P. cordata transcriptome by blasting the transcriptome against a local database of Emiliania huxleyi phosphorus metabolism genes (https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/PhosMetab.pep) using dammit.

Results file: https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/pc_euk_seqs.fasta.x.Pmetab.pep.crbl.csv

Two alkaline phosphates were found that were also expressed:

goi<- c("TRINITY_DN16018_c0_g1_i1", "TRINITY_DN19611_c0_g1_i1")
stopifnot(all(goi %in% names(dds)))
tcounts <- t(log2((counts(dds[goi, ], normalized=TRUE, replaced=FALSE)+.5))) %>%
  merge(colData(dds), ., by="row.names") %>%
  gather(gene, expression, (ncol(.)-length(goi)+1):ncol(.))

annots<-c(rep("ehap1", 15), rep("phoA", 15))

tcounts$annots <- annots
tcounts$annots_f = factor(tcounts$annots, levels=c("phoA", "ehap1"))
pal<-wes_palette(name = "Zissou1", type="discrete")
colors<- c(pal[2],pal[3])

goiplot<- ggplot(tcounts, aes(condition, expression, fill=condition)) +geom_dotplot(binaxis='y', stackdir='center', dotsize=1)+
  facet_wrap(~annots_f, scales="fixed") + 
  labs(x="", 
       y="Expression (log normalized counts)", 
       fill="Phenotype", 
       title="") + theme_bw() + scale_fill_manual(values= colors) + theme(text = element_text(size=19)) + theme(legend.position="none") +  theme(panel.grid.minor = element_blank(), panel.grid.major = element_blank()) +scale_x_discrete(labels=c("F" = "Free-living", "S" = "Symbiotic"))

goiplot

9.2 Chloroplast Division

Chloroplast division genes wer annotated by creating a local database of gene sequences and then blasting the P. cordata transcriptome against the database using dammit software.

local database: https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/chloroplast_division_proteins.fasta

results: https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/chlordiv_annotations.csv

DRP5B, FtsZ, PDR1

chlordiv <- read.csv("./pc_euk_seqs.fasta.dammit/chlordiv_annotations1.csv")

goi<- as.character(chlordiv$trinity)
stopifnot(all(goi %in% names(dds)))

tcounts <- t(log2((counts(dds[goi, ], normalized=TRUE, replaced=FALSE)+.5))) %>%
  merge(colData(dds), ., by="row.names") %>%
  gather(gene, expression, (ncol(.)-length(goi)+1):ncol(.))

chlordiv$gene <- chlordiv$trinity
tcounts<- merge(tcounts, chlordiv, by= "gene")

pal<-wes_palette(name = "Zissou1", type="discrete")
colors<- c(pal[2],pal[3])

tcounts$identifier = paste(tcounts$geneName ,tcounts$seqid )

tcounts <- tcounts %>% filter(gene %in% c("TRINITY_DN10430_c0_g4_i2", "TRINITY_DN10705_c0_g1_i1", "TRINITY_DN1346_c1_g2_i1"))

goiplot<- ggplot(tcounts, aes(condition, expression, fill=condition)) +geom_dotplot(binaxis='y', stackdir='center', dotsize=1)+
  facet_wrap(~geneName, scales="fixed") + 
  labs(x="", 
       y="Expression (log normalized counts)", 
       fill="Phenotype", 
       title="") + theme_bw() + scale_fill_manual(values= colors) + theme(text = element_text(size=11)) + theme(legend.position="none") + theme(panel.grid.minor = element_blank(), panel.grid.major = element_blank()) +scale_x_discrete(labels=c("F" = "Free-living", "S" = "Symbiotic"))

goiplot

TRINITY_DN10430_c0_g4_i2 - not differentially expressed (padj = 0.115076, lfc = -1.733) → PDR1

TRINITY_DN10705_c0_g1_i1 - not differentially expressed (padj = 0.1150529, lfc = 1.9) → DRP5B

TRINITY_DN1346_c1_g2_i1 - not differentially expressed (padj = 0.732, lfc = -0.6) → FtsZ

10 Session info

sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS  10.14.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] pathview_1.22.3             org.Hs.eg.db_3.7.0         
##  [3] treemap_2.4-2               GO.db_3.7.0                
##  [5] knitr_1.29                  jsonlite_1.7.0             
##  [7] data.table_1.13.0           reshape2_1.4.3             
##  [9] plotly_4.9.2.1              shiny_1.5.0                
## [11] ggrepel_0.8.1               scales_1.1.1               
## [13] GSEABase_1.44.0             annotate_1.60.1            
## [15] XML_3.98-1.20               GOstats_2.48.0             
## [17] graph_1.60.0                Category_2.48.1            
## [19] Matrix_1.2-18               AnnotationDbi_1.44.0       
## [21] edgeR_3.24.3                limma_3.38.3               
## [23] forcats_0.4.0               stringr_1.4.0              
## [25] dplyr_1.0.2                 purrr_0.3.4                
## [27] tidyr_1.1.2                 tibble_3.0.3               
## [29] tidyverse_1.3.0             stringi_1.4.3              
## [31] pheatmap_1.0.12             wesanderson_0.3.6.9000     
## [33] DESeq2_1.22.2               SummarizedExperiment_1.12.0
## [35] DelayedArray_0.8.0          BiocParallel_1.16.6        
## [37] matrixStats_0.56.0          Biobase_2.42.0             
## [39] GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
## [41] IRanges_2.16.0              S4Vectors_0.20.1           
## [43] BiocGenerics_0.28.0         ggplot2_3.3.2              
## [45] tximport_1.10.1             readr_1.3.1                
## 
## loaded via a namespace (and not attached):
##  [1] readxl_1.3.1           backports_1.1.9        Hmisc_4.3-0           
##  [4] plyr_1.8.6             igraph_1.2.5           lazyeval_0.2.2        
##  [7] splines_3.5.0          gridBase_0.4-7         digest_0.6.25         
## [10] htmltools_0.5.0        fansi_0.4.0            magrittr_1.5          
## [13] checkmate_1.9.4        memoise_1.1.0          cluster_2.1.0         
## [16] Biostrings_2.50.2      modelr_0.1.8           colorspace_1.4-1      
## [19] blob_1.2.0             rvest_0.3.6            haven_2.3.1           
## [22] xfun_0.16              crayon_1.3.4           RCurl_1.95-4.12       
## [25] genefilter_1.64.0      survival_3.2-3         glue_1.4.2            
## [28] gtable_0.3.0           zlibbioc_1.28.0        XVector_0.22.0        
## [31] Rgraphviz_2.26.0       DBI_1.1.0              Rcpp_1.0.5            
## [34] viridisLite_0.3.0      xtable_1.8-4           htmlTable_2.0.1       
## [37] foreign_0.8-72         bit_1.1-14             Formula_1.2-3         
## [40] AnnotationForge_1.24.0 htmlwidgets_1.5.1      httr_1.4.2            
## [43] RColorBrewer_1.1-2     acepack_1.4.1          ellipsis_0.3.1        
## [46] pkgconfig_2.0.3        nnet_7.3-12            dbplyr_1.4.2          
## [49] locfit_1.5-9.1         tidyselect_1.1.0       rlang_0.4.7           
## [52] later_1.1.0.1          munsell_0.5.0          cellranger_1.1.0      
## [55] tools_3.5.0            cli_2.0.2              generics_0.0.2        
## [58] RSQLite_2.2.0          broom_0.7.0            evaluate_0.14         
## [61] fastmap_1.0.1          yaml_2.2.0             bit64_0.9-7           
## [64] fs_1.3.1               KEGGREST_1.22.0        RBGL_1.58.2           
## [67] mime_0.7               KEGGgraph_1.42.0       xml2_1.2.2            
## [70] compiler_3.5.0         rstudioapi_0.11        png_0.1-7             
## [73] reprex_0.3.0           geneplotter_1.60.0     lattice_0.20-38       
## [76] vctrs_0.3.3            pillar_1.4.6           lifecycle_0.2.0       
## [79] bitops_1.0-6           httpuv_1.5.4           R6_2.4.1              
## [82] latticeExtra_0.6-28    promises_1.1.1         gridExtra_2.3         
## [85] assertthat_0.2.1       withr_2.1.2            GenomeInfoDbData_1.2.0
## [88] hms_0.5.3              grid_3.5.0             rpart_4.1-15          
## [91] rmarkdown_2.3          lubridate_1.7.4        base64enc_0.1-3