The following .html report details the analyses performed to detect and classify Phaeocystis cordata genes that were up and downregulated when cellls are acantharian endosymbionts.

The raw R markdown file used to build this document (Pcordata_gene_expression_analysis.Rmd), as well as all the intermediate data files, are available in the GitHub repository: https://github.com/maggimars/PcordataSymbiosisDGE

1 Sequence data

The raw sequences from this study are available from the NCBI sequence read archive (SRA) with the accession numbers listed below. Sample names beginning with “PC” are Phaeocystis cordata culture replicates and samples beginning with “A” are individual acantharian holobionts.

Sample	SRA Run Number	SRA Experiment Number
A7	SRR10997202	SRX7658295
A12	SRR10997200	SRX7658297
A14	SRR10997199	SRX7658298
A20	SRR10997198	SRX7658299
A21	SRR10997197	SRX7658300
A22	SRR10997196	SRX7658301
A25	SRR10997215	SRX7658282
A27	SRR10997214	SRX7658283
A28	SRR10997213	SRX7658284
A29	SRR10997212	SRX7658285
A35	SRR10997211	SRX7658286
A36	SRR10997210	SRX7658287
PC2	SRR10997209	SRX7658288
PC3	SRR10997208	SRX7658289
PC4	SRR10997207	SRX7658290

2 Packages

The following packages are used throughout the analyses:

library(readr)
library(tximport)
library(ggplot2)
library(DESeq2)
library(wesanderson)
library(pheatmap)
library(stringi)
library(tidyverse)
library(edgeR)
library(GOstats)
library(GSEABase)
library(scales)
library(ggrepel)
library(shiny)
library(plotly)
library(reshape2)
library(data.table)
library(jsonlite)
library(knitr)
library(pathview)

3 Internal quality control with ERCC spike-in

The External RNA Controls Consortium (ERCC) produces spike-in controls for RNA-seq experiments. The ERCC spike-in controls were added to RNA extracted from culture replicates before poly-A selection and cDNA synthesis with the SMART-seq v4 kit. By comparing the number of reads mapping to the ERCC sequences with the known concentrations of RNA molecules in the spike-in mix, bias introduced through PCR amplification or other steps of library-preparation and sequencing can be detected. This can be accomplished by plotting the concentration by the FPKM of mapped reads and performing a regression analysis.

The quality trimmed P. cordata culture reads were mapped to the ERCC sequences and counted with RSEM.

To plot results and perform regression analyses:

Import the counts of reads mapping to ERCC spike-in standard sequences in each sample:

`PC2.ercc`<- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/ERCC/PC2.genes.results")
`PC3.ercc` <- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/ERCC/PC3.genes.results")
`PC4.ercc` <- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/ERCC/PC4.genes.results")
FPKM<- data.frame(PC2.ercc$gene_id,PC2.ercc$FPKM,PC3.ercc$FPKM,PC4.ercc$FPKM)
#add column names
names(FPKM)<-c("gene_id","PC2","PC3","PC4")

Import ERCC spike-in standard concentrations for each ERCC sequence:

ercc_conc <- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/ERCC/cms_095046.txt", stringsAsFactors=FALSE)
ercc_conc<-ercc_conc[,2:7]
ercc_conc<-ercc_conc[order(ercc_conc$'ERCC ID'),]

Plot ERCC concentrations (log2Conc.) by their counts (Log2FPKM) in each sample and apply a linear regression. The slope of the linear regression line and the R-squared of the regression model should both be close to 1. If so, we can be fairly confident that PCR and other handling did not cause substantial or sytematic bias in the results.

FPKM$conc <- ercc_conc$'concentration in Mix 1 (attomoles/ul)'
row.names(FPKM) <- FPKM$gene_id
FPKM <- FPKM[,-1]

FPKM[FPKM==0] <- NA
FPKM <- FPKM[complete.cases(FPKM),]

for (i in c(1:4)) {
  FPKM[[i]] = log2(FPKM[[i]])
}

lm_eqn <- function(df){
  y<-df[[1]]
  x<-df[[2]]
  m <- lm(y ~ x, df);
  eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2, 
                   list(a = format(coef(m)[1], digits = 2), 
                        b = format(coef(m)[2], digits = 2), 
                        r2 = format(summary(m)$r.squared, digits = 3)))
  as.character(as.expression(eq));                 
}

EqDF <- data.frame( c(lm_eqn(data.frame(FPKM$PC2, FPKM$conc)),lm_eqn(data.frame(FPKM$PC3, FPKM$conc)), lm_eqn(data.frame(FPKM$PC4, FPKM$conc)))  , c("PC2", "PC3", "PC4" ))
names(EqDF) <- c("eq", "variable")

mERCC <- melt(FPKM, id.vars = c("conc"))

spikeinPlot<-ggplot(mERCC, aes(x=conc,y=value)) +geom_point()+ geom_smooth(method=lm, se=FALSE, color="black") +
  theme(panel.background =element_blank(), panel.grid.major = element_blank(), panel.grid.minor =element_blank()) +
  theme(axis.line = element_line(color = 'black'))+
  labs(x=expression(paste("Log"[2]," Concentration", sep=""))) +
  labs(y=expression(paste("Log"[2]," FPKM", sep=""))) + facet_wrap(~variable)

spikeinPlotWeq <- spikeinPlot +  geom_text(data=EqDF,aes(x = 0, y = 18,label=eq), parse = TRUE, inherit.aes=FALSE) + facet_grid(variable~.)
spikeinPlotWeq

## `geom_smooth()` using formula 'y ~ x'

In this case, the slopes of the fitted lines and the R-squared for the regression models are both close to 1, so we can be relatively confident in our data.

4 Transcriptome assembly and annotation

The Phaeocystis cordata CCMP3104 transcriptome was assembled with RNA reads from 3 biological culture replicates of P. cordata grown in standard culture conditions with L1 media. RNA was extracted in late exponential phase and libraries were prepared following the same protocols as the individual acantharian holobiont transcriptomes (i.e. with SMART-seq v4 and Nextera XT). Following RNA extraction, but before library preparation, diluted ERCC spike-in mix 1 was added to each sample. The RNA libraries were pooled and sequenced with an Illumina HiSeq 4000. Following sequencing, the reads were trimmed and quality filtered with Trimmomatic software. The reads mapping to ERCC sequences (after trimming) in each sample were filtered out to keep them from being assembled into the transcriptome. Remaining reads from the three replicates were then assembled using Trinity de novo transcriptome assembler. The assembly was dereplicated with CD-HIT EST (95%) and then the entire transcriptome was BLASTed against the nr/nt reference database. Assembled sequences that aligned best to a bacterial sequence from the database were removed from the assembly.

4.1 Assembly

Trimmomatic command for quality trimming raw reads from all samples:

java -jar trimmomatic-0.36.jar PE -phred33 sampleID_R1_001.fastq.gz \
sampleID_R2_001.fastq.gz \
sampleID_1_paired.fq sampleID_1_unpaired.fq \
sampleID_2_paired.fq sampleID_2_unpaired.fq \
ILLUMINACLIP:adapters/NexteraPE-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

Number of sequencing reads per sample before and after quality trimming with Trimmomatic:

sample	number of sequencing reads	remaining after trimming
A35	85,124,248	72,558,691
A7	96,255,554	83,430,343
A12	83,855,671	72,850,246
A14	103,587,165	88,442,308
A22	96,529,442	79,528,859
A25	91,208,346	76,733,652
A27	90,585,006	78,488,845
A20	348,088,442	304,901,030
A21	303,661,266	268,735,500
A28	299,236,369	254,086,670
A29	137,747,279	121,041,843
A36	142,694,491	124,725,007
PC4	83,711,987	68,530,568
PC2	81,369,009	67,254,517
PC3	77,926,482	62,200,308
total	2,121,580,757	1,823,508,387

Reads from the P. cordata culture samples that mapped to ERCC standards were removed by first mapping reads to a bowtie2 reference index prepared from the ERCC sequences and then filtering out the mapped reads:

bowtie2 -t -x ERCC_reference_bt2_index \
-1 sampleID_1_paired.fq \
-2 sampleID_2_paired.fq \
-S sampleID_ercc.sam

samtools view -bS sampleID_ercc.sam > sampleID.bam

samtools sort sampleID.bam sampleID_sorted

samtools view -b -f 13 sampleID_sorted.bam > sampleID_unmapped.bam

samtools sort -n sampleID_unmapped.bam sampleID.qsort

bedtools bamtofastq -i sampleID.qsort.bam -fq sampleID_1_paired.fq -fq2 sampleID_2_paired.fq

The filtered reads from the P. cordata culture replicates were then assembled with Trinity:

Trinity --seqType fq --max_memory 475G \
--left  PC2_1_paired.fq,PC3_1_paired.fq,PC4_1_paired.fq \
--right PC2_2_paired.fq,PC3_1_paired.fq,PC4_2_paired.fq \
--CPU 12

The assembly was dereplicated with CD-HIT_EST:

cd-hit-est -i Trinity.fasta -o Trinity_Pc_clustered_95 -c 0.95 -n 8 -p 1 -g 1 -M 200000 -T 8 -d 40

The Trinity assembly was filtered to remove bacterial contamination by first running a blastn (v2.6.0+) against the nr/nt NCBI database:

blastn -query $DATA/Trinity_Pc_clustered_95.fasta -task blastn -db nt -num_threads 12 -max_target_seqs 1 -outfmt 5 > pcTrinity_PC_clustered_95Blast.xml

The blast results were then parsed with the following script: TrinityBlastXML.ipynb: https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/TrinityBlastXML.ipynb

and reads that had their highest quality blast result corresponding to a bacterial sequence were removed from the transcriptome with the following script: FilterTrinityEukNotEuk.ipynb: https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/FilterTrinityEukNotEuk.ipynb

This script produced the filtered transcriptome file pc_euk_seqs.fasta (https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/pc_euk_seqs.fasta) that was used in read mapping and counting for differential gene expression testing.

The “completeness” of the transcriptome was evaluated by determining how many Benchmarking Universal Single-Copy Orthologs (BUSCOs) were complete, fragmented, or missing in the transcriptome. This was accomplished with the BUSCO software (v3) and the translated amino acid sequences for the P. cordata transcriptome. The BUSCO software was also used to assess the completeness of the P. cordata transcriptome previously sequenced as part of the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP) as a comparison point. BUSCO scores were calculated for Eukaryote BUSCOs and protist BUSCOs.

The P. cordata assembled for this project is more complete than the MMETSP transcriptome, which is likely due to much deeper sequencing performed for this study.

4.2 Annotation

The P. cordata transcriptome was annotated with the dammit de novo transcriptome annotation software: https://github.com/dib-lab/dammit

Pfam annotation results (https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/pc_euk_seqs.fasta.x.pfam-A.gff3) were parsed with the following script: https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/Pfam_gffParsing.ipynb

The final pfam annotation file is pfam.csv (https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/pfam.csv)

The assembly statistics and annotation results are summarized in the below table (including GO and KEGG annotations, which are described in later sections):

P. cordata CCMP3104 transcriptome assembly & annotation statistics
Number of contigs	32,621
Sum bp	37,640,601
min contig length	201
max contig length	13,685
med contig length	955
mean contig length	1153
N50	1866
GC%	67.4
With Pfam annotation	13,548 (41.5%)
With GO annotation	7,119 (21.8%)
With KO annotation	7,719 (23.7%)
With Kegg Pathway	4,618 (14.2%)

5 Read mapping and Counting

Quality filitered reads from the P. cordata culture replicates and individual acantharians were mapped to the P. cordata transcriptome (pc_euk_seqs.fasta) using the Salmon read mapping software.

salmon quant -i PC_index -l A \
-1 $DATA2/sampleID_1_paired.fq \
-2 $DATA2/sampleID_2_paired.fq \
-p 10 --validateMappings -o quants/sampleID_quant

The package tximport was used for importing Salmon read count results files to R.

All Salmon results are available here: https://github.com/maggimars/PcordataSymbiosisDGE/tree/master/quants

Salmon mapping rates (percent of the reads from each sample that mapped to the P. cordata transcriptome assembly):

sample_names<- c("A7", "A12", "A14", "A20", "A21", "A22","A25", "A27", "A28", "A29", "A35", "A36", "PC2", "PC3", "PC4")

mapping_rates<- c(fromJSON("quants/A7_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A12_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A14_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A20_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A21_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A22_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A25_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A27_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A28_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A29_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A35_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/A36_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/PC2_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/PC3_quant/aux_info/meta_info.json")$percent_mapped,
fromJSON("quants/PC4_quant/aux_info/meta_info.json")$percent_mapped
)

mappingrate_df<- data.frame("Sample" = sample_names, "Mapping Rate" = round(mapping_rates, digits=2))

kable(mappingrate_df, align = "l")

Sample	Mapping.Rate
A7	3.73
A12	1.73
A14	2.56
A20	1.12
A21	3.04
A22	3.29
A25	2.56
A27	4.86
A28	1.72
A29	1.62
A35	1.53
A36	3.87
PC2	73.66
PC3	75.30
PC4	71.68

Read in results files for each sample into R with tximport:

samples<-read.table("quants/samples.txt", header = TRUE)
files <- file.path("quants", samples$run, "quant.sf")
names(files) <- samples$run
tx2gene <- read_csv("quants/tx2gene.csv")
txi <- tximport(files, type="salmon", tx2gene= tx2gene)

6 DESeq2 differential expression testing

create DESeq data object:

ddsTxi <- DESeqDataSetFromTximport(txi,
                                   colData = samples,
                                   design = ~ condition)

## using counts and average transcript lengths from tximport

Pre-filtering:

“Note that more strict filtering to increase power is automatically applied via independent filtering on the mean of normalized counts within the results function. The adjusted p values for the genes which do not pass the filter threshold are set to NA” – from DESeq2 manual

keep <- rowSums(counts(ddsTxi )) >= 10
dds <- ddsTxi[keep,]

Re-level so that log-fold change is in relation to free-living (e.g. if logfold change is 3, then that gene is expressed 3x more in symbiotic samples than free-living replicates):

dds$condition <- relevel(dds$condition, ref = "F")

dds <- DESeq(ddsTxi)

res <- results(dds, alpha=0.05)
head(res, n =3)

## log2 fold change (MLE): condition S vs F 
## Wald test p-value: condition S vs F 
## DataFrame with 3 rows and 6 columns
##                                 baseMean    log2FoldChange            lfcSE
##                                <numeric>         <numeric>        <numeric>
## TRINITY_DN1_c0_g1_i1  0.0178204337170013  1.98354048709337  3.7365328592521
## TRINITY_DN10_c0_g1_i1   14.8187038538356 -7.72809144623968 2.37954484998336
## TRINITY_DN10_c1_g1_i1   7.55170822351109 -6.75555987522243 2.38815198139567
##                                    stat              pvalue                padj
##                               <numeric>           <numeric>           <numeric>
## TRINITY_DN1_c0_g1_i1  0.530850540276097   0.595522354038567                  NA
## TRINITY_DN10_c0_g1_i1 -3.24771833835942  0.0011633436943023 0.00916218836691006
## TRINITY_DN10_c1_g1_i1 -2.82878138738657 0.00467256046288919  0.0210043806317098

“condition S vs F” confirms that we are testing symbiotic samples against free-living samples.

summary(res)

## 
## out of 32210 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 589, 1.8%
## LFC < 0 (down)     : 5211, 16%
## outliers [1]       : 423, 1.3%
## low counts [2]     : 17088, 53%
## (mean count < 2)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

589 genes are significantly upregulated in symbiosis and 5211 genes are significantly downregulated in symbiosis

convert DESeq results to a data frame:

resultsall<- as.data.frame(res)
resultsall$Trinity <- row.names(resultsall)

Volcano Plot (plots “genes” by log2fold change and their adjusted p-values, with significantly differentially expressed genes indicated in red)

res_table<- data.frame(res)
# Obtain logical vector regarding whether padj values are less than 0.05 (to color significant points)
threshold <- res_table$padj < 0.05 
res_table$threshold <- threshold

ggplot(res_table) +
        geom_point(aes(x=log2FoldChange, y=-log10(padj), colour=threshold)) +
        ggtitle("") + theme_bw()+ scale_colour_manual(values = c("#bbbbbb", "red")) +
        xlab("log2 fold change") + 
        ylab("-log10 adjusted p-value") +
        scale_x_continuous(limits = c(-15,15)) +
        scale_y_continuous(limits = c(0,30)) +
        theme(legend.position = "none",
              plot.title = element_text(size = rel(1.5), hjust = 0.5),
              axis.title = element_text(size = rel(1.25)))  + theme(text = element_text(size=20))

Count Data Transformations for ranking and visualizations (e.g. PCA plots and heatmaps)

Variance stabilizing transformation (vst): “This function calculates a variance stabilizing transformation (VST) from the fitted dispersion-mean relation(s) and then transforms the count data (normalized by division by the size factors or normalization factors), yielding a matrix of values which are now approximately homoskedastic (having constant variance along the range of mean values). The transformation also normalizes with respect to library size.”" – from DESeq documentation

#Heat Map of the variance stabilizing transformed count matrix (vsd):
vsd<-vst(dds, blind = TRUE)
select <- order(rowMeans(counts(dds,normalized=TRUE)),
                decreasing=TRUE)[1:1000]
pheatmap(assay(vsd)[select,], cluster_rows=TRUE, show_rownames=FALSE,
         cluster_cols=TRUE)

#PCA by condition (Free-living or Symbiotic):
data1 <- plotPCA(vsd, returnData=TRUE)
data1$group<-gsub(" : ","_",as.character(data1$group))
percentVar1 <- round(100 * attr(data1, "percentVar"))

PCA<-ggplot(data1, aes(PC1, PC2, color = condition))+ theme_bw()+
  geom_point(size=9, alpha = 0.8) + scale_colour_manual(values = c("#44aabb","#bbbbbb"), labels = c("Free-living", "Symbiotic"), name = "")+
  xlab(paste0("PC1: ",percentVar1[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar1[2],"% variance")) +
  theme(text = element_text(size=20)) 
PCA

The majority of variance in gene expression (63%) is explained by sample type. There is much more variation between symbiotic samples than between culture replicates.

7 GO term enrichment

7.1 Annotation

The Pfam annotations for the P. cordata transcriptome were assigned using HMMER and HMM profiles for the Pfam database. Annotations with scores < 1e-05 were retained and included in the following analyses. The GO term to Pfam mapping comes from: http://current.geneontology.org/ontology/external2go/pfam2go.

Get dataframe of significantly differentially expressed genes:

sigDEgenes <- subset(resultsall, padj < 0.05)
sigDEgenes$trinity <- row.names(sigDEgenes)

Import Pfam annotation results from dammit and fix pfam annotation formatting to match that used in the pfam2go mapping file:

pfamannot<- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/pfam.csv", header = TRUE)
pfamannot$Dbxref <- stri_sub(pfamannot$Dbxref , from = 2, to = -2)
pfamannot$Pfam <- gsub("\\..*","", pfamannot$Dbxref)
dim(pfamannot)

## [1] 13548     6

13,548 transcripts have good (score<1e-05) pfam annotations

Import and format trinity ID to dammit transcript ID mapping file:

trinity2dammit <- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/pc_euk_seqs.fasta.dammit.namemap.csv", header = TRUE)
trinity2dammit1 <- trinity2dammit %>%
  separate(original, c("trinity"), sep = " ")

## Warning: Expected 1 pieces. Additional pieces discarded in 32621 rows [1, 2, 3,
## 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

names(trinity2dammit1) <- c("trinity", "seqid")

Import and format pfam2go mapping file:

pfam2go <- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/pfam2go4R.txt", header = FALSE)
pfam2go1 <- pfam2go %>%
  separate(V1, c('V1_1', 'V1_2'), sep = '>') %>%
  separate(V1_1, c("Pfam", "name"), sep = " ") %>%
  separate(V1_2, c("GO_desc", "GO"), sep = ";")

## Warning: Expected 2 pieces. Additional pieces discarded in 2 rows [3361, 6204].

## Warning: Expected 2 pieces. Additional pieces discarded in 10503 rows [1, 2, 3,
## 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 10503 rows [1, 2,
## 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

pfam2go1$GO <-stri_sub(pfam2go1$GO, from = 2)

Create transcript to GO reference:

pfam1 <- merge(pfamannot, pfam2go1, by = "Pfam")
pfam2<- merge(pfam1, trinity2dammit1, by ="seqid" )
pfam3 <- merge(pfam2, sigDEgenes, by = "trinity")
pfamUP <- pfam3[pfam3$log2FoldChange >0 & pfam3$padj <0.05,]
pfam4<-merge(pfamannot, trinity2dammit1, by = "seqid")
pfam5<-merge(pfam4, sigDEgenes, by = "trinity")
length(unique(pfam1$seqid))

## [1] 7119

7119 annotated transcripts also have a GO term assigned.

Create list of all upregulated GO terms:

length(pfamUP$GO_desc)

## [1] 411

length(unique(substring(pfamUP$GO_desc,5,nchar(pfamUP$GO_desc))))

## [1] 134

unique_upGOlist <- unique(substring(pfamUP$GO_desc,5,nchar(pfamUP$GO_desc))) 
unique_upGOlist[135] <- ""
kable(data.frame(unique_upGOlist[1:45],unique_upGOlist[46:90], unique_upGOlist[91:135] ))

unique_upGOlist.1.45.	unique_upGOlist.46.90.	unique_upGOlist.91.135.
transmembrane transport	biosynthetic process	inorganic diphosphatase activity
membrane	proton-transporting ATP synthase activity, rotational mechanism	hydrogen-translocating pyrophosphatase activity
ion channel activity	ATP synthesis coupled proton transport	fructose-bisphosphate aldolase activity
ion transport	proton-transporting ATP synthase complex, catalytic core F(1)	inositol-3-phosphate synthase activity
ATP binding	phosphoglycerate kinase activity	phospholipid biosynthetic process
nucleic acid binding	kinase activity	inositol biosynthetic process
structural constituent of ribosome	COPII vesicle coat	GTPase activity
ribosome	intracellular protein transport	iron-sulfur cluster binding
translation	ER to Golgi vesicle-mediated transport	nitrogen compound metabolic process
purine nucleotide biosynthetic process	eukaryotic translation initiation factor 3 complex	glutamate synthase activity
phosphoribosylaminoimidazolecarboxamide formyltransferase activity	translational initiation	carbohydrate metabolic process
IMP cyclohydrolase activity	translation initiation factor activity	hydrolase activity, hydrolyzing O-glycosyl compounds
protein binding	translation initiation factor binding	primary metabolic process
photosystem II stabilization	acetyl-CoA carboxylase activity	zinc ion binding
photosystem II assembly	fatty acid biosynthetic process	transcription coregulator activity
oxygen evolving activity	response to stress	nucleus
photosystem II oxygen evolving complex	unfolded protein binding	histone acetyltransferase activity
intracellular	protein folding	cytoskeleton organization
oxidation-reduction process	rRNA binding	heme binding
oxidoreductase activity	hydrolase activity	pyridoxal phosphate binding
photosynthetic electron transport in photosystem II	coenzyme binding	peroxidase activity
electron transporter, transferring electrons within the cyclic electron transport pathway of photosynthesis activity	cell redox homeostasis	response to oxidative stress
photosynthesis, light reaction	violaxanthin de-epoxidase activity	cellular amino acid biosynthetic process
photosystem	chloroplast	protein dimerization activity
chlorophyll binding	RNA binding	cytoplasm
photosynthetic electron transport chain	base-excision repair	ATP hydrolysis coupled proton transport
ATPase activity	ubiquitin-dependent protein catabolic process	proton transmembrane transporter activity
protein phosphorylation	oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor	proton-transporting V-type ATPase, V0 domain
protein kinase activity	motor activity	cellular metabolic process
GTP binding	myosin complex	transcription, DNA-templated
proton transmembrane transport	proteasome core complex	DNA-directed 5’-3’ RNA polymerase activity
ATP metabolic process	threonine-type endopeptidase activity	DNA binding
glucose-6-phosphate isomerase activity	proteolysis involved in cellular protein catabolic process	microtubule-based movement
gluconeogenesis	oxidoreductase activity, acting on the CH-CH group of donors	microtubule motor activity
glycolytic process	mitochondrial outer membrane	dynein complex
phosphoenolpyruvate carboxykinase (ATP) activity	peptidyl-prolyl cis-trans isomerase activity	aldehyde-lyase activity
aminoacyl-tRNA ligase activity	protein peptidyl-prolyl isomerization	U5 snRNA binding
tRNA aminoacylation for protein translation	vesicle-mediated transport	inositol catabolic process
catalytic activity	membrane coat	inositol oxygenase activity
trehalose biosynthetic process	actin binding	iron ion binding
protein transport	regulation of transcription, DNA-templated	integral component of membrane
metabolic process	ribosome binding	ATP:ADP antiporter activity
electron transfer activity	mature ribosome assembly	nucleotide transport
copper ion binding	transmembrane transporter activity	proton-transporting two-sector ATPase complex, proton-transporting domain
nucleotide binding	protein import

Create list of all downregulated GO terms:

pfamDOWN <- pfam3[pfam3$log2FoldChange <0 & pfam3$padj <0.05,]
length(pfamDOWN$GO_desc)

## [1] 3200

length(unique(pfamDOWN$GO_desc))

## [1] 552

(too long to print)

7.2 GO term enrichment testing with GOstats

Format transcript to GO mapping for GOstats:

GOdf <- data.frame(pfam2$trinity, pfam2$GO)
GOdf$evidence <- "ISS"
names(GOdf) <- c("isoform", "GO", "evidence")
#reorder columns
GOdf <- GOdf[,c("GO","evidence","isoform")]  
GOdf$GO <- as.character(GOdf$GO)
GOdf$isoform<- as.character(GOdf$isoform)
goframe=GOFrame(GOdf)
goAllFrame=GOAllFrame(goframe)
gsc <- GeneSetCollection(goAllFrame, setType = GOCollection())

Prepare universe gene set, up-regulated gene set, and down-regulated gene set:

#make list of all genes
universe <-trinity2dammit1$trinity 
#make list of upregulated genes
sigDEup <- sigDEgenes[sigDEgenes$log2FoldChange >0,] 
uplist <- sigDEup$trinity
#make list of downregulated genes
sigDEdown<- sigDEgenes[sigDEgenes$log2FoldChange <0,]
downlist <- sigDEdown$trinity

GOterm enrichment in up-regulated genes:

upregulated = hyperGTest(
  GSEAGOHyperGParams(name = "Symbiosis Upregged",
                     geneSetCollection=gsc,geneIds = uplist,
                     universeGeneIds=universe,ontology = "BP",pvalueCutoff = 0.05,conditional = TRUE,testDirection = "over"))

upregulated

## Gene to GO BP Conditional test for over-representation 
## 289 GO BP ids tested (68 have p < 0.05)
## Selected gene set size: 117 
##     Gene universe size: 4341 
##     Annotation package: Based on a GeneSetCollection Object

htmlReport(upregulated, file="enriched_up_go.html")
upgoterms <-data.frame(summary(upregulated))
write.csv(upgoterms,"upgo.csv")

Full Results (conditional = FALSE): https://maggimars.github.io/AcanthSymbiontDGE/Pcordata/enriched_up_GO.html

List of significantly enriched GO terms in UP-regulated genes (when conditional = TRUE , less redundancy):

kable(data.frame(GOterm =upgoterms$Term))

GOterm
amide biosynthetic process
peptide metabolic process
translation
organonitrogen compound biosynthetic process
cellular biosynthetic process
macromolecule biosynthetic process
cellular nitrogen compound metabolic process
photosynthetic electron transport chain
proton transmembrane transport
generation of precursor metabolites and energy
gluconeogenesis
monosaccharide biosynthetic process
inorganic ion transmembrane transport
cation transmembrane transport
ATP metabolic process
ribonucleoside monophosphate metabolic process
purine nucleoside monophosphate metabolic process
purine nucleotide metabolic process
establishment of protein localization
ribonucleoside triphosphate metabolic process
purine nucleoside triphosphate metabolic process
peptide transport
photosynthesis
purine-containing compound biosynthetic process
small molecule metabolic process
nucleoside triphosphate biosynthetic process
purine ribonucleoside triphosphate biosynthetic process
ribonucleotide metabolic process
purine ribonucleoside monophosphate biosynthetic process
nucleoside monophosphate biosynthetic process
purine ribonucleotide biosynthetic process
ribose phosphate biosynthetic process
nitrogen compound transport
monocarboxylic acid biosynthetic process
organic substance catabolic process
intracellular protein transport
translational initiation
organic acid biosynthetic process
inositol metabolic process
cellular macromolecule localization
establishment of localization in cell
vesicle-mediated transport
protein import
ribonucleoside diphosphate metabolic process
nucleoside diphosphate phosphorylation
purine nucleoside diphosphate metabolic process
glycolytic process
ADP metabolic process
hexose metabolic process
pyruvate metabolic process
proteolysis involved in cellular protein catabolic process
pyridine nucleotide biosynthetic process
nucleotide transport
nucleoside phosphate metabolic process
ATP synthesis coupled proton transport
response to oxidative stress
protein catabolic process
cellular macromolecule catabolic process
nicotinamide nucleotide metabolic process
cellular process
pyridine-containing compound metabolic process
nucleoside phosphate catabolic process
cellular macromolecule metabolic process
protein folding
carboxylic acid metabolic process
oxidoreduction coenzyme metabolic process
nucleotide biosynthetic process
cellular catabolic process

GOterm enrichment in significantly down-reguated genes:

downregulated = hyperGTest(
  GSEAGOHyperGParams(name = "Symbiosis DownRegged",
                     geneSetCollection=gsc,geneIds = downlist,
                     universeGeneIds=universe,ontology = "BP",pvalueCutoff = 0.05,conditional = FALSE,testDirection = "over"))

downregulated

## Gene to GO BP  test for over-representation 
## 902 GO BP ids tested (31 have p < 0.05)
## Selected gene set size: 1001 
##     Gene universe size: 4341 
##     Annotation package: Based on a GeneSetCollection Object

htmlReport(downregulated, file = "enrichedDOWNgostats.html")
downgoterms <- data.frame(summary(downregulated))
write.csv(downgoterms,"downgo.csv")

Full Results: https://maggimars.github.io/AcanthSymbiontDGE/Pcordata/enriched_down_GO.html

List of significantly enriched GO terms in DOWN-regulated genes:

kable(data.frame("GOterm" = downgoterms[["Term"]]))

GOterm
carbohydrate derivative metabolic process
DNA replication
glycosylation
protein glycosylation
macromolecule glycosylation
glycoprotein metabolic process
glycoprotein biosynthetic process
transport
establishment of localization
localization
cation transport
carbohydrate derivative biosynthetic process
carbohydrate derivative transport
fructose metabolic process
nucleobase-containing compound transport
respiratory electron transport chain
nucleotide-sugar transmembrane transport
pyrimidine nucleotide-sugar transmembrane transport
cellular respiration
metal ion transport
transmembrane transport
DNA-dependent DNA replication
macromolecule catabolic process
protein peptidyl-prolyl isomerization
peptidyl-proline modification
protein homooligomerization
protein complex oligomerization
pathogenesis
copper ion transport
molybdopterin cofactor metabolic process
prosthetic group metabolic process

7.3 ReviGO Plots

Make results file to import into the ReviGO online tool:

Down4revi <-as.data.frame(downgoterms$GOBPID) 
names(Down4revi) <- c("id")
Down4revi$updown <- 0
up4revi <-as.data.frame(upgoterms$GOBPID)
names(up4revi)<-c("id")
up4revi$updown <- 1
revi<-rbind(up4revi, Down4revi)
write.csv(revi, "revi.csv")

ReviGO summarizes long lists of GO terms by removing redundant terms and clustering terms with similar meanings. The results can be plotted as an MDS plot:

source("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/ReviGO_nmds.R")
reviGOplot

This plot has several of the GO terms labeled, but not all of them. Below is an interactive version (hover mouse over points to see the name of the GO term):

key = one.data$description

p1 <- ggplot( data = one.data, aes(key= key, x= plot_X, y= plot_Y, color = value) ) +
  geom_point(alpha = I(0.6), size =7) +
  scale_colour_manual(values =c("#3B9AB2", "red"), labels= c("Down", "Up")) +
  geom_point(shape = 21, fill = "transparent", colour = I (alpha ("black", 0.6) ), size = 7) +   scale_size_area() + theme_bw() + labs (y = "Axis 2", x = "Axis 1") +
  theme(legend.key = element_blank()) + theme(text = element_text(size=16)) + theme(legend.title=element_blank())

ggplotly(p1, tooltip="key")

TreeMap for GO terms enriched in significantly upregulated gene set:

source("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/ReviGO_treemap_UP.R")

TreeMap for GO terms enriched in significantly downregulated gene set:

source("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/ReviGO_treemap_DOWN.R")

8 KEGG pathway enrichment

8.1 Annotation

KEGG annotations (K numbers) were attained by submitting the transdecoder peptide sequences for the P. cordata transcriptome assemble as a GhostKoala job through https://www.kegg.jp/ghostkoala/

The protein fasta submitted to GhostKOALA can be found here: https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/pc_euk_seqs.fasta.transdecoder.pep

Results Summary:

Results were downloaded as user_ko.csv and are available here: https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/user_ko.csv

In order to get the KEGG pathway that each K number belongs to, the KEGG API was accessed with each annotated K number:

# read in results with pathway assigned to K numbers
keggs <- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/user_ko.csv", header =TRUE)

## format data frame

kegg <- keggs %>% separate(transcript, c("transcript", "protein"), "\\.")

trinity2dammit <- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/pc_euk_seqs.fasta.dammit.namemap.csv", header = TRUE)
trinity2dammit_k <- trinity2dammit %>%
  separate(original, c("trinity"), sep = " ") 
names(trinity2dammit_k) <- c("trinity", "transcript")

trinity2kegg <- merge(kegg, trinity2dammit_k, by = "transcript")
trinity2kegg<- trinity2kegg[,c(3,4)]
trinity2kegg<- trinity2kegg[trinity2kegg$KO!="",]

## Get pathway ids

koslist<- trinity2kegg$KO
paths <- character(length=0)
for(i in 1:length(koslist)) {
  pathway <- system(paste0("curl http://rest.kegg.jp/link/pathway/", koslist[[i]]), intern=TRUE)
   if (pathway[[1]] == "") {
     paths[[i]]<- "NA"
   } else if (length(pathway) == 1) {
     paths[[i]] <- stri_sub(pathway[[1]], 16)
   } else {
     l <- length(pathway)
     paths[[i]] <- stri_sub(pathway[[2]], 16)
   }
 }
trinity2kegg$path<-paths

## pathway ids 

write.csv(trinity2kegg, "trinity_KO_pathway.csv")

Then a new data frame was made with KEGG annotations (K number AND pathway) alongside DESeq results:

trinity2kegg<- fread("https://raw.githubusercontent.com/maggimars/PcordataSymbiosisDGE/master/trinity_KO_pathway.csv", header = TRUE)
trinity2kegg<-trinity2kegg[,-1]
trinity2kegg_noNA <- trinity2kegg[trinity2kegg$path != "NA",]
trinity2kegg_noNA <- trinity2kegg_noNA[trinity2kegg_noNA$path !='PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"',]
resultsframe<- as.data.frame(res)
resultsframe$trinity <- row.names(resultsframe)
resALL<-merge(trinity2kegg, resultsframe, by = 'trinity')
dim(resALL)[1]

## [1] 7720

8.2 Enrichment

KEGG pathway enrichment was performed with the kegga function from the edgeR package. The inputs for this function include a list of pathways (either corresponding to up or downregulated genes) and a ‘universe’ list including all pathways annotated in the reference transcriptome. The function was called twice, once for significantly upregulated genes and once for significantly downregulated genes.

To prepare the input lists, the dataframe with KEGG annotations and DE results was subset to contain only significantly UPregulated genes:

KOs_up<- resALL[resALL$log2FoldChange > 0 & resALL$padj <= 0.05 ,]
KOs_up_list <- KOs_up$KO
KOs_up_list <- (unique(KOs_up_list))

and only significantly DOWNregulated genes:

KOs_down<- resALL[resALL$log2FoldChange < 0 & resALL$padj <= 0.05 ,]
KOs_down_list <- KOs_down$KO
KOs_down_list <- (unique(KOs_down_list))

To get the list of all pathways with genes annotated in the P. cordata transcriptome (the “universe”):

resALLnoNA <- resALL[resALL$path != "NA",] #kegga requires no NA values
resALLnoNA <- resALLnoNA[resALLnoNA$path !='PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"',] #remove this message from accessing the API
genepathways <- data.frame(resALLnoNA$trinity, resALLnoNA$path)
names(genepathways)<-c("geneID", "pathwayID")
universelist<- genepathways$geneID
universelist<-as.vector(universelist)
length(universelist)

## [1] 4618

8.2.1 Pathways engriched among upregulated genes

Perform enrichment test for upregulated genes and access the KEGG API again to acquire more information about the enriched pathways (from pathway id, get pathway name and class/description):

ko_up<- resALLnoNA[resALLnoNA$log2FoldChange > 0 & resALLnoNA$padj <= 0.05 ,]
de<-ko_up$trinity
de<- as.vector(de)

keggUP<-kegga(de, universe = universelist, species = NULL, species.KEGG = NULL, convert = FALSE,
              gene.pathway = genepathways, pathway.names = NULL,
              prior.prob = NULL, covariate=NULL, plot=FALSE)

names(keggUP) <- c("Pathway", "N", "DE" , "p")
keggUP$Pathway<-row.names(keggUP)

KeggUpsig<- keggUP[keggUP$p < 0.05,]

kosup<- row.names(KeggUpsig)
name <- character(length=0)
class <- character(length=0) 

for(i in 1:length(kosup)) {
  pathway <- system(paste0("curl http://rest.kegg.jp/get/", kosup[[i]]), intern=TRUE)
  name[[i]] <- pathway[[2]]
  class[[i]] <-pathway[[3]]
}

KeggUpsig$name <- name
KeggUpsig$class <- class

kable(KeggUpsig[,c(1:5)], row.names=FALSE)

Pathway	N	DE	p	name
ko03010	193	40	0.0000000	NAME Ribosome
ko00010	98	13	0.0002064	NAME Glycolysis / Gluconeogenesis
ko00860	43	6	0.0086868	NAME Porphyrin and chlorophyll metabolism
ko00061	35	7	0.0005283	NAME Fatty acid biosynthesis
ko00760	17	3	0.0325781	NAME Nicotinate and nicotinamide metabolism
ko00195	28	9	0.0000012	NAME Photosynthesis
ko00190	75	7	0.0377923	NAME Oxidative phosphorylation
ko05166	8	2	0.0420081	NAME Human T-cell leukemia virus 1 infection
ko00053	7	2	0.0323935	NAME Ascorbate and aldarate metabolism
ko05162	1	1	0.0422352	NAME Measles

8.2.2 Pathways engriched among downregulated genes

Perform enrichment test for downregulated genes and access the KEGG API again to acquire more information about the enriched pathways (from pathway id, get pathway name and class/description):

ko_down<- resALLnoNA[resALLnoNA$log2FoldChange < 0 & resALLnoNA$padj <= 0.05 ,]
de<-ko_down$trinity
de<- as.vector(de)

keggDOWN<-kegga(de, universe = universelist, species = NULL, species.KEGG = NULL, convert = FALSE,
              gene.pathway = genepathways, pathway.names = NULL,
              prior.prob = NULL, covariate=NULL, plot=FALSE)

names(keggDOWN) <- c("Pathway", "N", "DE" , "p")
keggDOWN$Pathway<-row.names(keggDOWN)
KeggDOWNsig<- keggDOWN[keggDOWN$p < 0.05,]

kosup<- row.names(KeggDOWNsig)
name <- character(length=0)
class <- character(length=0) 

for(i in 1:length(kosup)) {
  pathway <- system(paste0("curl http://rest.kegg.jp/get/", kosup[[i]]), intern=TRUE)
  name[[i]] <- pathway[[2]]
  class[[i]] <-pathway[[3]]
}

KeggDOWNsig$name <- name
KeggDOWNsig$class <- class

kable(KeggDOWNsig[,c(2:5)], row.names=FALSE)

N	DE	p	name
45	14	0.0308998	NAME Galactose metabolism
51	15	0.0421142	NAME Ras signaling pathway
45	15	0.0136572	NAME Glycosaminoglycan degradation
28	11	0.0088506	NAME Pyrimidine metabolism
33	12	0.0127166	NAME Peroxisome
18	7	0.0368175	NAME Folate biosynthesis
34	13	0.0059705	NAME Glutathione metabolism
42	17	0.0008279	NAME DNA replication
5	3	0.0484208	NAME Biotin metabolism
5	3	0.0484208	NAME Chlorocyclohexane and chlorobenzene degradation
22	9	0.0129969	NAME Fructose and mannose metabolism
15	6	0.0456244	NAME Antifolate resistance
75	24	0.0038249	NAME Oxidative phosphorylation
7	4	0.0263902	NAME Various types of N-glycan biosynthesis
8	4	0.0450832	NAME Taste transduction
8	4	0.0450832	NAME Glycosaminoglycan biosynthesis - heparan sulfate / heparin

8.3 KEGG Pathway Maps

Add logFold change values to KEGG pathway maps to visualize results.

First, prepare data frames with all K number annotations and log fold change (ko_all) and all significantly differentially expressed K numbers and their log fold changes (ko_sig):

ko_all <-resALL[,c(2,5)]

#7,720
ko_all<-as.data.frame(ko_all[!duplicated(ko_all[,c('KO')]),])
#3,631
row.names(ko_all) <- ko_all[[1]]
ko_all<-ko_all[,2,drop=FALSE]

ko_sig<- resALL[resALL$padj <= 0.05 ,]
ko_sig<- ko_sig[complete.cases(ko_sig), c(2,5)]
ko_sig<-as.data.frame(ko_sig[!duplicated(ko_sig[,c('KO')]),])
#1,243

row.names(ko_sig) <- ko_sig$KO
ko_sig<-ko_sig[,2,drop=FALSE]

8.3.1 DNA replication

with only significantly differentially expressed genes colored:

pathvect<- c("ko03030")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

And with all KO annotations (not just sig diff expressed):

pathvect<- c("ko03030")
pv.out<- pathview(gene.data = ko_all, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "all", kegg.native = TRUE)

8.3.2 Cell Cycle Progression

Cell Cycle (significant only)

pathvect<- c("ko04110")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

Cell Cycle with all KO annotations (not just sig diff expressed):

pathvect<- c("ko04110")
pv.out<- pathview(gene.data = ko_all, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "all", kegg.native = TRUE)

8.3.3 Nitrogen Metabolism / Limitation

Nitrogen Metabolism (significant):

pathvect<- c("ko00910")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

Ribosome (significant only):

pathvect<- c("ko03010")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

8.3.4 Photosynthesis

Photosynthesis (significant only):

pathvect<- c("ko00195")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

Chlorophyll / porphyrin metabolism (significant only):

pathvect<- c("ko00860")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

8.3.5 Carbon Fixation

Carbon Fixation in Photosynthetic Organisms (significant only):

pathvect<- c("ko00710")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

Calvin-Benson cycle is upregulated

pathvect<- c("ko00010")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

gluconeogenesis = pathway that makes new glucose

glycolysis = glycolysis produces two molecules of pyruvate, two atp, 2 NADH, 2 H20

Pentose Phosphate Pathway (significant only):

pathvect<- c("ko00030")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

8.3.6 Cellular respiration

TCA cycle (significant only):

pathvect<- c("ko00020")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

Respiratory oxidative phosphorylation (significant only):

pathvect<- c("ko00190")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

with all KO annotations (not just significant):

pathvect<- c("ko00190")
pv.out<- pathview(gene.data = ko_all, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "all", kegg.native = TRUE)

8.3.7 Cell Signaling

Ras Signaling Pathway (significant only):

pathvect<- c("04014")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

ERK is downregulated

MAPK Signaling Pathway (significant only):

pathvect<- c("ko04010")
pv.out<- pathview(gene.data = ko_sig, pathway.id = pathvect, species = "ko", gene.idtype = "KEGG", out.suffix = "sig", kegg.native = TRUE)

9 Genes of specific interest

9.1 Phosphorus Limitation

phoA and ehap1

Alkaline phosphatase genes were annotated in the P. cordata transcriptome by blasting the transcriptome against a local database of Emiliania huxleyi phosphorus metabolism genes (https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/PhosMetab.pep) using dammit.

Results file: https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/pc_euk_seqs.fasta.x.Pmetab.pep.crbl.csv

Two alkaline phosphates were found that were also expressed:

goi<- c("TRINITY_DN16018_c0_g1_i1", "TRINITY_DN19611_c0_g1_i1")
stopifnot(all(goi %in% names(dds)))
tcounts <- t(log2((counts(dds[goi, ], normalized=TRUE, replaced=FALSE)+.5))) %>%
  merge(colData(dds), ., by="row.names") %>%
  gather(gene, expression, (ncol(.)-length(goi)+1):ncol(.))

annots<-c(rep("ehap1", 15), rep("phoA", 15))

tcounts$annots <- annots
tcounts$annots_f = factor(tcounts$annots, levels=c("phoA", "ehap1"))
pal<-wes_palette(name = "Zissou1", type="discrete")
colors<- c(pal[2],pal[3])

goiplot<- ggplot(tcounts, aes(condition, expression, fill=condition)) +geom_dotplot(binaxis='y', stackdir='center', dotsize=1)+
  facet_wrap(~annots_f, scales="fixed") + 
  labs(x="", 
       y="Expression (log normalized counts)", 
       fill="Phenotype", 
       title="") + theme_bw() + scale_fill_manual(values= colors) + theme(text = element_text(size=19)) + theme(legend.position="none") +  theme(panel.grid.minor = element_blank(), panel.grid.major = element_blank()) +scale_x_discrete(labels=c("F" = "Free-living", "S" = "Symbiotic"))

goiplot

9.2 Chloroplast Division

Chloroplast division genes wer annotated by creating a local database of gene sequences and then blasting the P. cordata transcriptome against the database using dammit software.

local database: https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/chloroplast_division_proteins.fasta

results: https://github.com/maggimars/PcordataSymbiosisDGE/blob/master/chlordiv_annotations.csv

DRP5B, FtsZ, PDR1

chlordiv <- read.csv("./pc_euk_seqs.fasta.dammit/chlordiv_annotations1.csv")

goi<- as.character(chlordiv$trinity)
stopifnot(all(goi %in% names(dds)))

tcounts <- t(log2((counts(dds[goi, ], normalized=TRUE, replaced=FALSE)+.5))) %>%
  merge(colData(dds), ., by="row.names") %>%
  gather(gene, expression, (ncol(.)-length(goi)+1):ncol(.))

chlordiv$gene <- chlordiv$trinity
tcounts<- merge(tcounts, chlordiv, by= "gene")

pal<-wes_palette(name = "Zissou1", type="discrete")
colors<- c(pal[2],pal[3])

tcounts$identifier = paste(tcounts$geneName ,tcounts$seqid )

tcounts <- tcounts %>% filter(gene %in% c("TRINITY_DN10430_c0_g4_i2", "TRINITY_DN10705_c0_g1_i1", "TRINITY_DN1346_c1_g2_i1"))

goiplot<- ggplot(tcounts, aes(condition, expression, fill=condition)) +geom_dotplot(binaxis='y', stackdir='center', dotsize=1)+
  facet_wrap(~geneName, scales="fixed") + 
  labs(x="", 
       y="Expression (log normalized counts)", 
       fill="Phenotype", 
       title="") + theme_bw() + scale_fill_manual(values= colors) + theme(text = element_text(size=11)) + theme(legend.position="none") + theme(panel.grid.minor = element_blank(), panel.grid.major = element_blank()) +scale_x_discrete(labels=c("F" = "Free-living", "S" = "Symbiotic"))

goiplot

TRINITY_DN10430_c0_g4_i2 - not differentially expressed (padj = 0.115076, lfc = -1.733) → PDR1

TRINITY_DN10705_c0_g1_i1 - not differentially expressed (padj = 0.1150529, lfc = 1.9) → DRP5B

TRINITY_DN1346_c1_g2_i1 - not differentially expressed (padj = 0.732, lfc = -0.6) → FtsZ

10 Session info

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS  10.14.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] pathview_1.22.3             org.Hs.eg.db_3.7.0         
##  [3] treemap_2.4-2               GO.db_3.7.0                
##  [5] knitr_1.29                  jsonlite_1.7.0             
##  [7] data.table_1.13.0           reshape2_1.4.3             
##  [9] plotly_4.9.2.1              shiny_1.5.0                
## [11] ggrepel_0.8.1               scales_1.1.1               
## [13] GSEABase_1.44.0             annotate_1.60.1            
## [15] XML_3.98-1.20               GOstats_2.48.0             
## [17] graph_1.60.0                Category_2.48.1            
## [19] Matrix_1.2-18               AnnotationDbi_1.44.0       
## [21] edgeR_3.24.3                limma_3.38.3               
## [23] forcats_0.4.0               stringr_1.4.0              
## [25] dplyr_1.0.2                 purrr_0.3.4                
## [27] tidyr_1.1.2                 tibble_3.0.3               
## [29] tidyverse_1.3.0             stringi_1.4.3              
## [31] pheatmap_1.0.12             wesanderson_0.3.6.9000     
## [33] DESeq2_1.22.2               SummarizedExperiment_1.12.0
## [35] DelayedArray_0.8.0          BiocParallel_1.16.6        
## [37] matrixStats_0.56.0          Biobase_2.42.0             
## [39] GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
## [41] IRanges_2.16.0              S4Vectors_0.20.1           
## [43] BiocGenerics_0.28.0         ggplot2_3.3.2              
## [45] tximport_1.10.1             readr_1.3.1                
## 
## loaded via a namespace (and not attached):
##  [1] readxl_1.3.1           backports_1.1.9        Hmisc_4.3-0           
##  [4] plyr_1.8.6             igraph_1.2.5           lazyeval_0.2.2        
##  [7] splines_3.5.0          gridBase_0.4-7         digest_0.6.25         
## [10] htmltools_0.5.0        fansi_0.4.0            magrittr_1.5          
## [13] checkmate_1.9.4        memoise_1.1.0          cluster_2.1.0         
## [16] Biostrings_2.50.2      modelr_0.1.8           colorspace_1.4-1      
## [19] blob_1.2.0             rvest_0.3.6            haven_2.3.1           
## [22] xfun_0.16              crayon_1.3.4           RCurl_1.95-4.12       
## [25] genefilter_1.64.0      survival_3.2-3         glue_1.4.2            
## [28] gtable_0.3.0           zlibbioc_1.28.0        XVector_0.22.0        
## [31] Rgraphviz_2.26.0       DBI_1.1.0              Rcpp_1.0.5            
## [34] viridisLite_0.3.0      xtable_1.8-4           htmlTable_2.0.1       
## [37] foreign_0.8-72         bit_1.1-14             Formula_1.2-3         
## [40] AnnotationForge_1.24.0 htmlwidgets_1.5.1      httr_1.4.2            
## [43] RColorBrewer_1.1-2     acepack_1.4.1          ellipsis_0.3.1        
## [46] pkgconfig_2.0.3        nnet_7.3-12            dbplyr_1.4.2          
## [49] locfit_1.5-9.1         tidyselect_1.1.0       rlang_0.4.7           
## [52] later_1.1.0.1          munsell_0.5.0          cellranger_1.1.0      
## [55] tools_3.5.0            cli_2.0.2              generics_0.0.2        
## [58] RSQLite_2.2.0          broom_0.7.0            evaluate_0.14         
## [61] fastmap_1.0.1          yaml_2.2.0             bit64_0.9-7           
## [64] fs_1.3.1               KEGGREST_1.22.0        RBGL_1.58.2           
## [67] mime_0.7               KEGGgraph_1.42.0       xml2_1.2.2            
## [70] compiler_3.5.0         rstudioapi_0.11        png_0.1-7             
## [73] reprex_0.3.0           geneplotter_1.60.0     lattice_0.20-38       
## [76] vctrs_0.3.3            pillar_1.4.6           lifecycle_0.2.0       
## [79] bitops_1.0-6           httpuv_1.5.4           R6_2.4.1              
## [82] latticeExtra_0.6-28    promises_1.1.1         gridExtra_2.3         
## [85] assertthat_0.2.1       withr_2.1.2            GenomeInfoDbData_1.2.0
## [88] hms_0.5.3              grid_3.5.0             rpart_4.1-15          
## [91] rmarkdown_2.3          lubridate_1.7.4        base64enc_0.1-3

RNAseq analysis for symbiotic and free-living Phaeocystis cordata

Maggi Brisbin

September 17, 2020