EukGeoBlast: Using the PR2 database to interactively plot geographic locations for 18S blast hits

by: Maggi Brisbin

Who are you and where do you come from?

How often do you find yourself wondering, "where in the world has this protist sequence been found before?"

I do all the time. Most recently, I wanted to know if sequences we found to be enriched in samples from hydrothermal vents near Okinawa had previously been found in other marine ecosystems and especially at other vent sites (read more: https://www.biorxiv.org/content/10.1101/714816v2).

With about 40 sequences to look into, we set about blasting the sequences and manually looking at results and the publications linked to hits. It was painfully clear, however, that this approach was not sustainable, reproducible, or anywhere close to 'best practices'.

Hence, the idea for this post was born.

The Protist Ribosomal Reference (PR2) database (https://github.com/pr2database/pr2database) is already an amazing resource for those of us excited about protists. PR2 includes 176,813 protist ribosomal RNA gene sequences that can be downloaded for use in classifiying sequences denoised with DADA2 or processed with other software (https://github.com/pr2database/pr2database/releases). In addition, metadata for sequences in the database, including taxonomy, publications, and collection location, are available. The PR2 tutorial even gives an example of plotting sequence collection location on a map in R (https://github.com/pr2database/pr2database/blob/master/PR2_R_database.md).

Excellent! Only ... a very small fraction of the sequences in the database have a latitude and longitude assigned to them. So, if you plot sequences by location, you only see a small subset of the available data. 18,459 of the 176,813 sequences in the database have a latitude and longitude assigned.

There are other metadata fields that include location information, however. For example, gb_country holds, at the very least, the country of collection, and sometimes more detailed information. 71,230 sequences have a country assigned. That is almost 4x more than the number of sequences with lat/lon. Furthermore, 82,468 have an isolation source provided. This potentially provides location information for much more of the database, as not all entries with country have entries for isolation source and vice versa.

But while the GenBank country column is standardized, researchers can really write whatever they want in the isolation source column. This makes the column valuable, since it often gives water depth and other descriptions of the sampling location, but it is difficult to extract the actual location information.

This post describes how to use ALL available location information in the PR2 database to make an interactive geographic map of Blast hits colored by % identity.

Fuzzy matching of isolation sources to locations with known lat/lon allows sequences without an assigned country or lat/lon to still be plotted by location. Interactive mapping allows zooming to look at metadata and %ID for individual Blast hits.

For more on Fuzzy matching see: https://en.wikipedia.org/wiki/Fuzzy_matching_(computer-assisted_translation)).

Note: fuzzy matched location information is now included in the PR2 database version 4.12.0 https://github.com/pr2database/pr2database/wiki/PR2-R-database-package

Example Output:

In [16]:
show(p) #If running in a notebook, this cell has to be run last.

To scroll in and out, select the zoom tool at the bottom of the map:zoom tool icon

By zooming in, the points will separate, allowing you to see the color of each individual point.

You can also reset the plot by clicking the reset button on the bottom of the map:reset icon

... A box zoom would be really great, but google API does not allow aspect ratio changes, so bokeh box zoom will not work with this type of plot.

How to get there:

Step 1: Blast some sequences of interest against the PR2 database.

You will need to have BLAST+ installed if you don't already: https://www.ncbi.nlm.nih.gov/books/NBK279671/

A .fasta file of the PR2 database can be downloaded with:
!wget https://github.com/pr2database/pr2database/releases/download/4.11.1/pr2_version_4.11.1_mothur.fasta.gz

and unzipped with a quick !gunzip pr2_version_4.11.1_mothur.fasta.gz

Next, make a blast database from the PR2 sequences:
!makeblastdb -in pr2_version_4.11.1_mothur.fasta -dbtype 'nucl' -out pr2

Note: the ! at the beginning of the command indicates that the command should be run in bash. If you are running these commands in a separate terminal, you can remove the !.

Note 2: If you cloned this whole repository, you do not need to download the PR2 sequences or make the blast database, it is included in the repo.

Blast your sequences against the PR2 database:
!blastn -query sigMAST.fasta -db pr2 -outfmt 6 > MASTsigBlast.txt

For this example, I am blasting MAST (Marine Stramenopile) sequences that were enriched at a vent site in the Okinawa Trough.

Step 2: Some data wrangling with Pandas

Import pandas and hush warnings:

In [1]:
import pandas as pd
import warnings
warnings.simplefilter('ignore')

Read in the blast results as a pandas data frame and add column names:

In [2]:
blast = pd.read_table("MASTsigBlast.txt", names = ["query_id", "pr2_accession", "pct_identity", "aln_length", "n_of_mismatches", "gap_openings", "q_start", "q_end", "s_start", "s_end", "e_value", "bit_score"])
blast = blast[blast['e_value']<=0.00001]

Make a list of the query sequence names:

In [3]:
queries= blast.query_id.unique()
queries
Out[3]:
array(['04f3b8953d543c25971a081f83642a14',
       '87e0cc57f3745b994536844cb334472c',
       'b0c7701d42954b2c0294b8cbb69bba64',
       'bd57dc3cebccdb81641adf03c8a610c3',
       'c80c006a131afe0d5457ffd185747821'], dtype=object)

Select which query to make a map for:

In [4]:
blast1= blast[blast.query_id==queries[2]] # here I am selecting query sequence #2

Filter the blast results based on your preferences:

In [5]:
blast1=blast1[blast1.pct_identity >= 97][blast1.aln_length >=400][["query_id","pr2_accession", "pct_identity"]]
# here I am selecting results with % identity greater than or equal to 97 and with alignment length greater than 400bp

Read in the PR2 database metadata as pandas dataframe and select the columns of interest:
Get the database metadate with:

!wget https://github.com/pr2database/pr2database/releases/download/4.11.1/pr2_version_4.11.1_merged.tsv.gz

and unzip with !gunzip pr2_version_4.11.1_merged.tsv.gz

In [6]:
pr2=pd.read_table("pr2_version_4.11.1_merged.tsv") #this takes a second because the file is quite large
pr22=pr2[['pr2_accession', 'genus',  'species', 'gb_organism', 'gb_taxonomy',  'gb_isolation_source', 'gb_country', 'gb_publication', 'pr2_ocean', 'pr2_latitude',
       'pr2_longitude', 'pr2_sequence_origin', 'pr2_size_fraction']]

Merge the blast results with the PR2 metadata dataframe:

In [7]:
pr2df = pd.merge(blast1, pr22, on='pr2_accession')
pr2df1 = pr2df
#manually jitter points so that each point has a unique lat/lon - important to be able to see % identity of each hit
pr2df1['index1'] = [i for i in range(pr2df1.shape[0])]
pr2df1['index1'] = pr2df1['index1'].mul(.00001)
pr2df1['lat1']= pr2df1['pr2_latitude'].add(pr2df1['index1'])
pr2df1['lon1']= pr2df1['pr2_longitude'].add(pr2df1['index1'])

Split out rows that do not have lat/lon but do have gb_country:

In [8]:
#get data frame of entries that don't have a pr2_lat or pr2_lon, but DO have gb_country assigned
noLats = pr2df[pr2df.pr2_latitude.isnull()]
countryLats = noLats[~noLats.gb_country.isnull()]
countryLats['country']= countryLats["gb_country"].str.split(":", n = 1, expand = True)[0] #split country label and keep the 1st part
#here we are using country, but not any other geographic information in this column --> room for improvement!
countries = pd.read_csv("countries.csv", names = ["country", "lon", "lat"], encoding='latin-1')
result = pd.merge(countryLats, countries, on='country')
#manual manually jitter points
result['index1'] = [i for i in range(result.shape[0])]
result['index1'] = result['index1'].mul(.00001)
result['lat1']= result['lat'].add(result['index1'])
result['lon1']= result['lon'].astype('float').add(result['index1'])

Split out rows that do not have a gb_country:

In [9]:
#get data frame of entries with no gb_country or pr2_lat pr2_lon
country_noLats = noLats[noLats.gb_country.isnull()]
country_noLats_cp = country_noLats[~country_noLats.gb_isolation_source.isnull()]
country_noLats_cp.reset_index(inplace=True)

Import fuzzywuzzy:
(to install: conda install -c conda-forge fuzzywuzzy)
https://anaconda.org/conda-forge/fuzzywuzzy

In [10]:
# fuzz is used to compare TWO strings
from fuzzywuzzy import fuzz
# process is used to compare a string to MULTIPLE other strings
from fuzzywuzzy import process
In [11]:
#function to fuzzy match isolation source with countries/oceans in order to assign lat/lon
names_array=[]
ratio_array=[]
def match_names(wrong_names,correct_names):
    for row in wrong_names:
        x=process.extractOne(row, correct_names)
        names_array.append(x[0])
        ratio_array.append(x[1])
    return names_array,ratio_array

Use fuzzy matching to get locations from gb_isolation_source:

In [12]:
wrong_names = country_noLats_cp['gb_isolation_source']
correct_names= countries[~countries.country.isnull()]['country']

name_match,ratio_match=match_names(wrong_names,correct_names)
 
country_noLats_cp['country']=pd.Series(name_match)
country_noLats_cp['country_names_ratio']=pd.Series(ratio_match)

country_noLats=country_noLats_cp[country_noLats_cp.country_names_ratio >89] #only keep good fuzzy matches
country_noLatsF = pd.merge(country_noLats, countries, on='country') #merge with countries df to get lat/lon

#manual point jitter so you can see individual points when you zoom in
country_noLatsF['index1'] = [i for i in range(country_noLatsF.shape[0])]
country_noLatsF['index1'] = country_noLatsF['index1'].mul(.00001)
country_noLatsF['lat1']= country_noLatsF['lat'].add(country_noLatsF['index1'])

Step 3: Make the map!

Import bokeh to build the interactive map:
(to install: conda install bokeh)
https://bokeh.pydata.org/en/latest/docs/installation.html

In [13]:
from bokeh.plotting import show
from bokeh.io import output_notebook
from bokeh.tile_providers import get_provider, Vendors
from bokeh.models import ColumnDataSource, GMapOptions, LinearColorMapper, BasicTicker, PrintfTickFormatter, ColorBar
from bokeh.plotting import gmap
from bokeh.plotting import figure
from bokeh.models import HoverTool
from bokeh.transform import jitter

To use google maps, you need a google API key.
So if you are going to use this code, you'll have to request your own API key and then insert it into the plotting code cell.
The folllowing video makes it very clear how to acquire the right type of google API key:
https://www.youtube.com/watch?v=9ImLCQBj9SE

And plot!

In [14]:
hover1 = HoverTool()
output_notebook()
map_options = GMapOptions(map_type="satellite", zoom=1)

# Replace the value below with your personal API key:
API_key = 'AIzaSyDRonCqOTohU3Y7XGV5NJ36mKynWgtNGjI'

p = gmap(API_key, map_options, title=pr2df1["query_id"][0], toolbar_location='below')

source = ColumnDataSource(data=pr2df1)
source2 = ColumnDataSource(data = result)
source3 = ColumnDataSource(data = country_noLatsF)

#Add the desired hover values. Any columns from the results data frame can be added. 

hover1.tooltips = """
    <font face="Arial" size="0">
    <strong>Accession:</strong> @pr2_accession <br>
    <strong>Name:</strong> @species <br>
    <strong>Location:</strong> @gb_country <br>
    <strong>Source:</strong> @gb_isolation_source
    </font>
"""

colors = ["#75968f", "#a5bab7", "#c9d9d3", "#e2e2e2", "#dfccce", "#ddb7b1", "#cc7878", "#933b41", "#550b1d"]
mapper = LinearColorMapper(palette=colors, low=pr2df.pct_identity.min(), high=pr2df.pct_identity.max())

p.circle(x=jitter("lon1",0.1), y=jitter("lat1",0.1), size=11, fill_color={'field': 'pct_identity', 'transform': mapper}, line_color='black', fill_alpha=0.8, source=source, legend="pr2 lat/lon")
p.square(x=jitter("lon1",0.1), y=jitter("lat1", 0.1), size=11, fill_color={'field': 'pct_identity', 'transform': mapper}, line_color='black', fill_alpha=0.8, source=source2, legend="gb country")
p.triangle(x="lon", y=jitter("lat1",0.1), size=15, fill_color={'field': 'pct_identity', 'transform': mapper}, line_color='black', fill_alpha=0.8, source=source3, legend="fuzzy matched")


color_bar = ColorBar(color_mapper=mapper, major_label_text_font_size="8pt",
                     ticker=BasicTicker(desired_num_ticks=len(colors)),
                     formatter=PrintfTickFormatter(format="%2.2f%%"),
                     label_standoff=10, border_line_color=None, location=(0, 0))
p.add_layout(color_bar, 'right')
p.legend.title = 'location source'

p.add_tools(hover1)

p.legend.click_policy="hide" # if you click on the legend, the corresponding points will be removed from the plot
show(p)
Loading BokehJS ...

That's it!
I hope this is useful to others, as well!
Any questions, comments, or feedback can be sent to margaret.marsbrisbin@oist.jp or of course tweeted.

Note: Additional files used in this tutorial, including the interactive Jupyter notebook, are available from the github repository: https://github.com/maggimars/eukGeoBlast

In [15]:
import IPython
print(IPython.sys_info())
{'commit_hash': 'd774f565b',
 'commit_source': 'installation',
 'default_encoding': 'UTF-8',
 'ipython_path': '/Users/brisbin/miniconda2/envs/Python4Data/lib/python3.6/site-packages/IPython',
 'ipython_version': '7.4.0',
 'os_name': 'posix',
 'platform': 'Darwin-17.7.0-x86_64-i386-64bit',
 'sys_executable': '/Users/brisbin/miniconda2/envs/Python4Data/bin/python',
 'sys_platform': 'darwin',
 'sys_version': '3.6.7 | packaged by conda-forge | (default, Feb 28 2019, '
                '02:16:08) \n'
                '[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'}