Epiviz File Server - Query & Transform Data from Indexed Genomic Files in Python¶
Epiviz file Server is a scalable data query and compute system for indexed genomic files. In addition to querying data, users can also compute transformations, summarization and aggregation using NumPy functions directly on data queried from files.
Since the genomic files are indexed, the library will only request and parse necessary bytes from these files to process the request (without loading the entire file into memory). We implemented a cache system to efficiently manage already accessed bytes of a file. We also use dask to parallelize computing requests for query and transformation. This allows us to process and scale our system to large data repositories.
This blog post (Jupyter notebook) describes various features of the file server library using genomic files hosted from the NIH Roadmap Epigenomics project.
- The library provides various modules to
- Parser: Read various genomic file formats,
- Query: Access only necessary bytes of file for a given genomic location,
- Compute: Apply transformations on data,
- Server: Instantly convert the datasets into a REST API
- Visualization: Interactive Exploration of data using Epiviz (uses the Server module above).
Note
- The Epiviz file Server is an open source project on GitHub
- Let us know what you think and any feedback or feature requests to improve the library!
Contents¶
Installation¶
Development Version¶
To install the development version from GitHub: Install using pip
pip install git@github.com:epiviz/epivizFileParser.git
you can also clone the repository and install from local directory using pip
Note
If you don’t have sudo rights to install the package, you can install it to the user directory using
pip install --user epivizfileserver
Tutorial¶
This blog post (Jupyter notebook) describes various features of the file server library using genomic files hosted from the NIH Roadmap Epigenomics project.
Note
This post describes a general walkthrough of the features of the file server. More usecases will be posted soon!
Import Measurements from File¶
Since large data repositories contains hundreds of files, manually adding files would be cumbersome. In order to make this process easier, we create a configuration file that lists all files with their locations. An example configuration file is described below -
Configuration file¶
The following is a configuration file for data hosted on the roadmap FTP server. This contains data for ChIP-seq experiments for the H3k27me3 marker in Esophagus and Sigmoid Colon tissues. Most fields in the configuration file are self explanatory.
[
{
url: "https://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/foldChange/E079-H3K27me3.fc.signal.bigwig",
file_type: "bigwig",
datatype: "bp",
name: "E079-H3K27me3",
id: "E079-H3K27me3",
annotation: {
group: "digestive",
tissue: "Esophagus",
marker: "H3K27me3"
}
}, {
url: "https://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/foldChange/E106-H3K27me3.fc.signal.bigwig",
file_type: "bigwig",
datatype: "bp",
name: "E106-H3K27me3",
id: "E106-H3K27me3",
annotation: {
group: "digestive",
tissue: "Sigmoid Colon",
marker: "H3K27me3"
}
}
]
Once the configuration file is generated, we can import these measurements into the file server. We first create a MeasurementManager object which handles measurements from files and databases. We can then use the helper function import_files to import all measurements from this configuration file.
mMgr = MeasurementManager()
fmeasurements = mMgr.import_files(os.getcwd() + "/roadmap.json", mHandler)
fmeasurements
Query for a genomic location¶
After loading the measurements, we can query the object for data in a particular genomic region using the get_data function.
result, err = await fmeasurements[1].get_data("chr11", 10550488, 11554489)
result.head()
The reponse is a tuple, DataFrame that contains all results and an error if there is any.
Compute a Function over files¶
We can define and create new measurements that can be computed using a Numpy function over the files loaded from the previous step.
Note
you can also write a custom statistical function, that applies to every row in the DataFrame. It must follow the same syntax as any Numpy row-apply function.
As an example to demonstrate, we can calculate the average ChIP-seq expression for H3K27me3 marker.
computed_measurement = mMgr.add_computed_measurement("computed", "avg_ChIP_seq", "Average ChIP seq expression",
measurements=fmeasurements, computeFunc=numpy.mean)
After defining a computed measurement, we can query this measurement for a genomic location.
result, err = await computed_measurement.get_data("chr11", 10550488, 11554489)
result.head()
Setup a REST API¶
Often times, developers would like to include data from genomic files into a web application for visualization or into their workflows. We can quickly setup a REST API web server from the measurements we loaded -
from epivizfileserver import setup_app
app = setup_app(mMgr)
app.run(port=8000)
The REST API is an asynchronous web server that is built on top of SANIC.
Query Files from AnnotationHub¶
We can also use the Bioconductor’s AnnotationHub to search for files and setup the file server. We are working on simplifying this process.
Annotation Hub API is hosted at https://annotationhub.bioconductor.org/.
We first download the annotationhub sqlite database for available data resources.
wget http://annotationhub.bioconductor.org/metadata/annotationhub.sqlite3
After download the resource database from AnnotatiobnHub, we can now load the sqlite database into python and query for datasets.
import pandas
import os
import sqlite3
conn = sqlite3.connect("annotationhub.sqlite3")
cur = conn.cursor()
cur.execute("select * from resources r JOIN input_sources inp_src ON r.id = inp_src.resource_id;")
results = cur.fetchall()
pd = pandas.DataFrame(results, columns = ["id", "ah_id", "title", "dataprovider", "species", "taxonomyid", "genome",
"description", "coordinate_1_based", "maintainer", "status_id",
"location_prefix_id", "recipe_id", "rdatadateadded", "rdatadateremoved",
"record_id", "preparerclass", "id", "sourcesize", "sourceurl", "sourceversion",
"sourcemd5", "sourcelastmodifieddate", "resource_id", "source_type"])
pd.head()
For the purpose of the tutorial, we will filter for Sigmoid Colon (“E106”) and Esophagus (“E079”) tissues, and the ChipSeq Data for “H3K27me3” histone marker files from the roadmap epigenomics project.
roadmap = pd.query('dataprovider=="BroadInstitute" and genome=="hg19"')
roadmap = roadmap.query('title.str.contains("H3K27me3") and (title.str.contains("E106") or title.str.contains("E079"))')
# only use fc files
roadmap = roadmap.query('title.str.contains("fc")')
roadmap
After filtering for resources we are interested in, we can load them into the file server using the import_ahub helper function.
mMgr = MeasurementManager()
ahub_measurements = mMgr.import_ahub(roadmap)
ahub_measurements
The rest of the process is similar as described in the beginning of this tutorial.
Workspaces or Usecases setup using the File Server¶
The following workspaces have been setup using the Epiviz File Server
Deployment¶
Using In built Sanic server (for development)¶
Sanic provides a default asynchronous web server to run the API. As a working example, checkout the Roadmap project from the Usecases section.
app = setup_app(mMgr)
app.run("0.0.0.0", port=8000)
Deploy using gunicorn + supervisor (for production)¶
Setup virtualenv and API¶
This process assumes the root API directory is /var/www/epiviz-api
Setup virtualenv either through pip or conda
cd /var/www/epiviz-api
virtualenv env
source env/bin/activate
pip install epivizfileserver
A generic version of the API script would look something like this (add this to /var/www/epiviz-api/epiviz.py)
from epivizfileserver import setup_app, create_fileHandler, MeasurementManager
from epivizfileserver.trackhub import TrackHub
# create measurements to load multiple trackhubs or configuration files
mMgr = MeasurementManager()
# create file handler, enables parallel processing of multiple requests
mHandler = create_fileHandler()
# add genome. - for supported genomes
# check https://obj.umiacs.umd.edu/genomes/index.html
genome = mMgr.add_genome("mm10")
genome = mMgr.add_genome("hg19")
# load measurements/files through config or TrackHub
# setup the app from the measurements manager
# and run the app
app = setup_app(mMgr)
# only if this file is run directly!
if __name__ == "__main__":
app.run(host="127.0.0.1", port=8000)
Install dependencies¶
- Supervisor (system wide) - http://supervisord.org/
- Gunicorn (to the virtual environment) - https://gunicorn.org/
# if using ubuntu
sudo apt install supervisor
# activate virtualenv that runs the API
source /var/www/epiviz-api/env/bin/activate
pip install gunicorn
Configure supervisor¶
Add this configuration to /etc/supervisor/conf.d/epiviz.conf
This snippet also assumes epiviz-api repo is in /var/www/epiviz-api
[program:gunicorn]
directory=/var/www/epiviz-api
environment=PYTHONPATH=/var/www/epiviz-api/bin/python
command=/var/www/epiviz-api/env/bin/gunicorn epiviz:app --log-level debug --bind 0.0.0.0:8000 --worker-class sanic.worker.GunicornWorker
autostart=true
autorestart=true
stderr_logfile=/var/log/gunicorn/gunicorn.err.log
stdout_logfile=/var/log/gunicorn/gunicorn.out.log
Enable Supervisor configuration
sudo supervisorctl reread
sudo supervisorctl update
service supervisor restart
Note
check status of supervisor to make sure there are no errors
Add Proxypass to nginx/Apache¶
(the port number here should match the binding port from supervisor configuration
for Apache
sudo a2enmod proxy
sudo a2enmod proxy-http
# add this to the apache site config
ProxyPreserveHost On
<Location "/api">
ProxyPass "http://127.0.0.1:8000/"
ProxyPassReverse "http://127.0.0.1:8000/"
</Location>
for nginx
# add this to nginx site config
upstream epiviz_api_server {
server 127.0.0.1:8000 fail_timeout=0;
}
location /api/ {
proxy_pass http://epiviz_api_server/;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_redirect off;
}
License¶
The MIT License (MIT)
Copyright (c) 2019 Jayaram Kancherla
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Contributors¶
- Jayaram Kancherla <jayaram.kancherla@gmail.com>
- Yifan Yang <yang7832@umd.edu>
- Hector Corrada Bravo <hcorrada@gmail.com>