Developer's Guide

Printer-friendly version

Top-Level Overview


SemNExT is built on top of Docker and Docker Compose, meaning it must be run on either OS X or Linux. When built, SemNExT and each of its constituent services are constructed as a Docker container, linked to each other through the configuration in the .yml file Docker Compose is executed against. Three core services will underlie every SemNExT-based application: the SemNExT service, a SPARQL endpoint and Redis. The SPARQL endpoint is used to handle internal knowledge representation, especially the application's ontology. Redis is used as the underlying implementation of the Cache object (see below) and handles all intermediate results going through the SemNExT pipeline. The SemNExT service itself is responsible for composing the user-defined components and managing API calls that use them. The API is built on CherryPy, allowing for easy and rapid extension to the functionality of the system.
In order to provide a consistent development environment, Vagrant has been configured as part of the source release. SemNExT's Docker containers can be started on a host machine without Vagrant if going into production or if performance is a concern.
We are currently integrating SemNExT with Docker Machine in order to make SemNExT a distributed system capable of heavy lifting across a network. This will allow the user to distribute heavy numerical computations and significantly speed up their SemNExT application with additional machines.

Starting Up


To get started, please reference the README.md Quickstart section. This details multiple different ways to initialize a local instance of the service, including with a local instance of the UMLS database. An additional way to start SemNExT is by directly invoking docker-compose.yml with docker-compose up. This is a more lightweight option for development when combined with scripts/refresh_semnext.sh, which will recompile the local SemNExT repository with any recent changes and without tearing down the other Dockerized services (StringDB, UMLS, etc.).

Ontology


In order to provide a consistent way of referencing information from various datasources, SemNExT uses an internal OWL ontology to reconcile and map schemas from datasets that are semantically aware and provide a model for those that aren't. Because of this, the application derives accuracy from the use of an expressive and consistent ontology. This is particularly important in applications that are highly sensitive to anomalies in their datasets, such as our bioinformatics application.
An example usage is in defining class equivalencies between different schemas. The ontology provides a way of asserting that genes found in Uniprot, Bio2RDF, and other datasets are all different representations of the same entity, promoting datasource interoperability and allowing the user to decorate Python object representations with a mix of facts derived from different datasets while remaining consistent.
Because SemNExT's datamodel and provenance class hierarchies are designed to mirror the structure of the PROV Ontology, extensions to the core model should account for a similar design pattern in both the OWL and Python built off of the W3C recommendation.

SemNExT Classes


SemNExT relies on a collection of nine different classes that form the basis for applications using the framework. Each class will have a representation in the core ontology, which allows their instances to be reasoned about in the triplestore before being implemented in Python. All of the following are subclassed from the prov.Entity object in order to promote the design going forward (unless otherwise noted).

Analysis


The Analysis object is a high-level representation of a mathematical procedure implemented as a script. The constructor accepts a URI referencing this script, which is then downloaded and housed in the Python representation for future use. This class is not meant to be instantiated directly but should be subclassed according to the desired implementation. The RAnalysis class is such an implementation designed to handle references to R scripts.

Annotator


The Annotator object is a lightweight class defined with two methods, processes_entity and annotate_entity. If the former method returns true when executed against a prov.Entity instance (can be a subclass), the entity can be fed to annotate_entity in order to append useful information to the Python instance.

Cache


The Cache object is a top-level interface for interacting with a user-defined memory management system. For our purposes, we have defined a RedisCache object, which allows the user to interact with the Redis instance created as part of the SemNExT service. It stores and returns intermediary results in memory as JSON-LD, as well as manages cache eviction operations.

Crawler


The Crawler module serves as a container for Scrapy spiders aimed at sites of interest that do not have datasets available in a format appropriate for use with the Datasource module. This allows SemNExT to garner unstructured data in the form of text scraping as well.

Datamodel


Datamodel objects mirror the structure of the concepts defined in the application's ontology and are used for representation of instances in Python. Class attributes capture the values of the instance's roles, creating an object-oriented reflection of the semantic model. The core class hierarchy defines an Annotatable class that subclasses the Python dict, allowing it to accept arbitrary key-value pairs. It is recommended that Datamodel classes inherit from this class in order to allow for flexibility in the application's representation while maintaining a consistent mapping between the ontology and Python representation in the form of class attributes.

Datasource


Datasource objects provide an abstract interface for interaction with external datasets. This must be subclassed in order to provide concrete functionality according to the type of datasource (SPARQL, SQL, etc.). Different class functions can be created in order to define additional functionality and generate new results from the source.

Provenance


The Provenance class hierarchy mirrors a small segment of the top-level classes contained in PROV-O, in addition to owl:Thing. These definitions are used as the basis for many of the objects defined in a SemNExT application. The top-level Thing class mirrors owl:Thing allows a URI and a set of types (which automatically includes owl:Thing) to be attached to the instance. Entity is a representation of prov:Entity and is the direct parent class of the AnnotatableEntity, Source, Annotator and Analysis classes. Entity doesn't add any extra functionality but is included for the sake of consistency between the Python and OWL representations. Any further concepts desired from PROV-O should be included in the Provenance module's __init__.py, while concepts from SemNExT's internal ontology and other external ontologies should be defined in the Datamodel's __init__.py.

Serializer


The Serializer object has yet to be implemented but will serve as a component mapping SemNExT objects into JSON-LD. For the meantime, this functionality is handled primarily by the RedisCache implementation.

Webservice


The Webservice __init__.py file houses the API calls of SemNExT and is used to tie together the various components described above. New API methods are created as Python defs that are then exposed to CherryPy (using @cherrypy.expose()), allowing a user to instantiate them directly using a URI of the following form:

https://semnext.tw.rpi.edu/api/v1/method?param1=value1&param2=key2...

 


Coding Standards


As mentioned on our GitHub page, it's recommended to adhere to a strict style standard when developing for SemNExT. This is both to enhance the reusability of your code (as suggested under the Architecture section of this page) and for your own sanity, as SemNExT projects can inherently become rather large.