Monday, December 11, 2006

How to access NCBI blast using java?

Part I.
NCBI Blast can be searched in two ways: using the web form available at their home page or by making our own custom interface.
In this tutorial we will concentrate on the second approach, so we will implement http access, by using NCBI’s URL API.
“The URL API is a standardized application program interface (API) for accessing the NCBI $QBl_{A}st$ system. It uses direct HTTP-encoded requests to NCBI web server.”
In order to implement this access we must have the following prerequisite libraries available for use:
Jakarta Commons HttpClient
The quick outline of the steps required is as follows:
First we query the NCBI blast query, after which we receive a reply containing the number of seconds left for processing this request, or an error message if the query was malformed.
After that wait period is expired, we again query the server for the status of our request, and if the reply is positive we proceed to the final request results. Otherwise we wait for the request to complete, and then proceed.
Basically, our program emulates a browser accessing blast via ordinary web form.
More accurately, our program creates HTTP GET query string , and sends a query to the
For every URL , query must specify CMD value which indicates that we are sending a query (CMD=Put) , or we are requesting the query results (CMD=Get).
There are also other CMD values (INFO and WEB), but we wont look at them now, as they are not essential for our access interface.
In addition to the CMD=Put command, we also need to select a database using a DATABASE variable (e.g. DATABASE=nr), then specify a sequence using QUERY
PROGRAM variable (e.g. PROGRAM=blastp).
Other values are not mandatory, and if not specified, the defaults will be used.
Jakarta Commons HttpClient library contains a class for constructing queries:
UrlQuery putQuery = new UrlQuery();
putQuery.setNameValue("CMD", "Put");
putQuery.setNameValue("DATABASE", "nr");
putQuery.setNameValue("PROGRAM", "blastp");
After query has been specified, we continue with:
HttpClient http = new HttpClient();
GetMethod getMethod = new GetMethod("")
And finally, we send the request:
int statusCode = http.executeMethod(getMethod);
Then we check for the results using:
if (statusCode != HttpStatus.SC_OK)
If all goes ok, then we extract the result to String:
InputStreamReader inputStream = new InputStreamReader(
BufferedReader buffer = new BufferedReader(inputStream);
StringBuilder resultBuffer = new StringBuilder(2048);
String line;
while ((line = buffer.readLine()) != null) {
Strigin result = resultBuffer.toString();
It is also important to close all opened connections:
The result contains HTML, and it is not the final data-set we are trying to retrieve.
NCBI returned two values commented in the HTML code, the rest we ignore:
<!-- QBlastInfoBegin
RID = 954517067-8610-1647
RTOE = 207
RID (Request Identifier) Value is something similar to a sessionID, which lasts for 24h
That is the number we will return to NCBI when we request the result, and
RTOE (Request Time of Execution ) is time of seconds to wait until NCBI has finished processing our request. Now we need to extract RID and RTOE values.
We will do that using regular expressions:
public static final String RTOE_RID_PATTERN =
"<!--\\s*QBlastInfoBegin\\s*RID = (.*)\\s*RTOE = (.*)\\s*QBlastInfoEnd\\s*-->";
Pattern pattern = Pattern.compile(RTOE_RID_PATTERN);
Matcher matcher = pattern.matcher(result);
String resultRID;
String resultRTOE;
if (matcher.find()) {
if (matcher.groupCount() != 2) {
return false;
resultRID =;
resultRTOE =;
Now that we know how much longer we need to wait, we call:
long sleepMillis = TimeUnit.SECONDS.toMillis((long) resultRTOE);
After waiting we are making another query to check if our request has been processed:
UrlQuery getQuery = new UrlQuery();
getQuery.setNameValue("RID", rid);
getQuery.setNameValue("CMD", "GET");
After which we get another HTML code containing something like this:
Now we need to retrieve the Status value which can be either WAITING, READY, UNKNOWN or ERROR.
WAITING - we need to wait a few more seconds
READY - request has been processed
UNKNOWN - request has been lost
ERROR - our request was malformed, we made an error during issuing our request
If we got Status=WAITING, then we will wait for couple of more seconds, and resend the query, and if the Status=READY we can create a query for retrieving the final result data in a form of an XML document.
To be continued in part II of this tutorial.

1 comment:

Brave Baboo said...

UrlQuery, GetMethod, classes(object can"t initilzed/declaed) are not found in Jakarta Commons HttpClinet and Castor.
So this method can't be used!