Quantifying the notion of ‘clumpiness’ within alignments obtained from BLAST similarity searches

  • Jacky Birrell

Student thesis: Doctoral Thesis


There are numerous methods utilised in the determination of the function of newly sequenced DNA or proteins. One such method is the use of sequence similarity searches, such as BLAST. However, due to the speed at which sequences can be produced and the ever-increasing size of the databases against which they are searched, it is becoming progressively more difficult for the scientist to carry out the necessary data analysis manually. Therefore, an automation of the analysis of the BLAST results should greatly reduce the amount of labour for the scientist and so improve the chances of accelerating research progress or indicate new fields of investigation.

An in-depth study of how the BLAST algorithm works was conducted. Also, interviews were used to determine which of the BLAST result features are of importance to the scientist in the decision of whether a particular similarity hit was of importance to their field of research and function determination. Based on this study, the feature of the clumpiness of a match’s alignment was chosen as the focus of this research. This decided, techniques into quantifying this clumpiness were studied and several possible clumpiness measures were proposed.

These measures were then tested with regard to specified criteria in order to assess their suitability as a clumpiness measure. This analysis was first conducted on synthetic data and it was found that the CUSUM measure proved to be the best according to the criteria and was chosen as the clumpiness measure for the subsequent testing. This took the form of testing the measure within real BLAST sequence analysis via the use of a prototype, which was utilised by scientists in their research. In conjugation with this, benchmark datasets containing families with distant relatives were used in order to assess the clumpiness measure’s ability to identify these distant relatives. Additional testing of the dumpiness measure was performed on a more abstract dataset of events and non-events in a one-dimensional field.

For both the prototype and the abstract testing, the results showed that the CUSUM clumpiness measure gives a good approximation of the degree of clustering of events within a one-dimensional field. In addition there is an indication that the measure will be of use in the identification of distant relatives, however, further testing is required to widen the subject base and further validate the measures suitability for assisting in the function determination of novel sequences.
Date of AwardNov 2007
Original languageEnglish

Cite this