Annotation is historically developed for human consumption rather than for computational interpretation

Essentially, this means textual annotations are mostly made up of free-text English. Although textual Pimozide annotation studies are limited, several explore ways to model propagation of structural annotation errors. Structural annotation sits between nucleotide sequences and textual annotations; it identifies genomic elements, such as open reading frames, for a given sequence. This is Danshensu similar to textual annotation, in that structural annotation often makes use of sequence data and can be manually or automatically curated. These studies highlight a number of reasons for structural annotation errors, such as mis-identification of homology, omissions or typographical mistakes, concluding that annotation accuracy declines as the database size increases. Further studies attempt to actually estimate the error rates in structural annotation. These include an estimated error rate of between 28% and 30% in GOSeqLite and between 33% and 43% in UniProtKB/Swiss-Prot. Therefore, it is highly plausible that these errors will affect textual annotation, as acknowledged by Gilks et al.. We hypothesise that sentence reuse is prominent within textual annotations and a lack of formal provenance has led to inaccuracies in the annotation space. Within this paper we aim: to quantify sentence reuse; to investigate patterns of reuse and provenance, through a novel visualisation technique; and to investigate whether we can use patterns of propagation to identify erroneous textual annotations, inconsistent textual annotations or textual annotations with low confidence. A typical sentence within our work is one which contains a group of words and is terminated with a full stop. However, there are a number of exceptions to this basic rule, such as abbreviations, which are especially commonplace in the biomedical domain. The vast majority of these are handled correctly by LingPipe. We have shown that sentence reuse in UniProtKB is both common and increasing. Therefore, given the scale of this data, we explore the usage of visualisation. By visualising sentence reuse across entries and over time, we may be able to better understand annotation propagation and infer provenance. From this, it is possible that patterns demonstrating interesting traits in the underlying data may emerge and be identified. Therefore, we wish to explore how we can visualise this data and ask: how can we clearly represent the flow of annotation through the database? A number of approaches to visualising large datasets were considered. One such approach is to model the relationship between sentences and entries as a graph. Using a tool, such as Cytoscape, we can easily model sentences occurring within entries. However, our experience with this approach suggests that it is troublesome to model change over time and manual intervention is often required to ensure nodes are organised in a correct and meaningful manner. Other similar approaches, such as Sankey diagrams, were not utilised as we cannot determine the exact source and flow of an annotation between each individual entry. One approach which produces a visualisation similar to our requirements is the history flow tool. This tool was developed to allow visualisation of relationships between multiple versions of a wiki. Therefore, it aims to clearly depict the change in sentences, and their order, in a document over time with the ability to attribute each change to a given author. The authors demonstrated this visualisation with an exploratory analysis of Wikipedia, revealing complex patterns of cooperation and conflict between Wikipedia authors. However, using the history flow tool to visualise the flow of individual sentences in UniProtKB is not ideal.

Leave a Reply