In light of new knowledge a previous annotation may now be erroneous with respect to current knowledge

Given that annotations propagate, any updates to an original annotation should also be propagated. However, we identify over 8,000 sentences which may, or may have, incorrectly remained in the database. While these first three patterns are of interest in regard to annotation quality, we next investigate whether we can use the missing root origin pattern as an indication for erroneous annotations. Current methods for detecting textual annotation provenance and tracking its propagation are somewhat limited. Within this paper we have presented a technique that allows annotation provenance and propagation to be identified and visualised by using sentences. Further, we have provided an analysis of sentence reuse levels and identified a number of annotation patterns that provide an indication as to an annotations quality and correctness. The cornerstone of this work was dependent upon sentence reuse in UniProtKB. Our analysis shows that reuse is heavily prevalent for both Swiss-Prot and TrEMBL. This is because of the curation process employed by UniProt, which consists of six key stages, one of which involves Gambogic-acid identifying similar entries and standardising annotations between these entries. If two entries from the same gene and species are identified then they are merged. Therefore, sentences are effectively copied between entries as a matter of protocol. This process can see sections, or sometimes whole annotations, from one entry being copied to other entries without change. Whilst the levels of reuse are generally increasing overtime, we interestingly note a slight decline in sentence reuse for later versions of Swiss-Prot. Although this decline coincides with the change of the UniProtKB release cycle, it appears to be related to a change in annotation policy for Swiss-Prot. After 2010 only sequences with experimental annotation were added to Swiss-Prot; previously automatically annotated orthologue sequences from complete genomes were often included in Swiss-Prot. In the face of ever increasing raw biological data, this reuse is not unexpected. Whilst manual curation is often regarded as the ��gold standard’, it is a significant bottleneck. For example, in the FlyBase database it can take between two and four months for an article to be manually curated with consideration recently being given to incorporating sections of automated processing into the curation process. It was for this same reason that UniProtKB introduced TrEMBL in 1996. Whilst reuse is understandably higher within automated methods, it is inevitably going to remain commonplace throughout both automated and manual databases while the quantity of raw biological data being generated continues to increase. Indeed, sentence reuse is an important feature of annotation curation. In addition to the propagation of knowledge, it also allows annotations to become standardised and can be used to enforce levels of quality control. Whilst these results further highlight the importance of being able to identify the origin of an annotation, the analysis was only achievable given that UniProtKB make available all major historical versions of Swiss-Prot and TrEMBL. Users are typically only interested in the most recent and up-to-date biological data available, but this work highlights the added value and importance of being able to scour archival data; database features such as UniSave should be a requirement rather than a luxury. It was this archival data that allows provenance and propagation to be analysed, allowing the Cefetamet pivoxil HCl development of a visualization technique. These visualisations appear to be useful, as their usage allowed a number of propagation patterns to be identified.

Leave a Reply