universal_grass_peps

Data e Risorse

Data is not yet available. Contact the author for more information

This Dataset is currently private and won't be accessible to anyone outside the organization. If you want to publish this dataset, please send an email to data.stewards@rothamsted.ac.uk

Cite this as

Retrieved: 03:57 27 Nov 2024 (UTC)
Authors
Name ORCID Affiliation

Abstract

This dataset is the output from a bioinformatics pipeline developed by Rowan Mitchell during 2018-2024 that seeks to identify all universal protein-coding genes in grasses and to estimate how specific they are to grasses. The dataset has 5 components: (1) universal_grass_peps.xlsx contains summary information on all the universal groups of peps identified. (2) files in genBlastG/* are genome annotation files for each novel gene model generated by the genBlastG files in the pipeline. (3) hmms/*.msa.fa are the multiple alignment sequence fasta files, one for each group. (4) files hmms/final_db.hmms* are for use to search the database with query sequences using the HMMER package. (5) files in lookup/* allow users to find which groups a grass query pep ID is a member of, or associated to, for 16 different grass species.

Methods

A bioinformatics pipeline was developed to identify highly-conserved universal grass genes using 16 grass full genomes in Ensembl Plants release 56. The first steps used existing gene models to generate groups of grass orthologs to rice and maize genes present in most grass species and refined membership of these groups such as to optimise the Hidden Markov Model (HMM) profile score from the HMMER package. These were then supplemented using new gene models found in grass genomes with the genBlastG tool; this step increased the number of universal groups by >2-fold to give 12,609 highly conserved, universal groups. Specificity for these groups was assessed using closest matching gene models from non-monocot species. Possible cut-off values were tested with sets of genes expected to be either of common function for all plants or of commelinid- / grass-specific function. A specificity metric based on HMM score from grass group profiles performed better than % identity as a means of discriminating between the specific and common function sets. Using an appropriate cut-off for this metric, 5,973 of the groups were identified as specific to monocots of which 66% appeared to be grass specific.

Technical Information

Simple Leaflet Map
Funder Information
Award Number Award Title Funder Name



Private Information
Responsible Person Rowan Mitchell
Research Infrastructure Used
Data Locations Rothamsted Research shared drives
Associated Notebooks

      
Experiment Code Type
Experiment Code
Withdrawal Reason