universal_grass_peps
Δεδομένα και Πόροι
Data is not yet available. Contact the author for more information
Cite this as
Name | ORCID | Affiliation |
---|
This dataset is the output from a bioinformatics pipeline developed by Rowan Mitchell during 2018-2024 that seeks to identify all universal protein-coding genes in grasses and to estimate how specific they are to grasses. The dataset has 5 components: (1) universal_grass_peps.xlsx contains summary information on all the universal groups of peps identified. (2) files in genBlastG/* are genome annotation files for each novel gene model generated by the genBlastG files in the pipeline. (3) hmms/*.msa.fa are the multiple alignment sequence fasta files, one for each group. (4) files hmms/final_db.hmms* are for use to search the database with query sequences using the HMMER package. (5) files in lookup/* allow users to find which groups a grass query pep ID is a member of, or associated to, for 16 different grass species.
A bioinformatics pipeline was developed to identify highly-conserved universal grass genes using 16 grass full genomes in Ensembl Plants release 56. The first steps used existing gene models to generate groups of grass orthologs to rice and maize genes present in most grass species and refined membership of these groups such as to optimise the Hidden Markov Model (HMM) profile score from the HMMER package. These were then supplemented using new gene models found in grass genomes with the genBlastG tool; this step increased the number of universal groups by >2-fold to give 12,609 highly conserved, universal groups. Specificity for these groups was assessed using closest matching gene models from non-monocot species. Possible cut-off values were tested with sets of genes expected to be either of common function for all plants or of commelinid- / grass-specific function. A specificity metric based on HMM score from grass group profiles performed better than % identity as a means of discriminating between the specific and common function sets. Using an appropriate cut-off for this metric, 5,973 of the groups were identified as specific to monocots of which 66% appeared to be grass specific.
Award Number | Award Title | Funder Name |
---|
Private Information | |
---|---|
Responsible Person | Rowan Mitchell |
Research Infrastructure Used | |
Data Locations | Rothamsted Research shared drives |
Associated Notebooks | |
Experiment Code Type | |
Experiment Code | |
Withdrawal Reason |