ATTED-II

ver.11.1

last update; Jun. 18. 2020

Supportability

(1) Purpose

Although strength of coexpression is represented by MR value, the coexpression may be artifact in a platform. A measure, supportability, is introduced to quantify reproducibility of a coexpressed gene list of interest.


History on ATTED-II

The supportability is previously called as reliability. With the refinement at Sep 2015, we renamed the reliability to supportability for the following two reasons.

  • We are trying to evaluate quality of coexpression data from various aspects. Reliability is not suitable to mean one of such measures.
  • Coexpression supported by another platform is reliable, but the converse is not true. Namely, coexpression without any supports can be occured when appropriate reference is not available.
Period Term Calculation of coex list similarity Null distribution Threshold
☆☆ ☆☆☆
2013-08 ~ 2015-08 Reliability COXSIM(100) Common for all platform virtually including 10000 genes E-04 E-12 E-20
2015-09 ~ now Supportability COXSIM(1%) Enpirical COXSIM distributions between an Arabidopsis (Ath-r) gene and a rice (Osa-r) gene irrespective of gene homology. E-01 E-02 E-03



(2) Basic idea

When a gene list is repeatedly observed in indipendent platforms, the coexpressed gene list can be regarded as reliable.

There are two possible ways to compare coexpression for reliability assessment. One is comparison of gene pairs (A), and the other is comparison of gene lists (B).



We employ the B-type (gene list) comparison because pseudo coexpression is mainly caused by inappropriate probes with weak hybridization or cross-hybridization and thus pseudo coexpression appears not only one gene pair but also all gene pairs from the problematic guide gene.




(3) Degree of coincidence of two coexpressed gene lists

Basic idea

We introduced a similarity measure COXSIM, which is the weighted concordance rate between the coexpressed gene listf from a guide gene g of interest (listg) and that from a reference guide gene r (listr). COXSIM is a function of guide gene g, guide gene r and threshold k.

, where n(i, listg, listr) is the number of common genes (orthologous genes in the case using platforms in different species) found in the top i coexpressed gene lists.



Excluding orphan genes for the gene list comparison

However, there are genes in a platform for listg that do not have corresponding genes in a platform for listr. When such genes appear at high ranks in listg, the coincidence of the two lists decreases. To avoid the effect of the absence of the corresponding genes in the reference platform, genes that lack corresponding genes in the reference platform are excluded from listg, leaving listg→r. In the same way, genes in listr that lack corresponding genes in listg are excluded, resulting listr→g. Subsequently, we examined the top k coexpressed genes in listg→r with the reference gene list, listr→g.


Selection of k

As k, we use 1% of the number of the genes in listg→r.

We previously used 100 for k, meaning that we checked the gene correspondence of the top 100 coexpressed genes, in accordance with the default representation of a coexpressed gene list on COXPRESdb. However, the use of a common threshold for all platforms causes different stringencies of the coexpression thresholds. For example, the Sce platform for S. cerevisiae has 4,461 genes for coexpression analysis, whereas the Hsa platform for human has 19,803 genes. The former has four to five times higher probability to randomly include a particular gene in the top k rank, and thus overestimates the significance for the coincidence of the gene lists. Therefore, we have modified the number of genes from the top k to the top 1% of all genes in listg→r.


(4) Selection of the most appropriate reference guide gene / gene list

Since the best reference guide gene is unknown, we checked all possible reference guide genes. The reference guide gene set R is composed of all available orthologous genes for different species. When multiple platforms are available for the species including the guide gene g, the same gene in the other platforms is also included in the reference guide gene set R. The COXSIM values are calculated between the target guide gene g and every reference gene r in R. The reference gene rmax that gives the maximum COXSIM value is regarded as the best reference guide gene.


(5) Calculation of p-value

To assess statistical significance, the maxCOXSIM value is compared with the null distribution generated under the same number of genes in listg→r.



(6) Discretization of maxCOXSIM significance

On the ATTED-II, the significance level, which we call the supportability, is shown as the number of stars according to the following p-value threshold.

p-value thresholdRepresentation
1E-01
1E-02☆☆
1E-03☆☆☆



(7) Result

IconPlatform☆☆☆☆☆No starNo referenceTotal
Ath-m4080801456192063106020836
Ath-r397783776834530979925296
Bra-r5454737969717620283235431
Gma-m110738156863371040715902
Gma-r167568031291918211317942787
Osa-m1709523967935391149320625
Osa-r147046746478452640017548
Ppo-m465226748448202613121909
Sly-m257122923231880975786
Sly-r6542824565712358170223195
Vvi-m2471118275651942499564
Zma-m105234553617235459111069
Zma-r1408511274207298135422592

Limitation

To use the supportability as a measures of reliability, comparison should be done between species having same genetic systems about the function of interest. Namly comparison between two platforms in the same species or closely related species is required. The table shows the ratio of supportability levels in each platform, indicating that Arabidopsis, soybean, tomato, rice and maize have higher ratio of the three stars. This does not directly mean that the absolute quality of these coexpression data is high, because they just have ideal reference platforms.