Manual for GO-Cluster

Software and manual were written by Boris Adryan.
Release date for external use: v2.2; January 2004.
Software copyright Max-Planck-Institute for Biophysical Chemistry, Goettingen, Germany, 2003,2004.

Purpose of the software

The software aims on the visual interpretation of microarray data. GO-Cluster uses the tree structure of the Gene Ontology-database as a framework for numerical clustering, thus allowing a simple visualization of gene expression data at various levels of the ontology tree.

Working with GO-Tree

The main screen is divided into two treeviews, the GO-tree (left) and the Cluster-tree (right). After setup of the analysis, the GO-tree will contain the hierarchical structure of biological terms known from the Gene Ontology-Web site, whereas the Cluster-tree on the right will show the clustering result for aselected GO-term, after you have clicked the cluster-buttonin the tool box. An exemplary view is shown in the following screen shot(Fig. 1).


Figure 1: A typical view of the GO-Cluster screen.

GO-Cluster will allow you to browse through the GO-tree, see genes assigned to the according biological term, and visualize gene expression data in tabular form. Specific terms, genes or microarray probes can be searched. Additional information about a selected term or a selected gene can be retrieved; the GO-ACC (their "catalogue number") or the gene identifier will automatically directed to a Web browser. A GO-term of interest can be selected for hierarchical average distance cluster analysis using Pearson's correlation coefficient. Alternatively, not GO-terms but single genes can be ticked and selected for clustering. Genes of interest can be selected in the Cluster-tree and will automatically be focussed in the GO-tree. The color-coded panel on the left side of the Cluster-tree represents the regulation of the gene in comparison to the other genes in the gene vector ("how that gene is regulated in the different experimental conditions..."). The panel on the right side of the Cluster-tree represents the regulation of the gene in comparison to the other genes in this GO-term for the given experimental condition (e.g., "how abundant that gene is in comparison to that othergene that plays a role in the same pathway..."). In Fig. 1, the gene"bam" is induced in the third experimental condition (it is red,this is the information we learn from the gene vector), but in all threeexperimental conditions it is expressed less than the average of the othergenes in that GO-term (it is green). The selection of cluster onlywill enable you to save the cluster result as a Windows Bitmap-fileor copy it to the clipboard.

How to get there?

The theory: In order to setup an analysis, the internal data organisation of GO-Cluster should be roughly understood (Fig. 2). An object in the GO-tree consists of a GO-term, whichis usually a biological phrase (.i.e. lipid binding), and the GO-ACC, which is some sort of catalogue number of the GO (i.e. 8289). With the help of these catalogue numbers, the relationship between GO-terms is established. A comprehensive explanation of the data representation in the Gene Ontology-database itself can be found on their Web site. GO-Cluster requires a table (Gene Directory-file) that assigns genes to certain GO-terms. This tableholds the gene's trivial name (Gene Symbol, i.e. RhoGEF2), the GO-ACCs were the gene should appear, and an unique identifier (Gene Identifier,i.e. FlyBase FBgn0023081). To add more flexibility, a second table(uniqID reference-file) is required to interconnect this Gene Identifierto an unique ID, which refers to a probe on your microarray. The first twotables are delivered in form of MySQL-exports (see Table 1 for examples), the chip data must be provided in Mike Eisen'sCluster-file format.


Figure 2: Internal data organisation in GO-Cluster.

Example for the structure of a Gene Directory-file
Example for the structure of a uniqID-reference file
# Query Results
# Connection: MySQL server
# Host: localhost
# Saved: 2003-09-07 20:31:25
#
# Query:
# SELECT DISTINCT GeneSymbol, GOAcc, GeneIdentifier FROM 'flymar03'
#
'GeneSymbol','GOAcc','GeneIdentifier',
'&agr;-Adaptin','5886','FBgn0015567',
'&agr;-Adaptin','7269','FBgn0015567',
'&agr;-Adaptin','16192','FBgn0015567',
'&agr;-Adaptin','30122','FBgn0015567',
'&agr;-Adaptin','6901','FBgn0015567',
'&agr;-Adaptin','8021','FBgn0015567',
'&agr;-Adaptin','16183','FBgn0015567',
'&agr;-Adaptin','30135','FBgn0015567',
'&agr;-Cat','3779','FBgn0010215',
'&agr;-Cat','7016','FBgn0010215',
'&agr;-Cat','8092','FBgn0010215',

and so on...
# Query Results
# Connection: MySQL server
# Host: localhost
# Saved: 2003-09-17 10:38:31
#
# Query:
# SELECT uniqID, GeneIdentifier FROM `probe2gene`
#
'uniqID','GeneIdentifier',
'145501_at','FBgn0031208',
'153565_at','FBgn0002121',
'145502_at','FBgn0031209',
'154367_at','FBgn0028472',
'145507_at','FBgn0031213',
'145508_at','FBgn0031214',
'153716_at','FBgn0031216',
'145510_at','FBgn0031217',
'141332_at','FBgn0026787',
'154909_at','FBgn0005278',

and so on...
Please note that the comment lines (starting with #) are not necessarily required. However, the first lines containing the column names (i.e. 'GeneSymbol','GOAcc','GeneIdentifier',) are important. The GOAcc is the GO-specific identifier for each termand in GO-Cluster it is used without the "GO:"-prefix and the leading zeros. It is also important that each item is quoted in " ' " and thatthe field values are separated with " , ". Note that no blanks are between the last " , " and and the linebreaks. Data lines of other structure than stated above will be ignored by GO-Cluster! If you don't know how to setup the files, please talk to your favorite computer person in your department.
Table 1: Structure of Gene Directory- and uniqID reference-files.

The exemplary data shown in the table is from our userfiles for Drosophila melanogaster. The GO-ACC numbers and the GeneIdentifiers were extracted from a CSV-file obtained from Affymetrix' NetAffx, the GeneSymbol was added from the model system-specific file from the Gene Ontology-Web site. Mapping information from uniqID (here: Affymetrix probe set) to GeneIdentifier was obtained from file AffymetrixDrosGenome1Release3_1_genes.cdf available at FlyBase. If you are not familiar withPerl or any other software for convenient extraction of such data for yourmodel system, you may want to have a look at ChipInfo from the Wong laboratory at Harvard University.

The practice: Once you have all of the above files ready, start GO-Cluster and select Setup (F3) from the Program menu. The following dialog (Fig. 3) will guide you 1, 2, 3, ... through the setup process.


Figure 3: The Setup dialog.

First select a Gene Ontology-tree file (*.GOT, as provided with the software) and load. Then select a valid Gene Directory- and a valid uniqID-reference file, then click on load. Loadingmay take between ten seconds and two minutes, depending on your system.You will then be asked whether all terms with no gene assignments shouldbe deleted from the tree or not. Last, select a data file with your microarray results and load. Again, this may take a while depending on howmany datasets you want to import. If everything is fine, the OK-button will be enabled and you can start your analysis. If at any stage of the setup process the software is unhappy with your files, you will not beallowed to continue with the next step and hopefully get a somewhat meaningfulerror message.

Note for experienced users with own MySQL-infrastructure: You can build the GO-tree, the Gene Directory- and the UniqID-reference tables using your own server. Specify a host and the databases holding the information. The GO-tree is built from the standard table namesas provided by the Gene Ontology-Web site. The Gene Directory can bestored in a table with any name, but the following mandatory fields: GeneSymbol (string), GOAcc (int), GeneIdentifier (string). The uniqID-reference table requires the fields uniqID (string) and GeneIdentifier (string).

Troubleshooting: We understand that setting up GO-Cluster is quite a challenging task for the first-time user. It appears that especiallythe files provided by the user (Gene Directory-file, uniqID-reference fileand microarray data) are a potential source of problems. As an example,minimal differences such as "145501_at" as UniqID in the reference fileand "145501_at " (note the invisible character!) as UniqID for the featureon the chip within the microarray data will let GO-Cluster to forget aboutthis piece of data. We are aware of this and offer a "hidden function" todebug your user files. Upon loading GO-Cluster, press CTRL+ALT+D to makethe Debug panel visible. Then start the Setup process as usual and movethe dialog so that you can see the Debug panel (Fig. 4).


Figure 4: The Debug panel (CTRL+ALT+D) next to the Setup dialog.

Continue with the setup process as usual. After you have specified the Gene Directory-file and the UniqID-reference file, click on load. First, GO-Cluster will read all data lines from your files and parse through this information. As soon as this step is finished, you will see information in the upper grid of the Debug panel.
  • Lines from the Gene Directory-file that are invalid are ignored by the internal database. They are displayed in a text field between the upper and the middle grid. After closing the Setup dialog, check in which structural detail they differ from 'GeneSymbol','1234','GeneIdentifier', . The debug output for the UniqID-reference file will be analogous, but between the middle and the lower grid.
  • A good hint that you have created a reasonable data file is whenthe program quickly browses through the grid, thus letting the scrollbaron the right jump vigorously. This is the moment when your genes and geneidentifier information is linked to the appropriate GO-terms. If this doesn'toccur, have a look at the GOAcc within your data file. Most likely, yourGOAcc cannot be interpreted as a valid integer number. In that case, theupper and the middle grid will most likely be empty, although you will beallowed to continue with the setup process.
  • If the Gene Reference-file is okay, at this point you should see something like Fig. 4, but no information is visible in the lower grid of the Debug panel. GO-Cluster will also display a message with the numberof times that a gene could be assigned to the GO tree. If that number is0, it is most likely that the parser couldn't recognize your Gene Reference-file. Check for all necessary " ' " and " , ", also in front of line breaks, these characters are necessary, no blanks between fields!
Now load the microarray data file. Once the file is in RAM, you will see the scrollbar of the middle grid jump as the upper one did before. Thatis the moment, when GO-Cluster recognizes a UniqID in your chip result andassigns it to a GeneIdentifier, which is already in the GO-tree. Finally,you will see the last feature that was assigned to the GO-tree and its corresponding values (denoted as 0 to number_of_experiments-1).
  • The program will display a message with the number of genes thatcould be recognized from your microarray results and linked to a gene inthe GO tree. If that number is 0, there is most likely a difference betweenthe UniqIDs in the UniqID-reference file and the microarray result file.
  • Click the OK-button and close the Setup dialog. You may now click on any of the items in the middle grid. If there is a valid chip result in the microarray data file, it will be displayed simultaneously in thelowest grid. Try to see whether your items in the grid are marked aloneor if the selection also includes a potentially invisible character in thebeginning or the end of the item! This is especially important for GeneIdentifierof the upper and the middle grid, and for UniqID in the middle and the lower grid.
After you have finished your debug session, we recommend to quit the application and restart it. If you still can't make your data files to work with GO-Cluster, or even don't see any of the information described in the Troubleshooting section, please drop us a message and maybe we can work something out.

Other functions in GO-Cluster

GO-Tree main window functions
Use the mouse or the arrow-/PgUp-/PgDown-keys to movethrough the trees. All of the tree can be expanded by pressing ctrl+shift++ and collapsed by pressing ctrl+shift+-. In the GO-treeview, genes are either selected by space or mouse-click. In the Cluster-treeview,selected genes are automatically searched and focussed in the GO-treeview.

Tool box functions
Cluster: Performs cluster analysis on a selected term or a selected set of genes. Clustering can be rather fast when a term is selected, for a set of genes the software must browse through all ofthe GO-tree, which may take a while.
Get info: Opens an external Web browser and loads information on the focussed term or gene (see Options for additional details).
Find first/next: Finds the phrase that you have enteredin the text field. Search is either exact or not, but always case-sensitive. Select the table column in which you would like to search.

Menu functions

Program
Setup: Setup the software for an analysis.
Exit: The usual.

Options
GO-base URL: sets the base URL, so that GO-Tree can load term-specific information from the resource in the following structure: base-URL+GO_acc (i.e. http://godatabase.org/cgi-bin/go.cgi?query=GO:0019992)
Gene-base URL: sets the base URL, so that GO-tree can load gene-specific information from the resource in the following structure: base-URL+gene identifier (i.e. http://www.flybase.org/.bin/fbidq.html?FBgn0003079)
Table color: sets a factor StdFactor for marking apparently differentially regulated genes, e.g., we use |mean(gene vector)-(gene value)| > |SD(gene vector) x StdFactor| to determine potential differential regulation.
 
About
Information on software and copyright.

Appendix

A) Mike Eisen's Cluster&Treeview-file format
This file format is a de facto-standard for the work with microarray data. It was first used by the very well-known (and maybe best) freely available clustering tools for microarray data, namely Cluster&Treeview from Mike Eisen at Stanford University. Both programs come with a very informative manual, from which I will freely quote here. For additional information, go to the Eisen lab Web site and read the manual.

Basically, Cluster&Treeview-files are "tab-delimited textfiles in a particular format. Such tab-delimited text files can be createdand exported in any standard spreadsheet program, such as Microsoft Excel. [..] By convention, in Cluster input tables rows represent genes andcolumns represent samples or observations (e.g. a single microarrayhybridization). For a simple set of experiments, a "minimal" Cluster-inputfile would look like this (Table 2):

UniqID
wild-type
condition 1
 condition 2
142126_at
32432,3
5644,4
6546,4
142127_at
342,5
345,5
865,4
142128_at
45366,7
74544,5
45355,6
Table 2: A "minimal" Cluster-file.

"Each row (gene) has an identifier (in green) that always goes in the first column." Here, we are using Affymetrix probe set identifiers. "Each column (sample) has a label (in blue) that is always in the first row; here the labels describe the experiment." The first column of thefirst row contains a special keyword (in red) that tells the program whatkind of objects are in each row. It is mandatory.
Note for experienced users: Also "maximal" Cluster&Treeview-files can be read by GO-Cluster. However, the fields Name, GWeight, GOrderEOrder will be ignored, since GO-Cluster only offers a very simple clustering approach.

Further important differences between Cluster&Treeview and GO-Cluster in the interpretation of the input file:

B) Bibliography
If you just found this manual by accident and don't know what microarrays and clustering is about: Go to Mike Eisen's Web site, this provides lots of good links and information to both fields.

C) Interesting stuff
GO-Cluster was developed because I found traditional numericalclustering somewhat counterintuitive and not always easy to interpretas biological information. The approach itself was already discussed,but no free software was available. The program was developed in parallelto my bench work during my PhD-studies. It is written in Borland ObjectPascal using Delphi 6 and hasabout 25 pages of source code. I use external components for visualisation(TVirtualTreeView by Mike Lischke) and data import with regular expressions (TRegExpr by Andrey V. Sorokin). GO-Cluster is connected to a local MySQL-server using the WinZeos library. If you would like to adapt GO-Cluster to your personal needs and wantthe source code (only fairly commented), please contact my group leaderDr. Reinhard Schuh.