PreviousNext

Importing Tab-delimited Genome Files

You can import genome files into GeneSpring if the file format is a tab-delimited text file. In GeneSpring, you need to specify the type of information that is included in the key columns of your data file. The minimal data file can contain just one column. This column contains the unique systematic names for each gene in the genome.

Selecting Annotation Files

To select an annotation file:

  1. Select File > Import Genome.
  2. The Import Genome window opens.



  3. Select the Create a Custom Genome option.
  4. Select the There is a tab-delimited file containing all of my genes and annotations option.
  5. The Import a Tab Delimited File window opens.



  6. In the Look in menu, click the drive, folder, or Internet location that contains the file you want to open.
  7. In the Folder menu, locate and open the folder that contains the file.
  8. Select the file and click the Open button.
  9. The Import Genome: Annotations File window opens. This window has two views:

    • When the Use column titles as annotation names option is selected, the Line of column titles spinner box appears (as shown below).


    • When the Use column titles as annotation names option is not selected, the First line of data spinner box appears (as shown below).


    • This window lets you choose the annotation type for each column. You must choose one column and label it as "Systematic Name". The available annotation types are:

    • Systematic Name-The unique identifier for the gene in this genome or array. It is recommended that the gene's systematic name be used to label the gene's expression values in your experiment data files.
    • Common Name-An alternative way of referring to this gene. Genes are not required to have a common name, and common names do not have to be unique. Using duplicate common names; however, will cause data not to be imported. Usually, the common name annotation column can be used to store the HUGO gene symbol or some other official gene identifier.
    • GenBank Accession Number-The GenBank or EMBL identifier for this gene, if known. If the GenBank identifiers for your genes were not used as either their systematic or common names, then including the GenBank Accession Number in this field allows you to update the information about this particular gene directly from GenBank. See Updating your Master Gene Table with GeneSpider for more information.
    • Synonym-This column allows for other names to be entered for the genes. Multiple names should be separated by semicolons (;).
    • Description-A description of this gene, if known. This information can be accessed when you use the Find Gene command.
    • Map-Mapping information for this gene. This should be a chromosome or a nucleotide position (1:228836..229309), or a cytogenic map position (such as 16q12.1).
    • Use column titles as annotation names-Apart from the standard annotation columns that are described above, each gene can have many more annotation columns. Each annotation column needs to have a name and this name can be extracted from the tab-delimited file.
    • Line of column titles-This option only appears if Use column titles as annotations names is selected. In most cases, the first line of the file contains the titles for each of the columns. When you select this option, GeneSpring uses the line indicated by the "Line of columns titles" value in the menu. If the column header row contains titles that are not in the first row, enter the row number in which the titles appear.
    • First line of data-This option only appears if Use column titles as annotations names is not selected. Use this setting if either the data file does not contain a header line or when the headers are not appropriate.
    • Reset-Resets the menus to Click to Set.
  10. To set a Systematic Name column, do the following:
    1. Locate the column you want to label "Systematic Name."
    2. From the menu, select the Systematic Name option.
  11. If a column in the annotations file is blank or if you do not want to import the annotation column, leave the menu on the Click to Set option.
  12. If you want an annotation that isn't in the menu, do the following:
    1. Select the Custom option from the menu.

The Custom Annotation window opens.

    1. Enter the name you want and click the OK button.

The name appears as the column header.

  1. Click the Next button.
  2. Do one of the following:
    • If the Non-Unique Identifiers window opens, GeneSpring has determined that the systematic names of the genes within the genome are not unique.

GeneSpring requires that the systematic names of genes are unique within a genome. When names in the gbk or embl files are not unique, GeneSpring will attempt to find a unique identifier within the entry to use. The results are presented in the Non-Unique Identifiers window.

Fixing Long Gene Name Problems

The Name Problems window (Figure 4-1) indicates that GeneSpring has detected a problem with long gene names. The limit for name in the Systematic Names, Common names, GenBank accession numbers, and Synonyms columns of the Import Gene: Annotation File window is 256 characters. Long gene names are not allowed and must be resolved before you can proceed.

Figure 4-1 Name Problems Window

This window contains the following elements:

To fix long names or invalid characters:

  1. Do one of the following:
    • Manually fix all of the problem entries by editing them.
    • Automatically fix all of the problem entries by clicking the Truncate/Fix button.
  2. Click the Continue button to fix the problem.
  3. The Import Genomic Sequence window opens.

  4. Go to "Importing Genome Sequences" for instructions.

Importing Genome Sequences

This section explains how to import custom genome sequence files in GeneSpring. The files that you import will be added to a Sequence Files table. The data must be in seq file format. GeneSpring will attempt to parse the chromosome names in the sequence files automatically. If the chromosome names cannot be parsed, GeneSpring will use the default chromosome names (numbers).

Sequence File Format

GeneSpring loads in sequence data from a GenBank or EMBL files automatically. If you have sequence data that is not in a GenBank/EMBL file, place it in a separate file using the seq format.

The Silicon Genetics seq format is similar to the FASTA format, although there are some differences. A FASTA formatted file, however, can easily be changed to a seq file. It basically only requires changing the identifier from the FASTA file to the chromosome number.

The seq format consists of one line of identifiers followed by lines of sequence. The identifier line consists of the "Greater than" sign (>) followed by the chromosome identifier, followed by a space which is followed by an optional description. An example is given here.

>CHR1 This is the description of Chromosome 1
GCTGACGGACTTTCTAGCGGTCTAGCAACTGAGCGGCGCGCGGGCATCGTA
CAGCAGCGAGCTACTATCTACGCGCGGCGGATATAAAACTACAAAAAAAAA

Chromosomes in GeneSpring are given a number (1, 2, 3 etc.) and the number should be part of the chromosome identifier. The chromosome identifier can optionally contain the letters 'CHR' but is not required. The number used in the seq format for the chromosome has to correspond to the number used in the Map position in the Master Table of Genes.

The seq format is not the same as the FASTA format. There is an example of the FASTA format at http://www.ncbi.nlm.nih.gov/BLAST/fasta.html.

An abridged example of the yeast.seq file might look like this:

>CHR1 Chromosome I data:

CCACACCACACCCACACACCCACACACCACCACCACACCACACCCACACACACA . . .

GTGGGTGTGGTGTGGTGTGTGGGTGTGGTGTGGGTGTGGTGTGTGTGGG

>CHR2 Complete DNA sequence of yeast chromosome II.

AAATAGCCCTCATGTACGTCTCCTCCAAGCCCTGTTGTCTCTTACCCGGA . . .

AGAATAGGGTACTGTTAGGATTGTGTTAGGGTGTGGGTGTGGTGTGTGTGGG

TGTGGTGTGTGGGTGTGT

>CHR3 LOCUS SCCHRIII 315341 bp DNA PLN

25-NOV-1996

CCCACACACCACACCCACACCACACCCACACACCACACACACCACACCCA . . .

AGTGTGTGGGTGTGGGTGTGTGGGTGTGGTGTGTGGGTGTGGTGTGTGTGTGGTGT

GTGGGTGTGGGTGTGTGGGTGTGGTGGGTGTGGTGTGTGTG

Name multiple chromosomes sequentially, for example, CHR1, CHR2 and so on. If there is only one chromosome, name it CHR1.

Importing Genome Sequences

This section explains how to import genome sequence files in seq format.

To import a genome sequence:

  1. In the Import Genome Sequence window, do the following:
    1. In the Drive menu, select the drive you want.
    2. In the Directories Navigator, select the directory you want.


  2. In the Files Navigator, do the following:
    • To sort by columns, click the column title.
    • To select one file, single-click the file.
    • To select several files at once, press Shift+click or Ctrl+click.
    • To deselect a file, press Ctrl+click.
  3. Click the Add>> button to add the selected files, or click the Add All>> button to all the files.
  4. The selected files appear in the Sequence Files box.

  5. If there is a single chromosome in the genome, and it is circular, select the The chromosome is circular option.
  6. Click the Next button.
  7. The Loading Genome window opens.



  8. Do one of the following:
    • If the Non-Unique Identifiers window opens, GeneSpring has determined that the systematic names of the genes within the genome are not unique.

GeneSpring requires that the systematic names of genes are unique within a genome. When names in the gbk or embl files are not unique, GeneSpring will attempt to find a unique identifier within the entry to use. The results are presented in the Non-Unique Identifiers window.

Go to "Managing Web Links" for the next step.

PreviousNext