change-history.org
Table of Contents
1 Database schema
In current we use Ensembl database schema as template.A full featured Ensembl database is consist of over 70 tables. For a gene prediction task using Augustus as annotation engine,we only need 3 of them.
1.1 table 'dna'
Contains DNA sequence. This table has a 1:1 relationship with the contig table. There's a one-one map for each record in this table to each single row in a plain file 'dna.txt' in which sequences are stored in format of 'int-id\tsequence'.
| Column | Type | Default value | Description | Index |
|---|---|---|---|---|
| seq_region_id | INT(10) | Primary key, internal identifier. Foreign key references to the seq_region table. | primary key | |
| sequence | LONGTEXT | DNA sequence. |
1.2 table 'seq_region'
Stores information about sequence regions. The primary key is used as a pointer into the dna table so that actual sequence can be obtained, and the coord_system_id allows sequence regions of multiple types to be stored.Contigs are stored with the 'coord_system_id=2'. Chromosomes have 'coord_system_id=1',they have no corresponding record in table 'dna'. The relationship between contigs and chromosomes is stored in the assembly table.
| Column | Type | Default value | Description | Index |
|---|---|---|---|---|
| seq_region_id | INT(10) | Primary key, internal identifier. | primary key | |
| name | VARCHAR(40) | Sequence region name. | unique key: name_cs_idx | |
| coord_system_id | INT(10) | Foreign key references to the coord_system table. | unique key: name_cs_idx | |
| key: cs_idx | ||||
| length | INT(10) | Sequence length. |
1.3 table 'assembly'
This is the assembly table structure.
| Field | Type | Null | Key | Default | Extra |
|---|---|---|---|---|---|
| asm_seq_region_id | int(10) unsigned | NO | PRI | NULL | |
| cmp_seq_region_id | int(10) unsigned | NO | PRI | NULL | |
| asm_start | int(10) | NO | PRI | NULL | |
| asm_end | int(10) | NO | PRI | NULL | |
| cmp_start | int(10) | NO | PRI | NULL | |
| cmp_end | int(10) | NO | PRI | NULL | |
| ori | tinyint(4) | NO | PRI | NULL |
2 mysql++ API
In current,we use a third-part mysql API:mysql++ to handle sequence from database.I choose it because of its lightweight and it supports STL perfectly.
2.1 install
install package
# apt-get install libmysql++-devor see docs/INSTALL.md for a complete overview.
2.2 use SSQLS
mysqlpp allows user defined 'Specialized SQL Structure'.At the most superficial level,and SSQLS has a member variable corresponding to each field in the SQL table. In 'include/table_structure.h' defined 'dna','seq_region','assembly'.
sql_create_2(dna,
1, 2,
int,seq_region_id,
std::string, sequence)
sql_create_4(seq_region,
1,4,
int,seq_region_id,
std::string,name,
std::string,coord_system_id,
int,length)
sql_create_6(assembly,
1, 6,
int, asm_seq_region_id,
int, cmp_seq_region_id,
int, asm_start,
int, asm_end,
int, cmp_start,
int, cmp_end)
3 cmdline parameters
- –dbaccess accepts comma separated string "database name,host name,user,passwd,table name"
- the only parameter without a '–' is the query.If '–dbaccess' is indicated,query corresponds to a name in 'seq_region' table.So skip filetype detect in this case.
- –predictionStart and –predictionEnd still work the same way as when input file is a fasta or genebank.
augustus --dbaccess="fly,localhost,henry,123456,," 3L --predictionStart=100 --predictionEnd=30000000 --species=fly
4 modification
| file | desc |
|---|---|
| Makefile | add 2 header path and 2 lib path;add -Wl,rpath=/your/run-timelib/path |
| types.cc | l-322~l-324,comment an exception thow message to allow 'dbaccess' in sigle mode.I don't want to modify this behavior in system level so I just comment it. |
| types.cc | reorder –dbaccess to "database name,host name,user,passwd,tablename" |
| randaccess.{hh,cc} | accomplish the AnnoSequence* DbSeqAccess::getSeq method.Give a mysqlpp::connection object to class DbSeqAccess. |
| genbank.cc | GBSplitter(string fname ),l-526. If input fname is a name in 'seq_region' in database,skip the filetype detect. |
| table_structure.h | in 'trunks/include/mysqlppheader' add 3 SSQLS: 'dna','seq_region','assembly' |
Date: 2012-06-09 Sat