Data format rules

 

 

GENOTYPIC DATA

 


NEWS:

Starting Jan 2015, all data formats are accepted. Partners are only asked to be sure TOP information is available. In case it is not, please be sure to add the format of the data in the datafile. 


Below, the RECOMMENDED format (but not compulsory anymore!).

 

 

For Illumina GOAT SNP60 beadchip data: 

All output Illumina files (DNAReport, FinalReport, LocusSummary, LocusxDNA, SampleMap, Snp_Map) should be uploaded [.tar.gz / .gz / .zip] compressed. Please be sure that the files you upload have clear names, reporting both researcher or institute name, project acronym and the dataset type. If needed, include a README file into the tar archive to describe the content of each file or other relevant information.

Note that, although all files will be stored in their original form, only 2 files (Snp_Map and Final_report) will be used in the ADAPTmap SNP-DB.

The preferred "SNP map" format is: Standard Snp_Map file.



The preferred "Final report" format is: SAMPLE_ID and SNP are displayed by row. Alleles are coded as A/B. The only fields that will be used in this file are: SNP_NAME, SAMPLE_ID, ALLELE1(TOP), ALLELE2(TOP). Columns can be ordered as needed, but header MUST be present. Space, comma, semi-colon and tab are accepted separators. A simple example (A/B format):

Note: 

  • In the above file, header information was skipped for clarity.
  • Only columns 1 to 4 are considered for the SNP-DB (but other variables – e.g. Gcscore, X Y – can be stored, if requested).   

 

Other SNP genotypes allele formats and input flexibility: In case alleles are coded in A/B or FORWARD formats, files should include the type of format in their file names (e.g. goatdata_Final_report_FORWARD.txt). 

More than one allele coding and much more information than needed in the file are accepted (so if you have a file like the one that follows, it’s just fine!). Note that information actually used in this file is in bold.

 

Considering that transfer genomic information in row format is memory inefficient, matrix format is also allowed (see the example below). 


In case you have data in other formats, and you cannot send the information in any of the specified formats above, please email adaptmap[at]tecnoparco[dot]org and we’ll agree a suitable format for you to send the data. 

 

 RE-SEQUENCING DATA

Alignment data will be stored in separate .bam files. For variation data, VCF file format is the only format accepted ( https://www.1000genomes.org/node/101) Sequence SNP data will be considered and dealt with separately. Before integrating SNPs coming form SNPchips and SNPs coming from sequence, procedures need to be discussed within the group. Illumina and SNP sequence allelic coding need to be normalized before integrating the information in the database. Once normalized, the information will be easily integrated in the database. In the meantime, Sequence SNP data will be stored in single files. Other variations (e.g. structural variation, etc) will be simply identified in each VCF file and stored in (different) single files. 

Please remind to upload an accompanying README file containing the tool/technique and parameter(s) used to produce the files.

 

 OTHER DATA

Other kind of data might be included in the files in a folder named “OTHER”. Sample breed, alternative sample_ID(s), sire-dam (sex,birth_date,ecc) information, phenotypes, etc can be stored in the database.

Multiple files can be included in this folder, but please use only genotype Sample_ID as PRIMARY RELATIONAL KEY. All files must have a header briefly explaining the file content and fields must be comma separated (in alternative, a short README file explaining fields & files can be included).

 

 

 UPLOAD THE DATA

 Once you have your files ready in the required formats, you can CLICK HERE to have information on how data transfer works in this project.