Content uploaded by Sivamani Balasubramaniam
Author content
All content in this area was uploaded by Sivamani Balasubramaniam on Feb 28, 2022
Content may be subject to copyright.
CIBA – TM Series – 2018 – No. 12
Hands on Training
AQUACULTURE GENOMICS AND
BIOINFORMATICS
Organized By
GENETICS AND BIOTECHNOLOGY UNIT
Prepared by
K. VINAYA KUMAR J. ASHOK KUMAR
MISHA SOMAN RAYMOND J ANGEL B. SIVAMANI
P. MAHALAKSHMI SHERLY TOMY
M. S. SHEKHAR
G. GOPIKRISHNA
ICAR – CENTRAL INSTITUTE OF BRACKISHWATER AQUACULTURE
75, SANTHOME HIGH ROAD, RA PURAM
MRC NAGAR, CHENNAI - 600 028
Published by
Dr. K. K. Vijayan
Director, ICAR-CIBA
Hands on Training Aquaculture Genomics and Bioinformacs iii
TABLE OF CONTENTS
Sr. No. Chapter Title Page number
1Introducon to Linux Environment 1
2Introducon to programming in R 4
3Python for Bioinformacs 10
4Understanding the Illumina datasets 14
5Checking quality of Illumina paired-end sequence data 17
6Quality control of RNAseq datasets – NGS QC Toolkit 19
7Quality control of RNAseq datasets – Trimmomac 21
8Assembling bacterial genomes 23
9RNAseq data analysis in Trinity 25
10 Phylogenomic analysis using MrBayes 31
11 Microsatellites genotypes generaon by Fragment analysis method 34
12 Genepop : Populaon Genecs analysis 38
13 Populaon genec analysis of microsatellite data in Arlequin 40
14 SoCompung techniques inBioinformacs 49
15 RNAseq data analysis – Genome-guided 56
16 Applicaon of ‘’OMICS’’ research in aquaculture with special reference
to penaeids
64
17 Shrimp Genomics : Current status and Challenges 72
18 Applicaon of Biotechnology in animal reproducon 76
19 Use of molecular techniques in growth enhancement 81
20 Gene Eding Tools and their applicaon in Aquaculture 90
Glossary 95
Hands on Training Aquaculture Genomics and Bioinformacs 1
1. Introducon to Linux Environment
J. Ashok Kumar and K. Vinaya Kumar
Opensource operang system (OS) Linux built based on Unix has become choicest OS worldwide
for servers as well as desktops in academic circles. There are dierent varients of Linux which include
Redhat, Ubuntu, fedora, CentOS, knoppix etc. Many of the bioinformacs soware and individual
programs are nave to linux OS. So it is important for a bioinformacian to have exposure to linux
commands. Here we give a list of most commonly used linux commands and procedure to execute
perl /python programmes. As advanced programming is beyond the scope of this training, we provide
here the basic constructs of perl/python programs which could be used for wring scripts for simple
bioinformacs tasks.
Linux commands
Accessing linux environment: You can access linux server using any windows based ssh client
from your system. This could be achieved by installing winSCP or Puy (both are free soware) on
your system. Once installed open WinSCP, ll in the Host name, user name and password columns
provided by system administrator and click on login buon which will prompt for password. Aer
successful login and selecng puy from menubar, console window pops upand you will see a dolloar
prompt where in you can submit commands for all the operaons you wish to perform on linux server.
Figure 1. WinSCP login window
Figure 2. Selecng Puy from winSCP
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
2
Figure 3. Linux console
The dollar prompt ($) shown in Fig. 3 is for users and the hash (#) prompt will be displayed for
administrators. Users who have the administrave privileges on the server can only work with hash
(#) prompt.
File system in linux: All the folders and les of the linux system will be under root (/) directory.
Users will have access to their home directories for which the path is /home/user_name
Once you login to the linux system by default you will be taken to your home directory. For
example is if user name is “david” aer login into Linux the current directory which he will be
accessing is /home/david. Users can input their commands aer the dollar ($) prompt. Some of the
most commonly used linux commands are given in the table below.
Funcon Command
Lisng the le names $ls
Lisng with le names along with other details $ls –l
Change to preexisng directory by name ‘test’ $cd test
Make a new directory by name ‘trial’ $mkdir trial
Viewing a preexisng le $vi mydata.txt
$nano mydata.txt
$more mydata.txt
$cat mydata.txt
Creang a new le $touch myle.txt
$vi myle.txt
$nano myle.txt
Renaming or moving the le $mv le1.txt le2.txt
$mv /home/ram/le1.txt /home/ram/
test/
Making duplicate of le $cp le1.txt le2.txt
$cat le1.txt > le2.txt
Appending two text les $cat le1.txt le2.txt > le3.txt
To display date $date
To nd number of lines in a le $wc –l xyz.txt
To display rst (top) 100 lines of a le $head -100 xyz.txt
To display last (boom) 100 lines of a le $tail -100 xyz.txt
Search for a paern in a le $grep “paern” le.txt
Search for paern at beginning of line $grep ‘^paern’ le.txt
Search for paern at the end of a line $grep ‘paern$’ le.txt
Search for only paern in the line $grep ‘^paern$’ le.txt
Hands on Training Aquaculture Genomics and Bioinformacs 3
Running perl /python programs
Perl program les will have extension “.pl”. Command to execute the programmes is
$ ./test_programme.pl
Or
$perl test_programme.pl
Opons of the program may be checked from the help les of the soware/programs.
Same way python program les will have “.py” extensions and they could be executed by giving
following command.
$python test_programmes.py
Standalone blast
NCBI Blast is used for comparing nucleode and protein sequences with the sequence databases
to nd signicant matches. Alignment of sequences using blast can be done either by using web-tool
available on NCBI site or by installing blast on local servers.
Blast can be installed on local servers along with the databases available in public domain. In
addion, users can make their own databases on local servers. If you have your own protein dataset
then local databases can be created by
$makeblastdb -in xyz.fasta -dbtype ‘prot’ -out xyzdb
Now you can run the blast using your own database
$blastp -db xyzdb -query abc.fasta –out out.fasta
More general blast Command
$blastn -query nucl.fasta -db xyzdb -oumt 6 -evalue 1e-05 -out output.txt
For fetching the sequences in fasta le format from output make a le with IDs of hits and run
the following command
fastacmd -d database_name -i blast_output > hits.fasta
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
4
2. Introducon to programming in R
J. Ashok Kumar, K. Vinaya Kumar and B. Sivamani
R is a programming environment for data analysis and graphics. The language was inially wrien
by Ross Ihaka and Robert Gentleman at the Department of Stascs at the University of Auckland.
Since its birth, a number of people have contributed to the package. It is open source stascal
soware which can be downloaded free of cost. Base package and all the contributory packages
could be downloaded from hp://www.r-project.org/
R is available for all operang systems like windows, Linux and Mac OS. This training material is
based on R stats package installed in windows operang system.
Invoking R stats
Start All programmes R R i386 3.2.0 (for 32 bit installaon)
Start All programmes R R x64 3.2.0 (for 64 bit installaon)
R Stats Graphical user interface in windows
Procedure to install addional packages
We need to add addional libraries to Base installaon to ulize full potenal of R. This can be
achieved by following command.
Install.packages(‘name of the package’)
Once the above command is executed R system asks the user to select a CRAN mirror out of
several listed mirrors. User can select mirror of any locaon.
There is a package/library called ‘Rcmdr’ which can be used for carrying out most commonly
used stascal procedure with graphical user interface. The command to install ‘Rcmdr’ is
Hands on Training Aquaculture Genomics and Bioinformacs 5
Install.packages(‘Rcmdr’)
Command to invoke the Rcmdr
Library(‘Rcmdr’)
R studio
R studio is integrated development environment(IDE) for R. This IDE features R notebook for
wring scripts, console for command input, graphics viewer, package window and environment
window all in single framework.
R les input and output.
First set the working directory
Command to know the locaon of present working directory is
Ø getwd()
Command to set the working directory to any other folder
Ø setwd(“E:/data/”)
Basic command to read the les is
Ø read.table()
and command to create the data les is
Ø write.table()
Imporng data
Data with dierent le formats i.e., text les, excel les, SPSS data les, SAS data les etc., can
be input into R stats for data analysis. It is advised that excel les may rst be converted to comma
separated les for easy input into R stats.
Command to read a comma separated text le with variable names in the rst row
Ø Data <- read.table(‘lename’, header=TRUE, sep=”,”)
Here lename is name of the text le with extension, header statement is to specify whether
variable names are included in the rst row of the data le and ‘sep’ parameter tells the separator
present between variables (columns) like comma, space, tab etc., in the le.
If the specied text le is not in present working directory and you wish to select it though
graphical interface use the following command
Ø Data <- read.table( le.choose(), header=TRUE, sep=”,”)
Upon entering the above command a le selector window will pop up and one can select the le
located at any drive/directory/folder other than the present working directory.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
6
Popup window for selecng les
For other text les like space separated and tab separated one need to change only ‘sep’parameter
of the above command with either “ “ or “ \t ”.
In the previous command ‘data’ is a dataframe which will contain all the variable names and data
Data in the dataframe can be edited and assigned the changed le contents to other dataframe
Ø data1<- edit(data)
Upon entering the above command a popup window appears for eding the data and all the
edits will be saved in data frame called ‘data1’
Data editor window
Exporng data
Data in the dataframe can be exported as a text le with the following command
Ø write.table(data, le=”xyz.csv”, col.names=TRUE, sep=”,”)
Hands on Training Aquaculture Genomics and Bioinformacs 7
Creang data les manually within Rstats
Data les can be created within Rstats by giving simple commands
Here we explain creang example table with variable names into R stats
S.No Bodyweight Length Species
1 25 15 aa
2 35 14 ab
3 65 27 ac
4 27 18 bb
5 45 22 cc
The above table can be created as a dataframe by giving the following commands
Øbodyweight <- c(25,35,65,27,45)
Ølength <- c(15,14,27,18,22)
Øspecies<-c(“aa”,”ab”,”ac”,”bb”,”cc”)
Ølengthweight <-cbind(bodyweight,length,species)
Descripve stascs
Suppose we have a variable by name ‘x’ and our task is to calculate all the descripve stascal
parameters like mean, median, standard deviaon, variance etc. for the variable x in R stats. First
create a variable x by giving the following command
Øx <- c(20,15,19,22,26,24,23,17,18,22)
Other way of creang variable ‘x’ is
Øx <- scan()
1: 20 15 19 22 26 24 23 17 18 22
11:
Read 10 items
Basic commands for descripve stascs
Ømean (x) # mean
Ømedian (x) # median
Øvar (x) # sample variance
Øsd(x) # sample std. deviaon
Øquanle (x,p) # sample quanle , p could be 0.25, 0.5,0.75
Ømin (x) # minimum of x
Ømax (x) # maximum of x
Ørange () # range of x
Ølibrary(e1071)
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
8
Øskewness (x) # skewness
Økurtosis (x) # kurtosis
Commands for stascal tests
Single sample t-test
Øt.test(y,mu=10)
here y is a variable; mu is populaon mean
Two sample t-test
Øt.test(y1,y2,var.equal=TRUE)
y1 and y2 are the two independent samples
Paired t-test
Øt.test(y1,y2,paired=TRUE)
y1 and y2 are the two paired samples
Chi-square test for goodness of t
Øn<- cbind(y1,y2)
Øchisq.test(n)
n is a datamatrix /conngency table
Correlaon
Øn <- cbind(y1,y2) # create dataframe n
Øcor(n)
where y1 and y2 are two variables and n is matrix of y1 and y2
Regression
Øt <- lm(y~x)
for mulple regression
Øt <- lm(y~x1+x2+x3)
Completely randomised design
Øtr <- c(1,1,1,2,2,2,3,3,3) # create treatment variable
Øyield<-c(25,41,54,65,45,65,25,12,35) # create dependent variable
Øt <- aov(yield ~ factor(tr)) # model statement
Øsummary(t)
Randomised Block Design
Øtr <- c(1,1,1,2,2,2,3,3,3) # create treatment variable
Ørep <-c(1,2,3,1,2,3,1,2,3) # create replicaon variable
Hands on Training Aquaculture Genomics and Bioinformacs 9
Øyield<-c(25,41,54,65,45,65,25,12,35) # create dependent variable
Øt <- aov(yield ~ factor(tr) + factor(rep))
Øsummary(t)
Two way factorialDesign
Øt <- aov(yield ~ factor(A) + factor(B) + factor(A) : factor(B) + factor(rep))
Øsummary(t)
Installing Bioconductor in R
Enter following commands in R console to install bioconductor packages.
source (hp://bioconductor.org/biocLite.R)
biocLite()
Steps in manipulang fasta les
First load library
Ølibrary(seqinr)
Set working directory in R where fasta les are loaded
Øsetwd(“c:/path/to/directory”)
Øseq1 <- read.fasta(“sequence.fasta”)
Øseq1.seq<- seq1[[1]] # to take the sequence from fasta le
Ølength(seq1.seq) # to nd length of the sequence (bases)
Øtable(seq1) # to nd frequency of each base
ØGC(seq1.seq) # to nd the GC content of the sequence
There are several advanced opons are available in R ranging from simple sequence analysis to
microarray data analysis. Purpose of this chapter is to introduce the R environment and to provide
hands-on for exploring the funconalies available in R.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
10
3. Python for Bioinformacs
J. Ashok Kumar and K. Vinaya Kumar
Python is one of the most popular high level general purpose programming languages. It
was developed in the year 1991 by Guido van Rossum, a Dutch programmer. It is an open source
programming language available for download at www.python.org. In recent years it has gained lot
of importance due to development of several libraries applicable to various elds of research and
development. One such library widely used in Bioinformacs is BioPython. Here we introduce python
environment for wring scripts and provide a glimpse of Biopython funconalies.
Installaon of Python
Python is available for both windows and Linux plaorms. Windows / Linux binaries can be
obtained from www.python.org. In windows you may double click on the exe le and accept the
default installaon sengs to get it installed in the system. Once installed go to edit environment
variable Advanced environment variables and add new python path as show in the gure
Now you can open command line interface in windows by entering ‘cmd’ search box on the
taskbar and enter.
On most of the Linux installaons python comes with default installaon. If not available it can
be installed on debian/Ubuntu systems by keying-in the following command
$sudo apt-get install python
Installing pip
Pip is package manager for python. To install pip download get-pip.py from hps://pip.pypa.io/
en/stable/installing/ and enter the following command.
$Python get-pip.py
Once pip is installed, any python package can be installed by the following command
$pip install ‘package-name’
Hands on Training Aquaculture Genomics and Bioinformacs 11
Installing Jupyter
Jupyter is notebook applicaons for python wherein one can write scripts, execute the scripts
and save the notebooks in dierent formats like pdf, doc for future use. Run following commands for
installing and opening the jupyter notebook
$pip install jupyter # install the jupyter package
$python –m IPython notebook ## Opening notebook in windows.
$jupyter notebook ## opening notebook in Linux
One can install required addional packages like matplotlib for plong the graphs, numpy for
numerical calculaons pandas for data structures and data analysis tools, statmodels for stascal
analysis, scipy for mathemacal & scienc applicaons. All these can be installed using python.
Introducon to python programming
ØPrint “hellow world” ## prinng a text
Hellow world
Øtext1 = “CIBA” # text1 is a string variable
Øa = 20 # b is a numeric variable having value 20
Øb = 30
Øa+b
50
Øb-a
10
Øa*b
600
Øa/b
Ø0
Øa/oat(b)
0.666
Øa**b # which is a to the power of
1.073741824e+39
For mathemacal funcons
Øimport math
Ømath.log(a)
2.995
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
12
Ømath.cos(a)
0.408
Araay in python
Øa =[]
Øa = [“hi”,”this”,”is”,”python”]
Øa[2]
Declaring diconary
Ødict1={“apple”: 250,”banana”: 100,”cherry”: 300}
Ødict1.keys()
[‘cherry’, ‘apple’, ‘banana’]
Ødict1.values()
[300, 250, 100]
Ødict1[“cherry”]
300
Programming loops
Øfor i in range(0,10):
print i
Øj=1
Øwhile (j < 10):
print j
j=j+1
Funcons
Ødef f2c(x):
return (x-32)*5/9.0
Read and write les
Øinp=open(“input.txt”,’r’)
out=open(“output.txt”,’w’)
for line in inp:
if line[0]==”>”:
out.write(line)
inp.close()
out.close()
Hands on Training Aquaculture Genomics and Bioinformacs 13
Biopython
Biopython is the set of computaonal methods used for Bioinformacs analysis. Biopython can
be used to parse dierent les like fasta, blast output, genbank, expasy; execute online tools like NCBI
blast, entrez etc., code to sequence alignment, mulple sequence alignment, phylogeny and even
machine learning classicaon methods like naïve bayes, knearest neighbourhood, support vector
machines etc.,. Biopython library can be installed through pip installaon method.
Øpip install biopython (or python –m pip install biopython in windows)
Øimport Bio
Øfrom Bio.Seq import Seq
Øseq1 = Seq(“ATGCGGATC”)
Seq(‘ATGCGGATC’, Alphabet())
Øseq1.complement()
Seq(‘TACGCCTAG’, Alphabet())
Øseq1.reverse.complement()
Seq(‘GATCCGCAT’, Alphabet())
Parsing fasta le
Øfrom Bio import SeqIO
Øfor seq_record in SeqIO.parse(“sequence.fasta”, “fasta”):
print(seq_record.id)
print(repr(seq_record.seq))
print(len(seq_record))
Dierent system commands can also be executed from python using following commands
Øimport os
Øcom = “blastn – query seq.fasta –db nr –out out.txt”
Øos.system(com)
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
14
4. Understanding the Illumina datasets
K. Vinaya Kumar and J. Ashok Kumar
With connual improvements being made over past few years, the Next Generaon Sequencing
(NGS) plaorms came a long way in generang enormous sequence data at low cost and less me.
Many NGS plaorms like Illumina, Pacbio, Nanopore, Ion Torrent etc are well-known plaorms with
several published manuscripts quong usage of them. A feature common to all these plaorms is
massively parallel sequencing of single or clonally amplied DNA molecules. Of dierent plaorms
available ll date, the one oered by Illumina stands apart in terms of the amount of sequence data
generated and the cost involved. In case of Illumina, right from the Genome Analyzer IIx, the HiSeq
XXXX series, the MiSeq, the NextSeq XXX series to the latest NovaSeq 6000, there is an improvement
in data output while reducing the sequencing me.
There are two popular sequencing chemistry of Illumina plaorm namely, paired-end (PE) and
mate-pair (MP) that are commonly used by researchers. The PE sequencing is used for RNAseq studies
where we nd dierenally expressed transcripts in experimental samples compared to control
sample. The MP sequence reads are mostly used in assembly of whole genomes where they play
an important role in scaolding the congs. In this chapter we understand the structure of paired-
end sequence datasets generated on Illumina plaorm. The raw sequence data les generated on
Illumina plaorm are delivered as ‘.fastq’ les. For every sample, two les are provided, one read_1 or
forward sequence read and the other read_2 or reverse sequence read. The order of reads in forward
and reverse sequence reads les should not be altered as they are linked.
Open the WinSCP tool. The following window appears. Enter the host name as told by the tutor.
Enter the ‘user name’ and ‘password’ to log in to your account.
Aer logging in, the window of WinSCP tool appears. The window has two panels. The le panel
is the le system of your computer. The right panel is the le system of your account in server.
Hands on Training Aquaculture Genomics and Bioinformacs 15
Click on the icon displaying ‘two connected computers’ in the top toolbar to open the puy
window. In this window you run your jobs in server. Enter the log in credenals on prompt. Then
browse to the folder where a le with extension ‘.fastq’ is present. Then type-in the command ‘head
le.fastq’ to see the rst few lines of le.
You nd that, the informaon about each sequence read is represented in four lines.
Line 1: has informaon about instrument ID, run ID, ow cell ID, lane ID, le ID, X and Y coordinates
of clusters, read number, status about the read is ltered or not and control sample status etc.
Line 2: the sequence of the read which is the familiar A, T, G and C
Line 3: a plus (+) sign
Line 4: the quality scores of the sequence bases
You may visit the following page to understand more about the quality scores.
hps://www.illumina.com/documents/products/technotes/technote_understanding_quality_scores.pdf
The symbols in line 4 represent quality scores of bases. The quality scores ranges from 0 to 40.
A score of 40 indicates that the base called is of high quality. In this case, the error probability infers
that one base call in 10,000 base calls would be incorrect. The following table illustrates the relaon
between the symbols and the corresponding quality scores.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
16
Table. List of symbols corresponding toquality scores of bases in Illumina sequence datasets.
Symbol Quality Score Symbol Quality Score
! 0 6 21
" 1 7 22
#2 8 23
$3 9 24
% 4 :25
&5;26
’6<27
(7=28
)8>29
*9 ? 30
+10 @ 31
, 11 A 32
- 12 B 33
.13 C 34
/14 D 35
0 15 E36
1 16 F37
2 17 G38
3 18 H39
4 19 I 40
5 20
Hands on Training Aquaculture Genomics and Bioinformacs 17
5. Checking quality of Illumina paired-end sequence data
K. Vinaya Kumar and J. Ashok Kumar
Illumina paired-end (PE) sequencing reads are commonly used for RNAseq studies and
assembling of genomes. For each sample, the sequencing machine prints output data in two paired
.fastq les. In this chapter, we discuss about the quality issues pertaining to PE reads. A beer
understanding of these helps in beer planning of read processing to extract quality data for further
studies.
One of the basic soware useful to understand the quality of PE reads le is ‘FastQC’. Visit the
following site to download the latest version of soware.
hps://www.bioinformacs.babraham.ac.uk/projects/download.html#fastqc
First, log in to your account using WinSCP tool. Open PuTTY SSH terminal. In your account, nd
a le named, a1F.fastq. We shall check the quality of this le using FastQC tool. To do this, run the
following command at your prompt.
$ fastqc<space> a1F.fastq
In less than two minutes, the analysis would be completed and two output les are printed,
a1F_fastqc.html and a1F_fastqc.zip. Save these les to your computer and open the .html le in any
browser. Check all images and understand their meaning. Observe carefully for the following aspects
in the le.
Box plot of quality scores along the sequence read length.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
18
5.1. The reads are contaminated with adapter sequences used during sequencing.
The quality report warrants us to do some data processing which includes,
1. Removal of poor quality reads that are pulling down the average of quality scores.
2. Removal of poor quality bases at the start and end of sequence reads.
3. Removal of adapter sequences contaminang the reads.
Hands on Training Aquaculture Genomics and Bioinformacs 19
6. Quality control of RNAseq datasets – NGS QC Toolkit
K. Vinaya Kumar and J. Ashok Kumar
There are several freeware available for processing of paired-end sequence reads. In this chapter
we shall use NGS QC Toolkit for quality control of PE reads. First, log in to your account using WinSCP
tool. Open PuTTY SSH terminal. In your account, nd two les named, a1F.fastq and a1R.fastq. Check
the quality of both the paired les using FastQC tool. Pracce the following quality control steps and
observe the changes in quality of trimmed les.
6.1 Discarding low quality reads
perl<>IlluQC_PRLL.pl<> -pe <> a1F.fastq <>a1R.fastq <> 2<> A <>-l <> 70 –s<> 20<> -c<> 50
This command removes all those reads where the proporon of bases having a quality of > 20 is
less than 70%. Aer the run, nd that a folder ‘IlluQC_Filtered_les’ is printed. The trimmed les are
present in this folder. Do quality check of these two les with FastQC. Observe the changes in reads
le aer running this command.
Aer discarding about 3 million reads completely, the average quality of bases improved.
Therefore the improvement in quality came at the expense of losing about 30 % of sequence reads.
6.2 Discarding poor quality bases at both ends based on length.
perl<>TrimmingReads.pl<> -i <>a1F.fastq<> -irev<> a1R.fastq<> -l <>3 <> -r <> 30
This command removes 3 bases at 5’ end and 30 bases at 3’ end from all reads.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
20
6.3 Discarding poor quality bases at 3’ end of reads based on quality score
perl<>TrimmingReads.pl<> -i <>a1F.fastq<> -irev<> a1R.fastq<> -q <>30
This command removes bases at 3’ ends where the base quality is <30. This improvement in
quality at ends came at the expense of some reads geng shorter.
6.4 Discarding reads based on read length
perl<>TrimmingReads.pl<> -i <>a1F_7020.fastq<> -irev<> a1R_7020.fastq<> -n <>25
This command removes reads shorter than 25 bases length.
A combinaon of these could be chosen and applied based on the inial base quality of sequence
datasets. Extract only the good quality data for downstream processing of reads.
Hands on Training Aquaculture Genomics and Bioinformacs 21
7. Quality control of RNAseq datasets – Trimmomac
K. Vinaya Kumar and J. Ashok Kumar
There are several freeware available for processing of paired-end sequence reads. In this chapter
we shall use ‘Trimmomac’ for quality control of PE reads. First, log in to your account using WinSCP
tool. Open PuTTY SSH terminal. In your account, nd two les named, a1F.fastq and a1R.fastq. Check
the quality of both the paired les using FastQC tool. Run the following command and observe the
changes in quality of trimmed les. The ‘<>’ sign used in the command argument indicates ‘space’.
The command
Java<> -jar<> trimmomac-0.36.jar<> PE<> -threads<> 70<> -trimlog<> a1.txt<> a1F.fastq<>
a1R.fastq<> a1F_P.fastq<> a1F_S.fastq<> a1R_P.fastq<> a1R_S.fastq<> ILLUMINACLIP:TruSeq3-PE-2.
fa:2:30:10<> LEADING:3<> TRAILING:13 <> SLIDINGWINDOW:4:15 <>MINLEN:100
De-coding the command
Each argument in the command has a purpose of improving the quality of trimmed les. It is
important to check the inial quality of sequence data and then apply the relevant arguments to
improve the quality.
Argument Meaning
PE Paired-end mode. Use this for processing of PE reads data
threads The argument to specify number of threads. Trimmomac supports
running arguments with mulple threads.
trimlog To specify a le name that stores log of the run.
a1F.fastq Input le name of forward or R1 reads
a1R.fastq Input le name of reverse or R2 reads
a1F_P.fastq Output le name of trimmed forward or R1 reads. This le is used
for subsequent analysis.
a1F_S.fastq Output le containing surviving forward reads of good quality. The
paired sequences in R2 le are discarded.
a1R_P.fastq Output le name of trimmed reverse or R2 reads. This le is used
for subsequent analysis.
a1R_S.fastq Output le containing surviving reverse reads of good quality. The
paired sequences in R1 le are discarded.
ILLUMINACLIP:TruSeq3-PE-2.
fa:2:30:10
Illuminaclip is used to remove adapter sequences from reads. The
TruSeq3-PE-2.fa is the le containing adapter sequences.
LEADING:3 To remove bases at the start of the read, if quality is below 3
TRAILING:13 To remove bases at the end of the read, if quality is below 13
SLIDINGWINDOW:4:15 This is an argument that trims reads based on base quality. Each
read is scanned from 5’ end. Four connuous bases are taken as
a window. The average quality of all windows in a read should
be higher than 15. Otherwise, the read gets trimmed from poor
quality window to the 3’ end of the read.
MINLEN:100 To discard reads shorter than 100 bases aer performing all the
steps.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
22
Run FastQC on the trimmed les
Below are the quality of forward sequence reads before (le) and aer (right) trimming.
Below are the quality of reverse sequence reads before (le) and aer (right) trimming.
Even the reads containing the adapters are trimmed. These trimmed les would be taken up for
nding dierenally expressed transcripts. The single-end good quality reads are also used in case of
assembling genomes.
Hands on Training Aquaculture Genomics and Bioinformacs 23
8. Assembling bacterial genomes
J. Ashok Kumar and K. Vinaya Kumar
Genome sequencing forms basis for understanding biology and funconal characterizaon of
microorganisms. Recent advances in shotgun sequencing pave the way generate genome sequences
with me and cost advantage. Here we discuss whole genome assembly with of paired-end sequence
reads generated from illumina plaorm. First we aempt to describe steps involved in denovo
assembly of bacterial genome using masucra assembler and later we look into the steps involved in
reference based assembly using Bowe2.
Download MaSuRCA (Maryland Super Read Cabog Assembler)
MaSuRCA assembler can be downloaded from hp://www.genome.umd.edu/masurca.html
and once it is downloaded keep the folder in you directory and extract the tar ball using following
command.
user@server$ tar –zxvf MaSuRCA-3.2.6.tar.gz
This will extract the les in to the folder MaSuRCA-3.2.6 . You will nd all the executable programs
in the bin subfolder of the MaSuRCA-3.2.6 folder.
Preparing Illumina sequence reads
Copy and paste the illumina paired-end sequence reads in a folder. There will be two les one
for forward strand and other for reverse strand say for example vibgenome_R1.fastq vibgenome_
R2.fastq. These fastq les need to be quality checked and corrected using tools like fastqc, cutadapt
and trimmomac etc.
Preparing Masurca conguraon le
You will nd sample conguraon (sr_cong_example.txt) le in the installaon directory which
needs to be edited with the assembly parameters. There are two secons in conguraon le. One is
DATA secon and Other one is PARAMETERS secon.
In the data secon Opons are available to specify paired-end (PE), mate-pair (JUMP), PACBIO
and Other (Celera assembler reads). Mulple libraries data can be menoned in mulple lines of the
same read type.
For paired-end reads the following line of the data secon needs to be edited.
PE= aa 180 20 /FULL_PATH/frag_1.fastq /FULL_PATH/frag_2.fastq
PE: paired-end; aa- two leer prex; 180 is Average insert length; 20 standard deviaon of insert
length;
In the PARAMETERS the mandatory parameters that need to be edited are NUM_THREADS and
JF_SIZE .
NUM_THREADS are number of threads alloed for assembly task. Example : NUM_THREADS=16
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
24
JF_SIZE is the jellysh hash size, set this to about 10x the genome size but it can be genome size
mulplied by its coverage.
Denovo assembly using MaSuRCA
Command to run masura assembly is
user@server$ /path/to/bin/masurca /path/to/cong.txt
this will generate ‘assemble.sh’ le in the current locaon. Now we need to run this shell script
for compleng the assembly
user@server$sh assemble.sh
Successful compleon of assembly will create several les. Look for the directory named CA and
within that folder you will see 10-gapclose subfolder wherein you will nd nal assembled output.
The output les are ‘genome.ctg.fasta’ for the cong sequences and ‘genome.scf.fasta’ for the scaold
sequences.
Reference based assembly using bowe2
In reference based assembly reads are mapped to reference genome to idenfy variaons like
single nucleode polymorphisms(SNPs), indels, inserons, copy number variants, genome wide
associaon studies (GWAS).
Steps involved in reference based assembly are listed below with the commands for running
each step
ØIndexing a reference genome
$bowe2-build V_para_GCA_000328405.1.fna vibindex
ØAligning reads
$bowe2 -x vibindex -1 V-Para-DNA_R1.fastq -2 V-Para-DNA_R2.fastq -S align1.sam
ØCovert sam to bam le
$samtools view -bS align1.sam > align1.bam
(-bs: input sam and output bam)
ØSort the bam le
$samtools sort align1.bamalign1.sorted.bam
ØCreate the BCF le
$samtools mpileup -uf V_para_GCA_000328405.1.fna align1.sorted.bam.bam | bcools view
-Ov - > align.raw.bcf
(-u generate uncompress BCF output; -f faidx indexed reference sequence le; -Ov output
potenal variant sites only)
Hands on Training Aquaculture Genomics and Bioinformacs 25
9. RNAseq data analysis in Trinity
K. Vinaya Kumar, J. Ashok Kumar and M.S. Shekhar
Many of the commercially relevant aquaculture species including shrimp are not having publicly
available reference genome. Therefore the analysis of RNAseq data for such species mandates building
a de novotranscriptome assembly. For every experiment, a de novo assembly has to be made ulizing
the RNAseq reads of all the samples in the study. In this chapter, we shall pracce building a de novo
assembly of transcriptome and conducng dierenal transcript analysis in trinity soware.
9.1 The datasets
Let us assume an experiment involving two treatments a & b. Each treatment has three replicate
individuals. At the end of the experiment, ssue samples are collected from all replicate individuals
and RNAseq was performed on Illumina plaorm. The following datasets have been generated.
Table.Datasets to be used for RNAseq data analysis
Treatment A Treatment B
Forward
reads
Reverse
reads
Forward
reads
Reverse
reads
replicate 1 a1F.fastq a1R.fastq b1F.fastq b1R.fastq
replicate 2 a2F.fastq a2R.fastq b2F.fastq b2R.fastq
replicate 3 a3F.fastq a3R.fastq b3F.fastq b3R.fastq
9.2 Quality control of datasets
Process the raw reads using Trimmomac tool and obtain quality reads. Keep the following
arguments while running Trimmomac.
ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10
LEADING:3
TRAILING:13
SLIDINGWINDOW:4:15
MINLEN:100
The numbers of reads retained for downstream analysis are given below
Sample name Reads in raw le
(million)
Reads in processed le
(million)
a1 10 4.954252
a2 10 5.577112
a3 10 6.412094
b1 10 5.257203
b2 10 4.160784
b3 10 3.607086
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
26
9.3 Building a de novo assembly
As the experiment involves triplicate samples, prepare a text le showing the triplicate samples
under each treatment and their le names as shown below.
Then proceed for building assembly using the following command,
Trinity<> --seqType<> fq<> --samples_le<> ab_samples.txt<> --CPU<> 70<> --max_memory<>
300G<> --SS_lib_type<> FR<> --output <>trinity_ab
The command arguments details are,
Input les are fastq format
Samples le names are given in ab_samples.txt
Use 70 threads
Limit maximum memory to 300 GB
Data obtained from strand-specic library as forward and reverse reads
Store output in folder, trinity_ab
The assembly is completed when you see the messages printed as shown below.
Browse to the folder and nd the assembled transcripts le, Trinity.fasta. Rename the le as
‘Trinity_ab.fasta’ for easy idencaon.
Hands on Training Aquaculture Genomics and Bioinformacs 27
9.4 Assessing quality of assembly
9.4.1. N50: Compute N50 stasc by running the following command,
TrinityStats.pl<> Trinity_ab.fasta<>><> Trinity_ab_stats.txt
9.4.2. ExN50: The E90N50 is being considered as more appropriate for RNAseq studies rather than
N50. Get ExN50 stats with the following argument.
cong_ExN50_stasc.pl <>matrix.TMM.EXPR.matrix <>Trinity_ab.fasta | tee ExN50.stats
EMinimum expression ExN50 Number of transcripts
E90 2.28 1611 45381
E91 1.952 1511 53150
E92 1.916 1409 61794
E93 1.55 1314 71403
E94 1.5 1212 82217
E95 1.262 1102 94509
E96 1.122 1005 108691
E97 0.95 927 125654
E98 0.746 858 146688
E99 0.566 791 175457
E100 0 605 281008
The N50 calculated based on the top most expressed transcripts that represent 90% of the total
normalized expression data is 1611 bases and includes 45381 transcripts.
9.4.3. Read representaon: The proporon of paired-reads represented in the assembled transcripts
is another parameter that helps in evaluang the assembly. We shall use bowe2 tool for this. First
an index is to be made and then reads are to be aligned on to transcripts. Run the following two
commands.
bowe2-build<>Trinity_ab.fasta<> Trinity_ab.fasta
AND
bowe2<> -x<> Trinity_ab.fasta<> -q<> --fr<> -1<> a1F_P.fastq,a2F_P.fastq,a3F_P.fastq,b1F_P.
fastq,b2F_P.fastq,b3F_P.fastq<> -2<> a1R_P.fastq,a2R_P.fastq,a3R_P.fastq,b1R_P.fastq,b2R_P.
fastq,b3R_P.fastq<> -S<> samle<> --no-unal<> -p<>50
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
28
As per the stascs shown above, the overall alignment rate is 90% which is good.
9.5 Transcript quancaon
9.5.1. Esmate abundance
The rst step in transcript quancaon is to esmate the abundance of all transcripts in every
sample. We shall pracce esmang transcript abundance using alignment-based method, RSEM
though alignment-free methods such as kallisto and salmon exist. Run the following command to get
abundance esmates by aligning the sequence reads to transcripts and counng the number of reads
aligned for each transcript.
align_and_esmate_abundance.pl<> --transcripts<> Trinity_ab.fasta<> --seqType<> fq
<>--samples_le<> ab_samples.txt <>--est_method<> RSEM <>--aln_method<> bowe<> --trinity_
mode<> --prep_reference<> --SS_lib_type <>FR <>--output_dir<> ab_rsem_outdir <>--thread_
count<>20
Argument Meaning
align_and_esmate_abundance.pl Script to align reads on to transcripts and get abundance
esmates
--transcripts To dene the assembled transcripts le name
--seqType To dene input le format
--samples_le Dene the le name that contains treatments, replicates and
reads le names
--est_method To dene abundance esmaon method (opons are RSEM/
eXpress/kallisto/salmon)
--aln_method To dene alignment method (bowe/bowe2)
--trinity_mode To automacally generate gene_trans_map
--prep_reference To build target index
--SS_lib_type Specify if the library is strand-specic (FR/RF)
--output_dir Name of the directory to store output les
--thread_count Number of threads to use for running the argument
At the end of the run, nd that six folders are created corresponding to six samples. In each
folder observe for a le named, RSEM.isoforms.results. These les are used for further processing.
These abundance esmates are built in to matrix with the following argument,
abundance_esmates_to_matrix.pl<> --est_method<> RSEM<>RSEM.isoforms.results
Menon all the six le names of RSEM.isoforms.results corresponding to six samples.
9.5.2. Count the numbers of expressed transcripts
Plot the number of transcripts that are expressed at dierent TPM threshold by running the
following argument,
count_matrix_features_given_MIN_TPM_threshold.pl matrix.TPM.not_cross_norm | tee
counts_by_min_TPM
Hands on Training Aquaculture Genomics and Bioinformacs 29
The output looks like the table depicted below.
Neg_min_tpm Number of features
-10 24978
-9 29296
-8 35850
-7 45677
-6 62228
-5 84308
-4 111966
-3 151586
-2 202147
-1 228356
0 281008
9.6 Dierenal expression analysis
At present, Trinity supports four R packages for performing dierenal expression analysis.
These are edgeR, DEseq2, limma/voom, and ROTS. We shall use edgeR in this tutorial to understand
dierenal expression analysis. Run the following commands.
run_DE_analysis.pl<> --matrix<> matrix.counts.matrix<> --method<> edgeR<> --samples_le<>
ab_samples_DE.txt<> --output <>ab_edgeRresult
AND
analyze_di_expr.pl<> --matrix<> matrix.TMM.EXPR.matrix<> --output<> aVSb <>--samples<>
ab_samples_analyzeDE.txt
In this parcular example, we got ve transcripts that are dierenally expressed in sample b
compared to sample a. Now proceed to funconal annotaon of these transcripts and understand its
role for the given treatment in the study.
9.7 Quality check of samples and replicates: You may compare the samples as well as replicates in
each sample with the following commands
9.7.1. /PtR<> --matrix <>matrix.counts.matrix<> --samples <>ab_replicatesTest.txt<> --CPM <>--
log2<> --min_rowSums <>10<> --compare_replicates
9.7.2. /PtR<> --matrix<> matrix.counts.matrix<> --min_rowSums <>10<> -s<> ab_replicatesTest.txt<>
--log2<> --CPM <>--sample_cor_matrix
9.7.3. /PtR<> --matrix<> matrix.counts.matrix<> -s<> ab_replicatesTest.txt<> --min_rowSums 10<>
--log2 <>--CPM<> --center_rows <>--prin_comp 3
For example, in the picture below, it is evident that the replicates in treatment b are clustered
closely. This ensures that all the replicates behaved similarly.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
30
Hands on Training Aquaculture Genomics and Bioinformacs 31
10. Phylogenomic analysis using MrBayes
K. Vinaya Kumar, J. Ashok Kumar and G. Gopikrishna
Researchers perform phylogenec analysis to understand the evoluonary relaons among
taxa. Such analyses require informaon on best-t paroning schemes and best-t models for
the sequence data in hand. The ParonFinder is a suitable tool to nd that informaon to build
phylogenec tree. In this chapter we conduct analyses using ParonFinder tool for nding the best-
t paroning scheme and evoluonary models. Then using these paroning schemes and models,
we would build a Bayesian tree in MrBayes tool.
10.1 ParonFinder
For this exercise, a sequence le containing sequence data of 5 genes on 10 taxa is provided in
your work folder. Open the folder and check for the le named, ‘sequence_10_5.phy’.
Taxa labels 10 taxa taxaA, taxaB, …….. taxaJ
Gene parons 5 genes
Gene1: 1-675 bp
Gene2: 676-834 bp
Gene3: 835-2373 bp
Gene4: 2374-3060 bp
Gene5: 3061-3849 bp
The arguments for running ParonFinder are to be provided in a conguraon le. Find the le
‘paron_nder.cfg’ in work folder. Keep sengs as per the table given below.
Argument Opon Meaning
alignment sequence_10_5.phy File containing sequences in phylip
format
branchlengths linked
Linked branch lengths are
supported by almost all phylogeny
programs
models mrbayes
Includes all the evoluonary
models that are compable for
MrBayes tool for tesng
model_selecon aicc Criterion to decide the best model
data_blocks
Gene1_pos1 = 1-675\3;
Gene1_pos2 = 2-675\3;
Gene1_pos3 = 3-675\3;
Gene2_pos1 = 676-834\3;
Gene2_pos2 = 677-834\3;
Gene2_pos3 = 678-834\3;
Gene3_pos1 = 835-2373\3;
Gene3_pos2 = 836-2373\3;
Dening data parons. For each
gene, three data parons are
dened based on the three base
posions of triplet code. We
dened 15 data blocks for 5 genes.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
32
Gene3_pos3 = 837-2373\3;
Gene4_pos1 = 2374-3060\3;
Gene4_pos2 = 2375-3060\3;
Gene4_pos3 = 2376-3060\3;
Gene5_pos1 = 3061-3849\3;
Gene5_pos1 = 3062-3849\3;
Gene5_pos1 = 3063-3849\3;
search greedy Dening the method to use for
nding good paroning scheme
Then use the following command to run the ParonFinder. Here, you menon the name of the
folder containing sequence le and conguraon le in place of ‘folder_name’.
python<>ParonFinder.py <> folder_name/ --no-ml-tree
The output les are stored in the folder ‘analysis’. Find the le ‘best_scheme.txt’ that contains
the arguments for running the best t models on best parons in MrBayes tool.
10.2 MrBayes
The Bayesian analysis requires the input sequence le in nexus format. Find the le,
sequence_10_5.nxs le which was used for analysis in ParonFinder. Download windows version of
MrBayes tool and unzip the le. Start the tool by clicking on the executable. Then run the following
arguments.
execute sequence_10_5.nxs;
outgroup taxaJ;
type arguments given in output le of ParonFinder, ‘best_scheme.txt’
showmodel # to check for the model dened
mcmc ngen=10000000 nruns=2 nchains=4 samplefreq=100 prinreq=100
diagnstat=maxstddev diagnfreq=1000 savebr=yes lename= ParonFinder
Aer running for 10 million generaons, you would see the following screen.
Hands on Training Aquaculture Genomics and Bioinformacs 33
You could connue with more generaons if required by opng for ‘yes’ at the prompt.
Then obtain a summary of parameters with the following command. Here, by default, rst 25%
of observaons are discarded.
sump lename= ParonFinder
Look for the parameters like esmated sample size and potenal scale reducon factor. Then
summarize the trees with the following command. This prints a cladogram and a phylogram.
sumt lename= ParonFinder
Check for the .tre le and open it in FigTree to view the tree.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
34
11. Microsatellites genotypes generaon by Fragment analysis method
B. Sivamani
Fragment analysis (Genotyping) can be performed on DNA fragments that have uorescent
labels. Using a labeled primer with PCR amplicaon is a common method used to incorporate these
labels. The Molecular Biology Core lab is already set to run mulple uorescent dye sets.
Steps
1. Microsatellite loci selecon
2. Primer designing (uorescent labelled)
3. PCR
4. Fragment analysis – ABI sequencer
5. Generang genotypes
1. Microsatellite loci selecon
The loci are selected loci through literature search or from any database. For sheries, the NBFGR
FishMicrosat database provides updated microsatellite loci and their primers for pcr amplicaon.
(hp://mail.nbfgr.res.in/shmicrosat/).
Steps to nd the microsatellite loci in Penaeus (Fenneropeaneus indicus)
ØVisit the site (hp://mail.nbfgr.res.in/shmicrosat/)
ØUnder Analysis and Primer, select your species of search and you will nd all the microsatellite
loci related to the specic search. The following details of the loci also present
1. Accession Number: link will lead to the NCBI site and will give all the details of the
nucleode sequence
2. SSR type: di, tri, tetra or compound
3. Microsatellite span in the sequence
4. Primers to amplify the locus
2. Designing primer
One can use the specied primer or a primer may be designed as per the user requirement by
ulizing the accession Number opon. One of the primers needs to be uorescent labeled.
3. PCR
Isolate Total DNA from the biological material (Blood/nclips/muscle,etc.) of the species. Verify
the DNA for quality and quanty. Carry out the PCR with labelled primers. Verify the amplicon by
agarose gel electrophoresis. The specic amplicaon of the product is considered beer. Else,
presence of some less intense non-specics also accepted.
Hands on Training Aquaculture Genomics and Bioinformacs 35
4. Fragment analysis – by ABI sequencer
The step is normally outsourced being the cost of the equipment is too high. We receive the
results generated by GeneMapper soware (Private rms use the inbuilt GeneMapper soware) with
the following les.
1. FSA le
2. PDF for electropherogram
3. Genotypes in excel sheet
Fig:1 Electropheogram
Fig: 2 Genotypes data generated by GeneMapper soware
5. Generang genotypes from FSA le using R
# 5.1 Install R from the site hps://www.r-project.org/
# 5.2 installing the package from R site##
Install.packages(“Fragman”)
# 5.3 To acvate, the package has to be loaded###
>Library(Fragman)
# 5.4 To specify the input fas le
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
36
FIM03<-storing.inds(“C:/Users/Admin/Desktop/training writeup/FAS le-ciba”)
#5.5 To specify the ladder used in the analysis
my.ladder <- c(35,50,75,100,139,150,160,200,250,300,340,350,400,450,490,500)
# 5.6 To merge both the earlier specied informaon (FAs le and ladder)
ladder.info.aach(stored=FIM03, ladder=my.ladder)
# 5.7 Tocreate friendly plots for any number of individuals specied and can be used to
#design panels for posterior automac scoring
overview2(my.inds=FIM03, channel = 2, ladder=my.ladder)
# 5.8 to view the vector with expected DNA sizes to be used in the next step for scoring
my.panel2<-overview2(my.inds=FIM03, channel = 2, ladder=my.ladder, init.thresh=3000,
xlim=c(90,130))
my.panel2
# 5.9 To score our samples for channel 2 with our panel created previously
res2 <- score.markers(my.inds=FIM03, channel = 2, panel=my.panel2$channel_2, ladder=my.
ladder, electro=FALSE)
# 5.10 To extract your peaks in a data.frame
nal.results <- get.scores(res2)
nal.results
# 5.11To get the results in text le format
write.table(nal.results, “ C:/Users/Admin/Desktop/training writeup/FIM03-18-2.txt”, sep=”\t”)
******
Note
install.packages = to install the specic package
library = to load addon package
storing.inds = is the funcon in charge of reading the FSA les and storing them
with a list structure
ladder.info.aach = uses the informaon read from the FSA les and a vector containing theladder
informaon (DNA size of the fragments) and matches the peaks from the channel where theladder
was run with the DNA sizes for all samples. Then loads such informaon in the R environmenor the
use of posterior funcons
Hands on Training Aquaculture Genomics and Bioinformacs 37
stored = List of dataframes obtained by using the storing.inds funcon
overview2 = create friendly plots for any number of individuals specied and can be used
to design panels (overview2) for posterior automacscoring (like licensed soware does)
my.inds =List with the channels informaon from the individuals specied, usually
comingfrom the storing.inds funcon output
Channel = The channel you wish to analyze, usually 1 is blue, 2 is green, 3 is yellow, 4 is
red and so on
init.thresh = An inial value of intensity to detect peaks. We recommend not to deal to
muchwith it unless you have highly controlled dna concentraons in your experiment.
score.markers = score the alleles by nding the peaks provided in the panel
panel =dierent dna sizes usually obtained by using overview and locator funcons
get.scores =Once the calls have been obtained we can extract a data frame with the get.
scores funcon.
******
xlim=c(a1,b1)) = the approximate amplicon size to be menoned in overview2
Dye sets used applied biosystem DNA analyser
Blue: 5FAM and 6FAM
Green: Hex, vic,Tet and Joe
Yellow: Tamra and Ned
Red: Rox and Pet
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
38
12. Genepop : Populaon Genecs analysis
B. Sivamani
GENEPOP is a populaon genecs soware package originally developed by Michel Raymond
(Raymond@isem.univ-montp2.fr) and Francois Rousset (Rousset@isem.univ-montp2.fr), at the
Laboraore de Geneque et Environment, Montpellier, France.
Access: Web version is easy to use but acve internet is required, also can be downloaded and
run under windows and linux without internet.
Genepop on the web
Can be accessed in the link: hp://genepop.curn.edu.au/
It has seven opons. Once the input le is prepared, all the opons can be run.
Opon1 Hardy Weinberg Exact Tests
Opon 2 Genotypic linkage disequilibrium
Opon 3 Populaon dierenaon
Opon 4 Nm esmates - private allele method
Opon 5 Basic Informaon
Opon 6 Fst and other correlaons
Opon 7 le conversion
Input le (e.g.)
Title:”P.indicus microsatellites based populaon diversity”
FIM03
FIM06
FIM20
FIM21
FIM17
FIM19
FIM23
POP
C051 , 113115 161161 128130 225225 308289 110118 201226
C054 , 109115 161167 123123 222231 299307 110116 222226
C055 , 113117 161164 123123 231231 307307 108110 191201
C056 , 109113 161164 123123 245248 307307 110118 191201
C057 , 117117 161161 123125 230230 293307 110118 191201
POP
K15 , 115115 161161 128130 230230 308308 110118 201222
K20 , 115117 160160 123125 000000 294307 110118 191201
K23 , 107113 149159 123123 000000 309308 110116 191201
K36 , 111115 160186 123123 000000 307308 108118 201222
K42 , 109115 138149 123125 000000 298300 110116 191201
Hands on Training Aquaculture Genomics and Bioinformacs 39
Instrucons to input le
ØInput le should be prepared in notepad, notepad++ or excel
ØThe input le should have txt extension e.g. lename.txt
ØFirst line, tle is wrien within inverted commas
ØNo constraint on blanks separang the various elds, tabs or spaces allowed.
ØLoci names can appear on separate lines, or on one line if separated by commas
ØIndividual idener may have blanks but must end with a comma
ØAlleles are numbered from 01 to 99 (or 001 to 999). Consecuve numbers to designate
alleles are not required.
ØPopulaons are dened by the posion of the “Pop” separator. To group various populaons,
just remove relevant “Pop” separators.
ØIndividual genotypes for the web version must be on one line. This diers from the PC
version.
Ø Missing data should be indicated as 00 (or 000) rather than blanks. There are three possibilies
for missing data :
v no informaon (0000) or (000000),
v paral informaon for rst allele (1000) or (010000),
v paral informaon for second allele (0010) or (000010).
ØThe number of locus names should correspond to the number of genotypes in each row. If
you remove one or several loci from your input le, you should remove both their names and
the corresponding genotypes.
ØNo empty lines should be found within the le.
ØNo more than one empty line should be present at the end of le.
To run in PC
Download Genepop form the link hp://kimura.univ-montp2.fr/~rousset/Genepop.htm
Based on OS 32 or 64 bit version can be run without installing from the PC. The Input le format
is same like Genepop on web. Input le should be in the same folder of the soware. Aer specifying
the input le, type the number of the opons (for analysis) and the output-le gets stored at the
same folder.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
40
13. Populaon genec analysis of microsatellite data in Arlequin
B. Sivamani
Arlequin is a soware tool specially designed to extract informaon on genec and demographic
features of a collecon of populaon samples. Arlequin can handle several types of data either in
haplotypic or genotypicform. The data types include
ØDNA sequences
ØRFLP data
ØMicrosatellite data
ØStandard data
ØAllele frequency data
Arlequin can analyse various populaon parameters. They are standard indices, molecular diversity,
Linkage disequilibrium, Hardy-Weinberg equilibrium, Amova, Exact populaon dierenaon etc.,
Installaon and uninstallaon
1. Download WinArl35.zip to any temporary directory.
2. Extract all les contained in Arlequin35.zip in the directory of your choice.
3. Start Arlequin by double-clicking on the le WinArl35.exe, which is the main executable le.
4. To uninstall simply delete the directory where you installed Arlequin. The registries were not
modied by the installaon of Arlequin.
Conguraon
Download text editor tool from www.textpad.com and install. It is required to create, edit the
project les and to view the log les.
Download R from www.rproject.org and install it.
Running the soware Arlequin
Open the arlequin by double clicking “WinArl35.exe” which leads to the home page.
Hands on Training Aquaculture Genomics and Bioinformacs 41
Step1. Conguraon of Arlequin
1.1 Click on ‘Arlequin Conguraon’ box, select the opon Append results, XML output and use
64bit external . Append Results is selected to get the results of several runs of a specic
input le into a single output le. The XML output opon is to get the results in XML format.
1.2 Under ‘Helper Programs’, the path of the Text editor and R has to be specied for the
ulizaon. Click the ‘Browse’ box of the ‘Text editor’ and browse where the Textpad.exe le
is located.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
42
1.3 Click the ‘Browse’ buon of ‘Rcmd’ and indicate the path by selecng the Rcmd Applicaon
form the specic folder.
Step 2: Project le preparaon
Arlequin requires project le (input) which has the extension “.arp”. Once the analysis over, the
output (results) will be stored in the same (WinArl35) folder as subfolder with the extension “.res”.
2.1 Open the arlequin soware by double cicking the “ WinArl35.exe”
2.2 Click on “project wizard” opon. An example project le will be created with the Arlequin
format.
2.3 Click the dropdown menu of ‘Datatype’. Select the opon ‘MICROSAT’
Hands on Training Aquaculture Genomics and Bioinformacs 43
2.4 Choose ‘Genotype data’
2.5 We have data on ve populaons for analysis. Therefore menon ‘No of samples’ as 5.
2.6 Choose ‘whitespace’ against the ‘Locus separator’
2.7 Type ‘?’ against ‘Missing data’
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
44
2.8 Select the opon ‘Include genec structure’
2.9 Click the ‘Browse’ opon
2.10 It opens a pop-up window where the project le need to be stored as ‘ciba-1’ in ‘WinArl35’
folder
2.11 Click on ‘CREATE PROJECT’ project which creates a project le named ‘ciba-1
2.12 Convert the Genepop format of input le into Arlequin project format
2.12.1 Open the ‘Genepop on the web’ (hp://genepop.curn.edu.au/)
2.12.2 Click on opon 7 (File Conversion) will open the window for Data format conversions.
Hands on Training Aquaculture Genomics and Bioinformacs 45
2.12.3 Select (opon 5) Genepop format to Arlequin project
2.12.4 Select ‘Datatype’ as ‘microsatellite’
2.12.5 Select ‘Genotypic data’ as ‘diploid’
2.12.6 For ‘Recessive (null) allele present’, select yes or no based on the data. Here our data
contains some null alleles. Therefore we select ‘Yes’ opon.
2.12.7 For ‘Gamec phase’, select ‘unknown’ opon (being a diploid data, gamec phase details
are not necessary; the same results will be obtained for either opon)
2.12.8 For ‘Output format & Delivery’ select any of the opons; ‘Email the results’ or ‘HTML -
Plain Text’. Under ‘Email the results’, enter your email id. The results will be sent to your mail
id. Plain text opon, will display in the same window.
2.12.9 Under ‘Choose File’ opon, browse your Genepop le (ciba_genpop_1.txt) and click
‘Submit data’ box. We get the Results in Arlequin project format.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
46
2.13 Copy the results from ‘[Data]’ to ll the end.
2.14 Goto Arlequin soware page and click ‘Edit project’
2.15 It will open ciba1.arp le in text pad
Hands on Training Aquaculture Genomics and Bioinformacs 47
2.16 Paste the copied content in the ciba1.arp le and replace the [Data] content
2.17 Edit the ‘Structure’ content
2.17.1# (Enter the tle between inverted commas) (e.g.)
StructureName = “Fish-India”
2.17.2 #Number of groups + {1,2,3...} (Enter 1,2,3 ..Etc., as per the number of groups one has to make)
(e.g.)
NbGroups = 1
2.17.3#Dene hereaer the structure of the rst group; menon all the names of the populaons.
Every populaon name should be within inverted comma. The populaons belong to the specic
group has to be menoned. (e.g.)
Group ={ “C051”
“K15”
“MNI01”
“P094”
“Q02”
}
2.17.4 Aer eding, save and close the le.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
48
3 Analyzing the data
3.1 Using ‘Open project’ box, open the le ‘ciba-1.arp’
3.2 Choose the required analysis from the ‘Sengs’
3.3 Click on ‘Start’ buon to start the analysis
3.4 View the results generated in the folder (project le name with the .res extension) ‘ciba-1.res’.
Hands on Training Aquaculture Genomics and Bioinformacs 49
14. So Compung techniques in Bioinformacs
P. Mahalakshmi
INTRODUCTION
The exponenal growth of the amount of biological data available raises two problems: on one
hand, ecient informaon storage and management and, on the other hand, the extracon of
useful informaon from these data. The second problem is one of the main challenges in computaonal
biology, which requires the development of tools and methods capable of transforming all these
heterogenerous data into biological knowledge about the underlying mechanism. These tools
and methods should allow us to go beyond a mere descripon of the data and provide knowledge
in the form of testable models. By this simplifying abstracon that constutes a model, we will be
able to obtain predicons of the system. There are several biological domains where so compung
techniques are applied for knowledge extracon from data.
Applicaon of so compung becomes relevant for solving some Bioinformacs and molecular
biology problems. Development in so compung method reveal the high principles of technology,
algorithms, and tools in bioinformacs for enthusiasc reason such as dependable and parallel
genome sequencing, fast sequence comparison, search in databases, mechanical gene idencaon,
ecient modeling and storage of mixed data, etc. Protein classicaon leads to idencaon and
proper funconal assignment of uncharacterized proteins with a nal goal towards nding homologies
and drug discovery. Again, structure based ligand design is one of the crucial steps in raonal drug
discovery, where a small molecule is designed by targeng the structure and biochemical properes
of the target.
The applicaon of so compung oers an on promising approach to achieve ecient and reliable
heurisc soluon. On the other side the incessant development of high quality biotechnology,
e.g. micro-array techniques and mass spectrometry, which provide complex paerns for the direct
characterizaon of cell processes, oers further promising opportunies for advanced research in
bioinformacs. So one important sub-discipline within bioinformacs involves the development of
new algorithms and models to extract new, and potenally useful informaon from various types of
biological data including DNA(nucleode sequences) and proteins (amino acid sequences). Analysis
of these macromolecules is performed both structurally and funconally using the major components
of so compung like Fuzzy Sets (FS), Arcial Neural Networks (ANN), Evoluonary Algorithms
(EAs) (including genec algorithms (GAs), Rough Sets (RS), Swarm Opmizaon (SO) etc. This lecture
notes aempts to describe the fuzzy logic, Arcial Neural Networks and genec algorithm and its
applicaons in bioinformacs.
NEED OF SOFT COMPUTING IN BIOINFORMATICS
The dierent tasks involved in the analysis of biological data include Sequence alignment,
genomics, proteomics, DNA and protein structure Predicon, gene/promoter idencaon
phylogenec analysis, analysis of Gene expression data, protein Folding, docking and molecule
and Drug design. Data analysis tools used earlier in bioinformacs were mainly based on stascal
techniques like regression and esmaon. So compung in bioinformacs can be used in handling
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
50
large, complex, inherently uncertain, data sets in biology in a robust and computaonally ecient
manner thus fuzzy sets (so compung technique) can be used as a natural framework for analysing
them. Most of the bioinformac tasks involve search and opmizaon of dierent criteria (like energy,
alignment score, overlap strength), while requiring robust, fast and close approximate soluons.
Missing and noisy data is one characterisc of biological data. The convenonal computer
techniques fail to handle this. So compung based techniques are able to deal with missing and noisy
data. As so compung are measured to handle vagueness, indecision and near opmality in large
and complex search spaces use of so compung gear for solving bioinformacs problems have been
gained the aenon of researchers. Most of the researches are woven around the tasks of paern
recognion and data mining like clustering, classicaon, feature selecon, and rule generaon,
while classicaon pertains to supervised or unsupervised learning, clustering corresponds to
unsupervised self -organizaon into homologous parons.
In molecular biology research, new data and concepts are generated every day, and those new
data and concepts update or replace the old ones. So compung can be easily adapted to a changing
environment. This benets system designers, as they do not need to re-design systems whenever the
environment changes. Moreover, since many of the problems involve mulple conicng objecves,
applicaon of so compung mul-objecve opmizaon algorithms like mulobjecve genec
algorithms appears to be natural and appropriate. So compung techniques, either individually or
in a hybridized manner, can be used for analyzing biological data in order to extract more and more
meaningful informaon and insights from them.
With advances in biotechnology, huge volumes of biological data are generated. In addion, it
is possible that important hidden relaonships and correlaons exist in the data. So compung
methods are designed to handle very large data sets, and can be used to extract such relaonships.
FUzzY LOGIC AND ITS APPLICATION IN BIOINFOAMTICS
Fuzzy Sets and Linguisc Variables
A fuzzy set is an extension of a crisp set. Crisp sets allow only full membership or no membership
at all, whereas fuzzy sets allow paral membership. In a crisp set, membership or non membership of
element x in set A is described by a characterisc funcon, where if and if . Fuzzy set theory extends
this concept by dening paral membership, where, where if; if and if x parally belongs to A.
Mathemacally, a fuzzy set A on a universe of discourse U is characterized by a membership funcon
that takes values in the interval [0 1] that can be dened as . Fuzzy set represent commonsense
linguisc labels viz., suitable, moderate, unsuitable, slow, very slow, fast etc. A given element can
be a member of more than one fuzzy set at a me. A fuzzy set A in U may be represented as a set of
ordered pairs. Each pair consists of a generic element x and its grade of membership funcon; that
is,, x is called a support value if (Zadeh, 1965). The concept of a linguisc variable plays important
role parcularly in fuzzy logic. A linguisc variable is a variable whose values are expressed in words
or sentences in natural language. For each input and output variables, fuzzy sets are created by
dividing its universe of discourse into a number of sub-regions and are named as linguisc variable
(Zimmermann, 1996).
Hands on Training Aquaculture Genomics and Bioinformacs 51
Membership Funcons
Although both classical and fuzzy subsets are dened by membership funcons, the degree to
which an element belongs to a classical subset is limited to being either zero or one. This means that
membership funcon may only be a step funcon (Figure 6.1a). On the other hand, in fuzzy logic, a
membership funcon (MF) is essenally a curve that denes how each point in the input space is
mapped to a membership value (or degree of membership) between 0 and 1.
.
Membership funcon for (a) crisp set and (b) fuzzy set
The membership funcons are usually dened for inputs and outputs in terms of linguisc
variables. Various types of membership funcons are used, such as triangular, trapezoidal, bell,
Gaussian, sigmoid funcons. In designing a fuzzy inference system, membership funcons are
associated with term sets that appear in the antecedent or consequent of rules. Many researchers
have used dierent techniques for determining membership funcons such as fuzzy clustering, neural
networks, and genec algorithms
Fuzzy Inference System
Fuzzy Inference System (FIS) incorporate an expert’s experience into the system design and they
are composed of four blocks. A FIS comprises a fuzzier that transforms the ‘crisp’ inputs into fuzzy
inputs by membership funcons that represent fuzzy sets of input vectors, a knowledge base that
includes the informaon given by the expert in the form of linguisc fuzzy rules, an inference engine
that uses them together with the knowledge base for inference by a method of implicaon and
aggregaon, and a defuzzier that transforms the fuzzy results of the inference into a crisp output
using a defuzzicaon method.
Fuzzy Inference System
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
52
The knowledge base comprises two components: a database, which denes the membership
funcons of the fuzzy sets used in the fuzzy rules, and a rule base comprising a collecon of linguisc
rules that are joined by a specic operator. Based on the consequent type of fuzzy rules, there are
two common types of FIS, which vary according to dierences between the specicaons of the
consequent part (Equaons 1 and 2). The rst fuzzy system uses the inference method proposed by
Mamdani in which the rule consequence is dened by fuzzy sets and has the following structure
IF x is A and y is B THEN z is C (1)
The second fuzzy system proposed by Takagi, Sugeno and Kang (TSK) contains an inference
engine in which the conclusion of a fuzzy rule comprises a constant (equaon 2 a) or a weighted linear
combinaon of the crisp inputs (equaon 2 b) rather than a fuzzy set. A fuzzy rule for the zero-order
Sugeno method is of the form
IF x is A and y is B THEN z = C (2 a)
where A and B are fuzzy sets in the antecedent and C is a constant. The rst-order Sugeno model
has rules of the form
IF x is A and y is B THEN z = px+qy+r (2 b)
where A and B are fuzzy sets in the antecedent and p, q, and r are constants
Fuzzy Inference Process
The inference process for evaluang the system needs ve steps
Fuzzy Inference Process
The rst step in evaluang the output of a FIS is to apply the inputs and determine the degree
to which they belong to each of the fuzzy sets via membership funcon (Figure 6.5). This is required
in order to acvate rules that are in terms of linguisc variables. Once membership funcons are
dened, fuzzicaon takes a real me input value and compares it with the stored membership
funcon to produce fuzzy input values. In order to perform this mapping, we can use fuzzy sets of any
shape, such as triangular, Gaussian, π-shaped, etc.
A fuzzy rule base contains a set of fuzzy rule R. For mul-input, single-output system is represented
by
),........,,(
21 n
RRRR =
where Ri can be represented as
( )
( )
11
11
.......,,
yxmxi
TisythenTisxandTisxifR
m
=
Hands on Training Aquaculture Genomics and Bioinformacs 53
In this rule, m precondions of Ri form a fuzzy set
).......( 21 m
xxx TTT ×××
, and the consequent is
single output. Generally, if-then-rule can be interpreted by the following three steps:
Resolve all fuzzy statements in the antecedent to a degree of membership between 0 and 1.
If the rule has more than one antecedent, the fuzzy operator is applied to obtain one number
that represents the result of applying that rule. This is called ring strength or weight factor of that
rule. For example, consider an ith rule has two parts in the antecedent
( ) ( )
i
y
i
x
i
xi
TisythenTisxandTisxifR
21
21
=
Then, the weight factor can be dened using either intersecon operators or product operators
( )
)(),(min
21
21
xx
i
x
i
xi
µµα
=
)()( 21 21 xx i
x
i
xi
µµα
=
The weight factor is used to shape the output fuzzy set that represents the consequent part of
the rule.
The implicaon method is dened as the shaping of the consequent, which is the output fuzzy
set, based on the antecedent. The input for the implicaon process is a single number given by the
antecedent, and the output is a fuzzy set. Minimum or product are two commonly used methods,
which are represented by the following respecvely.
( )
)(,min)( oo i
yi
i
y
µαµ
=
)()( oo i
yi
i
y
µαµ
=
whereois the variable that represents the support value of the membership funcon.
Aggregaon takes all truncated or modied output fuzzy sets obtained as the output of the
implicaon process and combines them into a single fuzzy set. The output of the aggregaon process
is a single fuzzy set that represents the output variable. The aggregated output is used as the input
to the defuzzicaon process. Aggregaon occurs only once for each output variable. Since the
aggregaon method is commutave, the order in which the rules are executed is not important. The
commonly used aggregaon method is the maxmethod which can be dened as follows:
( )
)(),(max)( ooo
i
y
i
yy
µµµ
=
The defuzzier maps output fuzzy sets into a crisp number. Defuzzicaon can be performed
by several methods such as: center of gravity, center of sums, center of the largest area, rst of the
maxima, middle of the maxima, maximum criterion and height defuzzicaon. Of these, center of
gravity (centroid method) and height defuzzicaon are the methods commonly used. The centroid
defuzzicaon method nds the center point of the soluon fuzzy region by calculang the weighted
mean of the output fuzzy region. It is the most widely used technique because the defuzzied values
tend to move smoothly around the output fuzzy region.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
54
Fuzzy Logic in Bioinformacs
Fuzzy systems have been successfully applied to several areas in pracce like for building
knowledge-based systems, fuzzy logic-based and fuzzy rule-based models. They can control and
analyze processes and diagnose and make decisions in biomedical sciences. There are many
applicaon areas in biomedical science and bioinformacs, where fuzzy logic techniques [10] can
be applied successfully. Some of the important uses of fuzzy logic are listed below:
ØIncreasing exibility of protein mofs.
ØStudying dierences between various poly nucleodes.
ØAnalyzing experimental expression data using fuzzy adapve resonance theory .
ØStudying aligning sequences based on a fuzzy dynamic programming algorithm.
ØMathemacal modeling of complex traits inuenced by genes with fuzzy-valued in pedigreed
populaons.
ØFinding cluster membership values to genes applying a fuzzy paroning method using fuzzy
C-Means and fuzzy c-hard mean algorithms.
ØGenerang DNA sequencing using genec fuzzy and neuro-fuzzy systems by ancipang
disturbances due to intangible parameters.
ØIdenfying the cluster genes from micro-array data.
ØPredicng protein’s sub-cellular locaons using fuzzy k- nearest neighbor’s algorithm.
APPLICATION OF ARTIFICIAL NEURAL NETWORK
An Arcial Neural Network (ANN) is an informaon processing model that is able to capture
and represent complex input-output relaonships. The movaon the development of the ANN
technique came from a desire for an intelligent arcial system that could process informaon in
the same way the human brain. Its novel structure is represented as mulple layers of simple
processing elements, operang in parallel to solve specic problems. ANNs resemble human brain
in two respects: learning process and storing experienal knowledge. An arcial neural network
learns and classies problem through repeated adjustments of the connecng weights between
the elements. There are several learning strategies using in bioinformacs: Supervised Learning,
Unsupervised Learning and Reinforcement Learning
An ANN learns from examples and generalizes the learning beyond the examples supplied. The
methodology of modeling or esmaon is somewhat comparable to stascal modeling. Neural
networks should not, however, be heralded as a substute for stascal modeling but rather as a
complementary eort (without the restricve assumpon of a parcular stascal model) or an
alternave approach to ng non-linear data .Neural networks have been widely used in biology
since the early 1990s. Some of the important applicaons of ANNs are listed below:
ØPredicon and the translaon sites iniaon in DNA sequences and proteins.
ØExplain the theory of arcial neural networks using applicaons in biology.
ØPredict immunologically interesng pepdes by combining an evoluonary algorithm.
Hands on Training Aquaculture Genomics and Bioinformacs 55
ØCarry out paern classicaon and signal processing successfully in bioinformacs.
ØPerform protein sequence classicaon.
ØPredict protein secondary structure predicon.
GENETIC ALGORITHMS IN BIOINFORMATICS
The genec algorithm is a method for solving both constrained and unconstrained opmizaon
problems that is based on natural selecon, the process that drives biological evoluon. The
applicaons of GAs are for solving certain mul objecve problems of bioinformacs, which yields
opmizaon of computaon requirements, and robust, fast and close approximate soluons.
GAs are executed iteravely on coded soluons (populaon) biological basic Operators: selecon/
reproducon, crossover, and mutaon. They use objecve funcon informaon and probabilisc
transion rules for moving to the next iteraon. GAs is generally based on manipulang populaons
of bit-strings using both crossover and point-wise mutaon.
Some of the important applicaons of GAs are listed below:
ØAlignment and comparison of DNA, RNA, and protein sequences.
ØGene mappings in chromosomes.
ØRNA structure predicon
ØProtein structure predicon and clustering.
ØMolecular design and molecular docking.
ØGene nding and promoter idencaon from DNA sequences.
ØInterpretaon of gene expression and micro array data.
ØGene regulatory network idencaon.
ØConstrucon of phylogenec tree for studying evoluonary relaonship.
ØDNA structure predicon.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
56
15. RNAseq data analysis – Genome-guided
K. Karthic
1. Introducon
The transcriptomic prole of an organism at any given me or condion gives the set of all its
transcripts and their quanes present at the specic me point or condion. The transcriptomereveals
a great deal about the funconal aspects of the genome as well as the dierent kinds of biomolecules
present within the cell or ssue. It is also very useful for studying the genecs behind growth,
development and disease.
This tutorial describes how to analyse RNA-seq data when a reference genome is available and
the steps involved in idenfying dierenally expressed genes between the two groups. For the
purpose of demonstraon, we have chosen an experiment conducted on Arabidopsis thaliana.
1.1. Input les
1. Reference genome in fasta format
2. RNA seq raw data for two groups in replicates in fastq format
1.2. Soware requirements
1. Bowe2
2. Tophat
3. Cuinks (and associated cudi and cumerge)
4. cummerbund (a R package for visualizing the results)
2. Methodology
2.1. Fetching Raw data
(To save me the raw data has been already downloaded and kept in respecve folders. So the
steps 1 to 15 are to be skipped here)
1. Open terminal and create new directories in your account
mkdir Athaliana
cd Athaliana
mkdir Ref_genome_raw
mkdir Transcriptome_raw
2. Go to Assembly database in NCBI [hps://www.ncbi.nlm.nih.gov/assembly/] and type
TAIR10 in the search bar and click search.
3. The summary of Arabidopsis thaliana assembly is displayed. Click on the Download As-
semblies buon and select Genomic fasta in the drop down menu and click download.
4. The genome downloads as a .tar le, copy the le to Ref_genome_raw folder.
5. Go to terminal and inside the Athaliana folder, type the following commands
cd Ref_genome_raw
tar xvf genome_assemblies.tar
6. A new folder is created with the name similar to ncbi-genomes-2018-08-22. Go to termi-
nal again and type the following commands.
Hands on Training Aquaculture Genomics and Bioinformacs 57
cd ncbi-genomes-2018-xx-xx
gunzip GCF_000001735.4_TAIR10.1_genomic.fna.gz
ls –l
7. Now you can see the lisng of les and in that you noce the fasta le of our genome
and its corresponding le size.
8. Go to terminal again and type the following command to copy and save our genome le
in a dierent name and format
cat GCF_000001735.4_TAIR10.1_genomic.fna > AraTha.fa
9. Now you can see our reference genome saved as AraTha.fa
10. To download RNA-seq data, go to Sequence Read Archive (SRA) database of NCBI [hp://
www.ncbi.nlm.nih.gov/sra] and type the experiment accession numbers SRR671946,
SRR671947, SRR671948 and SRR671949 one aer the other in the search bar and click
search.
11. A summary of the experiment is displayed, scroll down and click on the link displayed
below the run.
12. A summary of experiment of A.thaliana root treated with KCl, replicate-data is displayed
.Go to the downloads tap and click on FASTA/FASTQ link.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
58
13. In the displayed page, type the experiment number and click show runs
14. Select FASTQ and click download.
15. Repeat steps 12, 13 and 14 for all four experiment runs (SRR671946, SRR671947,
SRR671948 and SRR671949).
16. Copy the downloaded fastq les to folder Transcriptome_raw.
17. Go to terminal and change the directory to Transcriptome_raw
18. Inside the Transcriptome_raw directory, you should have fastq les in zipped format for
all the four experiment runs. Go to terminal and type the below command to unzip the
les.
for i in *.gz;do gunzip $i;done;
19. With this we have downloaded all our raw data required for our analysis.
2.2. Data analysis
In this secon, how to run bowe2, tophat, cuinks,cumerge for analyzing the transcriptome
data is described. The soware installaons are not described. Please refer to respecve manual for
the same.
2.2.1. Indexing genome using bowe2
1. Go to terminal and change to the directory Ref_genome_raw/ncbi-genomes-2018-xx-xx and
type the following command:
bowtie2-build AraTha.fa AraTha
2. The above command will create bowe indexed les with .bt2 extension
2.2.2. Running tophat
1. tophat will align the RNA-seq data to our bowe indexed genome. To do so, type the
following command in terminal
Hands on Training Aquaculture Genomics and Bioinformacs 59
cd /home/user/Athaliana
mkdir analysis
cd analysis
tophat –o SRR671946_topout /home/user/Athaliana/Ref_genome_raw/AraTha /home/
user/Athaliana/Transcriptome_raw/ SRR671946.fastq
tophat –o SRR671947_topout /home/user/Athaliana/Ref_genome_raw/AraTha /home/
user/Athaliana/Transcriptome_raw/ SRR671947.fastq
tophat –o SRR671948_topout /home/user/Athaliana/Ref_genome_raw/AraTha /home/
user/Athaliana/Transcriptome_raw/ SRR671948.fastq
tophat –o SRR671949_topout /home/user/Athaliana/Ref_genome_raw/AraTha /home/
user/Athaliana/Transcriptome_raw/ SRR671949.fastq
2. The –o SRR671949_topout represents output folder. For each run, folder is created with
the following les: accepted_hits.bam, align_summary.txt, deleons.bed, inserons.bed,
juncons.bed, prep_reads.info, unmapped.bam and logs folder.
3. The accepted_hits.bam is the main result le containing the mapped results in binary format.
2.2.3. Running cuinks
1. From the alignment les generated from tophat, we can assemble the transcripts using
cuinks.
2. In terminal, type the following commands one aer another.
cuinks –o SRR671946_cuinksout /home/user/Athaliana/analysis/ SRR671946_topout/
accepted_hits.bam
cuinks –o SRR671947_cuinksout /home/user/Athaliana/analysis/ SRR671947_topout/
accepted_hits.bam
cuinks –o SRR671948_cuinksout /home/user/Athaliana/analysis/ SRR671948_topout/
accepted_hits.bam
cuinks –o SRR671949_cuinksout /home/user/Athaliana/analysis/ SRR671949_topout/
accepted_hits.bam
3. For each run, the designated output directory will contain the following les: genes.fpkm_
tracking, isoforms.fpkm_tracking, skipped.g, transcripts.g. The assembled transcripts are
contained in transcripts.g.
2.2.4. Running cumerge
1. cumerge will merge the transcripts to a comprehensive transcriptome.
2. Open a text editor, and type the path of the transcripts as below:
/home/user/Athaliana/analysis/ SRR671946_cuinksout/transcripts.g
/home/user/Athaliana/analysis/ SRR671947_cuinksout/transcripts.g
/home/user/Athaliana/analysis/ SRR671948_cuinksout/transcripts.g
/home/user/Athaliana/analysis/ SRR671949_cuinksout/transcripts.g
and save the le as assembled_transcripts.txt
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
60
3. In terminal, type the following command
cumerge –s /home/user/Athaliana/Reg_genome_raw/ ncbi-genomes-2018-xx-xx/AraTha.
fa assembled_transcripts.txt
4. The successful run creates a merged_asm directory, which contains a logs directory and a
le containing the informaon of the merged transcripts called merged.g.
2.2.5. Running cudi
1. cud is used to see dierenal gene expression in dierent condions. Go to terminal and
type the following command in a single line.
cudi -o di_result -b /home/user/Athaliana/Reg_genome_raw/ ncbi-genomes-2018-
xx-xx/AraTha.fa-L Root_Kcl_control,Root_KNO3_treatment -u merged_asm/merged.g
/home/user/Athaliana/analysis/SRR671946_topout/accepted_hits.bam,/home/user/
Athaliana/analysis/ SRR671947_topout/accepted_hits.bam /home/user/Athaliana/
analysis/SRR671948_topout/accepted_hits.bam/home/user/Athaliana/analysis/
SRR671949_topout/accepted_hits.bam
2. The successful run creates a directory di_result in the working directory. The directory
contains a number of dierent les and databases, listed as follows:
bias_params.info cds_exp.di
genes.fpkm_tracking isoforms.count_tracking
promoters.di splicing.di
tss_groups.fpkm_tracking cds.count_tracking
cds.fpkm_tracking gene_exp.di
genes.read_group_tracking isoforms.fpkm_tracking
read_groups.info tss_group_exp.di
tss_groups.read_group_tracking cds.di
cds.read_group_tracking genes.count_tracking isoform_exp.di
isoforms.read_group_tracking run.info tss_groups.count_tracking var_model.info
3. The fpkm tracking les give FPKM counts of primary transcripts (tss_groups.fpkm), genes
(genes.fpkm_tracking), coding sequences (cds.fpkm_tracking), and transcripts (isoforms.
fpkm_tracking).
4. The count tracking les give the number of fragments for each gene (genes.count_tracking),
transcript (isoforms.count_tracking), primary transcript (tss_groups.count_tracking) and
coding sequence (cds.count_tracking).
5. The read group tracking les contain informaon on the counts of genes, transcripts and
primary transcripts, grouped by replicates.
6. The di les ending with ‘exp.di’ contain informaon on the dierenal expression tests
performed on the genes (gene_exp.di), primary transcripts (tss_group_exp.di), transcripts
(isoform_exp.di), and coding sequences (cds_exp.di).
Hands on Training Aquaculture Genomics and Bioinformacs 61
3. Results
3.1. Running cummeRbund
1. cummeRbund is an R package used to visualise the results in dierent plots.
2. Start an R session In R, go to your working directory and copy the di_result folder to that.
3. Type the following commands in R
>library(‘cummeRbund’)
>cudata < - readCuinks(‘di_result’)
>cudata
4. The above commands will print the result similar to the below
CuSet instance with:
2 samples
33318 genes
42109 isoforms
34957 TSS
32921 CDS
33318 promoters
34957 splicing
27174 relCDS
5. To obtain a density plot showing the expression levels for each sample, type the below
commands:
>csDensity(genes(cudata))
6. To obtain a volcano plot showing the dierenal expressed genes across the two samples,
type the below command:
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
62
>csVolcano(genes(cudata), ‘Root_Kcl_control’, ‘Root_KNO3_treatment’)
7. To obtain a scaer plot showing the dierenal expressed genes across the two samples,
type the below command:
>csScaer(genes(cudata), ‘Root_Kcl_control’, ‘Root_KNO3_treatment’)
8. To print a table displaying the details of all the dierenally expressed genes, type out the
following command.
> gene_di_data < - diData(genes(cudata))
> sig_gene_data < - subset(gene_di_data, (signi cant == ‘yes’))
>nrow(sig_gene_data)
>write.table(sig_gene_data, ‘di_genes.txt’, sep = ‘/t’, row. names = F, col.names = T, quote
= F)
> sig_gene_data
Hands on Training Aquaculture Genomics and Bioinformacs 63
The last command prints out a table containing the details of all the dierenally expressed
genes. The screenshot of the sample output is below:
In this chapter, we described how to download whole genome and transcriptome raw data from
NCBI databases. A very brief introducon about the soware used in this tutorial was presented and
then using the same tools it was demonstrated how to index a whole genome, aligning reads to a
reference genome and how to esmate transcript abundance and idenfy dierenally expressed
genes. In the end, interpretaons of results were visually described.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
64
16. Applicaon of ‘’OMICS’’ research in aquaculture with special
reference to penaeids
Gopikrishna, G., Vinaya Kumar, K., Shashi Shekhar, M. and Vijayan, K.K.
Introducon
The term OMICS refers informally to the eld of study in biology as in genomics, transcriptomics,
proteomics and metabolomics. Genomics is the study of genomes of organisms, transcriptomics is
the study of transcriptomes and so on. Convenonal genec improvement programmes rely mostly
on the phenotypic values which are then converted to breeding values on which the selecon is
carried out. In plants as well as livestock, applicaon of ‘omics’ has revealed interesng insights into
the genec and funconal biology. When these are integrated within selecve breeding programmes,
signicant improvements have been obtained in producvity.(Dekkers, 2012; Perez-de-Castro et al
2012). Omics approaches have been applied widely to elucidate the molecular basis of performance
traits ( eg. growth) and overcome poorly understood biological impediments that impede ecient
producon ( disease, reproducve failure etc) (Rothschild and Plastow 2008, Taylor et al 2016).
As far as livestock and plants are concerned, omics has had a transformaonal eect as observed
by Agrawal and Narayan (2015); Van Emon (2015) and Taylor et al (2016). Coming to the aquaculture
sector, the applicaon of selecve breeding programmes has been at a snail’s pace and it has been
suggested that the world aquaculture producon could be doubled in a period of 13 years if breeding
programmes were supplying stocks for the farmed species (Gjedrem and Rye, 2016). Less than 10%
of the aquaculture producon is derived from improved lines ( Gjedremet al 2012). Looking into the
above facts, it is quite clear that omics resources in aquac species need to be developed at a faster
pace so that these can be used in selecve breeding programmes to hasten genec response.
Crustaceans form a substanal aquaculture commodity globally. The global penaeid aquaculture
industry has exhibited remarkable growth and in 2015, the producon stood at 4.8 million tons (FAO,
2017). Penaeids are an important aquaculture resource the world over and it is necessary to have
selecve breeding programmes so that improved stocks could be generated and farmed. It is well
known that the Pacic white shrimp due to its ease of reproducve capability, has been subjected to
selecve breeding and genecally improved stocks are very much in demand. Informaon generated
from the genomes of shrimp can go a long way in aiding genec improvement programmes so that
the gains are realised at a much faster rate.
Informaon on whole genome of aquaculture species
Several aquaculture species like Oncorhynchus mykiss ( Berthelot et al 2014), Oreochromis
nilocus (Conte et al 2017) Lates calcarifer (Vij et al 2016) Ictalurus punctatus ( Liu et al 2016),
Salmo salar (Lien et al 2016) have had their whole genomes deciphered. In India, work on the whole
genome sequencing in Labeo rohita (Rohu) and Clarius batrachus has been carried out at ICAR-
NBFGR, Lucknow. Shrimp are unique in that the genome size is comparavely large ~ 2.2 Gbp in ger
shrimp and ~1.8 Gbp in Pacic white shrimp (Guppy et al 2018). The highly repeve nature of the
genome in shrimp is a major challenge to the assembly (Huang et al 2011; Baranski et al 2014). In
addion to this, penaeids have a large number of micro-chromosomes and higher levels of genomic
Hands on Training Aquaculture Genomics and Bioinformacs 65
heterozygosity ( Abdelrahmanet al 2017) compared to genome assemblies derived from terrestrial
farm species. Till date, no comprehensive genome assembly is available for a penaeid shrimp. (Guppy
et al 2018). There has been a lot of improvement in sequencing especially through the development
of high-throughput sequencing, resolving and assembling the many repeve regions of the penaeid
genome (~80%; Abdelrahman et al 2017) remains a major challenge.
Transcriptomics
For this, we require the sequence data of the transcriptome. The idea here is to get the mRNA
in individuals at a given point in me, thereaer obtain the cDNA and then go in for sequencing. The
primary focus of transcriptomics has been immunology, disease resistance and reproducve biology
(Guppy et al 2018). Generang transcriptome proles is much easier than generang the whole genome.
In P.vannamei, while invesgang the eect of ammonia exposure, many genes and pathways linked
to immune response (eg chinase, peritrophin, thrombospondin and penaeidin) and growth (linoleic
acid metabolism) were idened by Lu et al (2016a) to be suppressed. Reproducve dysfuncon is a
common feature we nd in capve broodstock of ger shrimp. Through dierenal gene expression
studies of whole transcriptome data, genes related to fay acid and steroid metabolism were found
to have altered expression paerns when comparing wild sourced and domescated stock (Rotllant
et al 2015).
Linkage mapping of genec markers in shrimp
One of the genomic resources is the linkage map which provides a wealth of genomic informaon
and also unravel the underlying genec architecture of commercially and biologically important
traits. In penaeids, there have been substanal eorts to generate linkage maps. Linkage maps are
constructed using data from family groups viz. parents as well as progeny. Earlier, Amplied Fragment
Length Polymrphism was used for construcon of linkage maps in ger shrimp ( Wilson et al 2002).
Later Baranski et al 2014 constructed the rst linkage map in ger shrimp using SNPs. Presently, linkage
maps are available that include between 3959 and 9298 markers and cover all 44 chromosomes of
the penaeid genome ( Baranski et al 2014, Yu et al 2015, Lu et al 2016b, Jones et al 2017a) . Such
maps have increased the applicability of these resources in assisng genome assembly, examining
architecture of traits and also for comparave mapping (Guppy et al 2018). It is interesng to note
that construcon of linkage maps has unravelled some hitherto unknown facts. Baranski et al (2014)
reported in ger shrimp that the female–specic map was substanally shorter than the male specic
map (2917 vs 4059 cM) whereas in P.vannamei,Perez et al (2004) and Zhang et al (2007) , reported
longer maps for females than males ( 4134 vs. 3221 cM and 2771 vs. 2116 cM respecvely) indicang
that there may be higher recombinaon in males. There is sll ambiguity in the karyotype due to
the micro-chromosomes in penaeids as a consequence of which it appears that the dierence in
map length between species exists and sex-based recombinaon might occur. (Baranski et al 2014).
Maps available for ger shrimp (Baranaski et al 2014) and Pacic white shrimp ( Yuet al 2015) have
average inter-marker distances between 0.9 and 0.7 cM respecvely across dierent map iteraons.
This is denitely a signicant achievement, however, 1 cM equates to an esmated physical genome
distance of ~ 400-600Kb for penaieds (P. monodon 395Kb/cM (Baranski et al 2014), P. vannamei
598.89 Kb/cM ( Yuet al 2015), P.japonicus 657.89Kb/cM (Lu et al 2016b) and presents a signicant
challenge when we look to characterise potenal useful genes or genomic regions underlying ndings
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
66
of trait-associaon studies. (Guppy et al 2018). Future work is required to obtain denser maps that
decrease the interval between markers. This could be accomplished by genotyping more families and
also more individuals per family which would provide addional observaons of informave meioc
recombinaon events or integrate orphaned (unplaced) markers into exisng maps (Fierst 2015).
Ulising enhanced cost-eecve genotyping strategies ( eg genotype by sequencing method ) could
result in genotyping of more families and also more individuals per family consequent to which ne
grain marker placement could be achieved (Guppy et al 2018).
Developing and applying polymorphic markers
There has been considerable eorts in the past, for development of a wide range of tradional
genomic markers ( eg. Allozymes, RFLP, AFLP and microsatellites) in several penaeid species. Most
of these markers have been used for assessing the wild populaons and manage family lines. These
markers exhibit caveats which have been reviewed by Benzie (1998, 2009). Due to the high cost in
developing them and the failure to unravel the complexity of producon traits, they have not found
favour in the penaied industry (Guppy et al 2018). Today, the tradional markers are being replaced
by powerful and cost-eecve markers like Single Nucleode Polymorphisms (SNPs). The SNPs are
very abundant in the genome and can help substanally in genome studies. About 9 million SNPs in
Bos taurus genome ( Xuet al 2017), 7 million SNPs in chickens (Rubin et al 2010) 9.7 million SNPs in
Atlanc Salmon ( Yanez et al 2016), 8.6 million SNPs in channel caish ( Zeng et al 2017) and 5.6 million
SNPs in Lates calcarifer ( Vij et al 2016) have been idened. The SNP discovery has further led to the
manufacture of SNP arrays in several species like cale, sheep, crops like wheat and in aquaculture
species like Caish and Atlanc Salmon. In P. monodon, at ICAR-CIBA, Baranski et al (2014) developed
a chip containing 6000 SNPs which were majorly idened using the transcriptomic approach. It would
be pernent to point out that ll date, only two studies have produced validated SNP genotyping
arrays ( Baranski et al 2014 ( 6000 SNPs) in ger shrimp and Jones et al (2017b) in Pacic white
shrimp (6400 SNPs). The laer one has been sold commercially as the Innium ShrimpLD-24 v1.0
Bead Chip. An interesng feature of these chips is that these arrays are based on type-I SNPs ( genic
rather than inter-genic) and many of these SNPs have been annotated with putave genes ( 62 and 47
%) respecvely, thereby providing a strong foundaon for further trait mapping studies. ( Robinsonet
al 2014 and Khatkar et al 2017b). An addional feature that needs to be factored in, is the cost of
the SNP arrays. The approximate cost of genotyping per individual has drascally fallen to about Rs.
5000/- This needs to be further reduced to make it cost-eecve. Selecon of a genotyping method
for commercial applicaons would hinge on the me required for sample processing, genotyping
and data analysis, as the window between pre-selecon of candidate broodstock at harvest and nal
breeding selecon and spawning is quite short ( less than 3-6 months) (Guppy et al 2018).
Genotype by sequencing
A unique advantage of the genotype by sequencing (GBS) method is the ability to discover and
genotype markers ( de novo marker discovery) without requiring reference to exisng genomic
informaon like genomic sequences and transcriptomes. In penaeids, a number of GBS approaches
have been ulised with 25140 and 23049 markers obtained in Pacic white shrimp (Yu et al 2015,
Wang et al 2017) and 28981 markers obtained in Kuruma shrimp ( Lu et al 2016b). Most of these
markers have been ulised to generate linkage maps, undertake Quantave Trait Loci (QTL) mapping
Hands on Training Aquaculture Genomics and Bioinformacs 67
(Yu et al 2015,Lu et al 2016b) and esmate genomic predicon accuracy (Wang et al 2017), and they
have yet to be ulised in the industry for genotyping.
Markers for breeding populaon management
Crustaceans have a tendency to frequently molt and this places them at a disadvantage in
idencaon. However, tagging with visible implant elastomer tags (for family idencaon) and
eye-ring tags ( for individual idencaon) have been found to address this issue to a certain extent.
The number of individuals available per family is rather large in shrimp and they need to be reared in
a common environment so that there is no confounding of environmental eects. Each family needs
to be reared ll tagging and this poses a signicant challenge on infrastructure. Tracking of pedigree
is of paramount importance to keep the inbreeding low. Use of genomic markers could enhance the
idencaon of individual shrimp but here again the cost of genotyping (high density solid state
arrays), lack of genotyping power (microsatellites) or a combinaon of both these factors are a major
stumbling block ( Vandepue and Haray, 2014).
Exploing genec variaon underlying phenotypes
It is important to comprehend the relaonship between genec variaon and the phenotypes
of economically important traits. The informaon so obtained could prove useful for integrang
genomics research into food producon industries. ( Abdelrahmanet al 2017). Through QTL mapping
and Genome-Wide Associaon Studies (GWAS), it may be possible to idenfy the number, locaon,
eect size of genec elements ( i.e. genes, loci and regions) that are linked to the observed phenotypic
variaon of a trait. (Mackay et al 2009). For this to be applied at the eld level, we need to idenfy
markers that are highly predicve for a superior or inferior phenotype in order to improve the selecon
of elite individuals for breeding programmes (Thorgaard et al 2006). Genomic breeding values have
recently been ulised in breeding programmes related to agriculture in an eort to improve simple
and complex traits. (Meuswissen et al 2001, 2006). Such a procedure could also be applied to shrimp
breeding programmes to elicit substanal genec response.
QTL mapping
A Quantave Trait Locus (QTL) is a region in the genome containing one or several genes that
aect variaon in a quantave trait which is idened by its linkage to polymorphic marker loci.
Mapping of QTLs involves two components: detecon and localisaon. Once the QTLs are detected,
they need to be localised and the gene(s) unravelled. QTLs can be localised through their genec
linkage to visible marker loci with genotypes that we can readily classify. In case a QTL is linked to
a marker locus, then individuals with dierent marker locus genotypes will exhibit dierent mean
values of the quantave trait. QTLs can be mapped in families or in segregang progeny of crosses
between genecally divergent strains ( linkage mapping) or in unrelated individuals from the same
populaon ( associaon mapping). Later, these QTLs need to be validated in a populaon of individuals.
If the validaon yields encouraging results, the QTLs can be ulised to improve the concerned trait
in a breeding programme. Two studies in aquaculture species related to QTL mapping have been
reported. One is by Li et al (2006) for growth in Kuruma shrimp P. japonicus and another by Robinson
et al (2014) for resistance to White Spot Syndrome Virus in ger shrimp. In the former case, AFLP
markers were used whereas in the case of ger shrimp SNP markers were used.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
68
Genome-Wide Associaon Studies
These are studies aimed at associang a parcular QTL with a trait. Till date there have been
only two studies reported in aquaculture species. The rst one is in ger shrimp, the work of which
was carried out at ICAR-CIBA and NOFIMA Norway. Seven families of ger shrimp were exposed to
the White Spot Syndrome Virus. The number of shrimp genotyped was 1024. About 9 QTLs in ger
shrimp were found to be signicantly associated to hours of survival. In addion, 3 SNPs were found
to be associated with sex in ger shrimp.(Robinson et al 2014). The second study was for growth in
P. vannamei. The authors could not nd any signicant associaon of markers with growth. Earlier, Yu
et al (2015) while working in P. vannamei, had reported a large QTL for growth explaining 17.9% of
the phenotypic variaon.
Conclusion
Omics research in aquaculture has generated a lot of informaon during the past three
decades. Compared to plant and livestock breeding programmes, aquac species has a long way
to go. The informaon owing from various resources like linkage maps, physical maps, annotated
transcriptome, characterised proteome data and genome sequence need to be incorporated onto
a single plaorm for use by other sciensts working in this eld. Wide publicity needs to be given
on high-density linkage maps to comprehend genome architecture so as to help in future genec
improvement programmes. Indepth studies on economically important traits in aquac species are
also required urgently so as to help the farmers reap prots from culture of sh/shrimp.
References cited
Abdelrahman, H., ElHady, M., Alcivar-Warren, A., Allen, S., Al-Tobasei, R., Bao, L., et al. (2017).
Aquaculture genomics, genecs and breeding in the United States: current status, challenges,
and priories for future research. BMC Genomics 18:191. doi: 10.1186/s12864-017-3557-1
Agrawal, R., and Narayan, J. (2015).Unravelling the impact of bioinformacs and omics in agriculture.
Int. J. Plant Biol. Res. 3:1039.
Baranski, M., Gopikrishna, G., Robinson, N. A., Katneni, V. K., Shekhar, M. S.,Shanmugakarthik, J.,
et al. (2014). The development of a high density linkage map for black ger shrimp (Penaeus
monodon) based on cSNPs. PLoS ONE 9:e85413. doi: 10.1371/journal.pone.0085413
Benzie, J. A. (1998). Penaeid genecs and biotechnology. Aquaculture 164, 23–47. doi: 10.1016/
S0044-8486(98)00175-6
Benzie, J. A. (2009). Use and exchange of genec resources of penaeidshrimps for food and aquaculture.
Rev. Aquacult. 1, 232–250.doi: 10.1111/j.1753-5131.2009.01018.x
Berthelot, C., Brunet, F., Chalopin, D., Juanchich, A., Bernard, M., Noël,B., et al. (2014). The rainbow
trout genome provides novel insights intoevoluon aer whole-genome duplicaon in
vertebrates. Nat. Commun.5:3657. doi: 10.1038/ncomms4657
Conte, M. A., Gammerdinger, W. J., Bare, K. L., Penman, D. J., and Kocher, T.D. (2017).
A high quality assembly of the Nile Tilapia (Oreochromis nilocus)genome reveals the structure of
two sex determinaon regions. BMC Genomics18:341. doi: 10.1186/s12864-017-3723-5
Hands on Training Aquaculture Genomics and Bioinformacs 69
Dekkers, J. C. (2004). Commercial applicaon of marker-and gene-assistedselecon in livestock:
strategies and lessons. J. Anim. Sci. 82(13 Suppl.), E313–E328.doi: 10.2527/2004.8213_
supplE313x
FAO (2017). FishStat Plus - Universal Soware for Fishery Stascal Time Series.
FAO Fisheries and Aquaculture Department. Rome.
Fierst, J. L. (2015). Using linkage maps to correct and scaold de novo genomeassemblies: methods,
challenges, and computaonal tools. Front. Genet.6:220. doi: 10.3389/fgene.2015.00220
Gjedrem, T., and Rye, M. (2016). Selecon response in sh and shellsh: a review.
Rev. Aquacult. 10, 168–179. doi: 10.1111/raq.12154
Gjedrem, T., Robinson, N., and Rye, M. (2012). The importance of selecvebreeding in aquaculture to
meet future demands for animal protein: a review.
Aquaculture 350, 117–129. doi: 10.1016/j.aquaculture.2012.04.008
Guppy, J.L., Jones, D.B., Jerry, D.R., Wade, N.M., Raadsma, H.W., Huerlimann, R., and Zenger,K.R. (2018).
The State of ‘’Omics’’ Research for farmed penaeids: Advances in research and impediments to
industry ulisaon.
Front. Genet. 9:282, doi:10.3389/fgene.2018.00282
Huang, S.-W., Lin, Y.-Y., You, E.-M., Liu, T.-T., Shu, H.-Y., Wu, K.-M., et al.(2011). Fosmid library end
sequencing reveals a rarely known genomestructure of marine shrimp Penaeus monodon. BMC
Genomics 12:242.doi: 10.1186/1471-2164-12-242
Jones, D. B., Jerry, D. R., Khatkar, M. S., Raadsma, H. W., Steen, H. V. D.,Prochaska, J., et al. (2017a).
A comparave integrated gene-based linkage andlocus ordering by linkage disequilibrium map
for the Pacic white shrimp,Litopenaeus vannamei. Sci. Rep. 7:10360. doi: 10.1038/s41598-017-
10515-7
Jones, D. B., Zenger, K. R., Khatkar, M. S., Raadsma, H. W., Steen, H. A. M. V.D., Prochaska, J., et
al. (2017b). “Development of a low-density commercialgenotyping array for the white legged
shrimp, Litopenaeus vannamei,” inAAABG, Edited by Genecs AAoABa (Townsville, QLD).
Khatkar, M., Coman, G., Thomson, P., and Raadsma, H. (2017a). “Comparisonof dierent breeding
design opons for long term genec gain and diversityin aquaculture species,” in Proc Assoc
Advmt Anim Breed Genet (Townsville,QLD), 449–452.
Li, Y., Dierens, L., Byrne, K., Miggiano, E., Lehnert, S., Preston, N.,et al. (2006). QTL detecon of
producon traits for the Kuruma prawnPenaeus japonicus (Bate) using AFLP markers. Aquaculture
258, 198–210.doi: 10.1016/j.aquaculture.2006.04.027
Lien, S., Koop, B. F., Sandve, S. R.,Miller, J. R., Kent,M. P., Nome, T., et al. (2016).
The Atlanc salmon genome provides insights into rediploidizaon. Nature533, 500–505. doi:
10.1038/nature17164
Liu, Z., Liu, S., Yao, J., Bao, L., Zhang, J., Li, Y., et al. (2016). The channel caishgenome sequence
provides insights into the evoluon of scale formaon inteleosts. Nat. Commun. 7:11757. doi:
10.1038/ncomms11757
Lu, X., Kong, J., Luan, S., Dai, P., Meng, X., Cao, B., et al. (2016a).
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
70
Transcriptome analysis of the hepatopancreas in the Pacic White Shrimp(Litopenaeus vannamei)
under acute ammonia stress. PLoS ONE 11:e0164396.
doi: 10.1371/journal.pone.0164396
Lu, X., Luan, S., Hu, L. Y., Mao, Y., Tao, Y., Zhong, S. P., et al. (2016b). Highresoluon
genec linkage mapping, high-temperature tolerance and growthrelatedquantave trait locus (QTL)
idencaon inMarsupenaeus japonicus.Mol. Genet. Genomics 291, 1391–1405. doi: 10.1007/
s00438-016-1192-1
Mackay, T. F., Stone, E. A., and Ayroles, J. F. (2009). The genecs ofquantave traits: challenges and
prospects. Nat. Rev. Genet. 10, 565–577.doi: 10.1038/nrg2612
Meuwissen, T.,Hayes, B., and Goddard,M. (2001). Predicon of total genec valueusing genome-wide
dense marker maps.Genecs 157, 1819.
Meuwissen, T., Hayes, B., and Goddard,M. (2016). Genomic selecon: a paradigmshi in animal
breeding. Anim. Front. 6, 6–14. doi: 10.2527/af.2016-0002
Pérez, F., Erazo, C., Zhinaula, M., Volckaert, F., and Calderón, J. (2004).
A sex-specic linkage map of the white shrimp Penaeus (Litopenaeus) vannamei) based on AFLP
markers. Aquaculture 242, 105–118.doi: 10.1016/j.aquaculture.2004.09.002
Pérez-de-Castro, A. M., Vilanova, S., Cañizares, J., Pascual, L., Blanca, J. M., Diez,M. J., et al. (2012).
Applicaon of genomic tools in plant breeding.Curr.Genomics 13, 179–195.
doi: 10.2174/138920212800543084
Robinson, N. A., Gopikrishna, G., Baranski, M., Katneni, V. K., Shekhar, M.S., Shanmugakarthik, J., et al.
(2014). QTL for white spot syndrome virusresistance and the sex-determining locus in the Indian
black ger shrimp(Penaeus monodon).
BMC Genomics 15:731. doi: 10.1186/1471-2164-15-731
Rothschild, M. F., and Plastow, G. S. (2008).Impact of genomics on animalagriculture and opportunies
for animal health.Trends Biotechnol. 26, 21–25.
doi: 10.1016/j.btech.2007.10.001
Rotllant, G.,Wade,N.M., Arnold, S. J., Coman, G. J., Preston,N. P., and Glencross,B. D. (2015).
Idencaon of genes involved in reproducon and lipid pathwaymetabolism in wild and
domescated shrimps. Mar. Genomics 22, 55–61.
doi: 10.1016/j.margen.2015.04.001
Rubin, C.-J., Zody, M. C., Eriksson, J., Meadows, J. R. S., Sherwood, E.,Webster, M. T., et al. (2010).
Whole-genome resequencing reveals loci underselecon during chicken domescaon. Nature
464:587. doi: 10.1038/nature08832
Taylor, J. F., Taylor, K. H., and Decker, J. E. (2016). Holsteins are thegenomic selecon poster cows.
Proc. Natl. Acad. Sci. U.S.A. 113, 7690–7692.doi: 10.1073/pnas.1608144113
Thorgaard, G. H., Nichols, K.M., and Phillips, R. B. (2006).Comparave gene andQTL mapping in
aquaculture species.Israeli J. Aquacult.Bamidgeh 58, 4.
Van Emon, J. M. (2015). The omics revoluon in agricultural research.
J. Agric.Food Chem. 64, 36–44. doi: 10.1021/acs.jafc.5b04515
Hands on Training Aquaculture Genomics and Bioinformacs 71
Vandepue, M., and Haray, P. (2014). Parentage assignment with genomicmarkers: a major advance
for understanding and exploing genec variaonof quantave traits in farmed aquac animals.
Front. Genet. 5:432.doi: 10.3389/fgene.2014.0043
Vij, S., Kuhl, H., Kuznetsova, I. S., Komissarov, A., Yurchenko, A. A., VanHeusden, P., et al. (2016).
Chromosomal-level assembly of the Asian seabassgenome using long sequence reads and mul-
layered scaolding. PLoS Genet.12:e1005954.
doi: 10.1371/journal.pgen.1005954
Wang, Q., Yu, Y., Yuan, J., Zhang, X., Huang, H., Li, F., et al. (2017). Eects ofmarker density and
populaon structure on the genomic predicon accuracyfor growth trait in Pacic white shrimp
Litopenaeus vannamei.
BMC Genet.18:45. doi: 10.1186/s12863-017-0507-5
Wilson, K., Li, Y. T., Whan, V., Lehnert, S., Byrne, K., Moore, S., et al.(2002). Genec mapping of the
black ger shrimp Penaeus monodonwith amplied fragment length polymorphism. Aquaculture
204, 297–309.doi: 10.1016/S0044-8486(01)00842-0
Xu, C., Li, E., Liu, Y., Wang, X., Qin, J. G., and Chen, L. (2017).Comparaveproteome analysis of the
hepatopancreas from the Pacic white shrimpLitopenaeus vannamei under long-term low
salinity stress. J. Proteomics 162,1–10. doi: 10.1016/j.jprot.2017.04.013
Yáñez, J. M., Naswa, S., López, M., Bassini, L., Correa, K., Gilbey, J., et al.(2016). Genomewide single
nucleode polymorphism discovery in Atlancsalmon (Salmo salar): validaon in wild and
farmed American and Europeanpopulaons.
Mol. Ecol. Resour. 16, 1002–1011. doi: 10.1111/1755-0998.12503
Yu, Y., Zhang, X., Yuan, J., Li, F., Chen, X., Zhao, Y., et al. (2015). Genomesurvey and high-density
genec map construcon provide genomic and genecresources for the PacicWhite Shrimp
Litopenaeus vannamei. Sci. Rep. 5:15612.doi: 10.1038/srep15612
Zeng, Q., Fu, Q., Li, Y., Waldbieser, G., Bosworth, B., Liu, S., et al. (2017).
Development of a 690K SNP array in caish and its applicaon for genecmapping and validaon of
the reference genome sequence. Sci. Rep. 7:40347.doi: 10.1038/srep40347
Zhang, L., Yang, C., Zhang, Y., Li, L., Zhang, X., Zhang, Q., et al. (2007). Agenec linkage map of Pacic
white shrimp (Litopenaeus vannamei): sex-linkedmicrosatellite markers and high recombinaon
rates. Geneca 131, 37–49.doi: 10.1007/s10709-006-9111-8
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
72
17. Shrimp Genomics : Current status and Challenges
M.S. Shekhar, K. Vinaya Kumar and K.K. Vijayan
The shrimp genomics has evolved a into a considerable research progress over last few decades.
The recent advances in “omics” in parcular with the advancement in NGS techniques, have
provided the aquaculture industry the opportunies as well the challenges faced in understanding
the complexity of the whole genome of shrimp. However, the currently available molecular biology
resources and bioinformacs techniques require further development to undertake the challenges
and provide the most informave results in deciphering the shrimp genome.
1. Introducon
The consumpon of food shes globally is projected to increase tremendously. However, with
exploitaon and decrease in wild catch sheries worldwide, much importance is now being given to
increase the producon from aquaculture. In aquaculture and sheries management for an eecve
genec improvement breeding programs, studies relang to populaon structure, genec diversity,
environmental adaptaon and molecular response to bioc and abioc stress are very important.
“Biotechnology” integrated with “Omics” is a term that has now come to encompass many of the
excing new developments in aquaculture during recent years. Hence, for sustainable aquaculture,
genec improvement for desired traits etc. through biotechnological means has gained importance in
recent years. Aquaculture biotechnology deals with the use of knowledge and techniques in the eld
of molecular, cellular and genec processes to develop improved aquaculture products and variees.
Therefore, a wide term of ‘omics’ which includes methods and techniques that are required for analyzing
all dierent types of molecules and the pathways associated with them is used in aquaculture as well.
This encompasses the major four “omics”, namely transcriptomics, proteomics, metabolomics and
epigenomics. Viral infecons are one of the major reasons for the huge economic losses in shrimp
farming. The control of viral diseases in shrimp remains a serious challenge for the shrimp aquaculture
industry. White spot syndrome virus (WSSV), is a major pathogen which is geographically widespread
and connues to be a serious threat aecng shrimp farms the world over. In the absence of a true
adapve immune response system in invertebrates, shrimps respond by non-specic innate immune
mechanisms. Shrimp genome annotaon and transcriptome generaon as “omics” tools would aid
to unravel the molecular mechanisms involved in the immune defence network that occur in shrimp
in response to WSSV infecon in addion to development of genecally improved variees of shrimp
with desirable traits through genec improvement breeding programmes.
2. Transcriptomics
Next-generaon high-throughput RNA sequencing technology (RNA-seq) is a modern and a high
throughput method which is not restricted by the unavailability of a genome reference sequence
has tremendous potenal for idencaon, proling and quanfying RNA transcripts with increased
sensivity. Transcriptome is the complete set of transcripts in a cell, indicang a specic developmental
stage or physiological condion together with the quanty. Transcriptome helps in idenfying the
funconal elements of genome revealing molecular constuents of cells and ssues, in response to
environmental stress with an accurate quancaon of gene expression levels. Because of these
several advantages over other techniques expression this approach has been widely used now in
Hands on Training Aquaculture Genomics and Bioinformacs 73
decoding the funconal role of gene and cell responses against environmental stress. Signicant
progress has been recently achieved in understanding the transcript expression of marine crustaceans
such as Litopenaeus vannamei, Fenneropenaeus chinensis, Eriocheir sinensis and Macrobrachium
nipponense in response to bioc and abioc stress factors. Transcriptome data aids in idencaon of
novel genes in absence of shrimp genome database as shown in Table 1. Next-generaon sequencing
technologies have therefore inuenced the analysis of gene regulaon.
Table 1. Transcriptomes generated from shrimp species
Species Tissues Transcriptome generaon
L. vannamei Hemocytes WSSV
L. vannamei Hepatopancreas WSSV
L. vannamei Hepatopancreas and muscle WSSV and growth
L. vannamei Hemolymph and hemocytes TSV
L. vannamei Hepatopancreas TSV
L. vannamei Tess and Ovaries Gonadal development
L. vannamei Hepatopancreas Acute ammonia stress
L. vannamei Hepatopancreas Osmoregulatory Stress
L. vannamei Gills Osmoregulatory Stress
L. vannamei Hepatopancreas and hemocytes Nitrite
L. vannamei Whole larvae Embryo development
L. vannamei Embryo, Nauplius, zoea, mysis, post
larvae
Larval Development
L. vannamei Whole shrimp Molng
L. vannamei Muscle Feed eciency
L. vannamei Heart, muscle, hepatopancreas and
eyestalk
Growth
P. monodon Hepatopancreas and ovary Reproducon and
development
P. monodon Eyestalk, stomach, female gonad, male
gonad, gill, haemolymph,
hepatopancreas,
lymphoid organ, tail muscle, embryos,
nauplii, zoea, and mysis, whole larvae
Gene discovery
M. japonicus Ferlized eggs, embryos and vegetal
halves
Embryo development
F. chinensis Cephalothorax WSSV
F. merguiensis Cucle, muscle, androgenic gland,
hepatopancreas, stomach, nervous
system, eyestalk, male gonads, female
gonads
Color
F. merguiensis Hepatopancreas, stomach, eye stalk,
nerve cord, male gonad, female gonad,
androgenic gland region, muscle and
cucle
Reproducon and
development
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
74
3. Complexity of shrimp genome
Shrimp genomes are large with highly repeve sequences which pose signicant challenges in
deciphering the whole genome and other genec studies. In our study, the shrimp genome esmated
by ow cytometry showed the shrimp genome to be of very high size. The genome size for the
four major species of genus Penaeus (Penaeus monodon, Penaeus indicus, Penaeus vannamei and
Penaeus japonicus) were found in similar range. The genome size of female shrimps ranged from
2.91 ± 0.03 pg (P. monodon) to 2.14 ± 0.02 pg (P. japonicus). In male shrimps, the genome size ranged
from 2.86 ± 0.06 pg (P.monodon)to 2.19 ± 0.02 pg (P. indicus). Signicant dierence was observed in
the genome size between male and female shrimp of all species except in P.monodon. The highest
relave dierence of 12.78% was observed in the genome size between the either sex in P.indicus.
The interspecic relave dierence of 30.59% in genome size was highest between the male shrimps
of P. monodon and P. indicus and 35.98% between the female shrimps of P. monodon and P. japonicus.
This study was undertaken to esmate genome size in shrimps which will help guiding the research
aimed towards generang the sequence data for the whole genome of these species in future. The
penaeid genome (80% repeve) remains a challenge even today for sequencing and assembly.
Short read second-generaon sequencing methods for example illumina sequencing technology
is preferred for non-complex genomes, by idenfying and overlaying sequences and building the
resulng congs and scaolds. However, when short read sequencing methods are applied to highly
repeve regions within the genome, it leads to diculty in building conguous sequences. The
shrimp genomes also have high levels of heterozygosity. The previous short-read assembly in shrimps
have led highly fragmented assembly with high number of scaolds. There are reports that shrimp
with polysaccharides contaminaon and high DNase acvity can interfere with long read sequencing
methodologies which are major challenges to overcome and methods to isolate intact pure shrimp
genome needs more standardizaon.
4. NGS plaorm for shrimp genome sequencing
Several NGS plaorms are currently in use such as Illumina MiSeq, Ion Torrent PGM, PacBio RS,
Illumina GAIIx, Illumina HiSeq 2000, etc. The key feature which determines the opmal plaorm to
be used is their speed of sequencing with less of error rates. The sequencing methodology has been
dominated by Illumina. However, the use of this technology is not adequate in dealing with complex
shrimp genomes which requires generaon of longer read lengths. One such latest plaorm which
yields longer read lengths is PacBio. PacBio is based on single molecule real me (SMRT) sequencing.
The DNA polymerase molecules, binds to a DNA template, are present at the base of 50 nm-wide wells
called zero-mode waveguides (ZMWs). Second strand DNA synthesis in the presence of γ-phosphate
uorescently labeled nucleodes is carried out by each polymerase. With each base incorporaon, a
disncve pulse of uorescence is detected in real me. The PacBio plaorm, by virtue of its long read
lengths, has the potenal applicaon in de novo sequencing of shrimp genome. Approx. mean read
lengths of 1500 bp were generated using the PacBio RS system with the rst generaon of chemistry
(C1 chemistry) , the advanced PacBio RS II system with the C4 chemistry yields average read lengths
over 10 kb , with an N50 of more than 20 kb and maximum read lengths over 60 kb. The latest PacBio
Sequel System is a advanced version with higher throughput, more scalability, a reduced footprint
and lower sequencing project costs compared to the PacBio® RS II System. This advanced version of
Hands on Training Aquaculture Genomics and Bioinformacs 75
the Sequel System is the capacity of its redesigned SMRT Cells, which contain one million zero-mode
waveguides (ZMWs) as, compared to 150,000 ZMWs in the PacBio RS II. Acve individual polymerases
are immobilized within the ZMWs, providing windows to observe and record DNA sequencing in real
me. In future the successful assemblies for shrimp genome will depend upon a “hybrid assembly”
approach, ulizing short-read sequencing to correct the high error rate observed in long read PacBio
sequencing system.
5. The Challenges
In comparison to other livestock industries, very less improved lines are used in aquaculture
producon (Gjedrem et al., 2012). The aquaculture producon has also not completely ulized the
exisng natural genec potenal and resources for increased producvity. In case of shrimp, there
have been numerous molecular studies on the expression and funcon of selected genes involved
in metabolic pathways, however, lile aenon is given to the metabolic dierences which exist
between shrimp or to their developmental stages. The dierence among shrimp due to result of
parcular metabolic and adaptaons to varied environmental condions needs to be studied in detail.
These types of studies have direct relevance to the beer management pracces and formulaon of
opmal diets for the domescaon of shrimp in aquaculture. In the immediate future, the main
challenges are to integrate the available genomic data with physiological studies on shrimp. These
outcomes will elucidate species-specic adaptaons to environmental condions, and have the
potenal to inform and smulate research in many biological disciplines. For, any genomic studies
and analysis, a reference genome is essenal, however, except for a brachiopod Daphnia pulex, no
informaon on complete genome assembly is available from other crustaceans. The genome size of
D. pulex is comparavely smaller in size of about 200 Mb, containing 30,970 genes and very less 9.4%
repeve sequences, however in shrimp, the genomes are too big and complex for sequencing and
assembly. Bioinformacs, data mining and sequence annotaon needs to be dened and developed
for complex genomes which would aid in complete genome assembly.
6. Future potenal
Introducing of improved bioinformacs approach for error-correcon of longer read sequencing
lengths and use of opcal mapping would help in compleng the large size genome assembly of shrimps
and other aquac species. There is also an urgent need to construct linkage and physical maps, and
to develop database for annotated transcriptome, proteomics and metabolomics, which would help
in generang highly informave shrimp “omics” to understand genome structure, genome evoluon,
phylogeny and natural selecon of aquaculture species. The funconal genomics with annotated
genome and validaon of candidate genes by experimental CRISPER or RNAi knockdown studies
would be signicant progress towards in idencaon of target genes of commercial importance
such as growth, and disease resistance. Understanding the genome and genec makeup of shrimp
would benet in deciphering complex traits which would eventually accelerate the breeding program
in shrimp. A high-density linkage map is essenal for shrimp genomics and genec studies. Creaon
of a high-density linkage map would help in mapping of QTLs for traits of interest such as body
weight, body length, disease resistance and other traits which have high commercial signicance in
aquaculture.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
76
18. Applicaon of Biotechnology in animal reproducon
Sherly Tomy
Reproducve eciency is a major factor determining the economic success of any livestock
enterprise.Majority of the animal breeding programs have aimed at enhancing the genec worth
of animals using convenonal selecon methods primarily based on phenotype. Revoluonary
tools in reproducve biotechnology like use of recombinant DNA procedures, genome engineering,
transgenic technology, somac cell nuclear transplantaon etc has added new dimensionstoanimal
breeding. Applicaon of biotechnology in animal breeding has resulted in several remarkable
discoveries like the sheep Dolly, created by the somac cloning technique, transgenic pigs that can
be organ donors for humans, and animal bioreactors producing human therapeuc proteins in milk.
Compared to the terrestrial animals, the development in aquac animals is comparavely less. Only a
small percentage of farmed aquac species have been subject to genec improvement programmes.
However, biotechnology have great potenal to increase sh producon mainly due to the availability
of large numbers of gametes, use of external ferlizaon, and ease of hormone treatments during
development to induce sterility or funconal sex reversal. Some of the important reproducve
biotechnological tools used in farm animals are:
Arcial inseminaon: Using this technology new breeds of animals are produced through the
introducon of the male sperm from one superior male to the female reproducve tract without
mang. The advantage of AI includes reduced transmission of venereal disease, lessens the need of
farms to maintain breeding males, facilitates more accurate recording of pedigrees, and minimizes the
cost of introducing improved genecs. However, success of AI depends on accurate heat detecon,
proper frozen semen handling and mely inseminaon by a trained inseminator.
Sex Determinaon of Sperm: Sexing of sperm could help to pre-determine the sex of the progeny. This
technique works on the principle of ow cytometric separaon of uorescent-labelled X-chromosome
bearing spermatozoa from the sperms carrying uorescent-labelled Y-chromosome. The accuracy of
this technique is high, however, the laser light used reduces the viability of the sexed sperm and
the throughput is low.However, new generaon ow cytometer with high sorng rates have opened
avenues for increasing sorted sperm output with minimal or no damage to sperm. Sex chromosome-
specic proteins (SCSPs) idened on the surface of sperm are also currently used for sperm sexing
which are less invasive and less damaging to sperm.
Sperm Encapsulaon: This involves encapsulaon of sperm for longer preservaon of sperm in
vivo and to allow progressive release of viable spermatozoa over several days in various domesc
species including human. The technique also prevents cryocapacitaon and also reported to have
increased concepon rate. The technique has been developed in cale and swine, sll it needs more
sophiscated instrument for encapsulaon and standardizaon, to be used under eld condions in
other livestock species.
Ovum Pick Up (OPU) :This is a non-invasive and repeatable technique used for recovering large numbers
of competent oocytes from antral follicles of live animals. Embryo producon from ovum pick-up
oocytes is aected by age, season, follicle smulang hormone (FSH) smulaon. It also evident that
Hands on Training Aquaculture Genomics and Bioinformacs 77
repeated OPU can be performed without side eects both in cale and bualoes with a minimal
stress to the animal. In India, the rst bualo calf (Saubhagya) was produced through this technique
by Prasad et al.2013, and subsequently, rst bovine calf (Holi) was produced at ICAR-Naonal Dairy
Research Instute. OPU has advantage to collect oocytes from animals with less invasiveness and the
use of superior animals as oocyte donors in embryo transfer. One of the limitaons of this technique
is the low oocyte yield per ovary and necessity for sophiscated instrument for carrying out this
technique.
In Vitro Maturaon, Ferlizaon and Culture (IVMFC) :This involves oocyte collecon from
slaughterhouse ovaries or from live animals followed by maturaon and ferlizaon in vitro for the
producon of viable embryos. Since the birth of the rst rabbit conceived through IVF in 1959, IVF has
been pracsed in several animals. Various methods for in vitro maturaon, IVF, and in vitro culture
have been standardized in animals. In addion, IVMFC has provided an excellent source for embryo
transfer, cloning, transgenesis, and other advanced in vitro techniques. It has also allowed the analysis
of the developmental potenal of embryos, paern of gene expression, epigenec modicaons
and cytogenec disorders in various domesc species and has been used as a model for human
embryogenesis studies. The low success rate and the costs make the technique less feasible for
applicaon in livestocks under eld condions
Intracytoplasmic Sperm Injecon (ICSI) :ICSI is a micromanipulaon technique used for treang male
inferlity. It involves mechanical inseron of a selected sperm into the cytoplasm of an oocyte to
produce desirable embryo. Since the rst report of ICSI success, ICSI has been done in other species
such as rabbits, mice, sheep, humans, horses, cale, and pigs including bualoes. This technique is
also used for sperm vector system for animal transgenic.
Mulple Ovulaon andEmbryo Transfer: In this technique selected genecally superior (elite) females
are induced to superovulate hormonally and inseminated with high quality semen of a superior male
at an appropriate me relave to ovulaon depending on the species. Week-old embryos are ushed
out of the donor’s uterus, isolated, examined microscopically for number and quality, and inserted
into the lining of the uterus of surrogate mothers non-surgically. ET increases reproducve rate of
selected females, reduces disease transfer, and facilitates the development of rare and economically
important genec stocks. The main liming factor for the ET is that this technique involves costly
hormones, labour intensive protocols and experse in addion to the poor super ovulatory response
and pregnancy outcomes.
Somac cell Nuclear Transfer or Cloning: Somac cell nuclear transfer (SCNT) is a major technique
for delivering nuclease-mediated genec alteraons in livestock. In this technique, the nucleus of a
somac cell is transferred into a female egg cell or oocyte in which the nucleus has been removed to
generate a new individual, genecally idencal to the somac cell donor. The advantage of SCNT is
that the gene-edited cell line can be genotyped and/or screened before transfer into the enucleated
oocyte to ensure that the desired edits, and no o-target edits, have occurred. A number of gene-
edited animals have been produced through SCNT cloning technique. This technique was used to
generate Dolly from a dierenated adult mammary epithelial cell. Further research is needed to
improve the eciency of the cloning. SCNT is a procedure of cloning within the same species whereas
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
78
interspecies cloning (interspecies Somac Cell Nuclear Transfer -iSCNT) are also feasible. The cloned
animals have already been produced between closely related species. Eg- domesc cale (Bos taurus)
and wild ox (Bos grunniens). Cloning procedure using embryonic stem cells (ESCs) referred as Nuclear
Transfer-derived Embryonic Stem Cell (NTESC) is sll unsuccessful. Despite the achievements made
through SCNT-eding method, certain drawbacks associated with cloning such as early embryonic
losses, postnatal death, and birth defects cannot be ignored
Cryopreservaon: Cryopreservaon is a process where cells, whole ssues, or any other substances
suscepble to damage caused by chemical reacvity or me are preserved by cooling to sub-zero
temperatures. Cryopreservaon is a mulstage complex process incorporang cryoprotectants or
anfreeze agents. The ability to cryopreserve germplasm indenitely allows genec diversity to
be preserved. Unlike semen, cryopreservaon of embryo helps in the preservaon of complete
genotypes. Freezing of embryos is an established commercial pracce especially in cale. In contrast
to embryos, oocytes are extremely sensive to chilling and are dicult to cryopreserve without
losing their viability. However, research is in progress on the vernalisaon of oocytes, where very low
temperature storage, without freezing, could preserve the oocytes for several months. This technique
is advantageous as it reduces the risk and expense in the transportaon of expensive animals; reduce
disease transmission and conservaon of endangered species germplasm.
Embryo Sexing :Embryo sexing is a technique in reproducve biotechnology having praccal
applicaons. Sex determinaon is performed by Y-chromosome-specic DNA probe technology
coupled with polymerase chain reacon (PCR) amplicaon of specic Y-chromosome region. Other
methods involve detecon of embryonic H-Y angen in the embryos and use of loop-mediated
isothermal amplicaon and duplex PCR-based assay showing more than 95% accuracy but involves
high cost, me and experse for carrying out these protocols.
Transgenesis: Transgenic animals have a foreign gene deliberately inserted into their genome by the
micro-injecon of DNA into the pronuclei of a ferlised egg which is subsequently implanted into
the oviduct of a surrogate mother. Transgenesis has great potenal in molecular breeding of farm
animals, such as development of animals with high fecundity, higher ferlity, disease resistance etc.
Transgenic technologies in shes can enhance growth rates and market size, feed conversion raos,
resistance to disease, sterility issues and tolerance of extreme environmental condions. In the
shrimp aquaculture sector, transgenic shrimp have been reported (Mialhe et al., 1995), but there has
been no successful development to date for commercial culture. The cost for making transgenic farm
animals is high and the eciency is low.
Stem Cells: Stem cells are unspecialized cells that renew themselves for long periods through cell
division, and later become specialized on receiving specic signals. Based on their source, stem cells
have been classied into three types, viz., embryonic, adult and fetal stem cells. ES cells are derived
from embryos at (blastocyst stage 32 cell stage), can give rise to cells from all three embryonic germ
layer.The ESs cells are advantageous as they do not form tumours when transferred into the body which
potenates their use in transplantaon. On the other, adult stem cells are those undierenated cells
found throughout the body which is needed for replenish and regenerate cells in any damaged ssue.
The spermatogonial stem cells are the only adult stem cells having the responsibility of transferring
Hands on Training Aquaculture Genomics and Bioinformacs 79
genes to next generaons via the process of ferlizaon of ovum. Some of the potenal applicaons
of this technology are surrogate producon of spermatozoa, reduced me for progeny tesng,
producon of transgenic animals and conservaon of endangered species.
Gene eding: Gene eding is a powerful tool to manipulate genome, bearingapplicaons in
animal breeding programs. Gene eding allows specic deleons, addions, or allele alteraon at
unambiguous locaons in a genome. The development of designer nucleases (zinc nger nucleases
[ZFNs], transcripon acvator-like eector nuclease [TALENs], and clustered regularly interspaced
short palindromic repeats [CRISPR/Cas9]) has enabled extremely ecient and more facile genome
eding in dierent animal species. These tools could be employed to enhance producvity, disease
resistance, breeding eciency, and for generaon of novel animal models. Such alteraons, if made
in zygotes or germ line cells, can be permanent and heritable. Recently, genome eding in many
livestock species has been reported such as myostan (MSTN) gene eding for “double muscling”
in pigs, cale, and sheep, polled gene introducon in dairy cale, and edits to confer resistance to
porcine reproducve and respiratory syndrome virus and African swine fever virus in pigs.
Endocrine regulaon of reproducon in sh
Biotechnology can be applied to enhance the reproducve performance of cultured aquac
species exhibing reproducve dysfuncon is capvity. In the past, sh gonadotropin, a group of
hormones that smulate reproducon, were produced in small amounts by extracon and puricaon
from crude preparaons of thousands of pituitary glands. At present, large quanes of highly puried
gonadotropin can be produced in the laboratory through recombinant DNA technology. The use of
synthec Gonadotropin Releasing Hormone (GnRH), the key regulator of reproducve cascade in all
vertebrates, triggers the secreon of the sh’s own gonadotropin. GnRHa is synthesized chemically
and does not carry the risk of transming diseases to the broodstock. However, injecon of GnRHa
does not always result in 100% ovulaon and oen mulple injecons are oen necessary to induce
ovulaon. Development of controlled-release delivery systems for synthec GnRHas has contributed
to capve breeding of many commercially important sh species. The hormones implants mixed with
cholesterol, ethylene-vinyl in biodegradable microspheres have been ecient in inducing maturaon
and spawning in many cultured sh.
Sex control :The control of sh sex could be useful where one sex displays advantageous
characteriscs, such as larger adult size, producon of high-value caviar(sturgeon), faster growth
rate, or higher age at rst sexual maturaon. Monosex populaons of the most advantageous sex can
be produced either by direct sex control via steroid treatment (masculinisaon by administraon of
androgens; feminisaon by administraon of estrogens); or by genec controland steroid treatment
of broodstock (indirect hormonal treatment, gynogenesis, androgenesis); or by control of external
factors (temperature, density etc.). In the case of lapia, males are preferred for culture as they grow
faster than females. The YY male technology involves a genec breeding programme combining the
hormone feminizaon of a normal male (XY female) followed by mang with normal males (in lapia).
Sterility: Sterility in sh by manipulaon of reproducon would help to increase growth by reducing
energy consumpon for reproducon. Sterility can be achieved by ploidy manipulaon to produce
sterile triploids or the use of transgenics by gene “knock-out” or “gene knock-down”.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
80
Conclusion
Reproducve biotechnology has revoluonized animal breeding and genec progress in livestock
industry.The applicaon of biotechnology in aquaculture including the use of synthec hormones
in induced breeding, producon of monosex, surrogate broodstock, transgenic sh etc has played
major role to ensure the connued expansion and intensicaon of aquaculture to meet the growing
sh demand.The emerging techniques should be judicially implemented for manipulaon and
improvement of reproducve performance of the livestock species.
Source of informaon
K. K. Choudhary, K. M. Kavya,A. Jerome, R. K. Sharma (2016). Advances in reproducve biotechnologies.
Vet World 9(4): 388–395.
Role of Biotechnology in Assisted Reproducon, Science, 14 May – 2014.
W. S. Lakra and S. Ayyappan (2002).Recent Advances in Biotechnology Applicaons to Aquaculture.
Internaonal Symposium on Recent Advances in Animal Nutrion,22nd September, New Delhi,
Pg-455-461
Hands on Training Aquaculture Genomics and Bioinformacs 81
19.Use of molecular techniques in growth enhancement
Raymond J Angel
Introducon
It is proved that the use of molecular techniques in aquaculture has the potenal to alleviate
the predicted sh shortages and price increases by enhancing producon eciency, minimizing costs
and reducing disease. Growth enhanced sh using molecular techniques will be equally benecial to
aquaculture and is more eecve than tradional breeding techniques to develop new sh strains. In
principle, the technology can be used to improve growth rate of the sh, control sexual maturaon,
sterility and sex dierenaon, improve survival by increasing disease resistance against pathogen,
adapt to extreme environment such as cold resistance and alter the biochemical characteriscs of
the esh to enhance the nutrional qualies. Since sh can be readily improved by applicaon of
molecular techniques, it is clearly mely to consider what genecally modied (GM) sh are likely to
oer in the future, both in terms of benets and disadvantages (Maclean and Norman, 2003). Growth
Hormone has also been ulised in recent years extensively for construcon of transgenic shes to
enhance growth. Genec engineering is an important tool to develop and improve traits of sh for
aquaculture. Species showing high growth rate is widely used to isolate Growth Hormone gene for
the producon of transgenic sh.
An overview of various target species used in growth enhancement using molecular techniques
Transgenic sh have been produced for numerous species of sh including non-commercial
model species such as the Loach, Misgurnus anguilIicaudatus (Maclean et al. 1987a), Medaka, Oryzias
lapes (Ozato et al. 1986), Topminnows and Zebra sh, although Gong et al. (2002) have developed
transgenic Rainbow zebra sh for the ornamental sh industry. Several experiments have evaluated
transgenic farmed sh species including Goldsh (Zhu et al. 1985), Common carp, Silver carp, Mud
loach, Rainbow trout (Chourrout, 1986), Atlanc salmon, Coho salmon, Chinook salmon, Channel
caish (Dunham et al. 1987) and Nile lapia (Brem et al. 1988). Addionally, gene transfer has been
accomplished in a game sh, Northern pike (Gross et al. 1992).
Many species of sh have been used in studies for standardizing GH- involved transgenesis. Even
though, many studies reported a posive enhancement of growth in target species, some proved to
be unsuccessful due to many unknown reasons. Some of the studies have been quoted for reference
(Table 1).
Techniques for growth enhancement
There are many ways to enhance growth including inbreeding, gynogenesis, androgenesis,
selecon, intraspecic crossbreeding, interspecic hybridizaon, polyploidy, sex reversal and
breeding, nuclear transplantaon and transgenesis. Cloned populaons have been produced via
gynogenesis and androgenesis (Dunham, 2004), but direct cloning of an individual sh of interest
has not yet been accomplished. Gene transfer technology has produced a great impact in modern
biology and biotechnology (Powers et al. 1998). A number of sh species are in focus for gene transfer
experiments and can be divided into two main groups: animals used in aquaculture (Fletcher and
Davies, 1991; Hew et al. 1995; Chen and Lu, 1998) and model sh used in basic research (Chen and
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
82
Lu, 1998). Among the major food sh species are Carp (Cyprinus sp.), Tilapia (Oreochromis sp.),
Salmon (Salmo sp., Oncorhynchus sp.) and Channel caish (Ictalurus punctatus) while Zebrash
(Danio rerio), Medaka (Oryzias lapes) and Goldsh (Carassius auratus) are used in basic research.
Genec engineering of farm animals oers great potenal for improvement of selected genec traits
of agricultural signicance. Several species of sh have also been used to exploit this technology for
commercial purposes, and examples include aempted inducon of freeze resistance in transgenic
salmon using an An-Freeze Protein gene (Fletcher et al., 1988) and producon of growth enhanced
sh using novel Growth Hormone (GH) genes (Dunham et al., 1987; Brem et al., 1988; Penman et al.,
1990) or an Insulin-like Growth Factor (IGF) gene (Chen et al., 1995). Although several species of sh
have been used to produce lines of transgenic sh, in only a few cases has germline transmission and
stable long term transgene expression been sasfactorily demonstrated.
Techniques for gene transfer
Microinjecon
Microinjecon is most successfully and widely used technique for gene transfer in sh. Gene
transfer research with sh began in the mid 1980’s ulizing microinjecon (Zhu, et al 1985, Dunham
et al 1987). Zhu et al. (1985) published the rst report of transgenes microinjected into the ferlized
eggs of goldsh. In almost all sh gene transfer research, the foreign gene was microinjected into the
cytoplasm of one-to- four cell embryos (Hayat, 1989) as pronuclei are extremely dicult to visualize
in live one-cell sh embryos.
To ensure the integraon of the DNA it should be injected to intact cells close to the cut site. The
injecon apparatus consists of a dissecng stereomicroscope and two micro-manipulators, one with
a glass micro-needle for delivering transgene and other with a micropipee for holding sh embryo in
place (Fig. 1). The success of microinjecon technique depends on the nature of egg chorion. The so
chorion facilitates the microinjecon while the thick chorion limits the ability to visualize the target
for injecon of DNA. In many shes (Atlanc salmon and rainbow trout) the egg chorion gets tough
and hard just aer the ferlizaon or to contact with the water and provides a diculty in injecng
the DNA.
Steps of Microinjecon Technique
(1) Desired eggs and sperms are stored separately at the opmum condions.
(2) Add water and sperms and iniate the ferlizaon.
(3) Ten minutes aer the ferlizaon, eggs are dechorionated by trypsinizaon.
(4) Ferlized eggs are microinjected with desired DNA just within a few hours of ferlizaon. DNA
is released into the centre of the germinal disc to the rst cleavage in dechorionated eggs.
The me available for microinjecon is rst 25 minutes and that too between ferlizaon
and rst cleavage.
(5) Aer microinjecon the embryos are incubated in water unl hatching takes place.
Survival rates of microinjected sh embryos is seem to be about 30-80% depending the sh
species.
Hands on Training Aquaculture Genomics and Bioinformacs 83
Fig 1.Microinjecon technique.
Other methods
Microinjecon is a tedious and slow procedure (Powers, et al. 1992) and can result in high egg
mortality (Dunham, et al. 1987). Aer the inial development of microinjecon, new techniques such
as electroporaon, retroviral integraon, liposomal-reverse-phase-evaporaon, sperm mediated
transfer and high velocity micro-projecle bombardment were developed (Chen and Powers, 1990)
that somemes can more eciently produce large quanes of transgenic individuals in a shorter
me period. The rst successful gene transfer ulizing electroporaon produced integraon rates and
survival similar to that for microinjecon (Inoue, et al.1990). Powers, et al. (1992) demonstrated that
electroporaon can be more ecient than microinjecon with integraon rates somemes as high as
30-100%. Walker (1993) found that hatching rates were higher for electroporated embryos than for
microinjected channel caish embryos, and post-ferlizaon electroporaon treatments had higher
hatching rates than electroporaon of sperm and then eggs prior to ferlizaon.
Environmental Concerns about Transgenic Fish and risk migaon
The primary environmental concerns about releases of transgenic sh, for example, include
compeon with wild populaons, movement of the transgene into the wild gene pool, and ecological
disrupons due to changes in prey and other niche requirements in the transgenic variety versus the
wild populaons.
It is important to note that developers of transgenic sh are aempng to reduce or eliminate
both gene ow and invasive species risks by sterilizing transgenic sh. Sterilizaon is relavely easy
and inexpensive but success rates are highly variable. In addion, sterilizaon does not necessarily
neutralize environmental risks. Academic sciensts note that an escaped, sterile sh might sll engage
in courtship and spawning behaviour, disrupng breeding in wild populaons. Waves of escaped
sterile sh could also create ecological disrupons as each group is replaced by another equally strong
group of transgenic sterile sh.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
84
Conclusion and future prospecve
Transgenic sh technology has great potenal in the aquaculture industry. By introducing
desirable genec traits into shes, mollusks, and crustaceans, superior transgenic strains can be
produced for aquaculture. These traits include faster growth rates, improved food conversion
eciency, resistance to some known diseases, tolerance to low oxygen concentraons, and tolerance
to extreme temperatures. Our laboratory and those of others have shown that transfer, expression
and inheritance of sh growth hormone transgenes can be achieved in several sh species and that
the resulng transgenics grow substanally faster than their non-transgenic siblings. This is a vivid
example of the potenal applicaon of the gene transfer technology to aquaculture.
However, to realize the full potenal of the transgenic sh technology in aquaculture orother
biotechnological applicaons, several important scienc breakthroughs are required. These include:
(1) more ecient technologies for mass gene transfer,
(2) targeted gene transfer technologies such as embryonic stem cell gene transfer or ribozyme
gene inacvaon,
(3) suitablepromoters to direct the expression of transgenes at opmal levels during the desired
developmental stages,
(4) idened genes of desirable traits for aquaculture and other applicaons,
(5) informaonon the physiological, nutrional, immunological and environmental factors that
maximize the performance of the transgenics, and
(6) safety and environmental impacts of transgenic sh. Once these problems are resolved, the
commercial applicaon of the transgenic sh technology will be readily aained.
Table 1. Studies showing enhancement of growth achieved in dierent target
organisms worldwide with citaon
FAMILY AND SPECIES CONSTRUCT GROWTH COUNTRY SUPPORTING CITATION
Salmonidae
Atlanc salmon,
Salmo salar
opAFP-csGH 2–6-Fold Canada Du et al. (1992)and
Fletcher et al. (2004)
Coho salmon,
Oncorhynchus kisutch
ssMT-ssGH Up to 11-
fold
Canada Devlin et al. (1994a,b)
Coho salmon
Oncorhynchus kisutch
opAFP-csGH 3–10-Fold Canada Devlin et al. (1995a)
Chinook salmon,
O. tshawhytscha
opAFP-csGH 6-Fold Canada Devlin et al. (1995a)
Rainbow trout,
O. mykiss
opAFP-csGH 3.2-Fold Canada Devlin et al. (1995a)
Cuhroat trout,
O. clarki
opAFP-csGH 6-Fold Canada Devlin et al. (1995a)
Hands on Training Aquaculture Genomics and Bioinformacs 85
Arcc charr,
Salvelinus alpines
Various
constructs
Up to 14-
fold
Finland Pitkanen et al. (1999)
Rainbow trout
O. mykiss
ssGH-ssGH None Finland Pitkanen et al. (1999)
Cichlidae
Nile lapia,
Oreochromis nilocus
opAFP-csGH 2–4-Fold UK Rahman et al. (1998;
2001)
and Rahman and
Maclean (1999)
Nile lapia
Oreochromisnilocus
ssMT-ssGH None UK Rahman et al.
(1998; 2001)
and Rahman and
Maclean (1999)
Tilapia, O. hornorum
Hybrid
hCMV-GH 82% Cuba Marnez et al. (1996)
Ictaluridae
Channel caish,
Ictalurus punctatus
RSVLTR-rtGH, Up to 26% USA Dunham et al. (1992)
Channel caish
Ictalurus punctatus
mMT-hGH None USA Dunham et al. (1987)
Heteropneusdae
Heteropneustes fossilis Zpb-ypGH 30–60% India Sheela et al. (1999)
Cyprinidae
Goldsh,
Carassiusauratus
mMT-hGH None PR China Zhu et al. (1985)
Common carp, Cyprinus
carpio
mMT-hGH None PR China Zhu et al. (1989)
Common carp
Cyprinus carpio
cbA-gcGH 42–80% PR China Zhu (1992) and
Wang et al. (2001)
Catla, Catla catla RSVLTR-rtGH None India Sarangi et al. (1999)
Common carp
Cyprinus carpio
ccbA-ccGH 4-Fold Israel Hinits and Moav (1999)
Rohu
Labeo rohita
CMV-roGH 4-Fold India Venugopal et al. (2004)
Rohu
Labeo rohita
gcbA-roGH 4.5–5.8-
Fold
India Venugopal et al. (2004)
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
86
Esocidae
Northern pike RSVLTR-bGH 30% USA Gross et al. (1992)
Cobidae
Mud loach,
Misgurnusmisolepis
mlb-acn-mlGH Up to 35-
fold
Republic
of Korea
Nam et al. (2001; 2002)
REFERENCES
Brem, G., Brenig, B., Horstgen-Schwark, G. and Winnacker, E. L., 1988. Gene transfer in lapia
(Oreochromis nilocus). Aquaculture., 68: 209-219.
Chen, T. T. and Lu, J. K., 1998. Transgenic sh technology: Basic principles and its applicaon in basic
and applied research. In: De la Fuente J. and Castro F.O. eds. Gene transfer in aquac organisms.
RG Landes Company and Germany: Springer-Verlag, Ausn, Texas, USA., pp. 45-73.
Chen, T. T. and Powers, D. A. (1990) Transgenic sh. Trends in Biotechnology., 8: 209-214.
Chen, T. T., Lu, J. K., Shamblo, M.J., Cheng, C. M., Lin, C. M., Burns, J. C., Reimschuessel, R., Chatakondi,
N. and Dunham, R. A., 1995.Transgenic sh: ideal models for basic research and biotechnological
applicaons. Zool. Studies.,344: pp. 215–234.
Chourrout, D., 1986. Techniques of chromosome manipulaon in rainbow trout: a new evaluaon
with karyology. Theorecal and Applied Genecs., 72: 627-632.
Devlin, R. H., Bya, J. C., McLean, E., Yesaki, T. Y., Krivi, G.G., Jaworski, E.G. and Donaldson, E.M.,
1994b. Bovine placental lactogen is a potent smulator of growth and displays strong binding to
hepac receptor sites of coho salmon. Gen. Comp. Endocrinol., 95: 31–41.
Devlin, R. H., Yesaki, T. Y., Biagi, C. A., Donaldson, E. M., Swanson, E. M. P. and Chan, W. K., 1994a.
Extraordinary salmon growth.Nature, 371, 209–210.
Devlin, R. H., Yesaki, T. Y., Donaldson, E. M., Du, S. J. and Hew, C. L., 1995a. Producon of germline
transgenic Pacic salmonids with dramacally increased growth performance. Can. J. Fish.Aquat.
Sci., 52: 1376–1384.
Du, S. J., Gong, Z. Y., Fletcher, G. L., Shears, M. A., King, M. J., Idler, D. R. and Hew, C. L., 1992. Growth
enhancement in transgenic Atlanc salmon by the use of an all-sh chimeric growth hormone
gene construct. BioTechnology., 10: 176–181.
Dunham, R. A., Ramboux, A. C., Duncan, P. L., Hayat, M., Chen, T. T., Lin, C. M., Kight, K., Gonzalez-
Villasenor, I. and Powers, D. A., 1992.Transfer, expression and inheritance of salmonid growth
hormone genes in channel caish, Ictalurus punctatus, and eects on performance traits.Mar.
Mol. Biol. Biotechnol., 1: 380–389.
Dunham, R. A. 2004.,Aquaculture and Fisheries Biotechnology Genec Approaches.CABI publishing,
Wallingford ,UK., 17: P. 400.
Hands on Training Aquaculture Genomics and Bioinformacs 87
Dunham, R. A., Eash, J., Askins, J. and Townes, T.M., 1987. Transfer of the metallothione in human
growth hormone fusion gene into channel caish. Transacons of the AmericanFisheries
Society.,116: 87-91.
Fletcher, G. L., Shears, M. A., King, M. J., Davies, P. L. and Hew, C. L., 1988. Evidence for anfreeze
protein gene transfer in Atlanc salmon (Salmo salar).Can. J. Fish.Aquat. Sci.,45, pp. 352–357
Fletcher, G. L., Shears, M. A., Yaskowiak, E. S., King, M. J. and Goddard, S. V., 2004. Gene transfer:
potenal to enhance the genome of Atlanc salmon for aquaculture. Aust. J. Exp. Agric., 44:
1095–1100.
Fletcher. G. L. and Davies, P. L., I991. Transgenic sh for aquaculture.Gen. Eng., 13: 33l-369.
Gong, Z., Wan, H., Ju, B., He, J., Wang, X.,and Yan, T., 2002. Generaon of living color transgenic
zebrash. In: Shimizu, N., Aoki, T., Hirono, I., and Takashima, F. (Eds.). Aquac Genomics: Steps
Toward a Great Future, Springer-Verlag, New York, NY., pp. 329-339.
Gross, M. L., Schneider, J. F., Moav, N., Moav, B., Alvarez, C., Myster, S. H., Liu, Z., Hallerman, E.
M., Hacke, P. B., Guise, K. S., Faras, A. J. and Kapuscinski, A. R., 1992. Molecular analysis and
growth evaluaon of northern pike (Esox lucius) microinjected with growth hormone genes.
Aquaculture.,103: 253-273.
Hayat, M., 1989. Transfer, expression and inheritance of growth hormone genes in channel caish
(Ictalurus punctatus) and common carp (Cyprinus carpio). Doctoral Dissertaon. Auburn
University, AL, USA.
Hew, C. L.; Fletcher, G. L. and Davies, P. L., 1995. Transgenic salmon: tailoring the genome for food
producon. Journal of Fish Biology, 47: 1-19.
Hinits, Y. and Moav, B., 1999. Growth performance studies in transgenic Cyprinus carpio. Aquaculture.,
173: 285–296.
Inoue, K., Yamashita, S., Hata, J. I., Kabeno, S., Asada, S., Nagahisa, E. and Fujita, T., 1990.
Electrophoration as a new technique for producing transgenic sh.Cell Differentiationand
Development.,29: 123-128.
Maclean, N. and Norman.,2003.Genetically modied sh and their effects on food quality and human
health and nutrition.Trends in Food Science & Technology., 14: (5-8), 242-252.
Maclean, N., Penman, D. and Talwar, S., 1987a.Introduction of novel genes into sh.Biotechnology.,
5: 257-261.
Marnez, R., Estrada, M. P., Berlanga, J., Guillen, I., Hernandez, O., Cabrera, E., Pimentel, R.,Morales,
R., Herrera, F., Morales, A., Pina, J. C., Abad, Z., Sanchez, V., Melamed, P., Lleonart, R. and de
la Fuente, J., 1996. Growth enhancement in transgenic lapia by ectopic expression of lapia
growth hormone.Mol. Mar. Biol. Biotechnol., 5: 62–70.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
88
Nam, Y. K., Cho, Y. S., Cho, H. and Kim, D. S., 2002. Accelerated growth performance and stable germ-
line transmission in androgenecally derived homozygous transgenic mud loach, Misgurnus
mizolepis. Aquaculture., 209: 257–270.
Nam, Y. K., Noh, J. K., Cho, Y. S., Cho, H. J., Cho, K. N., Kim, C. G. and Kim, D. S., 2001. Dramacally
accelerated growth and extraordinary gigansm of transgenic mud loach Misgurnus mizolepis.
Transgenic Res., 10: 353–362.
Ozato, K., Kondoh, H., Inohara, H., Iwamatsu, T., Wakamatsu, Y. and Okada, T. S.,1986. Producon of
transgenic sh: introducon and expression of chicken delta-crystallin gene in medaka embryos.
Cell Dier. Dev., 19: 237-244.
Penman, D. J., Beeching. A .J., Penn, S. and Maclean, N., 1990. Factors aecng survival and integraon
following microinjecon of novel DNA into rainbow trout eggs.Aquaculture, 85: 35-50.
Pitkanen, T. I., Krasnov, A., Teerijoki, H. and Molsa, H., 1999. Transfer of growth hormone (GH) genes
into Arcc charr (Salvelinus alpinus L.). I. Growth response to various GH constructs. Genet. Anal.:
Biomol. Eng., 15: 91–98.
Powers, D. A., Cole, T., Creech, K., Chen,T. T., Lin, C. M., Kight, K. and Dunham, R., 1992. Electroporaon:
a method for transferring genes into the gametes of zebrash, Brachydanio rerio, channel caish,
Ictalurus punctatus, and common carp, Cyprinuscarpio. Mol. Mar. Biol. Biotech., 1:301-309.
Powers, D. A.; Gómez-Chiarri, M.; Chen, T. T. and Dunham, R.,1998. Genec Enginering of Finsh and
shellsh.In: De la Fuente J. and Castro F.O. eds. Gene transfer in aquac organisms. RG Landes
Company and Germany, Springer-Verlag, Ausn, Texas, USA. pp. 17-34.
Rahman, M. A. and Maclean, N., 1999. Growth performance of transgenic lapia containing an
exogenous piscine growth hormone gene. Aqaculture, 173: 333–346.
Rahman, M. A., Mak, R., Ayad, H., Smith, A. and Maclean, N., 1998. Expression of a novel piscine
growth hormone gene results in growth enhancement in transgenic lapia (Oreochromis
nilocus). Transgenic Res., 7: 357– 369.
Rahman, M. A., Ronyai, A., Engidaw, B. Z., Jauncey, K., Hwang, G. L., Smith, A., Roderick, E., Penman,
D., Varadi, L. and Maclean, N., 2001. Growth and nutrional trials on transgenic Nile lapia
containing an exogenous sh growth hormone gene.J. Fish Biol., 59: 62–78
Sarangi, N., Mandall, A. B., Bandyopadhyay, A. K., Venugopal, T., Mathavan, S. and Pandian, T. J., 1999.
Electroporated sperm-mediated gene transfer in Indian major carps. Asia-Pacic J. Mol. Biol.
Biotechnol., 7: 151–158.
Sheela, S. G., Pandian, T. J. and Mathavan, S., 1999. Electroporac transfer, stable integraon, and
transmission of pZp beta ypGH and pZp beta rtGH in Indian caish, Heteropneustes fossilis
(Bloch).Aquac. Res., 30: 233–248.
Venugopal, T., Anathy, V., Kirankumar, S. and Pandian, T.J., 2004.Growth enhancement and food
conversion eciency of transgenic sh, Labeo rohita.J. Exp. Biol. 301A: 477–490.
Hands on Training Aquaculture Genomics and Bioinformacs 89
Walker, D.S., 1993. Eect of electroporaon and microinjecon on survival of ictalurid caish embryos.
Master of Science Thesis. Auburn University, AL.
Wang, Y., Hu, W., Wu, G., Sun, Y., Chen, S., Zhang, F., Zhu, Z., Feng, J. and Zhang, X., 2001. Genec
analysis of ‘‘all-sh’’ growth hormone gene transferred carp (Cyprinus carpio L.) and its F1
generaon. Chin. Sci. Bull., 46: 1174–1177.
Zhu, Z., 1992. Generaon of fast-growing transgenic sh: methods and mechanisms. In: Hew, C.L.,
Fletcher, G.L. (Eds.), Transgenic Fish. World Publishing, Singapore, pp. 92–119.
Zhu, Z., Li, G., He, L. and Chen, S., 1985.Novel gene transfer into the ferlized eggs of goldsh (Carassius
auratus, 1758).Journal of Applied Ichthyology, 1: 31-33.
Zhu, Z., Xu, K., Xie, Y., Li, G. and He, L., 1989.A model of transgenic sh.Sci. Sin., B 2: 147–155.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
90
20. Gene Eding Tools and their applicaon in Aquaculture
Misha Soman
Introducon
Genome eding is a kind of genec engineering in which a gene of interest is inserted, or erased in
the genome of an organism or cells using engineered restricon enzymes called “molecular scissors.”
These nucleases create site-specic double-strand breaks (DSBs) at desired locaons in the genome.
The induced double-strand breaks are repaired through non-homologous end-joining (NHEJ) or
homologous recombinaon (HR), resulng in targeted mutaons (‘edits’). By eding the genome the
characteriscs of a cell or an organism can be changed.
Genome eding uses ‘engineered nuclease’ which cuts the DNA at its targeted site. Engineered
nucleases have two parts a nuclease part and the DNA-targeng part that is designed in such a way that
it guides the nuclease to cut a specic sequence of DNA. When a cut forms within a parcular place
of DNA, the cell starts to repair the cut naturally.Gene eding technologies have wide applicaons in
dierent sh species for basic as well as applied research in disease modeling and aquaculture.
Genome eding can be used
For research: It can be used to alter the DNA in organisms to study impact of gene modicaon.
To treat disease: Genome eding is being used in medical research to study the viability of the
technology to treat deadly human diseases like leukemia, AIDS, cancer, etc. (Youdiil Ophinni, et al.,
2018;Pablo Tebas, et al., 2014).
For biotechnology: Genome eding has been used in agriculture to produce genecally modied crops
to improve their yields and resistance to disease, as well as to make genecally modied pigs(Kankan
Wang, et al., 2015; Jin-Dan Kang et al., 2017), sheep (Crispo, et al., 2015), and shes(Karim Khalil et
al., 2017).
TYPES OF GENE EDITING
ØZinc nger nucleases (ZFNs)
ØTALENS
ØCRISPR-Cas9
ZINC FINGER NUCLEASES (ZFNS)
Zinc nger nucleases (ZFNs) are the type of engineered restricon nucleases produced by joining
zinc nger DNA-binding domain and DNA-cleavage domain (FokI) that promote targeted eding of
the genome by generang double-strand breaks in DNA at targeted locaons. This nuclease is a site-
specic endonuclease designed to bind and cleave DNA at parcular locaons. ZFN is composed
of three to six zinc nger mofs, and each mof parcularly recognizes three nucleodes in a DNA
sequence. Hence, each ZFN can idenfy target 9 to 18 base pairs. The cleavage of target DNA requires
dimerizaon of two ZFNs for the FokI enzyme results in double-strand break (DSB) at the target locus
(Durai et al., 2005). Double-strand breaks are important for site-specic mutagenesis in that they
Hands on Training Aquaculture Genomics and Bioinformacs 91
smulate the cell’s natural DNA-repair processes homology-directed repair and Non-Homologous
End Joining (NHEJ); these reagents can be used to modify the genome precisely.
genomes precisely.
Fok I
Fok I
Zinc Finger
Module (DNA binding domain)
5’…………
…
3’………..
…..3’
…..5’
Catalytic module
DNA cleavage domain
Fig 1: DNA-binding domain and DNA-cleaving domains are fused together,
a highly-specic pair of ‘genomic scissors’ formed.
TALENS
Transcripon acvator-like eector nuclease (TALEN) technology use engineered restricon
enzymes generated by fusing a TAL eector DNA-binding domain to a DNA cleavage domain (FokI).
Restricon enzymes can be designed that will precisely cut any desired DNA sequence. When these
restricon enzymes are introduced into cells, it makes double-stranded breaks in the gene of interest.
The nucleases consist of programmable and sequence-specic DNA-binding modules coupled with a
regular DNA cleaved domain that allows accurate and ecient genec alteraons by smulang the
targeted DNA double-strand breaks to induce cellular DNA repair, including error-prone NHEJ and
HDR.
The DNA binding domain contains a repeated highly conserved 33–34 amino acid sequence
with divergent 12th and 13th amino acids. These two posions, referred to as the Repeat Variable
Diresidue (RVD), are highly variable and show a strong correlaon with specic nucleode recognion.
Dierent RVD allows each module to specically recognize one individual nucleode instead of three
nucleodes as in ZFN (Moscou and Bogdanove, 2009). The dimerized FokI randomly cleaves the DNA
sequence between the le and right TALEN target sites.
sites.
FokI
FokI
Catalytic module (DNA
cleavage domain)
TALE module
(DNA binding domain)
5’…………
…
…..5’
3’………..
…..3’
Fig 2: TALENS mechanism
Zinc FingerModule
(DNA binding domain)
Catalytic module
(DNA cleavage domain)
TALE module
(DNA binding domain)
Catalytic module
(DNA cleavage domain)
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
92
CRISP-R Cas9
The clustered regularly interspaced short palindromic repeat (CRISPR) and associated protein(Cas9)
emerged as a faster, cheaper and more precise gene eding tool in a wide range of organisms. It is
an adapve immunity mechanism in prokaryotes to eliminate invading genec material in which the
foreign genec material cut into fragments and integrated into its CRISPR locus as a series of short
repeats (20 bps). The loci are transcribed and processed into small RNAs which are called as guide
RNAs to guide nucleases to cleave the target DNA based on sequence complementarity. This unique
technology enables genecists and medical researchers to edit the genome by adding, removing or
altering the DNA sequence. The CRISPR-Cas9 system consists of two key players that make mutaon
into the targeted DNA. These are the enzyme Cas9 and a piece of RNA called guide RNA. The cas9 act
as a molecular scissor which cuts double-stranded DNA at a specic targeted site. So that bits of the
sequence can be added or removed. The guide RNA (gRNA) is an about 20 bases long pre-designed
RNA sequence located within the RNA scaold. The scaold part binds to DNA and the pre-designed
guide RNA ‘guides’ Cas9 nuclease to the targeted region of the genome, and it ensures that the Cas9
enzyme cuts at the right point in the genome.
Cas9 enzyme cuts at the right point in the genome.
Cas9
sgRNA
Target sequence
PAM
Sequence
5’……
3’……
…..3’
…..5’
Fig 3: CRISP-R Cas9 mechanism
DNA double stranded breaks (DSB) repair mechanisms
Most DSBs get repaired by either the non-homologous end joining (NHEJ) pathway or the
homology-directed repair (HDR) pathway. The NHEJ repair pathway causes nucleode inserons or
deleons (indels) at the cleavage site. In most cases, NHEJ gives rise to small indels in the targeted
DNA that result in deleons, inserons, or frameshi mutaons leading to the formaon of premature
stop codons inside the open reading frame (ORF) of the targeted gene and causes gene disrupon. It
results in the loss of funcon of the targeted gene.
Homology-directed repair (HDR) is a process of homologous recombinaon where a DNA
template is used for precise repair of a double-strand break (DSB). This template can be either from
the cell during the late S phase and the G2 phase of the cell cycle, before the compleon of mitosis, or
it can be an exogenous repair templates delivered into a cell mostly in the form of a synthec, single-
strand DNA donor oligo or DNA donor plasmid, to generate a specic change in the genome.
Cleavage
Hands on Training Aquaculture Genomics and Bioinformacs 93
Advantages of CRISPR-Cas 9 system over ZFNs AND TALENS
ØHighly ecient mutagenesis
ØEecve introducon of targeted indels at required genomic locaon
ØTarget eciency >80%
ØIn CRISPR-Cas9 system only one customized sg RNA is required to target a specic sequence,
the same Cas9 can be used for all targeted sequences.
ØZFNs and TALENS require design and assembly of two nucleases for each target site.
ØSg RNAs are of short sequences <100bp, therefore reduces complicaons
Applicaons of gene eding tools in shes
Fish species, especially the model species such as the zebrash, have played important roles
in tesng new protocols of genome eding because of the biological advantages of sh models.
A large number of genes have been disrupted or modied in sh species for funconal studies,
especially those involved in reproducon. These gene eding technologies can be ulized to modify
the genomes of a variety of industrially relevant organisms and standard research animals including
zebrash, rats, pigs, caish. The cis-regulatory mechanisms and gene knockdowns or knockouts can be
invesgated by using genome eding tools to know the unexplored processes of animal development
and gene funcon to use in basic and applied sciences. Genome eding can be ulized to study early
embryogenesis, inducon of mutaon, producon of knockout lines, to unravel ancestral features of
chordate development. It can be used to systemacally study the funconal analysis of reproducve
performance in shes, disease resistance, tolerance to environmental stressors, sex determinaon,
sex dierenaon, funconal analysis of genes in non-reproducve funcons like pigmentaon,
growth, and development and also for the disease modeling and drug screening. CRISPR is one of the
most useful and powerful tools for gene manipulaon in sh; even though o-target occurrence is a
serious concern. The authors report that o-target mutaon eciency can be reduced by lowering
the concentraon of gRNAs in the injecon. Genome eding tools were applied in zebrash, mainly to
induce mutaons which would give valuable insights for medical science. The myostan (MSTN) gene
(muscle suppressor gene) disrupon by CRISPR/Cas9 was successfully carried out in channel caish,
Ictalurus punctatus which resulted in 88–100% rates of mutagenesis in the protein-coding sites of
Myostan. The MSTN altered fry had more muscle cells, and the mean body weight also increased
by 29.7%. The alignment of the mutated sequences vs. wild-type showed mulple inserons and
deleons. (Karim Khalil et al., 2017). In India, Central Instute of Freshwater Aquaculture successfully
disrupted Toll-like receptor 22 (TLR22) gene of Labeo rohita (rohu) involved in innate immunity and
solely present in teleost shes and amphibians using the CRISPR/Cas9 technology and the mutants
lacked TLR22 mRNA expression (Chakrapani et al., 2016). These results conrm that CRISPR/Cas9
is a highly ecient tool for eding the sh genome, and exposes ways for promong sh genec
enhancement and funconal genomics.
Conclusion
Gene eding tools are widely used for studying the manipulaon of the gene in human, animals,
vegetables, and sh for various purposes. With this high-eciency gene eding in shes, we are
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
94
entering into a new era for the adopon of powerful technologies to study various gene funcons
to improve the traits. Gene eding tools widely used to study the impact of the manipulaon of
the gene in animals, vegetables, sh and in humans for various purposes. With these high-eciency
genes eding in shes, we are entering into a new era of powerful technologies to study mulple
gene funcons to improve the traits. These technologies will give insights into the gene funcons
and the evoluon of vertebrates and also the possibility to treat deadly human diseases in medical
research, to create improved variees in agriculture, livestock and aquaculture. In the aquaculture
industry, this approach may pave the way for growth-enhanced shes to increase the producvity.
References
www.Yourgenome.org
www.Genetherapynet.com
Khalil, K., Elayat, M., Khalifa, E., Daghash, S., Elaswad, A., Miller, M., Abdelrahman, H., Ye, Z., Odin,
R., Drescher, D., Vo, K., Gosh, K., Bugg, W., Robinson D, and Dunham R., (2017). Generaon
of Myostan Gene-Edited Channel Caish (Ictalurus punctatus) via Zygote Injecon of CRISPR/
Cas9 System. Scienc Reports volume 7, Arcle number: 7301.
Zhu, B., Ge, W., 2018. Genome eding in shes and their applicaons. General and Comparave
Endocrinology. 257, 3-12.
Chakrapani, V., Patra, S. K., Panda, R. P., Rasal, K. D., Jayasankar, P., Barman, H. K., 2016. Establishing
targeted carp TLR22 gene disrupon via homologous recombinaon using CRISPR/Cas9. J. of
Developmental and Comparave Immunology (61) 242-247.
Wang, K., Ouyang, H., Xie, Z., Yao, C., Guo, N., Li, M., Jiao, H, and Pang, D., 2015. Ecient Generaon
of Myostan Mutaons in Pigs Using the CRISPR/Cas9 System. Scienc Reports 5:16623.
Crispo, M., Mulet, A. P., Tesson, L., Barrera, N., Cuadro, F., dos Santos-Neto, P. C., Nguyen, T. H.,
Crénéguy, A., Brusselle, L., Anegón, I., Menchaca. A., 2015. Ecient Generaon of Myostan
Knock-Out Sheep Using CRISPR/Cas9 Technology and Microinjecon into Zygotes. PLoS ONE
10(8): e0136690.
Ophinni, Y., Inoue, M., Kotaki. T & Kameoka, M., 2018. CRISPR/Cas9 system targeng regulatory genes
of HIV-1 inhibits viral replicaon in infected T-cell cultures. Scienc Reports volume 8, Arcle
number: 7784.
Pablo Tebas, P., David Stein, D., Winson W. Tang, W. W., Ian Frank, I., Shelley Q. Wang, M.D., Gary Lee,
Ph.D., S. Kaye Spra, Ph.D., Richard T. Surosky, Ph.D., Marn A. Giedlin, Ph.D., Geo Nichol, M.D.,
Michael C. Holmes, Ph.D., Philip D. Gregory, Ph.D., et al. 2014. Gene Eding of CCR5 in Autologous
CD4 T Cells of Persons Infected with HIV. The New England journal of medicine. 370:901-910.
Hands on Training Aquaculture Genomics and Bioinformacs 95
GLOSSARY
ØRead – Base pair informaon of a given length from a DNA or cDNA fragment contained in a
sequencing library. Dierent sequencing plaorms are capable of generang dierent read
lengths.
ØSingle End Read – The sequence of the DNA is obtained from the 5’ end of only one strand of
the insert. These reads are typically expressed as 1x “y”, where “y” is the length of the read
in base pairs (ex. 1x50bp, 1x75bp).
ØPaired End Read – The sequence of the DNA is obtained from the 5’ ends of both strand of
the insert. These reads are typically expressed as 2x “y”, where “y” is the length of the read
in base pairs (ex. 2x100bp, 2x150bp).
ØMate Pair Read – The sequence of the DNA is obtained similar to paired-end reads, however
the size of the DNA insert is oen much greater in size (2-10kb in length) and the paired
reads originate from a single strand of the DNA insert.
ØDepth of Coverage – The number of reads that spans a given DNA sequence of interest. This
is commonly expressed in terms of “Yx” where “Y” is the number of reads and “x” is the unit
reecng the depth of coverage metric (i.e. 5x, 10x, 20x, 100x)
ØSequencing Depth – The amount of sequencing a given sample requires to achieve a certain
depth of coverage. This is frequently expressed as the number of reads a sample requires (ex.
40 million reads, 80 million reads) or the number of bases of sequencing a sample requires
(ex. 4 gigabases, 100 megabases).
ØSNP/SNV – Referring to a Single Nucleode Polymorphism or Single Nucleode Variant
detected in a sample.
ØInDels – One or more Inseron or Deleon event that is detected in a sample.
ØAnnotaon - Adding biological informaon to genome sequence. This is a very complex task,
and the process for doing this is rapidly evolving. Features that are added to the genome
oen include gene models, SNPs, and STSs.
ØCopy Number Variaon (CNV)- large-scale structural changes in DNA that vary from individual
to individual. These include inserons, deleons, duplicaons and complex mul-site
variants that range from kilobases to megabases in size. CNV can inuence gene expression,
phenotypic variaon and alter gene dosage, and in certain instances may be associated with
developmental disorders, cause disease or confer suscepbility to complex disease traits.
ØEST Expressed sequence tag - These are single-pass sequences of cDNA clones. Databases of
EST sequences are highly redundant but quite useful for gene idencaon. There are many
eorts to cluster EST sequences to remove the redundancy and low-quality sequences.
ØHaplotype (haploid genotype) - A set of closely linked genec markers present on one
chromosome that tend to be inherited together. A haplotype may also refer to a set of single
nucleode polymorphisms (SNPs) on a single chromad that are stascally associated with
one another.
ØReference sequence/genome - A fully assembled version of a genome that can be used for
mapping short DNA sequence reads for comparisons of genomes from various individuals
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
96
ØCong - A cong (from conguous) is a set of overlapping DNA segments that together
represent a consensus region of DNA. In sequencing projects, a cong refers to overlapping
sequence data (reads).
ØScaold - A scaold is a poron of the genome sequence reconstructed from end-sequenced
whole-genome shotgun clones. Scaolds are composed of congs and gaps.
ØSpecicity -The percentage of sequences that map to the intended targets out of total bases
per run.
ØHomopolymer - Uninterrupted stretch of a single nucleode type (e.g., TTT or GGGGGG)
ØBase Call-Base calling is the process of assigning bases (nucleobases) to chromatogram peaks.
ØHomology
v Ortholog - Orthologous sequences are homologous sequences in dierent species that
have a common origin. Disncon of Orthologoes is a result of gradual evoluonary
modicaons from the common ancestor. Perform same funcon in dierent species
v Paralog - Paralogous sequences are homologous sequences that exists within a species.
They have a common origin but involve gene duplicaon events to arise. Perform
dierent funcons in same species
v BLAST E-values - The BLAST programs (Basic Local Alignment Search Tools) are a set of
sequence comparison algorithms introduced in 1990 that are used to search sequence
databases for opmal local alignments to a query.
v The E-value represents the amount of alignments you would expect to nd by chance
that have the same score as the alignment you are looking at. The e-value is calculated
with the formula E = (query length) * (length of database) * 2^-(S). A good, biologically
signicant e-value would be 0.05 or less.
N50: The number of largest congs whose sum is equal to or greater than half the genome
size.
L50: The smallest number of congs whose sum produces N50
Blast - type query and subject
blastn query is DNA, subject is DNA
blastp query is protein, subject is protein
blastx query is nucleic acid that is translated by the program into protein sequences (all
6 reading frames); subject database is protein
tblastn query is protein; database is DNA translated into protein sequences in all 6
reading frames.
tblastx
query is DNA translated into protein, subject is nucleode translated into protein.
Both are translated into all 6 frames. It is very slow relave to the other BLAST
types.