BookPDF Available

Hands on Training AQUACULTURE GENOMICS AND BIOINFORMATICS GENETICS AND BIOTECHNOLOGY UNIT Prepared by ICAR -CENTRAL INSTITUTE OF BRACKISHWATER AQUACULTURE 75, SANTHOME HIGH ROAD, RA PURAM MRC NAGAR, CHENNAI -600 028

Authors:
  • ICAR - Central Institute of Brackishwater Aquaculture
CIBA – TM Series – 2018 – No. 12
Hands on Training
AQUACULTURE GENOMICS AND
BIOINFORMATICS
Organized By
GENETICS AND BIOTECHNOLOGY UNIT
Prepared by
K. VINAYA KUMAR J. ASHOK KUMAR
MISHA SOMAN RAYMOND J ANGEL B. SIVAMANI
P. MAHALAKSHMI SHERLY TOMY
M. S. SHEKHAR
G. GOPIKRISHNA
ICAR – CENTRAL INSTITUTE OF BRACKISHWATER AQUACULTURE
75, SANTHOME HIGH ROAD, RA PURAM
MRC NAGAR, CHENNAI - 600 028
Published by
Dr. K. K. Vijayan
Director, ICAR-CIBA
Hands on Training Aquaculture Genomics and Bioinformacs iii
TABLE OF CONTENTS
Sr. No. Chapter Title Page number
1Introducon to Linux Environment 1
2Introducon to programming in R 4
3Python for Bioinformacs 10
4Understanding the Illumina datasets 14
5Checking quality of Illumina paired-end sequence data 17
6Quality control of RNAseq datasets – NGS QC Toolkit 19
7Quality control of RNAseq datasets – Trimmomac 21
8Assembling bacterial genomes 23
9RNAseq data analysis in Trinity 25
10 Phylogenomic analysis using MrBayes 31
11 Microsatellites genotypes generaon by Fragment analysis method 34
12 Genepop : Populaon Genecs analysis 38
13 Populaon genec analysis of microsatellite data in Arlequin 40
14 SoCompung techniques inBioinformacs 49
15 RNAseq data analysis – Genome-guided 56
16 Applicaon of ‘’OMICS’’ research in aquaculture with special reference
to penaeids
64
17 Shrimp Genomics : Current status and Challenges 72
18 Applicaon of Biotechnology in animal reproducon 76
19 Use of molecular techniques in growth enhancement 81
20 Gene Eding Tools and their applicaon in Aquaculture 90
Glossary 95
Hands on Training Aquaculture Genomics and Bioinformacs 1
1. Introducon to Linux Environment
J. Ashok Kumar and K. Vinaya Kumar
Opensource operang system (OS) Linux built based on Unix has become choicest OS worldwide
for servers as well as desktops in academic circles. There are dierent varients of Linux which include
Redhat, Ubuntu, fedora, CentOS, knoppix etc. Many of the bioinformacs soware and individual
programs are nave to linux OS. So it is important for a bioinformacian to have exposure to linux
commands. Here we give a list of most commonly used linux commands and procedure to execute
perl /python programmes. As advanced programming is beyond the scope of this training, we provide
here the basic constructs of perl/python programs which could be used for wring scripts for simple
bioinformacs tasks.
Linux commands
Accessing linux environment: You can access linux server using any windows based ssh client
from your system. This could be achieved by installing winSCP or Puy (both are free soware) on
your system. Once installed open WinSCP, ll in the Host name, user name and password columns
provided by system administrator and click on login buon which will prompt for password. Aer
successful login and selecng puy from menubar, console window pops upand you will see a dolloar
prompt where in you can submit commands for all the operaons you wish to perform on linux server.
Figure 1. WinSCP login window
Figure 2. Selecng Puy from winSCP
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
2
Figure 3. Linux console
The dollar prompt ($) shown in Fig. 3 is for users and the hash (#) prompt will be displayed for
administrators. Users who have the administrave privileges on the server can only work with hash
(#) prompt.
File system in linux: All the folders and les of the linux system will be under root (/) directory.
Users will have access to their home directories for which the path is /home/user_name
Once you login to the linux system by default you will be taken to your home directory. For
example is if user name is david aer login into Linux the current directory which he will be
accessing is /home/david. Users can input their commands aer the dollar ($) prompt. Some of the
most commonly used linux commands are given in the table below.
Funcon Command
Lisng the le names $ls
Lisng with le names along with other details $ls –l
Change to preexisng directory by name ‘test’ $cd test
Make a new directory by name ‘trial’ $mkdir trial
Viewing a preexisng le $vi mydata.txt
$nano mydata.txt
$more mydata.txt
$cat mydata.txt
Creang a new le $touch myle.txt
$vi myle.txt
$nano myle.txt
Renaming or moving the le $mv le1.txt le2.txt
$mv /home/ram/le1.txt /home/ram/
test/
Making duplicate of le $cp le1.txt le2.txt
$cat le1.txt > le2.txt
Appending two text les $cat le1.txt le2.txt > le3.txt
To display date $date
To nd number of lines in a le $wc –l xyz.txt
To display rst (top) 100 lines of a le $head -100 xyz.txt
To display last (boom) 100 lines of a le $tail -100 xyz.txt
Search for a paern in a le $grep “paern” le.txt
Search for paern at beginning of line $grep ‘^paern’ le.txt
Search for paern at the end of a line $grep ‘paern$’ le.txt
Search for only paern in the line $grep ‘^paern$’ le.txt
Hands on Training Aquaculture Genomics and Bioinformacs 3
Running perl /python programs
Perl program les will have extension “.pl”. Command to execute the programmes is
$ ./test_programme.pl
Or
$perl test_programme.pl
Opons of the program may be checked from the help les of the soware/programs.
Same way python program les will have “.py” extensions and they could be executed by giving
following command.
$python test_programmes.py
Standalone blast
NCBI Blast is used for comparing nucleode and protein sequences with the sequence databases
to nd signicant matches. Alignment of sequences using blast can be done either by using web-tool
available on NCBI site or by installing blast on local servers.
Blast can be installed on local servers along with the databases available in public domain. In
addion, users can make their own databases on local servers. If you have your own protein dataset
then local databases can be created by
$makeblastdb -in xyz.fasta -dbtype ‘prot’ -out xyzdb
Now you can run the blast using your own database
$blastp -db xyzdb -query abc.fasta –out out.fasta
More general blast Command
$blastn -query nucl.fasta -db xyzdb -oumt 6 -evalue 1e-05 -out output.txt
For fetching the sequences in fasta le format from output make a le with IDs of hits and run
the following command
fastacmd -d database_name -i blast_output > hits.fasta
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
4
2. Introducon to programming in R
J. Ashok Kumar, K. Vinaya Kumar and B. Sivamani
R is a programming environment for data analysis and graphics. The language was inially wrien
by Ross Ihaka and Robert Gentleman at the Department of Stascs at the University of Auckland.
Since its birth, a number of people have contributed to the package. It is open source stascal
soware which can be downloaded free of cost. Base package and all the contributory packages
could be downloaded from hp://www.r-project.org/
R is available for all operang systems like windows, Linux and Mac OS. This training material is
based on R stats package installed in windows operang system.
Invoking R stats
Start All programmes R R i386 3.2.0 (for 32 bit installaon)
Start All programmes R R x64 3.2.0 (for 64 bit installaon)
R Stats Graphical user interface in windows
Procedure to install addional packages
We need to add addional libraries to Base installaon to ulize full potenal of R. This can be
achieved by following command.
Install.packages(‘name of the package’)
Once the above command is executed R system asks the user to select a CRAN mirror out of
several listed mirrors. User can select mirror of any locaon.
There is a package/library called ‘Rcmdrwhich can be used for carrying out most commonly
used stascal procedure with graphical user interface. The command to install ‘Rcmdr’ is
Hands on Training Aquaculture Genomics and Bioinformacs 5
Install.packages(‘Rcmdr’)
Command to invoke the Rcmdr
Library(‘Rcmdr’)
R studio
R studio is integrated development environment(IDE) for R. This IDE features R notebook for
wring scripts, console for command input, graphics viewer, package window and environment
window all in single framework.
R les input and output.
First set the working directory
Command to know the locaon of present working directory is
Ø getwd()
Command to set the working directory to any other folder
Ø setwd(“E:/data/”)
Basic command to read the les is
Ø read.table()
and command to create the data les is
Ø write.table()
Imporng data
Data with dierent le formats i.e., text les, excel les, SPSS data les, SAS data les etc., can
be input into R stats for data analysis. It is advised that excel les may rst be converted to comma
separated les for easy input into R stats.
Command to read a comma separated text le with variable names in the rst row
Ø Data <- read.table(‘lename’, header=TRUE, sep=”,”)
Here lename is name of the text le with extension, header statement is to specify whether
variable names are included in the rst row of the data le and ‘sep’ parameter tells the separator
present between variables (columns) like comma, space, tab etc., in the le.
If the specied text le is not in present working directory and you wish to select it though
graphical interface use the following command
Ø Data <- read.table( le.choose(), header=TRUE, sep=”,”)
Upon entering the above command a le selector window will pop up and one can select the le
located at any drive/directory/folder other than the present working directory.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
6
Popup window for selecng les
For other text les like space separated and tab separated one need to change only ‘sep’parameter
of the above command with either “ “ or “ \t ”.
In the previous command ‘data’ is a dataframe which will contain all the variable names and data
Data in the dataframe can be edited and assigned the changed le contents to other dataframe
Ø data1<- edit(data)
Upon entering the above command a popup window appears for eding the data and all the
edits will be saved in data frame called ‘data1’
Data editor window
Exporng data
Data in the dataframe can be exported as a text le with the following command
Ø write.table(data, le=”xyz.csv”, col.names=TRUE, sep=”,”)
Hands on Training Aquaculture Genomics and Bioinformacs 7
Creang data les manually within Rstats
Data les can be created within Rstats by giving simple commands
Here we explain creang example table with variable names into R stats
S.No Bodyweight Length Species
1 25 15 aa
2 35 14 ab
3 65 27 ac
4 27 18 bb
5 45 22 cc
The above table can be created as a dataframe by giving the following commands
Øbodyweight <- c(25,35,65,27,45)
Ølength <- c(15,14,27,18,22)
Øspecies<-c(“aa”,”ab”,”ac”,”bb”,cc”)
Ølengthweight <-cbind(bodyweight,length,species)
Descripve stascs
Suppose we have a variable by name ‘x’ and our task is to calculate all the descripve stascal
parameters like mean, median, standard deviaon, variance etc. for the variable x in R stats. First
create a variable x by giving the following command
Øx <- c(20,15,19,22,26,24,23,17,18,22)
Other way of creang variable ‘x’ is
Øx <- scan()
1: 20 15 19 22 26 24 23 17 18 22
11:
Read 10 items
Basic commands for descripve stascs
Ømean (x) # mean
Ømedian (x) # median
Øvar (x) # sample variance
Øsd(x) # sample std. deviaon
Øquanle (x,p) # sample quanle , p could be 0.25, 0.5,0.75
Ømin (x) # minimum of x
Ømax (x) # maximum of x
Ørange () # range of x
Ølibrary(e1071)
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
8
Øskewness (x) # skewness
Økurtosis (x) # kurtosis
Commands for stascal tests
Single sample t-test
Øt.test(y,mu=10)
here y is a variable; mu is populaon mean
Two sample t-test
Øt.test(y1,y2,var.equal=TRUE)
y1 and y2 are the two independent samples
Paired t-test
Øt.test(y1,y2,paired=TRUE)
y1 and y2 are the two paired samples
Chi-square test for goodness of t
Øn<- cbind(y1,y2)
Øchisq.test(n)
n is a datamatrix /conngency table
Correlaon
Øn <- cbind(y1,y2) # create dataframe n
Øcor(n)
where y1 and y2 are two variables and n is matrix of y1 and y2
Regression
Øt <- lm(y~x)
for mulple regression
Øt <- lm(y~x1+x2+x3)
Completely randomised design
Øtr <- c(1,1,1,2,2,2,3,3,3) # create treatment variable
Øyield<-c(25,41,54,65,45,65,25,12,35) # create dependent variable
Øt <- aov(yield ~ factor(tr)) # model statement
Øsummary(t)
Randomised Block Design
Øtr <- c(1,1,1,2,2,2,3,3,3) # create treatment variable
Ørep <-c(1,2,3,1,2,3,1,2,3) # create replicaon variable
Hands on Training Aquaculture Genomics and Bioinformacs 9
Øyield<-c(25,41,54,65,45,65,25,12,35) # create dependent variable
Øt <- aov(yield ~ factor(tr) + factor(rep))
Øsummary(t)
Two way factorialDesign
Øt <- aov(yield ~ factor(A) + factor(B) + factor(A) : factor(B) + factor(rep))
Øsummary(t)
Installing Bioconductor in R
Enter following commands in R console to install bioconductor packages.
source (hp://bioconductor.org/biocLite.R)
biocLite()
Steps in manipulang fasta les
First load library
Ølibrary(seqinr)
Set working directory in R where fasta les are loaded
Øsetwd(“c:/path/to/directory”)
Øseq1 <- read.fasta(“sequence.fasta”)
Øseq1.seq<- seq1[[1]] # to take the sequence from fasta le
Ølength(seq1.seq) # to nd length of the sequence (bases)
Øtable(seq1) # to nd frequency of each base
ØGC(seq1.seq) # to nd the GC content of the sequence
There are several advanced opons are available in R ranging from simple sequence analysis to
microarray data analysis. Purpose of this chapter is to introduce the R environment and to provide
hands-on for exploring the funconalies available in R.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
10
3. Python for Bioinformacs
J. Ashok Kumar and K. Vinaya Kumar
Python is one of the most popular high level general purpose programming languages. It
was developed in the year 1991 by Guido van Rossum, a Dutch programmer. It is an open source
programming language available for download at www.python.org. In recent years it has gained lot
of importance due to development of several libraries applicable to various elds of research and
development. One such library widely used in Bioinformacs is BioPython. Here we introduce python
environment for wring scripts and provide a glimpse of Biopython funconalies.
Installaon of Python
Python is available for both windows and Linux plaorms. Windows / Linux binaries can be
obtained from www.python.org. In windows you may double click on the exe le and accept the
default installaon sengs to get it installed in the system. Once installed go to edit environment
variable Advanced environment variables and add new python path as show in the gure
Now you can open command line interface in windows by entering ‘cmd’ search box on the
taskbar and enter.
On most of the Linux installaons python comes with default installaon. If not available it can
be installed on debian/Ubuntu systems by keying-in the following command
$sudo apt-get install python
Installing pip
Pip is package manager for python. To install pip download get-pip.py from hps://pip.pypa.io/
en/stable/installing/ and enter the following command.
$Python get-pip.py
Once pip is installed, any python package can be installed by the following command
$pip install ‘package-name’
Hands on Training Aquaculture Genomics and Bioinformacs 11
Installing Jupyter
Jupyter is notebook applicaons for python wherein one can write scripts, execute the scripts
and save the notebooks in dierent formats like pdf, doc for future use. Run following commands for
installing and opening the jupyter notebook
$pip install jupyter # install the jupyter package
$python –m IPython notebook ## Opening notebook in windows.
$jupyter notebook ## opening notebook in Linux
One can install required addional packages like matplotlib for plong the graphs, numpy for
numerical calculaons pandas for data structures and data analysis tools, statmodels for stascal
analysis, scipy for mathemacal & scienc applicaons. All these can be installed using python.
Introducon to python programming
ØPrint “hellow world” ## prinng a text
Hellow world
Øtext1 = “CIBA” # text1 is a string variable
Øa = 20 # b is a numeric variable having value 20
Øb = 30
Øa+b
50
Øb-a
10
Øa*b
600
Øa/b
Ø0
Øa/oat(b)
0.666
Øa**b # which is a to the power of
1.073741824e+39
For mathemacal funcons
Øimport math
Ømath.log(a)
2.995
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
12
Ømath.cos(a)
0.408
Araay in python
Øa =[]
Øa = [“hi”,”this”,”is”,”python”]
Øa[2]
Declaring diconary
Ødict1={“apple”: 250,”banana”: 100,”cherry”: 300}
Ødict1.keys()
[‘cherry’, ‘apple’, ‘banana’]
Ødict1.values()
[300, 250, 100]
Ødict1[“cherry”]
300
Programming loops
Øfor i in range(0,10):
print i
Øj=1
Øwhile (j < 10):
print j
j=j+1
Funcons
Ødef f2c(x):
return (x-32)*5/9.0
Read and write les
Øinp=open(“input.txt”,’r’)
out=open(“output.txt”,’w’)
for line in inp:
if line[0]==”>”:
out.write(line)
inp.close()
out.close()
Hands on Training Aquaculture Genomics and Bioinformacs 13
Biopython
Biopython is the set of computaonal methods used for Bioinformacs analysis. Biopython can
be used to parse dierent les like fasta, blast output, genbank, expasy; execute online tools like NCBI
blast, entrez etc., code to sequence alignment, mulple sequence alignment, phylogeny and even
machine learning classicaon methods like naïve bayes, knearest neighbourhood, support vector
machines etc.,. Biopython library can be installed through pip installaon method.
Øpip install biopython (or python –m pip install biopython in windows)
Øimport Bio
Øfrom Bio.Seq import Seq
Øseq1 = Seq(“ATGCGGATC”)
Seq(‘ATGCGGATC’, Alphabet())
Øseq1.complement()
Seq(‘TACGCCTAG’, Alphabet())
Øseq1.reverse.complement()
Seq(‘GATCCGCAT’, Alphabet())
Parsing fasta le
Øfrom Bio import SeqIO
Øfor seq_record in SeqIO.parse(“sequence.fasta”, “fasta”):
print(seq_record.id)
print(repr(seq_record.seq))
print(len(seq_record))
Dierent system commands can also be executed from python using following commands
Øimport os
Øcom = “blastn – query seq.fasta –db nr –out out.txt”
Øos.system(com)
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
14
4. Understanding the Illumina datasets
K. Vinaya Kumar and J. Ashok Kumar
With connual improvements being made over past few years, the Next Generaon Sequencing
(NGS) plaorms came a long way in generang enormous sequence data at low cost and less me.
Many NGS plaorms like Illumina, Pacbio, Nanopore, Ion Torrent etc are well-known plaorms with
several published manuscripts quong usage of them. A feature common to all these plaorms is
massively parallel sequencing of single or clonally amplied DNA molecules. Of dierent plaorms
available ll date, the one oered by Illumina stands apart in terms of the amount of sequence data
generated and the cost involved. In case of Illumina, right from the Genome Analyzer IIx, the HiSeq
XXXX series, the MiSeq, the NextSeq XXX series to the latest NovaSeq 6000, there is an improvement
in data output while reducing the sequencing me.
There are two popular sequencing chemistry of Illumina plaorm namely, paired-end (PE) and
mate-pair (MP) that are commonly used by researchers. The PE sequencing is used for RNAseq studies
where we nd dierenally expressed transcripts in experimental samples compared to control
sample. The MP sequence reads are mostly used in assembly of whole genomes where they play
an important role in scaolding the congs. In this chapter we understand the structure of paired-
end sequence datasets generated on Illumina plaorm. The raw sequence data les generated on
Illumina plaorm are delivered as ‘.fastq’ les. For every sample, two les are provided, one read_1 or
forward sequence read and the other read_2 or reverse sequence read. The order of reads in forward
and reverse sequence reads les should not be altered as they are linked.
Open the WinSCP tool. The following window appears. Enter the host name as told by the tutor.
Enter the ‘user name’ and ‘password’ to log in to your account.
Aer logging in, the window of WinSCP tool appears. The window has two panels. The le panel
is the le system of your computer. The right panel is the le system of your account in server.
Hands on Training Aquaculture Genomics and Bioinformacs 15
Click on the icon displaying ‘two connected computers’ in the top toolbar to open the puy
window. In this window you run your jobs in server. Enter the log in credenals on prompt. Then
browse to the folder where a le with extension ‘.fastq’ is present. Then type-in the command ‘head
le.fastq’ to see the rst few lines of le.
You nd that, the informaon about each sequence read is represented in four lines.
Line 1: has informaon about instrument ID, run ID, ow cell ID, lane ID, le ID, X and Y coordinates
of clusters, read number, status about the read is ltered or not and control sample status etc.
Line 2: the sequence of the read which is the familiar A, T, G and C
Line 3: a plus (+) sign
Line 4: the quality scores of the sequence bases
You may visit the following page to understand more about the quality scores.
hps://www.illumina.com/documents/products/technotes/technote_understanding_quality_scores.pdf
The symbols in line 4 represent quality scores of bases. The quality scores ranges from 0 to 40.
A score of 40 indicates that the base called is of high quality. In this case, the error probability infers
that one base call in 10,000 base calls would be incorrect. The following table illustrates the relaon
between the symbols and the corresponding quality scores.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
16
Table. List of symbols corresponding toquality scores of bases in Illumina sequence datasets.
Symbol Quality Score Symbol Quality Score
! 0 6 21
" 1 7 22
#2 8 23
$3 9 24
% 4 :25
&5;26
6<27
(7=28
)8>29
*9 ? 30
+10 @ 31
, 11 A 32
- 12 B 33
.13 C 34
/14 D 35
0 15 E36
1 16 F37
2 17 G38
3 18 H39
4 19 I 40
5 20
Hands on Training Aquaculture Genomics and Bioinformacs 17
5. Checking quality of Illumina paired-end sequence data
K. Vinaya Kumar and J. Ashok Kumar
Illumina paired-end (PE) sequencing reads are commonly used for RNAseq studies and
assembling of genomes. For each sample, the sequencing machine prints output data in two paired
.fastq les. In this chapter, we discuss about the quality issues pertaining to PE reads. A beer
understanding of these helps in beer planning of read processing to extract quality data for further
studies.
One of the basic soware useful to understand the quality of PE reads le is ‘FastQC’. Visit the
following site to download the latest version of soware.
hps://www.bioinformacs.babraham.ac.uk/projects/download.html#fastqc
First, log in to your account using WinSCP tool. Open PuTTY SSH terminal. In your account, nd
a le named, a1F.fastq. We shall check the quality of this le using FastQC tool. To do this, run the
following command at your prompt.
$ fastqc<space> a1F.fastq
In less than two minutes, the analysis would be completed and two output les are printed,
a1F_fastqc.html and a1F_fastqc.zip. Save these les to your computer and open the .html le in any
browser. Check all images and understand their meaning. Observe carefully for the following aspects
in the le.
Box plot of quality scores along the sequence read length.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
18
5.1. The reads are contaminated with adapter sequences used during sequencing.
The quality report warrants us to do some data processing which includes,
1. Removal of poor quality reads that are pulling down the average of quality scores.
2. Removal of poor quality bases at the start and end of sequence reads.
3. Removal of adapter sequences contaminang the reads.
Hands on Training Aquaculture Genomics and Bioinformacs 19
6. Quality control of RNAseq datasets – NGS QC Toolkit
K. Vinaya Kumar and J. Ashok Kumar
There are several freeware available for processing of paired-end sequence reads. In this chapter
we shall use NGS QC Toolkit for quality control of PE reads. First, log in to your account using WinSCP
tool. Open PuTTY SSH terminal. In your account, nd two les named, a1F.fastq and a1R.fastq. Check
the quality of both the paired les using FastQC tool. Pracce the following quality control steps and
observe the changes in quality of trimmed les.
6.1 Discarding low quality reads
perl<>IlluQC_PRLL.pl<> -pe <> a1F.fastq <>a1R.fastq <> 2<> A <>-l <> 70 –s<> 20<> -c<> 50
This command removes all those reads where the proporon of bases having a quality of > 20 is
less than 70%. Aer the run, nd that a folder ‘IlluQC_Filtered_les’ is printed. The trimmed les are
present in this folder. Do quality check of these two les with FastQC. Observe the changes in reads
le aer running this command.
Aer discarding about 3 million reads completely, the average quality of bases improved.
Therefore the improvement in quality came at the expense of losing about 30 % of sequence reads.
6.2 Discarding poor quality bases at both ends based on length.
perl<>TrimmingReads.pl<> -i <>a1F.fastq<> -irev<> a1R.fastq<> -l <>3 <> -r <> 30
This command removes 3 bases at 5’ end and 30 bases at 3’ end from all reads.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
20
6.3 Discarding poor quality bases at 3’ end of reads based on quality score
perl<>TrimmingReads.pl<> -i <>a1F.fastq<> -irev<> a1R.fastq<> -q <>30
This command removes bases at 3’ ends where the base quality is <30. This improvement in
quality at ends came at the expense of some reads geng shorter.
6.4 Discarding reads based on read length
perl<>TrimmingReads.pl<> -i <>a1F_7020.fastq<> -irev<> a1R_7020.fastq<> -n <>25
This command removes reads shorter than 25 bases length.
A combinaon of these could be chosen and applied based on the inial base quality of sequence
datasets. Extract only the good quality data for downstream processing of reads.
Hands on Training Aquaculture Genomics and Bioinformacs 21
7. Quality control of RNAseq datasets – Trimmomac
K. Vinaya Kumar and J. Ashok Kumar
There are several freeware available for processing of paired-end sequence reads. In this chapter
we shall use ‘Trimmomac’ for quality control of PE reads. First, log in to your account using WinSCP
tool. Open PuTTY SSH terminal. In your account, nd two les named, a1F.fastq and a1R.fastq. Check
the quality of both the paired les using FastQC tool. Run the following command and observe the
changes in quality of trimmed les. The ‘<>’ sign used in the command argument indicates ‘space’.
The command
Java<> -jar<> trimmomac-0.36.jar<> PE<> -threads<> 70<> -trimlog<> a1.txt<> a1F.fastq<>
a1R.fastq<> a1F_P.fastq<> a1F_S.fastq<> a1R_P.fastq<> a1R_S.fastq<> ILLUMINACLIP:TruSeq3-PE-2.
fa:2:30:10<> LEADING:3<> TRAILING:13 <> SLIDINGWINDOW:4:15 <>MINLEN:100
De-coding the command
Each argument in the command has a purpose of improving the quality of trimmed les. It is
important to check the inial quality of sequence data and then apply the relevant arguments to
improve the quality.
Argument Meaning
PE Paired-end mode. Use this for processing of PE reads data
threads The argument to specify number of threads. Trimmomac supports
running arguments with mulple threads.
trimlog To specify a le name that stores log of the run.
a1F.fastq Input le name of forward or R1 reads
a1R.fastq Input le name of reverse or R2 reads
a1F_P.fastq Output le name of trimmed forward or R1 reads. This le is used
for subsequent analysis.
a1F_S.fastq Output le containing surviving forward reads of good quality. The
paired sequences in R2 le are discarded.
a1R_P.fastq Output le name of trimmed reverse or R2 reads. This le is used
for subsequent analysis.
a1R_S.fastq Output le containing surviving reverse reads of good quality. The
paired sequences in R1 le are discarded.
ILLUMINACLIP:TruSeq3-PE-2.
fa:2:30:10
Illuminaclip is used to remove adapter sequences from reads. The
TruSeq3-PE-2.fa is the le containing adapter sequences.
LEADING:3 To remove bases at the start of the read, if quality is below 3
TRAILING:13 To remove bases at the end of the read, if quality is below 13
SLIDINGWINDOW:4:15 This is an argument that trims reads based on base quality. Each
read is scanned from 5’ end. Four connuous bases are taken as
a window. The average quality of all windows in a read should
be higher than 15. Otherwise, the read gets trimmed from poor
quality window to the 3’ end of the read.
MINLEN:100 To discard reads shorter than 100 bases aer performing all the
steps.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
22
Run FastQC on the trimmed les
Below are the quality of forward sequence reads before (le) and aer (right) trimming.
Below are the quality of reverse sequence reads before (le) and aer (right) trimming.
Even the reads containing the adapters are trimmed. These trimmed les would be taken up for
nding dierenally expressed transcripts. The single-end good quality reads are also used in case of
assembling genomes.
Hands on Training Aquaculture Genomics and Bioinformacs 23
8. Assembling bacterial genomes
J. Ashok Kumar and K. Vinaya Kumar
Genome sequencing forms basis for understanding biology and funconal characterizaon of
microorganisms. Recent advances in shotgun sequencing pave the way generate genome sequences
with me and cost advantage. Here we discuss whole genome assembly with of paired-end sequence
reads generated from illumina plaorm. First we aempt to describe steps involved in denovo
assembly of bacterial genome using masucra assembler and later we look into the steps involved in
reference based assembly using Bowe2.
Download MaSuRCA (Maryland Super Read Cabog Assembler)
MaSuRCA assembler can be downloaded from hp://www.genome.umd.edu/masurca.html
and once it is downloaded keep the folder in you directory and extract the tar ball using following
command.
user@server$ tar –zxvf MaSuRCA-3.2.6.tar.gz
This will extract the les in to the folder MaSuRCA-3.2.6 . You will nd all the executable programs
in the bin subfolder of the MaSuRCA-3.2.6 folder.
Preparing Illumina sequence reads
Copy and paste the illumina paired-end sequence reads in a folder. There will be two les one
for forward strand and other for reverse strand say for example vibgenome_R1.fastq vibgenome_
R2.fastq. These fastq les need to be quality checked and corrected using tools like fastqc, cutadapt
and trimmomac etc.
Preparing Masurca conguraon le
You will nd sample conguraon (sr_cong_example.txt) le in the installaon directory which
needs to be edited with the assembly parameters. There are two secons in conguraon le. One is
DATA secon and Other one is PARAMETERS secon.
In the data secon Opons are available to specify paired-end (PE), mate-pair (JUMP), PACBIO
and Other (Celera assembler reads). Mulple libraries data can be menoned in mulple lines of the
same read type.
For paired-end reads the following line of the data secon needs to be edited.
PE= aa 180 20 /FULL_PATH/frag_1.fastq /FULL_PATH/frag_2.fastq
PE: paired-end; aa- two leer prex; 180 is Average insert length; 20 standard deviaon of insert
length;
In the PARAMETERS the mandatory parameters that need to be edited are NUM_THREADS and
JF_SIZE .
NUM_THREADS are number of threads alloed for assembly task. Example : NUM_THREADS=16
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
24
JF_SIZE is the jellysh hash size, set this to about 10x the genome size but it can be genome size
mulplied by its coverage.
Denovo assembly using MaSuRCA
Command to run masura assembly is
user@server$ /path/to/bin/masurca /path/to/cong.txt
this will generate ‘assemble.sh’ le in the current locaon. Now we need to run this shell script
for compleng the assembly
user@server$sh assemble.sh
Successful compleon of assembly will create several les. Look for the directory named CA and
within that folder you will see 10-gapclose subfolder wherein you will nd nal assembled output.
The output les are ‘genome.ctg.fasta’ for the cong sequences and ‘genome.scf.fasta’ for the scaold
sequences.
Reference based assembly using bowe2
In reference based assembly reads are mapped to reference genome to idenfy variaons like
single nucleode polymorphisms(SNPs), indels, inserons, copy number variants, genome wide
associaon studies (GWAS).
Steps involved in reference based assembly are listed below with the commands for running
each step
ØIndexing a reference genome
$bowe2-build V_para_GCA_000328405.1.fna vibindex
ØAligning reads
$bowe2 -x vibindex -1 V-Para-DNA_R1.fastq -2 V-Para-DNA_R2.fastq -S align1.sam
ØCovert sam to bam le
$samtools view -bS align1.sam > align1.bam
(-bs: input sam and output bam)
ØSort the bam le
$samtools sort align1.bamalign1.sorted.bam
ØCreate the BCF le
$samtools mpileup -uf V_para_GCA_000328405.1.fna align1.sorted.bam.bam | bcools view
-Ov - > align.raw.bcf
(-u generate uncompress BCF output; -f faidx indexed reference sequence le; -Ov output
potenal variant sites only)
Hands on Training Aquaculture Genomics and Bioinformacs 25
9. RNAseq data analysis in Trinity
K. Vinaya Kumar, J. Ashok Kumar and M.S. Shekhar
Many of the commercially relevant aquaculture species including shrimp are not having publicly
available reference genome. Therefore the analysis of RNAseq data for such species mandates building
a de novotranscriptome assembly. For every experiment, a de novo assembly has to be made ulizing
the RNAseq reads of all the samples in the study. In this chapter, we shall pracce building a de novo
assembly of transcriptome and conducng dierenal transcript analysis in trinity soware.
9.1 The datasets
Let us assume an experiment involving two treatments a & b. Each treatment has three replicate
individuals. At the end of the experiment, ssue samples are collected from all replicate individuals
and RNAseq was performed on Illumina plaorm. The following datasets have been generated.
Table.Datasets to be used for RNAseq data analysis
Treatment A Treatment B
Forward
reads
Reverse
reads
Forward
reads
Reverse
reads
replicate 1 a1F.fastq a1R.fastq b1F.fastq b1R.fastq
replicate 2 a2F.fastq a2R.fastq b2F.fastq b2R.fastq
replicate 3 a3F.fastq a3R.fastq b3F.fastq b3R.fastq
9.2 Quality control of datasets
Process the raw reads using Trimmomac tool and obtain quality reads. Keep the following
arguments while running Trimmomac.
ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10
LEADING:3
TRAILING:13
SLIDINGWINDOW:4:15
MINLEN:100
The numbers of reads retained for downstream analysis are given below
Sample name Reads in raw le
(million)
Reads in processed le
(million)
a1 10 4.954252
a2 10 5.577112
a3 10 6.412094
b1 10 5.257203
b2 10 4.160784
b3 10 3.607086
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
26
9.3 Building a de novo assembly
As the experiment involves triplicate samples, prepare a text le showing the triplicate samples
under each treatment and their le names as shown below.
Then proceed for building assembly using the following command,
Trinity<> --seqType<> fq<> --samples_le<> ab_samples.txt<> --CPU<> 70<> --max_memory<>
300G<> --SS_lib_type<> FR<> --output <>trinity_ab
The command arguments details are,
Input les are fastq format
Samples le names are given in ab_samples.txt
Use 70 threads
Limit maximum memory to 300 GB
Data obtained from strand-specic library as forward and reverse reads
Store output in folder, trinity_ab
The assembly is completed when you see the messages printed as shown below.
Browse to the folder and nd the assembled transcripts le, Trinity.fasta. Rename the le as
Trinity_ab.fasta’ for easy idencaon.
Hands on Training Aquaculture Genomics and Bioinformacs 27
9.4 Assessing quality of assembly
9.4.1. N50: Compute N50 stasc by running the following command,
TrinityStats.pl<> Trinity_ab.fasta<>><> Trinity_ab_stats.txt
9.4.2. ExN50: The E90N50 is being considered as more appropriate for RNAseq studies rather than
N50. Get ExN50 stats with the following argument.
cong_ExN50_stasc.pl <>matrix.TMM.EXPR.matrix <>Trinity_ab.fasta | tee ExN50.stats
EMinimum expression ExN50 Number of transcripts
E90 2.28 1611 45381
E91 1.952 1511 53150
E92 1.916 1409 61794
E93 1.55 1314 71403
E94 1.5 1212 82217
E95 1.262 1102 94509
E96 1.122 1005 108691
E97 0.95 927 125654
E98 0.746 858 146688
E99 0.566 791 175457
E100 0 605 281008
The N50 calculated based on the top most expressed transcripts that represent 90% of the total
normalized expression data is 1611 bases and includes 45381 transcripts.
9.4.3. Read representaon: The proporon of paired-reads represented in the assembled transcripts
is another parameter that helps in evaluang the assembly. We shall use bowe2 tool for this. First
an index is to be made and then reads are to be aligned on to transcripts. Run the following two
commands.
bowe2-build<>Trinity_ab.fasta<> Trinity_ab.fasta
AND
bowe2<> -x<> Trinity_ab.fasta<> -q<> --fr<> -1<> a1F_P.fastq,a2F_P.fastq,a3F_P.fastq,b1F_P.
fastq,b2F_P.fastq,b3F_P.fastq<> -2<> a1R_P.fastq,a2R_P.fastq,a3R_P.fastq,b1R_P.fastq,b2R_P.
fastq,b3R_P.fastq<> -S<> samle<> --no-unal<> -p<>50
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
28
As per the stascs shown above, the overall alignment rate is 90% which is good.
9.5 Transcript quancaon
9.5.1. Esmate abundance
The rst step in transcript quancaon is to esmate the abundance of all transcripts in every
sample. We shall pracce esmang transcript abundance using alignment-based method, RSEM
though alignment-free methods such as kallisto and salmon exist. Run the following command to get
abundance esmates by aligning the sequence reads to transcripts and counng the number of reads
aligned for each transcript.
align_and_esmate_abundance.pl<> --transcripts<> Trinity_ab.fasta<> --seqType<> fq
<>--samples_le<> ab_samples.txt <>--est_method<> RSEM <>--aln_method<> bowe<> --trinity_
mode<> --prep_reference<> --SS_lib_type <>FR <>--output_dir<> ab_rsem_outdir <>--thread_
count<>20
Argument Meaning
align_and_esmate_abundance.pl Script to align reads on to transcripts and get abundance
esmates
--transcripts To dene the assembled transcripts le name
--seqType To dene input le format
--samples_le Dene the le name that contains treatments, replicates and
reads le names
--est_method To dene abundance esmaon method (opons are RSEM/
eXpress/kallisto/salmon)
--aln_method To dene alignment method (bowe/bowe2)
--trinity_mode To automacally generate gene_trans_map
--prep_reference To build target index
--SS_lib_type Specify if the library is strand-specic (FR/RF)
--output_dir Name of the directory to store output les
--thread_count Number of threads to use for running the argument
At the end of the run, nd that six folders are created corresponding to six samples. In each
folder observe for a le named, RSEM.isoforms.results. These les are used for further processing.
These abundance esmates are built in to matrix with the following argument,
abundance_esmates_to_matrix.pl<> --est_method<> RSEM<>RSEM.isoforms.results
Menon all the six le names of RSEM.isoforms.results corresponding to six samples.
9.5.2. Count the numbers of expressed transcripts
Plot the number of transcripts that are expressed at dierent TPM threshold by running the
following argument,
count_matrix_features_given_MIN_TPM_threshold.pl matrix.TPM.not_cross_norm | tee
counts_by_min_TPM
Hands on Training Aquaculture Genomics and Bioinformacs 29
The output looks like the table depicted below.
Neg_min_tpm Number of features
-10 24978
-9 29296
-8 35850
-7 45677
-6 62228
-5 84308
-4 111966
-3 151586
-2 202147
-1 228356
0 281008
9.6 Dierenal expression analysis
At present, Trinity supports four R packages for performing dierenal expression analysis.
These are edgeR, DEseq2, limma/voom, and ROTS. We shall use edgeR in this tutorial to understand
dierenal expression analysis. Run the following commands.
run_DE_analysis.pl<> --matrix<> matrix.counts.matrix<> --method<> edgeR<> --samples_le<>
ab_samples_DE.txt<> --output <>ab_edgeRresult
AND
analyze_di_expr.pl<> --matrix<> matrix.TMM.EXPR.matrix<> --output<> aVSb <>--samples<>
ab_samples_analyzeDE.txt
In this parcular example, we got ve transcripts that are dierenally expressed in sample b
compared to sample a. Now proceed to funconal annotaon of these transcripts and understand its
role for the given treatment in the study.
9.7 Quality check of samples and replicates: You may compare the samples as well as replicates in
each sample with the following commands
9.7.1. /PtR<> --matrix <>matrix.counts.matrix<> --samples <>ab_replicatesTest.txt<> --CPM <>--
log2<> --min_rowSums <>10<> --compare_replicates
9.7.2. /PtR<> --matrix<> matrix.counts.matrix<> --min_rowSums <>10<> -s<> ab_replicatesTest.txt<>
--log2<> --CPM <>--sample_cor_matrix
9.7.3. /PtR<> --matrix<> matrix.counts.matrix<> -s<> ab_replicatesTest.txt<> --min_rowSums 10<>
--log2 <>--CPM<> --center_rows <>--prin_comp 3
For example, in the picture below, it is evident that the replicates in treatment b are clustered
closely. This ensures that all the replicates behaved similarly.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
30
Hands on Training Aquaculture Genomics and Bioinformacs 31
10. Phylogenomic analysis using MrBayes
K. Vinaya Kumar, J. Ashok Kumar and G. Gopikrishna
Researchers perform phylogenec analysis to understand the evoluonary relaons among
taxa. Such analyses require informaon on best-t paroning schemes and best-t models for
the sequence data in hand. The ParonFinder is a suitable tool to nd that informaon to build
phylogenec tree. In this chapter we conduct analyses using ParonFinder tool for nding the best-
t paroning scheme and evoluonary models. Then using these paroning schemes and models,
we would build a Bayesian tree in MrBayes tool.
10.1 ParonFinder
For this exercise, a sequence le containing sequence data of 5 genes on 10 taxa is provided in
your work folder. Open the folder and check for the le named, ‘sequence_10_5.phy’.
Taxa labels 10 taxa taxaA, taxaB, …….. taxaJ
Gene parons 5 genes
Gene1: 1-675 bp
Gene2: 676-834 bp
Gene3: 835-2373 bp
Gene4: 2374-3060 bp
Gene5: 3061-3849 bp
The arguments for running ParonFinder are to be provided in a conguraon le. Find the le
‘paron_nder.cfg’ in work folder. Keep sengs as per the table given below.
Argument Opon Meaning
alignment sequence_10_5.phy File containing sequences in phylip
format
branchlengths linked
Linked branch lengths are
supported by almost all phylogeny
programs
models mrbayes
Includes all the evoluonary
models that are compable for
MrBayes tool for tesng
model_selecon aicc Criterion to decide the best model
data_blocks
Gene1_pos1 = 1-675\3;
Gene1_pos2 = 2-675\3;
Gene1_pos3 = 3-675\3;
Gene2_pos1 = 676-834\3;
Gene2_pos2 = 677-834\3;
Gene2_pos3 = 678-834\3;
Gene3_pos1 = 835-2373\3;
Gene3_pos2 = 836-2373\3;
Dening data parons. For each
gene, three data parons are
dened based on the three base
posions of triplet code. We
dened 15 data blocks for 5 genes.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
32
Gene3_pos3 = 837-2373\3;
Gene4_pos1 = 2374-3060\3;
Gene4_pos2 = 2375-3060\3;
Gene4_pos3 = 2376-3060\3;
Gene5_pos1 = 3061-3849\3;
Gene5_pos1 = 3062-3849\3;
Gene5_pos1 = 3063-3849\3;
search greedy Dening the method to use for
nding good paroning scheme
Then use the following command to run the ParonFinder. Here, you menon the name of the
folder containing sequence le and conguraon le in place of ‘folder_name’.
python<>ParonFinder.py <> folder_name/ --no-ml-tree
The output les are stored in the folder ‘analysis’. Find the le ‘best_scheme.txt’ that contains
the arguments for running the best t models on best parons in MrBayes tool.
10.2 MrBayes
The Bayesian analysis requires the input sequence le in nexus format. Find the le,
sequence_10_5.nxs le which was used for analysis in ParonFinder. Download windows version of
MrBayes tool and unzip the le. Start the tool by clicking on the executable. Then run the following
arguments.
execute sequence_10_5.nxs;
outgroup taxaJ;
type arguments given in output le of ParonFinder, ‘best_scheme.txt’
showmodel # to check for the model dened
mcmc ngen=10000000 nruns=2 nchains=4 samplefreq=100 prinreq=100
diagnstat=maxstddev diagnfreq=1000 savebr=yes lename= ParonFinder
Aer running for 10 million generaons, you would see the following screen.
Hands on Training Aquaculture Genomics and Bioinformacs 33
You could connue with more generaons if required by opng for ‘yes’ at the prompt.
Then obtain a summary of parameters with the following command. Here, by default, rst 25%
of observaons are discarded.
sump lename= ParonFinder
Look for the parameters like esmated sample size and potenal scale reducon factor. Then
summarize the trees with the following command. This prints a cladogram and a phylogram.
sumt lename= ParonFinder
Check for the .tre le and open it in FigTree to view the tree.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
34
11. Microsatellites genotypes generaon by Fragment analysis method
B. Sivamani
Fragment analysis (Genotyping) can be performed on DNA fragments that have uorescent
labels. Using a labeled primer with PCR amplicaon is a common method used to incorporate these
labels. The Molecular Biology Core lab is already set to run mulple uorescent dye sets.
Steps
1. Microsatellite loci selecon
2. Primer designing (uorescent labelled)
3. PCR
4. Fragment analysis – ABI sequencer
5. Generang genotypes
1. Microsatellite loci selecon
The loci are selected loci through literature search or from any database. For sheries, the NBFGR
FishMicrosat database provides updated microsatellite loci and their primers for pcr amplicaon.
(hp://mail.nbfgr.res.in/shmicrosat/).
Steps to nd the microsatellite loci in Penaeus (Fenneropeaneus indicus)
ØVisit the site (hp://mail.nbfgr.res.in/shmicrosat/)
ØUnder Analysis and Primer, select your species of search and you will nd all the microsatellite
loci related to the specic search. The following details of the loci also present
1. Accession Number: link will lead to the NCBI site and will give all the details of the
nucleode sequence
2. SSR type: di, tri, tetra or compound
3. Microsatellite span in the sequence
4. Primers to amplify the locus
2. Designing primer
One can use the specied primer or a primer may be designed as per the user requirement by
ulizing the accession Number opon. One of the primers needs to be uorescent labeled.
3. PCR
Isolate Total DNA from the biological material (Blood/nclips/muscle,etc.) of the species. Verify
the DNA for quality and quanty. Carry out the PCR with labelled primers. Verify the amplicon by
agarose gel electrophoresis. The specic amplicaon of the product is considered beer. Else,
presence of some less intense non-specics also accepted.
Hands on Training Aquaculture Genomics and Bioinformacs 35
4. Fragment analysis – by ABI sequencer
The step is normally outsourced being the cost of the equipment is too high. We receive the
results generated by GeneMapper soware (Private rms use the inbuilt GeneMapper soware) with
the following les.
1. FSA le
2. PDF for electropherogram
3. Genotypes in excel sheet
Fig:1 Electropheogram
Fig: 2 Genotypes data generated by GeneMapper soware
5. Generang genotypes from FSA le using R
# 5.1 Install R from the site hps://www.r-project.org/
# 5.2 installing the package from R site##
Install.packages(“Fragman”)
# 5.3 To acvate, the package has to be loaded###
>Library(Fragman)
# 5.4 To specify the input fas le
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
36
FIM03<-storing.inds(“C:/Users/Admin/Desktop/training writeup/FAS le-ciba”)
#5.5 To specify the ladder used in the analysis
my.ladder <- c(35,50,75,100,139,150,160,200,250,300,340,350,400,450,490,500)
# 5.6 To merge both the earlier specied informaon (FAs le and ladder)
ladder.info.aach(stored=FIM03, ladder=my.ladder)
# 5.7 Tocreate friendly plots for any number of individuals specied and can be used to
#design panels for posterior automac scoring
overview2(my.inds=FIM03, channel = 2, ladder=my.ladder)
# 5.8 to view the vector with expected DNA sizes to be used in the next step for scoring
my.panel2<-overview2(my.inds=FIM03, channel = 2, ladder=my.ladder, init.thresh=3000,
xlim=c(90,130))
my.panel2
# 5.9 To score our samples for channel 2 with our panel created previously
res2 <- score.markers(my.inds=FIM03, channel = 2, panel=my.panel2$channel_2, ladder=my.
ladder, electro=FALSE)
# 5.10 To extract your peaks in a data.frame
nal.results <- get.scores(res2)
nal.results
# 5.11To get the results in text le format
write.table(nal.results, “ C:/Users/Admin/Desktop/training writeup/FIM03-18-2.txt”, sep=”\t”)
******
Note
install.packages = to install the specic package
library = to load addon package
storing.inds = is the funcon in charge of reading the FSA les and storing them
with a list structure
ladder.info.aach = uses the informaon read from the FSA les and a vector containing theladder
informaon (DNA size of the fragments) and matches the peaks from the channel where theladder
was run with the DNA sizes for all samples. Then loads such informaon in the R environmenor the
use of posterior funcons
Hands on Training Aquaculture Genomics and Bioinformacs 37
stored = List of dataframes obtained by using the storing.inds funcon
overview2 = create friendly plots for any number of individuals specied and can be used
to design panels (overview2) for posterior automacscoring (like licensed soware does)
my.inds =List with the channels informaon from the individuals specied, usually
comingfrom the storing.inds funcon output
Channel = The channel you wish to analyze, usually 1 is blue, 2 is green, 3 is yellow, 4 is
red and so on
init.thresh = An inial value of intensity to detect peaks. We recommend not to deal to
muchwith it unless you have highly controlled dna concentraons in your experiment.
score.markers = score the alleles by nding the peaks provided in the panel
panel =dierent dna sizes usually obtained by using overview and locator funcons
get.scores =Once the calls have been obtained we can extract a data frame with the get.
scores funcon.
******
xlim=c(a1,b1)) = the approximate amplicon size to be menoned in overview2
Dye sets used applied biosystem DNA analyser
Blue: 5FAM and 6FAM
Green: Hex, vic,Tet and Joe
Yellow: Tamra and Ned
Red: Rox and Pet
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
38
12. Genepop : Populaon Genecs analysis
B. Sivamani
GENEPOP is a populaon genecs soware package originally developed by Michel Raymond
(Raymond@isem.univ-montp2.fr) and Francois Rousset (Rousset@isem.univ-montp2.fr), at the
Laboraore de Geneque et Environment, Montpellier, France.
Access: Web version is easy to use but acve internet is required, also can be downloaded and
run under windows and linux without internet.
Genepop on the web
Can be accessed in the link: hp://genepop.curn.edu.au/
It has seven opons. Once the input le is prepared, all the opons can be run.
Opon1 Hardy Weinberg Exact Tests
Opon 2 Genotypic linkage disequilibrium
Opon 3 Populaon dierenaon
Opon 4 Nm esmates - private allele method
Opon 5 Basic Informaon
Opon 6 Fst and other correlaons
Opon 7 le conversion
Input le (e.g.)
Title:”P.indicus microsatellites based populaon diversity”
FIM03
FIM06
FIM20
FIM21
FIM17
FIM19
FIM23
POP
C051 , 113115 161161 128130 225225 308289 110118 201226
C054 , 109115 161167 123123 222231 299307 110116 222226
C055 , 113117 161164 123123 231231 307307 108110 191201
C056 , 109113 161164 123123 245248 307307 110118 191201
C057 , 117117 161161 123125 230230 293307 110118 191201
POP
K15 , 115115 161161 128130 230230 308308 110118 201222
K20 , 115117 160160 123125 000000 294307 110118 191201
K23 , 107113 149159 123123 000000 309308 110116 191201
K36 , 111115 160186 123123 000000 307308 108118 201222
K42 , 109115 138149 123125 000000 298300 110116 191201
Hands on Training Aquaculture Genomics and Bioinformacs 39
Instrucons to input le
ØInput le should be prepared in notepad, notepad++ or excel
ØThe input le should have txt extension e.g. lename.txt
ØFirst line, tle is wrien within inverted commas
ØNo constraint on blanks separang the various elds, tabs or spaces allowed.
ØLoci names can appear on separate lines, or on one line if separated by commas
ØIndividual idener may have blanks but must end with a comma
ØAlleles are numbered from 01 to 99 (or 001 to 999). Consecuve numbers to designate
alleles are not required.
ØPopulaons are dened by the posion of the “Pop” separator. To group various populaons,
just remove relevant “Pop” separators.
ØIndividual genotypes for the web version must be on one line. This diers from the PC
version.
Ø Missing data should be indicated as 00 (or 000) rather than blanks. There are three possibilies
for missing data :
v no informaon (0000) or (000000),
v paral informaon for rst allele (1000) or (010000),
v paral informaon for second allele (0010) or (000010).
ØThe number of locus names should correspond to the number of genotypes in each row. If
you remove one or several loci from your input le, you should remove both their names and
the corresponding genotypes.
ØNo empty lines should be found within the le.
ØNo more than one empty line should be present at the end of le.
To run in PC
Download Genepop form the link hp://kimura.univ-montp2.fr/~rousset/Genepop.htm
Based on OS 32 or 64 bit version can be run without installing from the PC. The Input le format
is same like Genepop on web. Input le should be in the same folder of the soware. Aer specifying
the input le, type the number of the opons (for analysis) and the output-le gets stored at the
same folder.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
40
13. Populaon genec analysis of microsatellite data in Arlequin
B. Sivamani
Arlequin is a soware tool specially designed to extract informaon on genec and demographic
features of a collecon of populaon samples. Arlequin can handle several types of data either in
haplotypic or genotypicform. The data types include
ØDNA sequences
ØRFLP data
ØMicrosatellite data
ØStandard data
ØAllele frequency data
Arlequin can analyse various populaon parameters. They are standard indices, molecular diversity,
Linkage disequilibrium, Hardy-Weinberg equilibrium, Amova, Exact populaon dierenaon etc.,
Installaon and uninstallaon
1. Download WinArl35.zip to any temporary directory.
2. Extract all les contained in Arlequin35.zip in the directory of your choice.
3. Start Arlequin by double-clicking on the le WinArl35.exe, which is the main executable le.
4. To uninstall simply delete the directory where you installed Arlequin. The registries were not
modied by the installaon of Arlequin.
Conguraon
Download text editor tool from www.textpad.com and install. It is required to create, edit the
project les and to view the log les.
Download R from www.rproject.org and install it.
Running the soware Arlequin
Open the arlequin by double clicking “WinArl35.exe” which leads to the home page.
Hands on Training Aquaculture Genomics and Bioinformacs 41
Step1. Conguraon of Arlequin
1.1 Click on Arlequin Conguraon’ box, select the opon Append results, XML output and use
64bit external . Append Results is selected to get the results of several runs of a specic
input le into a single output le. The XML output opon is to get the results in XML format.
1.2 Under ‘Helper Programs’, the path of the Text editor and R has to be specied for the
ulizaon. Click the ‘Browse’ box of the ‘Text editor’ and browse where the Textpad.exe le
is located.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
42
1.3 Click the ‘Browse’ buon of ‘Rcmd’ and indicate the path by selecng the Rcmd Applicaon
form the specic folder.
Step 2: Project le preparaon
Arlequin requires project le (input) which has the extension “.arp”. Once the analysis over, the
output (results) will be stored in the same (WinArl35) folder as subfolder with the extension “.res”.
2.1 Open the arlequin soware by double cicking the “ WinArl35.exe”
2.2 Click on “project wizard” opon. An example project le will be created with the Arlequin
format.
2.3 Click the dropdown menu of ‘Datatype’. Select the opon ‘MICROSAT’
Hands on Training Aquaculture Genomics and Bioinformacs 43
2.4 Choose ‘Genotype data’
2.5 We have data on ve populaons for analysis. Therefore menon ‘No of samples’ as 5.
2.6 Choose ‘whitespace’ against the ‘Locus separator’
2.7 Type ‘?’ against ‘Missing data’
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
44
2.8 Select the opon ‘Include genec structure’
2.9 Click the ‘Browse’ opon
2.10 It opens a pop-up window where the project le need to be stored as ‘ciba-1’ in ‘WinArl35’
folder
2.11 Click on ‘CREATE PROJECT’ project which creates a project le named ‘ciba-1
2.12 Convert the Genepop format of input le into Arlequin project format
2.12.1 Open the ‘Genepop on the web’ (hp://genepop.curn.edu.au/)
2.12.2 Click on opon 7 (File Conversion) will open the window for Data format conversions.
Hands on Training Aquaculture Genomics and Bioinformacs 45
2.12.3 Select (opon 5) Genepop format to Arlequin project
2.12.4 Select ‘Datatype’ as ‘microsatellite’
2.12.5 Select ‘Genotypic data’ as ‘diploid’
2.12.6 For ‘Recessive (null) allele present’, select yes or no based on the data. Here our data
contains some null alleles. Therefore we select ‘Yes’ opon.
2.12.7 For ‘Gamec phase’, select ‘unknown’ opon (being a diploid data, gamec phase details
are not necessary; the same results will be obtained for either opon)
2.12.8 For ‘Output format & Delivery’ select any of the opons; ‘Email the results’ or ‘HTML -
Plain Text’. Under ‘Email the results’, enter your email id. The results will be sent to your mail
id. Plain text opon, will display in the same window.
2.12.9 Under ‘Choose File’ opon, browse your Genepop le (ciba_genpop_1.txt) and click
‘Submit data’ box. We get the Results in Arlequin project format.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
46
2.13 Copy the results from ‘[Data]’ to ll the end.
2.14 Goto Arlequin soware page and click ‘Edit project’
2.15 It will open ciba1.arp le in text pad
Hands on Training Aquaculture Genomics and Bioinformacs 47
2.16 Paste the copied content in the ciba1.arp le and replace the [Data] content
2.17 Edit the ‘Structure’ content
2.17.1# (Enter the tle between inverted commas) (e.g.)
StructureName = “Fish-India”
2.17.2 #Number of groups + {1,2,3...} (Enter 1,2,3 ..Etc., as per the number of groups one has to make)
(e.g.)
NbGroups = 1
2.17.3#Dene hereaer the structure of the rst group; menon all the names of the populaons.
Every populaon name should be within inverted comma. The populaons belong to the specic
group has to be menoned. (e.g.)
Group ={ “C051”
“K15”
“MNI01”
“P094”
“Q02”
}
2.17.4 Aer eding, save and close the le.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
48
3 Analyzing the data
3.1 Using ‘Open project’ box, open the le ‘ciba-1.arp’
3.2 Choose the required analysis from the ‘Sengs’
3.3 Click on ‘Start’ buon to start the analysis
3.4 View the results generated in the folder (project le name with the .res extension) ‘ciba-1.res’.
Hands on Training Aquaculture Genomics and Bioinformacs 49
14. So Compung techniques in Bioinformacs
P. Mahalakshmi
INTRODUCTION
The exponenal growth of the amount of biological data available raises two problems: on one
hand, ecient informaon storage and management and, on the other hand, the extracon of
useful informaon from these data. The second problem is one of the main challenges in computaonal
biology, which requires the development of tools and methods capable of transforming all these
heterogenerous data into biological knowledge about the underlying mechanism. These tools
and methods should allow us to go beyond a mere descripon of the data and provide knowledge
in the form of testable models. By this simplifying abstracon that constutes a model, we will be
able to obtain predicons of the system. There are several biological domains where so compung
techniques are applied for knowledge extracon from data.
Applicaon of so compung becomes relevant for solving some Bioinformacs and molecular
biology problems. Development in so compung method reveal the high principles of technology,
algorithms, and tools in bioinformacs for enthusiasc reason such as dependable and parallel
genome sequencing, fast sequence comparison, search in databases, mechanical gene idencaon,
ecient modeling and storage of mixed data, etc. Protein classicaon leads to idencaon and
proper funconal assignment of uncharacterized proteins with a nal goal towards nding homologies
and drug discovery. Again, structure based ligand design is one of the crucial steps in raonal drug
discovery, where a small molecule is designed by targeng the structure and biochemical properes
of the target.
The applicaon of so compung oers an on promising approach to achieve ecient and reliable
heurisc soluon. On the other side the incessant development of high quality biotechnology,
e.g. micro-array techniques and mass spectrometry, which provide complex paerns for the direct
characterizaon of cell processes, oers further promising opportunies for advanced research in
bioinformacs. So one important sub-discipline within bioinformacs involves the development of
new algorithms and models to extract new, and potenally useful informaon from various types of
biological data including DNA(nucleode sequences) and proteins (amino acid sequences). Analysis
of these macromolecules is performed both structurally and funconally using the major components
of so compung like Fuzzy Sets (FS), Arcial Neural Networks (ANN), Evoluonary Algorithms
(EAs) (including genec algorithms (GAs), Rough Sets (RS), Swarm Opmizaon (SO) etc. This lecture
notes aempts to describe the fuzzy logic, Arcial Neural Networks and genec algorithm and its
applicaons in bioinformacs.
NEED OF SOFT COMPUTING IN BIOINFORMATICS
The dierent tasks involved in the analysis of biological data include Sequence alignment,
genomics, proteomics, DNA and protein structure Predicon, gene/promoter idencaon
phylogenec analysis, analysis of Gene expression data, protein Folding, docking and molecule
and Drug design. Data analysis tools used earlier in bioinformacs were mainly based on stascal
techniques like regression and esmaon. So compung in bioinformacs can be used in handling
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
50
large, complex, inherently uncertain, data sets in biology in a robust and computaonally ecient
manner thus fuzzy sets (so compung technique) can be used as a natural framework for analysing
them. Most of the bioinformac tasks involve search and opmizaon of dierent criteria (like energy,
alignment score, overlap strength), while requiring robust, fast and close approximate soluons.
Missing and noisy data is one characterisc of biological data. The convenonal computer
techniques fail to handle this. So compung based techniques are able to deal with missing and noisy
data. As so compung are measured to handle vagueness, indecision and near opmality in large
and complex search spaces use of so compung gear for solving bioinformacs problems have been
gained the aenon of researchers. Most of the researches are woven around the tasks of paern
recognion and data mining like clustering, classicaon, feature selecon, and rule generaon,
while classicaon pertains to supervised or unsupervised learning, clustering corresponds to
unsupervised self -organizaon into homologous parons.
In molecular biology research, new data and concepts are generated every day, and those new
data and concepts update or replace the old ones. So compung can be easily adapted to a changing
environment. This benets system designers, as they do not need to re-design systems whenever the
environment changes. Moreover, since many of the problems involve mulple conicng objecves,
applicaon of so compung mul-objecve opmizaon algorithms like mulobjecve genec
algorithms appears to be natural and appropriate. So compung techniques, either individually or
in a hybridized manner, can be used for analyzing biological data in order to extract more and more
meaningful informaon and insights from them.
With advances in biotechnology, huge volumes of biological data are generated. In addion, it
is possible that important hidden relaonships and correlaons exist in the data. So compung
methods are designed to handle very large data sets, and can be used to extract such relaonships.
FUzzY LOGIC AND ITS APPLICATION IN BIOINFOAMTICS
Fuzzy Sets and Linguisc Variables
A fuzzy set is an extension of a crisp set. Crisp sets allow only full membership or no membership
at all, whereas fuzzy sets allow paral membership. In a crisp set, membership or non membership of
element x in set A is described by a characterisc funcon, where if and if . Fuzzy set theory extends
this concept by dening paral membership, where, where if; if and if x parally belongs to A.
Mathemacally, a fuzzy set A on a universe of discourse U is characterized by a membership funcon
that takes values in the interval [0 1] that can be dened as . Fuzzy set represent commonsense
linguisc labels viz., suitable, moderate, unsuitable, slow, very slow, fast etc. A given element can
be a member of more than one fuzzy set at a me. A fuzzy set A in U may be represented as a set of
ordered pairs. Each pair consists of a generic element x and its grade of membership funcon; that
is,, x is called a support value if (Zadeh, 1965). The concept of a linguisc variable plays important
role parcularly in fuzzy logic. A linguisc variable is a variable whose values are expressed in words
or sentences in natural language. For each input and output variables, fuzzy sets are created by
dividing its universe of discourse into a number of sub-regions and are named as linguisc variable
(Zimmermann, 1996).
Hands on Training Aquaculture Genomics and Bioinformacs 51
Membership Funcons
Although both classical and fuzzy subsets are dened by membership funcons, the degree to
which an element belongs to a classical subset is limited to being either zero or one. This means that
membership funcon may only be a step funcon (Figure 6.1a). On the other hand, in fuzzy logic, a
membership funcon (MF) is essenally a curve that denes how each point in the input space is
mapped to a membership value (or degree of membership) between 0 and 1.
.
Membership funcon for (a) crisp set and (b) fuzzy set
The membership funcons are usually dened for inputs and outputs in terms of linguisc
variables. Various types of membership funcons are used, such as triangular, trapezoidal, bell,
Gaussian, sigmoid funcons. In designing a fuzzy inference system, membership funcons are
associated with term sets that appear in the antecedent or consequent of rules. Many researchers
have used dierent techniques for determining membership funcons such as fuzzy clustering, neural
networks, and genec algorithms
Fuzzy Inference System
Fuzzy Inference System (FIS) incorporate an expert’s experience into the system design and they
are composed of four blocks. A FIS comprises a fuzzier that transforms the ‘crisp’ inputs into fuzzy
inputs by membership funcons that represent fuzzy sets of input vectors, a knowledge base that
includes the informaon given by the expert in the form of linguisc fuzzy rules, an inference engine
that uses them together with the knowledge base for inference by a method of implicaon and
aggregaon, and a defuzzier that transforms the fuzzy results of the inference into a crisp output
using a defuzzicaon method.
Fuzzy Inference System
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
52
The knowledge base comprises two components: a database, which denes the membership
funcons of the fuzzy sets used in the fuzzy rules, and a rule base comprising a collecon of linguisc
rules that are joined by a specic operator. Based on the consequent type of fuzzy rules, there are
two common types of FIS, which vary according to dierences between the specicaons of the
consequent part (Equaons 1 and 2). The rst fuzzy system uses the inference method proposed by
Mamdani in which the rule consequence is dened by fuzzy sets and has the following structure
IF x is A and y is B THEN z is C (1)
The second fuzzy system proposed by Takagi, Sugeno and Kang (TSK) contains an inference
engine in which the conclusion of a fuzzy rule comprises a constant (equaon 2 a) or a weighted linear
combinaon of the crisp inputs (equaon 2 b) rather than a fuzzy set. A fuzzy rule for the zero-order
Sugeno method is of the form
IF x is A and y is B THEN z = C (2 a)
where A and B are fuzzy sets in the antecedent and C is a constant. The rst-order Sugeno model
has rules of the form
IF x is A and y is B THEN z = px+qy+r (2 b)
where A and B are fuzzy sets in the antecedent and p, q, and r are constants
Fuzzy Inference Process
The inference process for evaluang the system needs ve steps
Fuzzy Inference Process
The rst step in evaluang the output of a FIS is to apply the inputs and determine the degree
to which they belong to each of the fuzzy sets via membership funcon (Figure 6.5). This is required
in order to acvate rules that are in terms of linguisc variables. Once membership funcons are
dened, fuzzicaon takes a real me input value and compares it with the stored membership
funcon to produce fuzzy input values. In order to perform this mapping, we can use fuzzy sets of any
shape, such as triangular, Gaussian, π-shaped, etc.
A fuzzy rule base contains a set of fuzzy rule R. For mul-input, single-output system is represented
by
),........,,(
21 n
RRRR =
where Ri can be represented as
( )
( )
11
11
.......,,
yxmxi
TisythenTisxandTisxifR
m
=
Hands on Training Aquaculture Genomics and Bioinformacs 53
In this rule, m precondions of Ri form a fuzzy set
).......( 21 m
xxx TTT ×××
, and the consequent is
single output. Generally, if-then-rule can be interpreted by the following three steps:
Resolve all fuzzy statements in the antecedent to a degree of membership between 0 and 1.
If the rule has more than one antecedent, the fuzzy operator is applied to obtain one number
that represents the result of applying that rule. This is called ring strength or weight factor of that
rule. For example, consider an ith rule has two parts in the antecedent
( ) ( )
i
y
i
x
i
xi
TisythenTisxandTisxifR
21
21
=
Then, the weight factor can be dened using either intersecon operators or product operators
( )
)(),(min
21
21
xx
i
x
i
xi
µµα
=
)()( 21 21 xx i
x
i
xi
µµα
=
The weight factor is used to shape the output fuzzy set that represents the consequent part of
the rule.
The implicaon method is dened as the shaping of the consequent, which is the output fuzzy
set, based on the antecedent. The input for the implicaon process is a single number given by the
antecedent, and the output is a fuzzy set. Minimum or product are two commonly used methods,
which are represented by the following respecvely.
( )
)(,min)( oo i
yi
i
y
µαµ
=
)()( oo i
yi
i
y
µαµ
=
whereois the variable that represents the support value of the membership funcon.
Aggregaon takes all truncated or modied output fuzzy sets obtained as the output of the
implicaon process and combines them into a single fuzzy set. The output of the aggregaon process
is a single fuzzy set that represents the output variable. The aggregated output is used as the input
to the defuzzicaon process. Aggregaon occurs only once for each output variable. Since the
aggregaon method is commutave, the order in which the rules are executed is not important. The
commonly used aggregaon method is the maxmethod which can be dened as follows:
( )
)(),(max)( ooo
i
y
i
yy
µµµ
=
The defuzzier maps output fuzzy sets into a crisp number. Defuzzicaon can be performed
by several methods such as: center of gravity, center of sums, center of the largest area, rst of the
maxima, middle of the maxima, maximum criterion and height defuzzicaon. Of these, center of
gravity (centroid method) and height defuzzicaon are the methods commonly used. The centroid
defuzzicaon method nds the center point of the soluon fuzzy region by calculang the weighted
mean of the output fuzzy region. It is the most widely used technique because the defuzzied values
tend to move smoothly around the output fuzzy region.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
54
Fuzzy Logic in Bioinformacs
Fuzzy systems have been successfully applied to several areas in pracce like for building
knowledge-based systems, fuzzy logic-based and fuzzy rule-based models. They can control and
analyze processes and diagnose and make decisions in biomedical sciences. There are many
applicaon areas in biomedical science and bioinformacs, where fuzzy logic techniques [10] can
be applied successfully. Some of the important uses of fuzzy logic are listed below:
ØIncreasing exibility of protein mofs.
ØStudying dierences between various poly nucleodes.
ØAnalyzing experimental expression data using fuzzy adapve resonance theory .
ØStudying aligning sequences based on a fuzzy dynamic programming algorithm.
ØMathemacal modeling of complex traits inuenced by genes with fuzzy-valued in pedigreed
populaons.
ØFinding cluster membership values to genes applying a fuzzy paroning method using fuzzy
C-Means and fuzzy c-hard mean algorithms.
ØGenerang DNA sequencing using genec fuzzy and neuro-fuzzy systems by ancipang
disturbances due to intangible parameters.
ØIdenfying the cluster genes from micro-array data.
ØPredicng protein’s sub-cellular locaons using fuzzy k- nearest neighbor’s algorithm.
APPLICATION OF ARTIFICIAL NEURAL NETWORK
An Arcial Neural Network (ANN) is an informaon processing model that is able to capture
and represent complex input-output relaonships. The movaon the development of the ANN
technique came from a desire for an intelligent arcial system that could process informaon in
the same way the human brain. Its novel structure is represented as mulple layers of simple
processing elements, operang in parallel to solve specic problems. ANNs resemble human brain
in two respects: learning process and storing experienal knowledge. An arcial neural network
learns and classies problem through repeated adjustments of the connecng weights between
the elements. There are several learning strategies using in bioinformacs: Supervised Learning,
Unsupervised Learning and Reinforcement Learning
An ANN learns from examples and generalizes the learning beyond the examples supplied. The
methodology of modeling or esmaon is somewhat comparable to stascal modeling. Neural
networks should not, however, be heralded as a substute for stascal modeling but rather as a
complementary eort (without the restricve assumpon of a parcular stascal model) or an
alternave approach to ng non-linear data .Neural networks have been widely used in biology
since the early 1990s. Some of the important applicaons of ANNs are listed below:
ØPredicon and the translaon sites iniaon in DNA sequences and proteins.
ØExplain the theory of arcial neural networks using applicaons in biology.
ØPredict immunologically interesng pepdes by combining an evoluonary algorithm.
Hands on Training Aquaculture Genomics and Bioinformacs 55
ØCarry out paern classicaon and signal processing successfully in bioinformacs.
ØPerform protein sequence classicaon.
ØPredict protein secondary structure predicon.
GENETIC ALGORITHMS IN BIOINFORMATICS
The genec algorithm is a method for solving both constrained and unconstrained opmizaon
problems that is based on natural selecon, the process that drives biological evoluon. The
applicaons of GAs are for solving certain mul objecve problems of bioinformacs, which yields
opmizaon of computaon requirements, and robust, fast and close approximate soluons.
GAs are executed iteravely on coded soluons (populaon) biological basic Operators: selecon/
reproducon, crossover, and mutaon. They use objecve funcon informaon and probabilisc
transion rules for moving to the next iteraon. GAs is generally based on manipulang populaons
of bit-strings using both crossover and point-wise mutaon.
Some of the important applicaons of GAs are listed below:
ØAlignment and comparison of DNA, RNA, and protein sequences.
ØGene mappings in chromosomes.
ØRNA structure predicon
ØProtein structure predicon and clustering.
ØMolecular design and molecular docking.
ØGene nding and promoter idencaon from DNA sequences.
ØInterpretaon of gene expression and micro array data.
ØGene regulatory network idencaon.
ØConstrucon of phylogenec tree for studying evoluonary relaonship.
ØDNA structure predicon.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
56
15. RNAseq data analysis – Genome-guided
K. Karthic
1. Introducon
The transcriptomic prole of an organism at any given me or condion gives the set of all its
transcripts and their quanes present at the specic me point or condion. The transcriptomereveals
a great deal about the funconal aspects of the genome as well as the dierent kinds of biomolecules
present within the cell or ssue. It is also very useful for studying the genecs behind growth,
development and disease.
This tutorial describes how to analyse RNA-seq data when a reference genome is available and
the steps involved in idenfying dierenally expressed genes between the two groups. For the
purpose of demonstraon, we have chosen an experiment conducted on Arabidopsis thaliana.
1.1. Input les
1. Reference genome in fasta format
2. RNA seq raw data for two groups in replicates in fastq format
1.2. Soware requirements
1. Bowe2
2. Tophat
3. Cuinks (and associated cudi and cumerge)
4. cummerbund (a R package for visualizing the results)
2. Methodology
2.1. Fetching Raw data
(To save me the raw data has been already downloaded and kept in respecve folders. So the
steps 1 to 15 are to be skipped here)
1. Open terminal and create new directories in your account
mkdir Athaliana
cd Athaliana
mkdir Ref_genome_raw
mkdir Transcriptome_raw
2. Go to Assembly database in NCBI [hps://www.ncbi.nlm.nih.gov/assembly/] and type
TAIR10 in the search bar and click search.
3. The summary of Arabidopsis thaliana assembly is displayed. Click on the Download As-
semblies buon and select Genomic fasta in the drop down menu and click download.
4. The genome downloads as a .tar le, copy the le to Ref_genome_raw folder.
5. Go to terminal and inside the Athaliana folder, type the following commands
cd Ref_genome_raw
tar xvf genome_assemblies.tar
6. A new folder is created with the name similar to ncbi-genomes-2018-08-22. Go to termi-
nal again and type the following commands.
Hands on Training Aquaculture Genomics and Bioinformacs 57
cd ncbi-genomes-2018-xx-xx
gunzip GCF_000001735.4_TAIR10.1_genomic.fna.gz
ls –l
7. Now you can see the lisng of les and in that you noce the fasta le of our genome
and its corresponding le size.
8. Go to terminal again and type the following command to copy and save our genome le
in a dierent name and format
cat GCF_000001735.4_TAIR10.1_genomic.fna > AraTha.fa
9. Now you can see our reference genome saved as AraTha.fa
10. To download RNA-seq data, go to Sequence Read Archive (SRA) database of NCBI [hp://
www.ncbi.nlm.nih.gov/sra] and type the experiment accession numbers SRR671946,
SRR671947, SRR671948 and SRR671949 one aer the other in the search bar and click
search.
11. A summary of the experiment is displayed, scroll down and click on the link displayed
below the run.
12. A summary of experiment of A.thaliana root treated with KCl, replicate-data is displayed
.Go to the downloads tap and click on FASTA/FASTQ link.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
58
13. In the displayed page, type the experiment number and click show runs
14. Select FASTQ and click download.
15. Repeat steps 12, 13 and 14 for all four experiment runs (SRR671946, SRR671947,
SRR671948 and SRR671949).
16. Copy the downloaded fastq les to folder Transcriptome_raw.
17. Go to terminal and change the directory to Transcriptome_raw
18. Inside the Transcriptome_raw directory, you should have fastq les in zipped format for
all the four experiment runs. Go to terminal and type the below command to unzip the
les.
for i in *.gz;do gunzip $i;done;
19. With this we have downloaded all our raw data required for our analysis.
2.2. Data analysis
In this secon, how to run bowe2, tophat, cuinks,cumerge for analyzing the transcriptome
data is described. The soware installaons are not described. Please refer to respecve manual for
the same.
2.2.1. Indexing genome using bowe2
1. Go to terminal and change to the directory Ref_genome_raw/ncbi-genomes-2018-xx-xx and
type the following command:
bowtie2-build AraTha.fa AraTha
2. The above command will create bowe indexed les with .bt2 extension
2.2.2. Running tophat
1. tophat will align the RNA-seq data to our bowe indexed genome. To do so, type the
following command in terminal
Hands on Training Aquaculture Genomics and Bioinformacs 59
cd /home/user/Athaliana
mkdir analysis
cd analysis
tophat –o SRR671946_topout /home/user/Athaliana/Ref_genome_raw/AraTha /home/
user/Athaliana/Transcriptome_raw/ SRR671946.fastq
tophat –o SRR671947_topout /home/user/Athaliana/Ref_genome_raw/AraTha /home/
user/Athaliana/Transcriptome_raw/ SRR671947.fastq
tophat –o SRR671948_topout /home/user/Athaliana/Ref_genome_raw/AraTha /home/
user/Athaliana/Transcriptome_raw/ SRR671948.fastq
tophat –o SRR671949_topout /home/user/Athaliana/Ref_genome_raw/AraTha /home/
user/Athaliana/Transcriptome_raw/ SRR671949.fastq
2. The –o SRR671949_topout represents output folder. For each run, folder is created with
the following les: accepted_hits.bam, align_summary.txt, deleons.bed, inserons.bed,
juncons.bed, prep_reads.info, unmapped.bam and logs folder.
3. The accepted_hits.bam is the main result le containing the mapped results in binary format.
2.2.3. Running cuinks
1. From the alignment les generated from tophat, we can assemble the transcripts using
cuinks.
2. In terminal, type the following commands one aer another.
cuinks –o SRR671946_cuinksout /home/user/Athaliana/analysis/ SRR671946_topout/
accepted_hits.bam
cuinks –o SRR671947_cuinksout /home/user/Athaliana/analysis/ SRR671947_topout/
accepted_hits.bam
cuinks –o SRR671948_cuinksout /home/user/Athaliana/analysis/ SRR671948_topout/
accepted_hits.bam
cuinks –o SRR671949_cuinksout /home/user/Athaliana/analysis/ SRR671949_topout/
accepted_hits.bam
3. For each run, the designated output directory will contain the following les: genes.fpkm_
tracking, isoforms.fpkm_tracking, skipped.g, transcripts.g. The assembled transcripts are
contained in transcripts.g.
2.2.4. Running cumerge
1. cumerge will merge the transcripts to a comprehensive transcriptome.
2. Open a text editor, and type the path of the transcripts as below:
/home/user/Athaliana/analysis/ SRR671946_cuinksout/transcripts.g
/home/user/Athaliana/analysis/ SRR671947_cuinksout/transcripts.g
/home/user/Athaliana/analysis/ SRR671948_cuinksout/transcripts.g
/home/user/Athaliana/analysis/ SRR671949_cuinksout/transcripts.g
and save the le as assembled_transcripts.txt
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
60
3. In terminal, type the following command
cumerge –s /home/user/Athaliana/Reg_genome_raw/ ncbi-genomes-2018-xx-xx/AraTha.
fa assembled_transcripts.txt
4. The successful run creates a merged_asm directory, which contains a logs directory and a
le containing the informaon of the merged transcripts called merged.g.
2.2.5. Running cudi
1. cud is used to see dierenal gene expression in dierent condions. Go to terminal and
type the following command in a single line.
cudi -o di_result -b /home/user/Athaliana/Reg_genome_raw/ ncbi-genomes-2018-
xx-xx/AraTha.fa-L Root_Kcl_control,Root_KNO3_treatment -u merged_asm/merged.g
/home/user/Athaliana/analysis/SRR671946_topout/accepted_hits.bam,/home/user/
Athaliana/analysis/ SRR671947_topout/accepted_hits.bam /home/user/Athaliana/
analysis/SRR671948_topout/accepted_hits.bam/home/user/Athaliana/analysis/
SRR671949_topout/accepted_hits.bam
2. The successful run creates a directory di_result in the working directory. The directory
contains a number of dierent les and databases, listed as follows:
bias_params.info cds_exp.di
genes.fpkm_tracking isoforms.count_tracking
promoters.di splicing.di
tss_groups.fpkm_tracking cds.count_tracking
cds.fpkm_tracking gene_exp.di
genes.read_group_tracking isoforms.fpkm_tracking
read_groups.info tss_group_exp.di
tss_groups.read_group_tracking cds.di
cds.read_group_tracking genes.count_tracking isoform_exp.di
isoforms.read_group_tracking run.info tss_groups.count_tracking var_model.info
3. The fpkm tracking les give FPKM counts of primary transcripts (tss_groups.fpkm), genes
(genes.fpkm_tracking), coding sequences (cds.fpkm_tracking), and transcripts (isoforms.
fpkm_tracking).
4. The count tracking les give the number of fragments for each gene (genes.count_tracking),
transcript (isoforms.count_tracking), primary transcript (tss_groups.count_tracking) and
coding sequence (cds.count_tracking).
5. The read group tracking les contain informaon on the counts of genes, transcripts and
primary transcripts, grouped by replicates.
6. The di les ending with ‘exp.di’ contain informaon on the dierenal expression tests
performed on the genes (gene_exp.di), primary transcripts (tss_group_exp.di), transcripts
(isoform_exp.di), and coding sequences (cds_exp.di).
Hands on Training Aquaculture Genomics and Bioinformacs 61
3. Results
3.1. Running cummeRbund
1. cummeRbund is an R package used to visualise the results in dierent plots.
2. Start an R session In R, go to your working directory and copy the di_result folder to that.
3. Type the following commands in R
>library(‘cummeRbund’)
>cudata < - readCuinks(‘di_result’)
>cudata
4. The above commands will print the result similar to the below
CuSet instance with:
2 samples
33318 genes
42109 isoforms
34957 TSS
32921 CDS
33318 promoters
34957 splicing
27174 relCDS
5. To obtain a density plot showing the expression levels for each sample, type the below
commands:
>csDensity(genes(cudata))
6. To obtain a volcano plot showing the dierenal expressed genes across the two samples,
type the below command:
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
62
>csVolcano(genes(cudata), ‘Root_Kcl_control’, ‘Root_KNO3_treatment’)
7. To obtain a scaer plot showing the dierenal expressed genes across the two samples,
type the below command:
>csScaer(genes(cudata), ‘Root_Kcl_control’, ‘Root_KNO3_treatment’)
8. To print a table displaying the details of all the dierenally expressed genes, type out the
following command.
> gene_di_data < - diData(genes(cudata))
> sig_gene_data < - subset(gene_di_data, (signi cant == ‘yes’))
>nrow(sig_gene_data)
>write.table(sig_gene_data, ‘di_genes.txt’, sep = ‘/t’, row. names = F, col.names = T, quote
= F)
> sig_gene_data
Hands on Training Aquaculture Genomics and Bioinformacs 63
The last command prints out a table containing the details of all the dierenally expressed
genes. The screenshot of the sample output is below:
In this chapter, we described how to download whole genome and transcriptome raw data from
NCBI databases. A very brief introducon about the soware used in this tutorial was presented and
then using the same tools it was demonstrated how to index a whole genome, aligning reads to a
reference genome and how to esmate transcript abundance and idenfy dierenally expressed
genes. In the end, interpretaons of results were visually described.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
64
16. Applicaon of ‘’OMICS’’ research in aquaculture with special
reference to penaeids
Gopikrishna, G., Vinaya Kumar, K., Shashi Shekhar, M. and Vijayan, K.K.
Introducon
The term OMICS refers informally to the eld of study in biology as in genomics, transcriptomics,
proteomics and metabolomics. Genomics is the study of genomes of organisms, transcriptomics is
the study of transcriptomes and so on. Convenonal genec improvement programmes rely mostly
on the phenotypic values which are then converted to breeding values on which the selecon is
carried out. In plants as well as livestock, applicaon of ‘omics’ has revealed interesng insights into
the genec and funconal biology. When these are integrated within selecve breeding programmes,
signicant improvements have been obtained in producvity.(Dekkers, 2012; Perez-de-Castro et al
2012). Omics approaches have been applied widely to elucidate the molecular basis of performance
traits ( eg. growth) and overcome poorly understood biological impediments that impede ecient
producon ( disease, reproducve failure etc) (Rothschild and Plastow 2008, Taylor et al 2016).
As far as livestock and plants are concerned, omics has had a transformaonal eect as observed
by Agrawal and Narayan (2015); Van Emon (2015) and Taylor et al (2016). Coming to the aquaculture
sector, the applicaon of selecve breeding programmes has been at a snail’s pace and it has been
suggested that the world aquaculture producon could be doubled in a period of 13 years if breeding
programmes were supplying stocks for the farmed species (Gjedrem and Rye, 2016). Less than 10%
of the aquaculture producon is derived from improved lines ( Gjedremet al 2012). Looking into the
above facts, it is quite clear that omics resources in aquac species need to be developed at a faster
pace so that these can be used in selecve breeding programmes to hasten genec response.
Crustaceans form a substanal aquaculture commodity globally. The global penaeid aquaculture
industry has exhibited remarkable growth and in 2015, the producon stood at 4.8 million tons (FAO,
2017). Penaeids are an important aquaculture resource the world over and it is necessary to have
selecve breeding programmes so that improved stocks could be generated and farmed. It is well
known that the Pacic white shrimp due to its ease of reproducve capability, has been subjected to
selecve breeding and genecally improved stocks are very much in demand. Informaon generated
from the genomes of shrimp can go a long way in aiding genec improvement programmes so that
the gains are realised at a much faster rate.
Informaon on whole genome of aquaculture species
Several aquaculture species like Oncorhynchus mykiss ( Berthelot et al 2014), Oreochromis
nilocus (Conte et al 2017) Lates calcarifer (Vij et al 2016) Ictalurus punctatus ( Liu et al 2016),
Salmo salar (Lien et al 2016) have had their whole genomes deciphered. In India, work on the whole
genome sequencing in Labeo rohita (Rohu) and Clarius batrachus has been carried out at ICAR-
NBFGR, Lucknow. Shrimp are unique in that the genome size is comparavely large ~ 2.2 Gbp in ger
shrimp and ~1.8 Gbp in Pacic white shrimp (Guppy et al 2018). The highly repeve nature of the
genome in shrimp is a major challenge to the assembly (Huang et al 2011; Baranski et al 2014). In
addion to this, penaeids have a large number of micro-chromosomes and higher levels of genomic
Hands on Training Aquaculture Genomics and Bioinformacs 65
heterozygosity ( Abdelrahmanet al 2017) compared to genome assemblies derived from terrestrial
farm species. Till date, no comprehensive genome assembly is available for a penaeid shrimp. (Guppy
et al 2018). There has been a lot of improvement in sequencing especially through the development
of high-throughput sequencing, resolving and assembling the many repeve regions of the penaeid
genome (~80%; Abdelrahman et al 2017) remains a major challenge.
Transcriptomics
For this, we require the sequence data of the transcriptome. The idea here is to get the mRNA
in individuals at a given point in me, thereaer obtain the cDNA and then go in for sequencing. The
primary focus of transcriptomics has been immunology, disease resistance and reproducve biology
(Guppy et al 2018). Generang transcriptome proles is much easier than generang the whole genome.
In P.vannamei, while invesgang the eect of ammonia exposure, many genes and pathways linked
to immune response (eg chinase, peritrophin, thrombospondin and penaeidin) and growth (linoleic
acid metabolism) were idened by Lu et al (2016a) to be suppressed. Reproducve dysfuncon is a
common feature we nd in capve broodstock of ger shrimp. Through dierenal gene expression
studies of whole transcriptome data, genes related to fay acid and steroid metabolism were found
to have altered expression paerns when comparing wild sourced and domescated stock (Rotllant
et al 2015).
Linkage mapping of genec markers in shrimp
One of the genomic resources is the linkage map which provides a wealth of genomic informaon
and also unravel the underlying genec architecture of commercially and biologically important
traits. In penaeids, there have been substanal eorts to generate linkage maps. Linkage maps are
constructed using data from family groups viz. parents as well as progeny. Earlier, Amplied Fragment
Length Polymrphism was used for construcon of linkage maps in ger shrimp ( Wilson et al 2002).
Later Baranski et al 2014 constructed the rst linkage map in ger shrimp using SNPs. Presently, linkage
maps are available that include between 3959 and 9298 markers and cover all 44 chromosomes of
the penaeid genome ( Baranski et al 2014, Yu et al 2015, Lu et al 2016b, Jones et al 2017a) . Such
maps have increased the applicability of these resources in assisng genome assembly, examining
architecture of traits and also for comparave mapping (Guppy et al 2018). It is interesng to note
that construcon of linkage maps has unravelled some hitherto unknown facts. Baranski et al (2014)
reported in ger shrimp that the female–specic map was substanally shorter than the male specic
map (2917 vs 4059 cM) whereas in P.vannamei,Perez et al (2004) and Zhang et al (2007) , reported
longer maps for females than males ( 4134 vs. 3221 cM and 2771 vs. 2116 cM respecvely) indicang
that there may be higher recombinaon in males. There is sll ambiguity in the karyotype due to
the micro-chromosomes in penaeids as a consequence of which it appears that the dierence in
map length between species exists and sex-based recombinaon might occur. (Baranski et al 2014).
Maps available for ger shrimp (Baranaski et al 2014) and Pacic white shrimp ( Yuet al 2015) have
average inter-marker distances between 0.9 and 0.7 cM respecvely across dierent map iteraons.
This is denitely a signicant achievement, however, 1 cM equates to an esmated physical genome
distance of ~ 400-600Kb for penaieds (P. monodon 395Kb/cM (Baranski et al 2014), P. vannamei
598.89 Kb/cM ( Yuet al 2015), P.japonicus 657.89Kb/cM (Lu et al 2016b) and presents a signicant
challenge when we look to characterise potenal useful genes or genomic regions underlying ndings
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
66
of trait-associaon studies. (Guppy et al 2018). Future work is required to obtain denser maps that
decrease the interval between markers. This could be accomplished by genotyping more families and
also more individuals per family which would provide addional observaons of informave meioc
recombinaon events or integrate orphaned (unplaced) markers into exisng maps (Fierst 2015).
Ulising enhanced cost-eecve genotyping strategies ( eg genotype by sequencing method ) could
result in genotyping of more families and also more individuals per family consequent to which ne
grain marker placement could be achieved (Guppy et al 2018).
Developing and applying polymorphic markers
There has been considerable eorts in the past, for development of a wide range of tradional
genomic markers ( eg. Allozymes, RFLP, AFLP and microsatellites) in several penaeid species. Most
of these markers have been used for assessing the wild populaons and manage family lines. These
markers exhibit caveats which have been reviewed by Benzie (1998, 2009). Due to the high cost in
developing them and the failure to unravel the complexity of producon traits, they have not found
favour in the penaied industry (Guppy et al 2018). Today, the tradional markers are being replaced
by powerful and cost-eecve markers like Single Nucleode Polymorphisms (SNPs). The SNPs are
very abundant in the genome and can help substanally in genome studies. About 9 million SNPs in
Bos taurus genome ( Xuet al 2017), 7 million SNPs in chickens (Rubin et al 2010) 9.7 million SNPs in
Atlanc Salmon ( Yanez et al 2016), 8.6 million SNPs in channel caish ( Zeng et al 2017) and 5.6 million
SNPs in Lates calcarifer ( Vij et al 2016) have been idened. The SNP discovery has further led to the
manufacture of SNP arrays in several species like cale, sheep, crops like wheat and in aquaculture
species like Caish and Atlanc Salmon. In P. monodon, at ICAR-CIBA, Baranski et al (2014) developed
a chip containing 6000 SNPs which were majorly idened using the transcriptomic approach. It would
be pernent to point out that ll date, only two studies have produced validated SNP genotyping
arrays ( Baranski et al 2014 ( 6000 SNPs) in ger shrimp and Jones et al (2017b) in Pacic white
shrimp (6400 SNPs). The laer one has been sold commercially as the Innium ShrimpLD-24 v1.0
Bead Chip. An interesng feature of these chips is that these arrays are based on type-I SNPs ( genic
rather than inter-genic) and many of these SNPs have been annotated with putave genes ( 62 and 47
%) respecvely, thereby providing a strong foundaon for further trait mapping studies. ( Robinsonet
al 2014 and Khatkar et al 2017b). An addional feature that needs to be factored in, is the cost of
the SNP arrays. The approximate cost of genotyping per individual has drascally fallen to about Rs.
5000/- This needs to be further reduced to make it cost-eecve. Selecon of a genotyping method
for commercial applicaons would hinge on the me required for sample processing, genotyping
and data analysis, as the window between pre-selecon of candidate broodstock at harvest and nal
breeding selecon and spawning is quite short ( less than 3-6 months) (Guppy et al 2018).
Genotype by sequencing
A unique advantage of the genotype by sequencing (GBS) method is the ability to discover and
genotype markers ( de novo marker discovery) without requiring reference to exisng genomic
informaon like genomic sequences and transcriptomes. In penaeids, a number of GBS approaches
have been ulised with 25140 and 23049 markers obtained in Pacic white shrimp (Yu et al 2015,
Wang et al 2017) and 28981 markers obtained in Kuruma shrimp ( Lu et al 2016b). Most of these
markers have been ulised to generate linkage maps, undertake Quantave Trait Loci (QTL) mapping
Hands on Training Aquaculture Genomics and Bioinformacs 67
(Yu et al 2015,Lu et al 2016b) and esmate genomic predicon accuracy (Wang et al 2017), and they
have yet to be ulised in the industry for genotyping.
Markers for breeding populaon management
Crustaceans have a tendency to frequently molt and this places them at a disadvantage in
idencaon. However, tagging with visible implant elastomer tags (for family idencaon) and
eye-ring tags ( for individual idencaon) have been found to address this issue to a certain extent.
The number of individuals available per family is rather large in shrimp and they need to be reared in
a common environment so that there is no confounding of environmental eects. Each family needs
to be reared ll tagging and this poses a signicant challenge on infrastructure. Tracking of pedigree
is of paramount importance to keep the inbreeding low. Use of genomic markers could enhance the
idencaon of individual shrimp but here again the cost of genotyping (high density solid state
arrays), lack of genotyping power (microsatellites) or a combinaon of both these factors are a major
stumbling block ( Vandepue and Haray, 2014).
Exploing genec variaon underlying phenotypes
It is important to comprehend the relaonship between genec variaon and the phenotypes
of economically important traits. The informaon so obtained could prove useful for integrang
genomics research into food producon industries. ( Abdelrahmanet al 2017). Through QTL mapping
and Genome-Wide Associaon Studies (GWAS), it may be possible to idenfy the number, locaon,
eect size of genec elements ( i.e. genes, loci and regions) that are linked to the observed phenotypic
variaon of a trait. (Mackay et al 2009). For this to be applied at the eld level, we need to idenfy
markers that are highly predicve for a superior or inferior phenotype in order to improve the selecon
of elite individuals for breeding programmes (Thorgaard et al 2006). Genomic breeding values have
recently been ulised in breeding programmes related to agriculture in an eort to improve simple
and complex traits. (Meuswissen et al 2001, 2006). Such a procedure could also be applied to shrimp
breeding programmes to elicit substanal genec response.
QTL mapping
A Quantave Trait Locus (QTL) is a region in the genome containing one or several genes that
aect variaon in a quantave trait which is idened by its linkage to polymorphic marker loci.
Mapping of QTLs involves two components: detecon and localisaon. Once the QTLs are detected,
they need to be localised and the gene(s) unravelled. QTLs can be localised through their genec
linkage to visible marker loci with genotypes that we can readily classify. In case a QTL is linked to
a marker locus, then individuals with dierent marker locus genotypes will exhibit dierent mean
values of the quantave trait. QTLs can be mapped in families or in segregang progeny of crosses
between genecally divergent strains ( linkage mapping) or in unrelated individuals from the same
populaon ( associaon mapping). Later, these QTLs need to be validated in a populaon of individuals.
If the validaon yields encouraging results, the QTLs can be ulised to improve the concerned trait
in a breeding programme. Two studies in aquaculture species related to QTL mapping have been
reported. One is by Li et al (2006) for growth in Kuruma shrimp P. japonicus and another by Robinson
et al (2014) for resistance to White Spot Syndrome Virus in ger shrimp. In the former case, AFLP
markers were used whereas in the case of ger shrimp SNP markers were used.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
68
Genome-Wide Associaon Studies
These are studies aimed at associang a parcular QTL with a trait. Till date there have been
only two studies reported in aquaculture species. The rst one is in ger shrimp, the work of which
was carried out at ICAR-CIBA and NOFIMA Norway. Seven families of ger shrimp were exposed to
the White Spot Syndrome Virus. The number of shrimp genotyped was 1024. About 9 QTLs in ger
shrimp were found to be signicantly associated to hours of survival. In addion, 3 SNPs were found
to be associated with sex in ger shrimp.(Robinson et al 2014). The second study was for growth in
P. vannamei. The authors could not nd any signicant associaon of markers with growth. Earlier, Yu
et al (2015) while working in P. vannamei, had reported a large QTL for growth explaining 17.9% of
the phenotypic variaon.
Conclusion
Omics research in aquaculture has generated a lot of informaon during the past three
decades. Compared to plant and livestock breeding programmes, aquac species has a long way
to go. The informaon owing from various resources like linkage maps, physical maps, annotated
transcriptome, characterised proteome data and genome sequence need to be incorporated onto
a single plaorm for use by other sciensts working in this eld. Wide publicity needs to be given
on high-density linkage maps to comprehend genome architecture so as to help in future genec
improvement programmes. Indepth studies on economically important traits in aquac species are
also required urgently so as to help the farmers reap prots from culture of sh/shrimp.
References cited
Abdelrahman, H., ElHady, M., Alcivar-Warren, A., Allen, S., Al-Tobasei, R., Bao, L., et al. (2017).
Aquaculture genomics, genecs and breeding in the United States: current status, challenges,
and priories for future research. BMC Genomics 18:191. doi: 10.1186/s12864-017-3557-1
Agrawal, R., and Narayan, J. (2015).Unravelling the impact of bioinformacs and omics in agriculture.
Int. J. Plant Biol. Res. 3:1039.
Baranski, M., Gopikrishna, G., Robinson, N. A., Katneni, V. K., Shekhar, M. S.,Shanmugakarthik, J.,
et al. (2014). The development of a high density linkage map for black ger shrimp (Penaeus
monodon) based on cSNPs. PLoS ONE 9:e85413. doi: 10.1371/journal.pone.0085413
Benzie, J. A. (1998). Penaeid genecs and biotechnology. Aquaculture 164, 23–47. doi: 10.1016/
S0044-8486(98)00175-6
Benzie, J. A. (2009). Use and exchange of genec resources of penaeidshrimps for food and aquaculture.
Rev. Aquacult. 1, 232–250.doi: 10.1111/j.1753-5131.2009.01018.x
Berthelot, C., Brunet, F., Chalopin, D., Juanchich, A., Bernard, M., Noël,B., et al. (2014). The rainbow
trout genome provides novel insights intoevoluon aer whole-genome duplicaon in
vertebrates. Nat. Commun.5:3657. doi: 10.1038/ncomms4657
Conte, M. A., Gammerdinger, W. J., Bare, K. L., Penman, D. J., and Kocher, T.D. (2017).
A high quality assembly of the Nile Tilapia (Oreochromis nilocus)genome reveals the structure of
two sex determinaon regions. BMC Genomics18:341. doi: 10.1186/s12864-017-3723-5
Hands on Training Aquaculture Genomics and Bioinformacs 69
Dekkers, J. C. (2004). Commercial applicaon of marker-and gene-assistedselecon in livestock:
strategies and lessons. J. Anim. Sci. 82(13 Suppl.), E313–E328.doi: 10.2527/2004.8213_
supplE313x
FAO (2017). FishStat Plus - Universal Soware for Fishery Stascal Time Series.
FAO Fisheries and Aquaculture Department. Rome.
Fierst, J. L. (2015). Using linkage maps to correct and scaold de novo genomeassemblies: methods,
challenges, and computaonal tools. Front. Genet.6:220. doi: 10.3389/fgene.2015.00220
Gjedrem, T., and Rye, M. (2016). Selecon response in sh and shellsh: a review.
Rev. Aquacult. 10, 168–179. doi: 10.1111/raq.12154
Gjedrem, T., Robinson, N., and Rye, M. (2012). The importance of selecvebreeding in aquaculture to
meet future demands for animal protein: a review.
Aquaculture 350, 117–129. doi: 10.1016/j.aquaculture.2012.04.008
Guppy, J.L., Jones, D.B., Jerry, D.R., Wade, N.M., Raadsma, H.W., Huerlimann, R., and Zenger,K.R. (2018).
The State of ‘’Omics’’ Research for farmed penaeids: Advances in research and impediments to
industry ulisaon.
Front. Genet. 9:282, doi:10.3389/fgene.2018.00282
Huang, S.-W., Lin, Y.-Y., You, E.-M., Liu, T.-T., Shu, H.-Y., Wu, K.-M., et al.(2011). Fosmid library end
sequencing reveals a rarely known genomestructure of marine shrimp Penaeus monodon. BMC
Genomics 12:242.doi: 10.1186/1471-2164-12-242
Jones, D. B., Jerry, D. R., Khatkar, M. S., Raadsma, H. W., Steen, H. V. D.,Prochaska, J., et al. (2017a).
A comparave integrated gene-based linkage andlocus ordering by linkage disequilibrium map
for the Pacic white shrimp,Litopenaeus vannamei. Sci. Rep. 7:10360. doi: 10.1038/s41598-017-
10515-7
Jones, D. B., Zenger, K. R., Khatkar, M. S., Raadsma, H. W., Steen, H. A. M. V.D., Prochaska, J., et
al. (2017b). “Development of a low-density commercialgenotyping array for the white legged
shrimp, Litopenaeus vannamei,” inAAABG, Edited by Genecs AAoABa (Townsville, QLD).
Khatkar, M., Coman, G., Thomson, P., and Raadsma, H. (2017a). “Comparisonof dierent breeding
design opons for long term genec gain and diversityin aquaculture species,” in Proc Assoc
Advmt Anim Breed Genet (Townsville,QLD), 449–452.
Li, Y., Dierens, L., Byrne, K., Miggiano, E., Lehnert, S., Preston, N.,et al. (2006). QTL detecon of
producon traits for the Kuruma prawnPenaeus japonicus (Bate) using AFLP markers. Aquaculture
258, 198–210.doi: 10.1016/j.aquaculture.2006.04.027
Lien, S., Koop, B. F., Sandve, S. R.,Miller, J. R., Kent,M. P., Nome, T., et al. (2016).
The Atlanc salmon genome provides insights into rediploidizaon. Nature533, 500–505. doi:
10.1038/nature17164
Liu, Z., Liu, S., Yao, J., Bao, L., Zhang, J., Li, Y., et al. (2016). The channel caishgenome sequence
provides insights into the evoluon of scale formaon inteleosts. Nat. Commun. 7:11757. doi:
10.1038/ncomms11757
Lu, X., Kong, J., Luan, S., Dai, P., Meng, X., Cao, B., et al. (2016a).
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
70
Transcriptome analysis of the hepatopancreas in the Pacic White Shrimp(Litopenaeus vannamei)
under acute ammonia stress. PLoS ONE 11:e0164396.
doi: 10.1371/journal.pone.0164396
Lu, X., Luan, S., Hu, L. Y., Mao, Y., Tao, Y., Zhong, S. P., et al. (2016b). Highresoluon
genec linkage mapping, high-temperature tolerance and growthrelatedquantave trait locus (QTL)
idencaon inMarsupenaeus japonicus.Mol. Genet. Genomics 291, 1391–1405. doi: 10.1007/
s00438-016-1192-1
Mackay, T. F., Stone, E. A., and Ayroles, J. F. (2009). The genecs ofquantave traits: challenges and
prospects. Nat. Rev. Genet. 10, 565–577.doi: 10.1038/nrg2612
Meuwissen, T.,Hayes, B., and Goddard,M. (2001). Predicon of total genec valueusing genome-wide
dense marker maps.Genecs 157, 1819.
Meuwissen, T., Hayes, B., and Goddard,M. (2016). Genomic selecon: a paradigmshi in animal
breeding. Anim. Front. 6, 6–14. doi: 10.2527/af.2016-0002
Pérez, F., Erazo, C., Zhinaula, M., Volckaert, F., and Calderón, J. (2004).
A sex-specic linkage map of the white shrimp Penaeus (Litopenaeus) vannamei) based on AFLP
markers. Aquaculture 242, 105–118.doi: 10.1016/j.aquaculture.2004.09.002
Pérez-de-Castro, A. M., Vilanova, S., Cañizares, J., Pascual, L., Blanca, J. M., Diez,M. J., et al. (2012).
Applicaon of genomic tools in plant breeding.Curr.Genomics 13, 179–195.
doi: 10.2174/138920212800543084
Robinson, N. A., Gopikrishna, G., Baranski, M., Katneni, V. K., Shekhar, M.S., Shanmugakarthik, J., et al.
(2014). QTL for white spot syndrome virusresistance and the sex-determining locus in the Indian
black ger shrimp(Penaeus monodon).
BMC Genomics 15:731. doi: 10.1186/1471-2164-15-731
Rothschild, M. F., and Plastow, G. S. (2008).Impact of genomics on animalagriculture and opportunies
for animal health.Trends Biotechnol. 26, 21–25.
doi: 10.1016/j.btech.2007.10.001
Rotllant, G.,Wade,N.M., Arnold, S. J., Coman, G. J., Preston,N. P., and Glencross,B. D. (2015).
Idencaon of genes involved in reproducon and lipid pathwaymetabolism in wild and
domescated shrimps. Mar. Genomics 22, 55–61.
doi: 10.1016/j.margen.2015.04.001
Rubin, C.-J., Zody, M. C., Eriksson, J., Meadows, J. R. S., Sherwood, E.,Webster, M. T., et al. (2010).
Whole-genome resequencing reveals loci underselecon during chicken domescaon. Nature
464:587. doi: 10.1038/nature08832
Taylor, J. F., Taylor, K. H., and Decker, J. E. (2016). Holsteins are thegenomic selecon poster cows.
Proc. Natl. Acad. Sci. U.S.A. 113, 7690–7692.doi: 10.1073/pnas.1608144113
Thorgaard, G. H., Nichols, K.M., and Phillips, R. B. (2006).Comparave gene andQTL mapping in
aquaculture species.Israeli J. Aquacult.Bamidgeh 58, 4.
Van Emon, J. M. (2015). The omics revoluon in agricultural research.
J. Agric.Food Chem. 64, 36–44. doi: 10.1021/acs.jafc.5b04515
Hands on Training Aquaculture Genomics and Bioinformacs 71
Vandepue, M., and Haray, P. (2014). Parentage assignment with genomicmarkers: a major advance
for understanding and exploing genec variaonof quantave traits in farmed aquac animals.
Front. Genet. 5:432.doi: 10.3389/fgene.2014.0043
Vij, S., Kuhl, H., Kuznetsova, I. S., Komissarov, A., Yurchenko, A. A., VanHeusden, P., et al. (2016).
Chromosomal-level assembly of the Asian seabassgenome using long sequence reads and mul-
layered scaolding. PLoS Genet.12:e1005954.
doi: 10.1371/journal.pgen.1005954
Wang, Q., Yu, Y., Yuan, J., Zhang, X., Huang, H., Li, F., et al. (2017). Eects ofmarker density and
populaon structure on the genomic predicon accuracyfor growth trait in Pacic white shrimp
Litopenaeus vannamei.
BMC Genet.18:45. doi: 10.1186/s12863-017-0507-5
Wilson, K., Li, Y. T., Whan, V., Lehnert, S., Byrne, K., Moore, S., et al.(2002). Genec mapping of the
black ger shrimp Penaeus monodonwith amplied fragment length polymorphism. Aquaculture
204, 297–309.doi: 10.1016/S0044-8486(01)00842-0
Xu, C., Li, E., Liu, Y., Wang, X., Qin, J. G., and Chen, L. (2017).Comparaveproteome analysis of the
hepatopancreas from the Pacic white shrimpLitopenaeus vannamei under long-term low
salinity stress. J. Proteomics 162,1–10. doi: 10.1016/j.jprot.2017.04.013
Yáñez, J. M., Naswa, S., López, M., Bassini, L., Correa, K., Gilbey, J., et al.(2016). Genomewide single
nucleode polymorphism discovery in Atlancsalmon (Salmo salar): validaon in wild and
farmed American and Europeanpopulaons.
Mol. Ecol. Resour. 16, 1002–1011. doi: 10.1111/1755-0998.12503
Yu, Y., Zhang, X., Yuan, J., Li, F., Chen, X., Zhao, Y., et al. (2015). Genomesurvey and high-density
genec map construcon provide genomic and genecresources for the PacicWhite Shrimp
Litopenaeus vannamei. Sci. Rep. 5:15612.doi: 10.1038/srep15612
Zeng, Q., Fu, Q., Li, Y., Waldbieser, G., Bosworth, B., Liu, S., et al. (2017).
Development of a 690K SNP array in caish and its applicaon for genecmapping and validaon of
the reference genome sequence. Sci. Rep. 7:40347.doi: 10.1038/srep40347
Zhang, L., Yang, C., Zhang, Y., Li, L., Zhang, X., Zhang, Q., et al. (2007). Agenec linkage map of Pacic
white shrimp (Litopenaeus vannamei): sex-linkedmicrosatellite markers and high recombinaon
rates. Geneca 131, 37–49.doi: 10.1007/s10709-006-9111-8
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
72
17. Shrimp Genomics : Current status and Challenges
M.S. Shekhar, K. Vinaya Kumar and K.K. Vijayan
The shrimp genomics has evolved a into a considerable research progress over last few decades.
The recent advances in “omics” in parcular with the advancement in NGS techniques, have
provided the aquaculture industry the opportunies as well the challenges faced in understanding
the complexity of the whole genome of shrimp. However, the currently available molecular biology
resources and bioinformacs techniques require further development to undertake the challenges
and provide the most informave results in deciphering the shrimp genome.
1. Introducon
The consumpon of food shes globally is projected to increase tremendously. However, with
exploitaon and decrease in wild catch sheries worldwide, much importance is now being given to
increase the producon from aquaculture. In aquaculture and sheries management for an eecve
genec improvement breeding programs, studies relang to populaon structure, genec diversity,
environmental adaptaon and molecular response to bioc and abioc stress are very important.
“Biotechnology” integrated with “Omics” is a term that has now come to encompass many of the
excing new developments in aquaculture during recent years. Hence, for sustainable aquaculture,
genec improvement for desired traits etc. through biotechnological means has gained importance in
recent years. Aquaculture biotechnology deals with the use of knowledge and techniques in the eld
of molecular, cellular and genec processes to develop improved aquaculture products and variees.
Therefore, a wide term of ‘omics’ which includes methods and techniques that are required for analyzing
all dierent types of molecules and the pathways associated with them is used in aquaculture as well.
This encompasses the major four “omics”, namely transcriptomics, proteomics, metabolomics and
epigenomics. Viral infecons are one of the major reasons for the huge economic losses in shrimp
farming. The control of viral diseases in shrimp remains a serious challenge for the shrimp aquaculture
industry. White spot syndrome virus (WSSV), is a major pathogen which is geographically widespread
and connues to be a serious threat aecng shrimp farms the world over. In the absence of a true
adapve immune response system in invertebrates, shrimps respond by non-specic innate immune
mechanisms. Shrimp genome annotaon and transcriptome generaon as “omics” tools would aid
to unravel the molecular mechanisms involved in the immune defence network that occur in shrimp
in response to WSSV infecon in addion to development of genecally improved variees of shrimp
with desirable traits through genec improvement breeding programmes.
2. Transcriptomics
Next-generaon high-throughput RNA sequencing technology (RNA-seq) is a modern and a high
throughput method which is not restricted by the unavailability of a genome reference sequence
has tremendous potenal for idencaon, proling and quanfying RNA transcripts with increased
sensivity. Transcriptome is the complete set of transcripts in a cell, indicang a specic developmental
stage or physiological condion together with the quanty. Transcriptome helps in idenfying the
funconal elements of genome revealing molecular constuents of cells and ssues, in response to
environmental stress with an accurate quancaon of gene expression levels. Because of these
several advantages over other techniques expression this approach has been widely used now in
Hands on Training Aquaculture Genomics and Bioinformacs 73
decoding the funconal role of gene and cell responses against environmental stress. Signicant
progress has been recently achieved in understanding the transcript expression of marine crustaceans
such as Litopenaeus vannamei, Fenneropenaeus chinensis, Eriocheir sinensis and Macrobrachium
nipponense in response to bioc and abioc stress factors. Transcriptome data aids in idencaon of
novel genes in absence of shrimp genome database as shown in Table 1. Next-generaon sequencing
technologies have therefore inuenced the analysis of gene regulaon.
Table 1. Transcriptomes generated from shrimp species
Species Tissues Transcriptome generaon
L. vannamei Hemocytes WSSV
L. vannamei Hepatopancreas WSSV
L. vannamei Hepatopancreas and muscle WSSV and growth
L. vannamei Hemolymph and hemocytes TSV
L. vannamei Hepatopancreas TSV
L. vannamei Tess and Ovaries Gonadal development
L. vannamei Hepatopancreas Acute ammonia stress
L. vannamei Hepatopancreas Osmoregulatory Stress
L. vannamei Gills Osmoregulatory Stress
L. vannamei Hepatopancreas and hemocytes Nitrite
L. vannamei Whole larvae Embryo development
L. vannamei Embryo, Nauplius, zoea, mysis, post
larvae
Larval Development
L. vannamei Whole shrimp Molng
L. vannamei Muscle Feed eciency
L. vannamei Heart, muscle, hepatopancreas and
eyestalk
Growth
P. monodon Hepatopancreas and ovary Reproducon and
development
P. monodon Eyestalk, stomach, female gonad, male
gonad, gill, haemolymph,
hepatopancreas,
lymphoid organ, tail muscle, embryos,
nauplii, zoea, and mysis, whole larvae
Gene discovery
M. japonicus Ferlized eggs, embryos and vegetal
halves
Embryo development
F. chinensis Cephalothorax WSSV
F. merguiensis Cucle, muscle, androgenic gland,
hepatopancreas, stomach, nervous
system, eyestalk, male gonads, female
gonads
Color
F. merguiensis Hepatopancreas, stomach, eye stalk,
nerve cord, male gonad, female gonad,
androgenic gland region, muscle and
cucle
Reproducon and
development
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
74
3. Complexity of shrimp genome
Shrimp genomes are large with highly repeve sequences which pose signicant challenges in
deciphering the whole genome and other genec studies. In our study, the shrimp genome esmated
by ow cytometry showed the shrimp genome to be of very high size. The genome size for the
four major species of genus Penaeus (Penaeus monodon, Penaeus indicus, Penaeus vannamei and
Penaeus japonicus) were found in similar range. The genome size of female shrimps ranged from
2.91 ± 0.03 pg (P. monodon) to 2.14 ± 0.02 pg (P. japonicus). In male shrimps, the genome size ranged
from 2.86 ± 0.06 pg (P.monodon)to 2.19 ± 0.02 pg (P. indicus). Signicant dierence was observed in
the genome size between male and female shrimp of all species except in P.monodon. The highest
relave dierence of 12.78% was observed in the genome size between the either sex in P.indicus.
The interspecic relave dierence of 30.59% in genome size was highest between the male shrimps
of P. monodon and P. indicus and 35.98% between the female shrimps of P. monodon and P. japonicus.
This study was undertaken to esmate genome size in shrimps which will help guiding the research
aimed towards generang the sequence data for the whole genome of these species in future. The
penaeid genome (80% repeve) remains a challenge even today for sequencing and assembly.
Short read second-generaon sequencing methods for example illumina sequencing technology
is preferred for non-complex genomes, by idenfying and overlaying sequences and building the
resulng congs and scaolds. However, when short read sequencing methods are applied to highly
repeve regions within the genome, it leads to diculty in building conguous sequences. The
shrimp genomes also have high levels of heterozygosity. The previous short-read assembly in shrimps
have led highly fragmented assembly with high number of scaolds. There are reports that shrimp
with polysaccharides contaminaon and high DNase acvity can interfere with long read sequencing
methodologies which are major challenges to overcome and methods to isolate intact pure shrimp
genome needs more standardizaon.
4. NGS plaorm for shrimp genome sequencing
Several NGS plaorms are currently in use such as Illumina MiSeq, Ion Torrent PGM, PacBio RS,
Illumina GAIIx, Illumina HiSeq 2000, etc. The key feature which determines the opmal plaorm to
be used is their speed of sequencing with less of error rates. The sequencing methodology has been
dominated by Illumina. However, the use of this technology is not adequate in dealing with complex
shrimp genomes which requires generaon of longer read lengths. One such latest plaorm which
yields longer read lengths is PacBio. PacBio is based on single molecule real me (SMRT) sequencing.
The DNA polymerase molecules, binds to a DNA template, are present at the base of 50 nm-wide wells
called zero-mode waveguides (ZMWs). Second strand DNA synthesis in the presence of γ-phosphate
uorescently labeled nucleodes is carried out by each polymerase. With each base incorporaon, a
disncve pulse of uorescence is detected in real me. The PacBio plaorm, by virtue of its long read
lengths, has the potenal applicaon in de novo sequencing of shrimp genome. Approx. mean read
lengths of 1500 bp were generated using the PacBio RS system with the rst generaon of chemistry
(C1 chemistry) , the advanced PacBio RS II system with the C4 chemistry yields average read lengths
over 10 kb , with an N50 of more than 20 kb and maximum read lengths over 60 kb. The latest PacBio
Sequel System is a advanced version with higher throughput, more scalability, a reduced footprint
and lower sequencing project costs compared to the PacBio® RS II System. This advanced version of
Hands on Training Aquaculture Genomics and Bioinformacs 75
the Sequel System is the capacity of its redesigned SMRT Cells, which contain one million zero-mode
waveguides (ZMWs) as, compared to 150,000 ZMWs in the PacBio RS II. Acve individual polymerases
are immobilized within the ZMWs, providing windows to observe and record DNA sequencing in real
me. In future the successful assemblies for shrimp genome will depend upon a “hybrid assembly
approach, ulizing short-read sequencing to correct the high error rate observed in long read PacBio
sequencing system.
5. The Challenges
In comparison to other livestock industries, very less improved lines are used in aquaculture
producon (Gjedrem et al., 2012). The aquaculture producon has also not completely ulized the
exisng natural genec potenal and resources for increased producvity. In case of shrimp, there
have been numerous molecular studies on the expression and funcon of selected genes involved
in metabolic pathways, however, lile aenon is given to the metabolic dierences which exist
between shrimp or to their developmental stages. The dierence among shrimp due to result of
parcular metabolic and adaptaons to varied environmental condions needs to be studied in detail.
These types of studies have direct relevance to the beer management pracces and formulaon of
opmal diets for the domescaon of shrimp in aquaculture. In the immediate future, the main
challenges are to integrate the available genomic data with physiological studies on shrimp. These
outcomes will elucidate species-specic adaptaons to environmental condions, and have the
potenal to inform and smulate research in many biological disciplines. For, any genomic studies
and analysis, a reference genome is essenal, however, except for a brachiopod Daphnia pulex, no
informaon on complete genome assembly is available from other crustaceans. The genome size of
D. pulex is comparavely smaller in size of about 200 Mb, containing 30,970 genes and very less 9.4%
repeve sequences, however in shrimp, the genomes are too big and complex for sequencing and
assembly. Bioinformacs, data mining and sequence annotaon needs to be dened and developed
for complex genomes which would aid in complete genome assembly.
6. Future potenal
Introducing of improved bioinformacs approach for error-correcon of longer read sequencing
lengths and use of opcal mapping would help in compleng the large size genome assembly of shrimps
and other aquac species. There is also an urgent need to construct linkage and physical maps, and
to develop database for annotated transcriptome, proteomics and metabolomics, which would help
in generang highly informave shrimp “omics” to understand genome structure, genome evoluon,
phylogeny and natural selecon of aquaculture species. The funconal genomics with annotated
genome and validaon of candidate genes by experimental CRISPER or RNAi knockdown studies
would be signicant progress towards in idencaon of target genes of commercial importance
such as growth, and disease resistance. Understanding the genome and genec makeup of shrimp
would benet in deciphering complex traits which would eventually accelerate the breeding program
in shrimp. A high-density linkage map is essenal for shrimp genomics and genec studies. Creaon
of a high-density linkage map would help in mapping of QTLs for traits of interest such as body
weight, body length, disease resistance and other traits which have high commercial signicance in
aquaculture.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
76
18. Applicaon of Biotechnology in animal reproducon
Sherly Tomy
Reproducve eciency is a major factor determining the economic success of any livestock
enterprise.Majority of the animal breeding programs have aimed at enhancing the genec worth
of animals using convenonal selecon methods primarily based on phenotype. Revoluonary
tools in reproducve biotechnology like use of recombinant DNA procedures, genome engineering,
transgenic technology, somac cell nuclear transplantaon etc has added new dimensionstoanimal
breeding. Applicaon of biotechnology in animal breeding has resulted in several remarkable
discoveries like the sheep Dolly, created by the somac cloning technique, transgenic pigs that can
be organ donors for humans, and animal bioreactors producing human therapeuc proteins in milk.
Compared to the terrestrial animals, the development in aquac animals is comparavely less. Only a
small percentage of farmed aquac species have been subject to genec improvement programmes.
However, biotechnology have great potenal to increase sh producon mainly due to the availability
of large numbers of gametes, use of external ferlizaon, and ease of hormone treatments during
development to induce sterility or funconal sex reversal. Some of the important reproducve
biotechnological tools used in farm animals are:
Arcial inseminaon: Using this technology new breeds of animals are produced through the
introducon of the male sperm from one superior male to the female reproducve tract without
mang. The advantage of AI includes reduced transmission of venereal disease, lessens the need of
farms to maintain breeding males, facilitates more accurate recording of pedigrees, and minimizes the
cost of introducing improved genecs. However, success of AI depends on accurate heat detecon,
proper frozen semen handling and mely inseminaon by a trained inseminator.
Sex Determinaon of Sperm: Sexing of sperm could help to pre-determine the sex of the progeny. This
technique works on the principle of ow cytometric separaon of uorescent-labelled X-chromosome
bearing spermatozoa from the sperms carrying uorescent-labelled Y-chromosome. The accuracy of
this technique is high, however, the laser light used reduces the viability of the sexed sperm and
the throughput is low.However, new generaon ow cytometer with high sorng rates have opened
avenues for increasing sorted sperm output with minimal or no damage to sperm. Sex chromosome-
specic proteins (SCSPs) idened on the surface of sperm are also currently used for sperm sexing
which are less invasive and less damaging to sperm.
Sperm Encapsulaon: This involves encapsulaon of sperm for longer preservaon of sperm in
vivo and to allow progressive release of viable spermatozoa over several days in various domesc
species including human. The technique also prevents cryocapacitaon and also reported to have
increased concepon rate. The technique has been developed in cale and swine, sll it needs more
sophiscated instrument for encapsulaon and standardizaon, to be used under eld condions in
other livestock species.
Ovum Pick Up (OPU) :This is a non-invasive and repeatable technique used for recovering large numbers
of competent oocytes from antral follicles of live animals. Embryo producon from ovum pick-up
oocytes is aected by age, season, follicle smulang hormone (FSH) smulaon. It also evident that
Hands on Training Aquaculture Genomics and Bioinformacs 77
repeated OPU can be performed without side eects both in cale and bualoes with a minimal
stress to the animal. In India, the rst bualo calf (Saubhagya) was produced through this technique
by Prasad et al.2013, and subsequently, rst bovine calf (Holi) was produced at ICAR-Naonal Dairy
Research Instute. OPU has advantage to collect oocytes from animals with less invasiveness and the
use of superior animals as oocyte donors in embryo transfer. One of the limitaons of this technique
is the low oocyte yield per ovary and necessity for sophiscated instrument for carrying out this
technique.
In Vitro Maturaon, Ferlizaon and Culture (IVMFC) :This involves oocyte collecon from
slaughterhouse ovaries or from live animals followed by maturaon and ferlizaon in vitro for the
producon of viable embryos. Since the birth of the rst rabbit conceived through IVF in 1959, IVF has
been pracsed in several animals. Various methods for in vitro maturaon, IVF, and in vitro culture
have been standardized in animals. In addion, IVMFC has provided an excellent source for embryo
transfer, cloning, transgenesis, and other advanced in vitro techniques. It has also allowed the analysis
of the developmental potenal of embryos, paern of gene expression, epigenec modicaons
and cytogenec disorders in various domesc species and has been used as a model for human
embryogenesis studies. The low success rate and the costs make the technique less feasible for
applicaon in livestocks under eld condions
Intracytoplasmic Sperm Injecon (ICSI) :ICSI is a micromanipulaon technique used for treang male
inferlity. It involves mechanical inseron of a selected sperm into the cytoplasm of an oocyte to
produce desirable embryo. Since the rst report of ICSI success, ICSI has been done in other species
such as rabbits, mice, sheep, humans, horses, cale, and pigs including bualoes. This technique is
also used for sperm vector system for animal transgenic.
Mulple Ovulaon andEmbryo Transfer: In this technique selected genecally superior (elite) females
are induced to superovulate hormonally and inseminated with high quality semen of a superior male
at an appropriate me relave to ovulaon depending on the species. Week-old embryos are ushed
out of the donor’s uterus, isolated, examined microscopically for number and quality, and inserted
into the lining of the uterus of surrogate mothers non-surgically. ET increases reproducve rate of
selected females, reduces disease transfer, and facilitates the development of rare and economically
important genec stocks. The main liming factor for the ET is that this technique involves costly
hormones, labour intensive protocols and experse in addion to the poor super ovulatory response
and pregnancy outcomes.
Somac cell Nuclear Transfer or Cloning: Somac cell nuclear transfer (SCNT) is a major technique
for delivering nuclease-mediated genec alteraons in livestock. In this technique, the nucleus of a
somac cell is transferred into a female egg cell or oocyte in which the nucleus has been removed to
generate a new individual, genecally idencal to the somac cell donor. The advantage of SCNT is
that the gene-edited cell line can be genotyped and/or screened before transfer into the enucleated
oocyte to ensure that the desired edits, and no o-target edits, have occurred. A number of gene-
edited animals have been produced through SCNT cloning technique. This technique was used to
generate Dolly from a dierenated adult mammary epithelial cell. Further research is needed to
improve the eciency of the cloning. SCNT is a procedure of cloning within the same species whereas
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
78
interspecies cloning (interspecies Somac Cell Nuclear Transfer -iSCNT) are also feasible. The cloned
animals have already been produced between closely related species. Eg- domesc cale (Bos taurus)
and wild ox (Bos grunniens). Cloning procedure using embryonic stem cells (ESCs) referred as Nuclear
Transfer-derived Embryonic Stem Cell (NTESC) is sll unsuccessful. Despite the achievements made
through SCNT-eding method, certain drawbacks associated with cloning such as early embryonic
losses, postnatal death, and birth defects cannot be ignored
Cryopreservaon: Cryopreservaon is a process where cells, whole ssues, or any other substances
suscepble to damage caused by chemical reacvity or me are preserved by cooling to sub-zero
temperatures. Cryopreservaon is a mulstage complex process incorporang cryoprotectants or
anfreeze agents. The ability to cryopreserve germplasm indenitely allows genec diversity to
be preserved. Unlike semen, cryopreservaon of embryo helps in the preservaon of complete
genotypes. Freezing of embryos is an established commercial pracce especially in cale. In contrast
to embryos, oocytes are extremely sensive to chilling and are dicult to cryopreserve without
losing their viability. However, research is in progress on the vernalisaon of oocytes, where very low
temperature storage, without freezing, could preserve the oocytes for several months. This technique
is advantageous as it reduces the risk and expense in the transportaon of expensive animals; reduce
disease transmission and conservaon of endangered species germplasm.
Embryo Sexing :Embryo sexing is a technique in reproducve biotechnology having praccal
applicaons. Sex determinaon is performed by Y-chromosome-specic DNA probe technology
coupled with polymerase chain reacon (PCR) amplicaon of specic Y-chromosome region. Other
methods involve detecon of embryonic H-Y angen in the embryos and use of loop-mediated
isothermal amplicaon and duplex PCR-based assay showing more than 95% accuracy but involves
high cost, me and experse for carrying out these protocols.
Transgenesis: Transgenic animals have a foreign gene deliberately inserted into their genome by the
micro-injecon of DNA into the pronuclei of a ferlised egg which is subsequently implanted into
the oviduct of a surrogate mother. Transgenesis has great potenal in molecular breeding of farm
animals, such as development of animals with high fecundity, higher ferlity, disease resistance etc.
Transgenic technologies in shes can enhance growth rates and market size, feed conversion raos,
resistance to disease, sterility issues and tolerance of extreme environmental condions. In the
shrimp aquaculture sector, transgenic shrimp have been reported (Mialhe et al., 1995), but there has
been no successful development to date for commercial culture. The cost for making transgenic farm
animals is high and the eciency is low.
Stem Cells: Stem cells are unspecialized cells that renew themselves for long periods through cell
division, and later become specialized on receiving specic signals. Based on their source, stem cells
have been classied into three types, viz., embryonic, adult and fetal stem cells. ES cells are derived
from embryos at (blastocyst stage 32 cell stage), can give rise to cells from all three embryonic germ
layer.The ESs cells are advantageous as they do not form tumours when transferred into the body which
potenates their use in transplantaon. On the other, adult stem cells are those undierenated cells
found throughout the body which is needed for replenish and regenerate cells in any damaged ssue.
The spermatogonial stem cells are the only adult stem cells having the responsibility of transferring
Hands on Training Aquaculture Genomics and Bioinformacs 79
genes to next generaons via the process of ferlizaon of ovum. Some of the potenal applicaons
of this technology are surrogate producon of spermatozoa, reduced me for progeny tesng,
producon of transgenic animals and conservaon of endangered species.
Gene eding: Gene eding is a powerful tool to manipulate genome, bearingapplicaons in
animal breeding programs. Gene eding allows specic deleons, addions, or allele alteraon at
unambiguous locaons in a genome. The development of designer nucleases (zinc nger nucleases
[ZFNs], transcripon acvator-like eector nuclease [TALENs], and clustered regularly interspaced
short palindromic repeats [CRISPR/Cas9]) has enabled extremely ecient and more facile genome
eding in dierent animal species. These tools could be employed to enhance producvity, disease
resistance, breeding eciency, and for generaon of novel animal models. Such alteraons, if made
in zygotes or germ line cells, can be permanent and heritable. Recently, genome eding in many
livestock species has been reported such as myostan (MSTN) gene eding for “double muscling
in pigs, cale, and sheep, polled gene introducon in dairy cale, and edits to confer resistance to
porcine reproducve and respiratory syndrome virus and African swine fever virus in pigs.
Endocrine regulaon of reproducon in sh
Biotechnology can be applied to enhance the reproducve performance of cultured aquac
species exhibing reproducve dysfuncon is capvity. In the past, sh gonadotropin, a group of
hormones that smulate reproducon, were produced in small amounts by extracon and puricaon
from crude preparaons of thousands of pituitary glands. At present, large quanes of highly puried
gonadotropin can be produced in the laboratory through recombinant DNA technology. The use of
synthec Gonadotropin Releasing Hormone (GnRH), the key regulator of reproducve cascade in all
vertebrates, triggers the secreon of the sh’s own gonadotropin. GnRHa is synthesized chemically
and does not carry the risk of transming diseases to the broodstock. However, injecon of GnRHa
does not always result in 100% ovulaon and oen mulple injecons are oen necessary to induce
ovulaon. Development of controlled-release delivery systems for synthec GnRHas has contributed
to capve breeding of many commercially important sh species. The hormones implants mixed with
cholesterol, ethylene-vinyl in biodegradable microspheres have been ecient in inducing maturaon
and spawning in many cultured sh.
Sex control :The control of sh sex could be useful where one sex displays advantageous
characteriscs, such as larger adult size, producon of high-value caviar(sturgeon), faster growth
rate, or higher age at rst sexual maturaon. Monosex populaons of the most advantageous sex can
be produced either by direct sex control via steroid treatment (masculinisaon by administraon of
androgens; feminisaon by administraon of estrogens); or by genec controland steroid treatment
of broodstock (indirect hormonal treatment, gynogenesis, androgenesis); or by control of external
factors (temperature, density etc.). In the case of lapia, males are preferred for culture as they grow
faster than females. The YY male technology involves a genec breeding programme combining the
hormone feminizaon of a normal male (XY female) followed by mang with normal males (in lapia).
Sterility: Sterility in sh by manipulaon of reproducon would help to increase growth by reducing
energy consumpon for reproducon. Sterility can be achieved by ploidy manipulaon to produce
sterile triploids or the use of transgenics by gene “knock-out” or “gene knock-down”.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
80
Conclusion
Reproducve biotechnology has revoluonized animal breeding and genec progress in livestock
industry.The applicaon of biotechnology in aquaculture including the use of synthec hormones
in induced breeding, producon of monosex, surrogate broodstock, transgenic sh etc has played
major role to ensure the connued expansion and intensicaon of aquaculture to meet the growing
sh demand.The emerging techniques should be judicially implemented for manipulaon and
improvement of reproducve performance of the livestock species.
Source of informaon
K. K. Choudhary, K. M. Kavya,A. Jerome, R. K. Sharma (2016). Advances in reproducve biotechnologies.
Vet World 9(4): 388–395.
Role of Biotechnology in Assisted Reproducon, Science, 14 May – 2014.
W. S. Lakra and S. Ayyappan (2002).Recent Advances in Biotechnology Applicaons to Aquaculture.
Internaonal Symposium on Recent Advances in Animal Nutrion,22nd September, New Delhi,
Pg-455-461
Hands on Training Aquaculture Genomics and Bioinformacs 81
19.Use of molecular techniques in growth enhancement
Raymond J Angel
Introducon
It is proved that the use of molecular techniques in aquaculture has the potenal to alleviate
the predicted sh shortages and price increases by enhancing producon eciency, minimizing costs
and reducing disease. Growth enhanced sh using molecular techniques will be equally benecial to
aquaculture and is more eecve than tradional breeding techniques to develop new sh strains. In
principle, the technology can be used to improve growth rate of the sh, control sexual maturaon,
sterility and sex dierenaon, improve survival by increasing disease resistance against pathogen,
adapt to extreme environment such as cold resistance and alter the biochemical characteriscs of
the esh to enhance the nutrional qualies. Since sh can be readily improved by applicaon of
molecular techniques, it is clearly mely to consider what genecally modied (GM) sh are likely to
oer in the future, both in terms of benets and disadvantages (Maclean and Norman, 2003). Growth
Hormone has also been ulised in recent years extensively for construcon of transgenic shes to
enhance growth. Genec engineering is an important tool to develop and improve traits of sh for
aquaculture. Species showing high growth rate is widely used to isolate Growth Hormone gene for
the producon of transgenic sh.
An overview of various target species used in growth enhancement using molecular techniques
Transgenic sh have been produced for numerous species of sh including non-commercial
model species such as the Loach, Misgurnus anguilIicaudatus (Maclean et al. 1987a), Medaka, Oryzias
lapes (Ozato et al. 1986), Topminnows and Zebra sh, although Gong et al. (2002) have developed
transgenic Rainbow zebra sh for the ornamental sh industry. Several experiments have evaluated
transgenic farmed sh species including Goldsh (Zhu et al. 1985), Common carp, Silver carp, Mud
loach, Rainbow trout (Chourrout, 1986), Atlanc salmon, Coho salmon, Chinook salmon, Channel
caish (Dunham et al. 1987) and Nile lapia (Brem et al. 1988). Addionally, gene transfer has been
accomplished in a game sh, Northern pike (Gross et al. 1992).
Many species of sh have been used in studies for standardizing GH- involved transgenesis. Even
though, many studies reported a posive enhancement of growth in target species, some proved to
be unsuccessful due to many unknown reasons. Some of the studies have been quoted for reference
(Table 1).
Techniques for growth enhancement
There are many ways to enhance growth including inbreeding, gynogenesis, androgenesis,
selecon, intraspecic crossbreeding, interspecic hybridizaon, polyploidy, sex reversal and
breeding, nuclear transplantaon and transgenesis. Cloned populaons have been produced via
gynogenesis and androgenesis (Dunham, 2004), but direct cloning of an individual sh of interest
has not yet been accomplished. Gene transfer technology has produced a great impact in modern
biology and biotechnology (Powers et al. 1998). A number of sh species are in focus for gene transfer
experiments and can be divided into two main groups: animals used in aquaculture (Fletcher and
Davies, 1991; Hew et al. 1995; Chen and Lu, 1998) and model sh used in basic research (Chen and
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
82
Lu, 1998). Among the major food sh species are Carp (Cyprinus sp.), Tilapia (Oreochromis sp.),
Salmon (Salmo sp., Oncorhynchus sp.) and Channel caish (Ictalurus punctatus) while Zebrash
(Danio rerio), Medaka (Oryzias lapes) and Goldsh (Carassius auratus) are used in basic research.
Genec engineering of farm animals oers great potenal for improvement of selected genec traits
of agricultural signicance. Several species of sh have also been used to exploit this technology for
commercial purposes, and examples include aempted inducon of freeze resistance in transgenic
salmon using an An-Freeze Protein gene (Fletcher et al., 1988) and producon of growth enhanced
sh using novel Growth Hormone (GH) genes (Dunham et al., 1987; Brem et al., 1988; Penman et al.,
1990) or an Insulin-like Growth Factor (IGF) gene (Chen et al., 1995). Although several species of sh
have been used to produce lines of transgenic sh, in only a few cases has germline transmission and
stable long term transgene expression been sasfactorily demonstrated.
Techniques for gene transfer
Microinjecon
Microinjecon is most successfully and widely used technique for gene transfer in sh. Gene
transfer research with sh began in the mid 1980’s ulizing microinjecon (Zhu, et al 1985, Dunham
et al 1987). Zhu et al. (1985) published the rst report of transgenes microinjected into the ferlized
eggs of goldsh. In almost all sh gene transfer research, the foreign gene was microinjected into the
cytoplasm of one-to- four cell embryos (Hayat, 1989) as pronuclei are extremely dicult to visualize
in live one-cell sh embryos.
To ensure the integraon of the DNA it should be injected to intact cells close to the cut site. The
injecon apparatus consists of a dissecng stereomicroscope and two micro-manipulators, one with
a glass micro-needle for delivering transgene and other with a micropipee for holding sh embryo in
place (Fig. 1). The success of microinjecon technique depends on the nature of egg chorion. The so
chorion facilitates the microinjecon while the thick chorion limits the ability to visualize the target
for injecon of DNA. In many shes (Atlanc salmon and rainbow trout) the egg chorion gets tough
and hard just aer the ferlizaon or to contact with the water and provides a diculty in injecng
the DNA.
Steps of Microinjecon Technique
(1) Desired eggs and sperms are stored separately at the opmum condions.
(2) Add water and sperms and iniate the ferlizaon.
(3) Ten minutes aer the ferlizaon, eggs are dechorionated by trypsinizaon.
(4) Ferlized eggs are microinjected with desired DNA just within a few hours of ferlizaon. DNA
is released into the centre of the germinal disc to the rst cleavage in dechorionated eggs.
The me available for microinjecon is rst 25 minutes and that too between ferlizaon
and rst cleavage.
(5) Aer microinjecon the embryos are incubated in water unl hatching takes place.
Survival rates of microinjected sh embryos is seem to be about 30-80% depending the sh
species.
Hands on Training Aquaculture Genomics and Bioinformacs 83
Fig 1.Microinjecon technique.
Other methods
Microinjecon is a tedious and slow procedure (Powers, et al. 1992) and can result in high egg
mortality (Dunham, et al. 1987). Aer the inial development of microinjecon, new techniques such
as electroporaon, retroviral integraon, liposomal-reverse-phase-evaporaon, sperm mediated
transfer and high velocity micro-projecle bombardment were developed (Chen and Powers, 1990)
that somemes can more eciently produce large quanes of transgenic individuals in a shorter
me period. The rst successful gene transfer ulizing electroporaon produced integraon rates and
survival similar to that for microinjecon (Inoue, et al.1990). Powers, et al. (1992) demonstrated that
electroporaon can be more ecient than microinjecon with integraon rates somemes as high as
30-100%. Walker (1993) found that hatching rates were higher for electroporated embryos than for
microinjected channel caish embryos, and post-ferlizaon electroporaon treatments had higher
hatching rates than electroporaon of sperm and then eggs prior to ferlizaon.
Environmental Concerns about Transgenic Fish and risk migaon
The primary environmental concerns about releases of transgenic sh, for example, include
compeon with wild populaons, movement of the transgene into the wild gene pool, and ecological
disrupons due to changes in prey and other niche requirements in the transgenic variety versus the
wild populaons.
It is important to note that developers of transgenic sh are aempng to reduce or eliminate
both gene ow and invasive species risks by sterilizing transgenic sh. Sterilizaon is relavely easy
and inexpensive but success rates are highly variable. In addion, sterilizaon does not necessarily
neutralize environmental risks. Academic sciensts note that an escaped, sterile sh might sll engage
in courtship and spawning behaviour, disrupng breeding in wild populaons. Waves of escaped
sterile sh could also create ecological disrupons as each group is replaced by another equally strong
group of transgenic sterile sh.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
84
Conclusion and future prospecve
Transgenic sh technology has great potenal in the aquaculture industry. By introducing
desirable genec traits into shes, mollusks, and crustaceans, superior transgenic strains can be
produced for aquaculture. These traits include faster growth rates, improved food conversion
eciency, resistance to some known diseases, tolerance to low oxygen concentraons, and tolerance
to extreme temperatures. Our laboratory and those of others have shown that transfer, expression
and inheritance of sh growth hormone transgenes can be achieved in several sh species and that
the resulng transgenics grow substanally faster than their non-transgenic siblings. This is a vivid
example of the potenal applicaon of the gene transfer technology to aquaculture.
However, to realize the full potenal of the transgenic sh technology in aquaculture orother
biotechnological applicaons, several important scienc breakthroughs are required. These include:
(1) more ecient technologies for mass gene transfer,
(2) targeted gene transfer technologies such as embryonic stem cell gene transfer or ribozyme
gene inacvaon,
(3) suitablepromoters to direct the expression of transgenes at opmal levels during the desired
developmental stages,
(4) idened genes of desirable traits for aquaculture and other applicaons,
(5) informaonon the physiological, nutrional, immunological and environmental factors that
maximize the performance of the transgenics, and
(6) safety and environmental impacts of transgenic sh. Once these problems are resolved, the
commercial applicaon of the transgenic sh technology will be readily aained.
Table 1. Studies showing enhancement of growth achieved in dierent target
organisms worldwide with citaon
FAMILY AND SPECIES CONSTRUCT GROWTH COUNTRY SUPPORTING CITATION
Salmonidae
Atlanc salmon,
Salmo salar
opAFP-csGH 2–6-Fold Canada Du et al. (1992)and
Fletcher et al. (2004)
Coho salmon,
Oncorhynchus kisutch
ssMT-ssGH Up to 11-
fold
Canada Devlin et al. (1994a,b)
Coho salmon
Oncorhynchus kisutch
opAFP-csGH 3–10-Fold Canada Devlin et al. (1995a)
Chinook salmon,
O. tshawhytscha
opAFP-csGH 6-Fold Canada Devlin et al. (1995a)
Rainbow trout,
O. mykiss
opAFP-csGH 3.2-Fold Canada Devlin et al. (1995a)
Cuhroat trout,
O. clarki
opAFP-csGH 6-Fold Canada Devlin et al. (1995a)
Hands on Training Aquaculture Genomics and Bioinformacs 85
Arcc charr,
Salvelinus alpines
Various
constructs
Up to 14-
fold
Finland Pitkanen et al. (1999)
Rainbow trout
O. mykiss
ssGH-ssGH None Finland Pitkanen et al. (1999)
Cichlidae
Nile lapia,
Oreochromis nilocus
opAFP-csGH 2–4-Fold UK Rahman et al. (1998;
2001)
and Rahman and
Maclean (1999)
Nile lapia
Oreochromisnilocus
ssMT-ssGH None UK Rahman et al.
(1998; 2001)
and Rahman and
Maclean (1999)
Tilapia, O. hornorum
Hybrid
hCMV-GH 82% Cuba Marnez et al. (1996)
Ictaluridae
Channel caish,
Ictalurus punctatus
RSVLTR-rtGH, Up to 26% USA Dunham et al. (1992)
Channel caish
Ictalurus punctatus
mMT-hGH None USA Dunham et al. (1987)
Heteropneusdae
Heteropneustes fossilis Zpb-ypGH 30–60% India Sheela et al. (1999)
Cyprinidae
Goldsh,
Carassiusauratus
mMT-hGH None PR China Zhu et al. (1985)
Common carp, Cyprinus
carpio
mMT-hGH None PR China Zhu et al. (1989)
Common carp
Cyprinus carpio
cbA-gcGH 42–80% PR China Zhu (1992) and
Wang et al. (2001)
Catla, Catla catla RSVLTR-rtGH None India Sarangi et al. (1999)
Common carp
Cyprinus carpio
ccbA-ccGH 4-Fold Israel Hinits and Moav (1999)
Rohu
Labeo rohita
CMV-roGH 4-Fold India Venugopal et al. (2004)
Rohu
Labeo rohita
gcbA-roGH 4.5–5.8-
Fold
India Venugopal et al. (2004)
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
86
Esocidae
Northern pike RSVLTR-bGH 30% USA Gross et al. (1992)
Cobidae
Mud loach,
Misgurnusmisolepis
mlb-acn-mlGH Up to 35-
fold
Republic
of Korea
Nam et al. (2001; 2002)
REFERENCES
Brem, G., Brenig, B., Horstgen-Schwark, G. and Winnacker, E. L., 1988. Gene transfer in lapia
(Oreochromis nilocus). Aquaculture., 68: 209-219.
Chen, T. T. and Lu, J. K., 1998. Transgenic sh technology: Basic principles and its applicaon in basic
and applied research. In: De la Fuente J. and Castro F.O. eds. Gene transfer in aquac organisms.
RG Landes Company and Germany: Springer-Verlag, Ausn, Texas, USA., pp. 45-73.
Chen, T. T. and Powers, D. A. (1990) Transgenic sh. Trends in Biotechnology., 8: 209-214.
Chen, T. T., Lu, J. K., Shamblo, M.J., Cheng, C. M., Lin, C. M., Burns, J. C., Reimschuessel, R., Chatakondi,
N. and Dunham, R. A., 1995.Transgenic sh: ideal models for basic research and biotechnological
applicaons. Zool. Studies.,344: pp. 215–234.
Chourrout, D., 1986. Techniques of chromosome manipulaon in rainbow trout: a new evaluaon
with karyology. Theorecal and Applied Genecs., 72: 627-632.
Devlin, R. H., Bya, J. C., McLean, E., Yesaki, T. Y., Krivi, G.G., Jaworski, E.G. and Donaldson, E.M.,
1994b. Bovine placental lactogen is a potent smulator of growth and displays strong binding to
hepac receptor sites of coho salmon. Gen. Comp. Endocrinol., 95: 31–41.
Devlin, R. H., Yesaki, T. Y., Biagi, C. A., Donaldson, E. M., Swanson, E. M. P. and Chan, W. K., 1994a.
Extraordinary salmon growth.Nature, 371, 209–210.
Devlin, R. H., Yesaki, T. Y., Donaldson, E. M., Du, S. J. and Hew, C. L., 1995a. Producon of germline
transgenic Pacic salmonids with dramacally increased growth performance. Can. J. Fish.Aquat.
Sci., 52: 1376–1384.
Du, S. J., Gong, Z. Y., Fletcher, G. L., Shears, M. A., King, M. J., Idler, D. R. and Hew, C. L., 1992. Growth
enhancement in transgenic Atlanc salmon by the use of an all-sh chimeric growth hormone
gene construct. BioTechnology., 10: 176–181.
Dunham, R. A., Ramboux, A. C., Duncan, P. L., Hayat, M., Chen, T. T., Lin, C. M., Kight, K., Gonzalez-
Villasenor, I. and Powers, D. A., 1992.Transfer, expression and inheritance of salmonid growth
hormone genes in channel caish, Ictalurus punctatus, and eects on performance traits.Mar.
Mol. Biol. Biotechnol., 1: 380–389.
Dunham, R. A. 2004.,Aquaculture and Fisheries Biotechnology Genec Approaches.CABI publishing,
Wallingford ,UK., 17: P. 400.
Hands on Training Aquaculture Genomics and Bioinformacs 87
Dunham, R. A., Eash, J., Askins, J. and Townes, T.M., 1987. Transfer of the metallothione in human
growth hormone fusion gene into channel caish. Transacons of the AmericanFisheries
Society.,116: 87-91.
Fletcher, G. L., Shears, M. A., King, M. J., Davies, P. L. and Hew, C. L., 1988. Evidence for anfreeze
protein gene transfer in Atlanc salmon (Salmo salar).Can. J. Fish.Aquat. Sci.,45, pp. 352–357
Fletcher, G. L., Shears, M. A., Yaskowiak, E. S., King, M. J. and Goddard, S. V., 2004. Gene transfer:
potenal to enhance the genome of Atlanc salmon for aquaculture. Aust. J. Exp. Agric., 44:
1095–1100.
Fletcher. G. L. and Davies, P. L., I991. Transgenic sh for aquaculture.Gen. Eng., 13: 33l-369.
Gong, Z., Wan, H., Ju, B., He, J., Wang, X.,and Yan, T., 2002. Generaon of living color transgenic
zebrash. In: Shimizu, N., Aoki, T., Hirono, I., and Takashima, F. (Eds.). Aquac Genomics: Steps
Toward a Great Future, Springer-Verlag, New York, NY., pp. 329-339.
Gross, M. L., Schneider, J. F., Moav, N., Moav, B., Alvarez, C., Myster, S. H., Liu, Z., Hallerman, E.
M., Hacke, P. B., Guise, K. S., Faras, A. J. and Kapuscinski, A. R., 1992. Molecular analysis and
growth evaluaon of northern pike (Esox lucius) microinjected with growth hormone genes.
Aquaculture.,103: 253-273.
Hayat, M., 1989. Transfer, expression and inheritance of growth hormone genes in channel caish
(Ictalurus punctatus) and common carp (Cyprinus carpio). Doctoral Dissertaon. Auburn
University, AL, USA.
Hew, C. L.; Fletcher, G. L. and Davies, P. L., 1995. Transgenic salmon: tailoring the genome for food
producon. Journal of Fish Biology, 47: 1-19.
Hinits, Y. and Moav, B., 1999. Growth performance studies in transgenic Cyprinus carpio. Aquaculture.,
173: 285–296.
Inoue, K., Yamashita, S., Hata, J. I., Kabeno, S., Asada, S., Nagahisa, E. and Fujita, T., 1990.
Electrophoration as a new technique for producing transgenic sh.Cell Differentiationand
Development.,29: 123-128.
Maclean, N. and Norman.,2003.Genetically modied sh and their effects on food quality and human
health and nutrition.Trends in Food Science & Technology., 14: (5-8), 242-252.
Maclean, N., Penman, D. and Talwar, S., 1987a.Introduction of novel genes into sh.Biotechnology.,
5: 257-261.
Marnez, R., Estrada, M. P., Berlanga, J., Guillen, I., Hernandez, O., Cabrera, E., Pimentel, R.,Morales,
R., Herrera, F., Morales, A., Pina, J. C., Abad, Z., Sanchez, V., Melamed, P., Lleonart, R. and de
la Fuente, J., 1996. Growth enhancement in transgenic lapia by ectopic expression of lapia
growth hormone.Mol. Mar. Biol. Biotechnol., 5: 62–70.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
88
Nam, Y. K., Cho, Y. S., Cho, H. and Kim, D. S., 2002. Accelerated growth performance and stable germ-
line transmission in androgenecally derived homozygous transgenic mud loach, Misgurnus
mizolepis. Aquaculture., 209: 257–270.
Nam, Y. K., Noh, J. K., Cho, Y. S., Cho, H. J., Cho, K. N., Kim, C. G. and Kim, D. S., 2001. Dramacally
accelerated growth and extraordinary gigansm of transgenic mud loach Misgurnus mizolepis.
Transgenic Res., 10: 353–362.
Ozato, K., Kondoh, H., Inohara, H., Iwamatsu, T., Wakamatsu, Y. and Okada, T. S.,1986. Producon of
transgenic sh: introducon and expression of chicken delta-crystallin gene in medaka embryos.
Cell Dier. Dev., 19: 237-244.
Penman, D. J., Beeching. A .J., Penn, S. and Maclean, N., 1990. Factors aecng survival and integraon
following microinjecon of novel DNA into rainbow trout eggs.Aquaculture, 85: 35-50.
Pitkanen, T. I., Krasnov, A., Teerijoki, H. and Molsa, H., 1999. Transfer of growth hormone (GH) genes
into Arcc charr (Salvelinus alpinus L.). I. Growth response to various GH constructs. Genet. Anal.:
Biomol. Eng., 15: 91–98.
Powers, D. A., Cole, T., Creech, K., Chen,T. T., Lin, C. M., Kight, K. and Dunham, R., 1992. Electroporaon:
a method for transferring genes into the gametes of zebrash, Brachydanio rerio, channel caish,
Ictalurus punctatus, and common carp, Cyprinuscarpio. Mol. Mar. Biol. Biotech., 1:301-309.
Powers, D. A.; Gómez-Chiarri, M.; Chen, T. T. and Dunham, R.,1998. Genec Enginering of Finsh and
shellsh.In: De la Fuente J. and Castro F.O. eds. Gene transfer in aquac organisms. RG Landes
Company and Germany, Springer-Verlag, Ausn, Texas, USA. pp. 17-34.
Rahman, M. A. and Maclean, N., 1999. Growth performance of transgenic lapia containing an
exogenous piscine growth hormone gene. Aqaculture, 173: 333–346.
Rahman, M. A., Mak, R., Ayad, H., Smith, A. and Maclean, N., 1998. Expression of a novel piscine
growth hormone gene results in growth enhancement in transgenic lapia (Oreochromis
nilocus). Transgenic Res., 7: 357– 369.
Rahman, M. A., Ronyai, A., Engidaw, B. Z., Jauncey, K., Hwang, G. L., Smith, A., Roderick, E., Penman,
D., Varadi, L. and Maclean, N., 2001. Growth and nutrional trials on transgenic Nile lapia
containing an exogenous sh growth hormone gene.J. Fish Biol., 59: 62–78
Sarangi, N., Mandall, A. B., Bandyopadhyay, A. K., Venugopal, T., Mathavan, S. and Pandian, T. J., 1999.
Electroporated sperm-mediated gene transfer in Indian major carps. Asia-Pacic J. Mol. Biol.
Biotechnol., 7: 151–158.
Sheela, S. G., Pandian, T. J. and Mathavan, S., 1999. Electroporac transfer, stable integraon, and
transmission of pZp beta ypGH and pZp beta rtGH in Indian caish, Heteropneustes fossilis
(Bloch).Aquac. Res., 30: 233–248.
Venugopal, T., Anathy, V., Kirankumar, S. and Pandian, T.J., 2004.Growth enhancement and food
conversion eciency of transgenic sh, Labeo rohita.J. Exp. Biol. 301A: 477–490.
Hands on Training Aquaculture Genomics and Bioinformacs 89
Walker, D.S., 1993. Eect of electroporaon and microinjecon on survival of ictalurid caish embryos.
Master of Science Thesis. Auburn University, AL.
Wang, Y., Hu, W., Wu, G., Sun, Y., Chen, S., Zhang, F., Zhu, Z., Feng, J. and Zhang, X., 2001. Genec
analysis of ‘‘all-sh’’ growth hormone gene transferred carp (Cyprinus carpio L.) and its F1
generaon. Chin. Sci. Bull., 46: 1174–1177.
Zhu, Z., 1992. Generaon of fast-growing transgenic sh: methods and mechanisms. In: Hew, C.L.,
Fletcher, G.L. (Eds.), Transgenic Fish. World Publishing, Singapore, pp. 92–119.
Zhu, Z., Li, G., He, L. and Chen, S., 1985.Novel gene transfer into the ferlized eggs of goldsh (Carassius
auratus, 1758).Journal of Applied Ichthyology, 1: 31-33.
Zhu, Z., Xu, K., Xie, Y., Li, G. and He, L., 1989.A model of transgenic sh.Sci. Sin., B 2: 147–155.
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
90
20. Gene Eding Tools and their applicaon in Aquaculture
Misha Soman
Introducon
Genome eding is a kind of genec engineering in which a gene of interest is inserted, or erased in
the genome of an organism or cells using engineered restricon enzymes called “molecular scissors.
These nucleases create site-specic double-strand breaks (DSBs) at desired locaons in the genome.
The induced double-strand breaks are repaired through non-homologous end-joining (NHEJ) or
homologous recombinaon (HR), resulng in targeted mutaons (‘edits’). By eding the genome the
characteriscs of a cell or an organism can be changed.
Genome eding uses ‘engineered nuclease’ which cuts the DNA at its targeted site. Engineered
nucleases have two parts a nuclease part and the DNA-targeng part that is designed in such a way that
it guides the nuclease to cut a specic sequence of DNA. When a cut forms within a parcular place
of DNA, the cell starts to repair the cut naturally.Gene eding technologies have wide applicaons in
dierent sh species for basic as well as applied research in disease modeling and aquaculture.
Genome eding can be used
For research: It can be used to alter the DNA in organisms to study impact of gene modicaon.
To treat disease: Genome eding is being used in medical research to study the viability of the
technology to treat deadly human diseases like leukemia, AIDS, cancer, etc. (Youdiil Ophinni, et al.,
2018;Pablo Tebas, et al., 2014).
For biotechnology: Genome eding has been used in agriculture to produce genecally modied crops
to improve their yields and resistance to disease, as well as to make genecally modied pigs(Kankan
Wang, et al., 2015; Jin-Dan Kang et al., 2017), sheep (Crispo, et al., 2015), and shes(Karim Khalil et
al., 2017).
TYPES OF GENE EDITING
ØZinc nger nucleases (ZFNs)
ØTALENS
ØCRISPR-Cas9
ZINC FINGER NUCLEASES (ZFNS)
Zinc nger nucleases (ZFNs) are the type of engineered restricon nucleases produced by joining
zinc nger DNA-binding domain and DNA-cleavage domain (FokI) that promote targeted eding of
the genome by generang double-strand breaks in DNA at targeted locaons. This nuclease is a site-
specic endonuclease designed to bind and cleave DNA at parcular locaons. ZFN is composed
of three to six zinc nger mofs, and each mof parcularly recognizes three nucleodes in a DNA
sequence. Hence, each ZFN can idenfy target 9 to 18 base pairs. The cleavage of target DNA requires
dimerizaon of two ZFNs for the FokI enzyme results in double-strand break (DSB) at the target locus
(Durai et al., 2005). Double-strand breaks are important for site-specic mutagenesis in that they
Hands on Training Aquaculture Genomics and Bioinformacs 91
smulate the cell’s natural DNA-repair processes homology-directed repair and Non-Homologous
End Joining (NHEJ); these reagents can be used to modify the genome precisely.
genomes precisely.
Fok I
Fok I
Zinc Finger
Module (DNA binding domain)
5’…………
3………..
…..3
…..5
Catalytic module
DNA cleavage domain
Fig 1: DNA-binding domain and DNA-cleaving domains are fused together,
a highly-specic pair of ‘genomic scissors’ formed.
TALENS
Transcripon acvator-like eector nuclease (TALEN) technology use engineered restricon
enzymes generated by fusing a TAL eector DNA-binding domain to a DNA cleavage domain (FokI).
Restricon enzymes can be designed that will precisely cut any desired DNA sequence. When these
restricon enzymes are introduced into cells, it makes double-stranded breaks in the gene of interest.
The nucleases consist of programmable and sequence-specic DNA-binding modules coupled with a
regular DNA cleaved domain that allows accurate and ecient genec alteraons by smulang the
targeted DNA double-strand breaks to induce cellular DNA repair, including error-prone NHEJ and
HDR.
The DNA binding domain contains a repeated highly conserved 33–34 amino acid sequence
with divergent 12th and 13th amino acids. These two posions, referred to as the Repeat Variable
Diresidue (RVD), are highly variable and show a strong correlaon with specic nucleode recognion.
Dierent RVD allows each module to specically recognize one individual nucleode instead of three
nucleodes as in ZFN (Moscou and Bogdanove, 2009). The dimerized FokI randomly cleaves the DNA
sequence between the le and right TALEN target sites.
sites.
FokI
FokI
Catalytic module (DNA
cleavage domain)
TALE module
(DNA binding domain)
5’…………
…..5’
3………..
…..3
Fig 2: TALENS mechanism
Zinc FingerModule
(DNA binding domain)
Catalytic module
(DNA cleavage domain)
TALE module
(DNA binding domain)
Catalytic module
(DNA cleavage domain)
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
92
CRISP-R Cas9
The clustered regularly interspaced short palindromic repeat (CRISPR) and associated protein(Cas9)
emerged as a faster, cheaper and more precise gene eding tool in a wide range of organisms. It is
an adapve immunity mechanism in prokaryotes to eliminate invading genec material in which the
foreign genec material cut into fragments and integrated into its CRISPR locus as a series of short
repeats (20 bps). The loci are transcribed and processed into small RNAs which are called as guide
RNAs to guide nucleases to cleave the target DNA based on sequence complementarity. This unique
technology enables genecists and medical researchers to edit the genome by adding, removing or
altering the DNA sequence. The CRISPR-Cas9 system consists of two key players that make mutaon
into the targeted DNA. These are the enzyme Cas9 and a piece of RNA called guide RNA. The cas9 act
as a molecular scissor which cuts double-stranded DNA at a specic targeted site. So that bits of the
sequence can be added or removed. The guide RNA (gRNA) is an about 20 bases long pre-designed
RNA sequence located within the RNA scaold. The scaold part binds to DNA and the pre-designed
guide RNA ‘guides’ Cas9 nuclease to the targeted region of the genome, and it ensures that the Cas9
enzyme cuts at the right point in the genome.
Cas9 enzyme cuts at the right point in the genome.
Cas9
sgRNA
Target sequence
PAM
Sequence
5’
3
…..3
…..5
Fig 3: CRISP-R Cas9 mechanism
DNA double stranded breaks (DSB) repair mechanisms
Most DSBs get repaired by either the non-homologous end joining (NHEJ) pathway or the
homology-directed repair (HDR) pathway. The NHEJ repair pathway causes nucleode inserons or
deleons (indels) at the cleavage site. In most cases, NHEJ gives rise to small indels in the targeted
DNA that result in deleons, inserons, or frameshi mutaons leading to the formaon of premature
stop codons inside the open reading frame (ORF) of the targeted gene and causes gene disrupon. It
results in the loss of funcon of the targeted gene.
Homology-directed repair (HDR) is a process of homologous recombinaon where a DNA
template is used for precise repair of a double-strand break (DSB). This template can be either from
the cell during the late S phase and the G2 phase of the cell cycle, before the compleon of mitosis, or
it can be an exogenous repair templates delivered into a cell mostly in the form of a synthec, single-
strand DNA donor oligo or DNA donor plasmid, to generate a specic change in the genome.
Cleavage
Hands on Training Aquaculture Genomics and Bioinformacs 93
Advantages of CRISPR-Cas 9 system over ZFNs AND TALENS
ØHighly ecient mutagenesis
ØEecve introducon of targeted indels at required genomic locaon
ØTarget eciency >80%
ØIn CRISPR-Cas9 system only one customized sg RNA is required to target a specic sequence,
the same Cas9 can be used for all targeted sequences.
ØZFNs and TALENS require design and assembly of two nucleases for each target site.
ØSg RNAs are of short sequences <100bp, therefore reduces complicaons
Applicaons of gene eding tools in shes
Fish species, especially the model species such as the zebrash, have played important roles
in tesng new protocols of genome eding because of the biological advantages of sh models.
A large number of genes have been disrupted or modied in sh species for funconal studies,
especially those involved in reproducon. These gene eding technologies can be ulized to modify
the genomes of a variety of industrially relevant organisms and standard research animals including
zebrash, rats, pigs, caish. The cis-regulatory mechanisms and gene knockdowns or knockouts can be
invesgated by using genome eding tools to know the unexplored processes of animal development
and gene funcon to use in basic and applied sciences. Genome eding can be ulized to study early
embryogenesis, inducon of mutaon, producon of knockout lines, to unravel ancestral features of
chordate development. It can be used to systemacally study the funconal analysis of reproducve
performance in shes, disease resistance, tolerance to environmental stressors, sex determinaon,
sex dierenaon, funconal analysis of genes in non-reproducve funcons like pigmentaon,
growth, and development and also for the disease modeling and drug screening. CRISPR is one of the
most useful and powerful tools for gene manipulaon in sh; even though o-target occurrence is a
serious concern. The authors report that o-target mutaon eciency can be reduced by lowering
the concentraon of gRNAs in the injecon. Genome eding tools were applied in zebrash, mainly to
induce mutaons which would give valuable insights for medical science. The myostan (MSTN) gene
(muscle suppressor gene) disrupon by CRISPR/Cas9 was successfully carried out in channel caish,
Ictalurus punctatus which resulted in 88–100% rates of mutagenesis in the protein-coding sites of
Myostan. The MSTN altered fry had more muscle cells, and the mean body weight also increased
by 29.7%. The alignment of the mutated sequences vs. wild-type showed mulple inserons and
deleons. (Karim Khalil et al., 2017). In India, Central Instute of Freshwater Aquaculture successfully
disrupted Toll-like receptor 22 (TLR22) gene of Labeo rohita (rohu) involved in innate immunity and
solely present in teleost shes and amphibians using the CRISPR/Cas9 technology and the mutants
lacked TLR22 mRNA expression (Chakrapani et al., 2016). These results conrm that CRISPR/Cas9
is a highly ecient tool for eding the sh genome, and exposes ways for promong sh genec
enhancement and funconal genomics.
Conclusion
Gene eding tools are widely used for studying the manipulaon of the gene in human, animals,
vegetables, and sh for various purposes. With this high-eciency gene eding in shes, we are
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
94
entering into a new era for the adopon of powerful technologies to study various gene funcons
to improve the traits. Gene eding tools widely used to study the impact of the manipulaon of
the gene in animals, vegetables, sh and in humans for various purposes. With these high-eciency
genes eding in shes, we are entering into a new era of powerful technologies to study mulple
gene funcons to improve the traits. These technologies will give insights into the gene funcons
and the evoluon of vertebrates and also the possibility to treat deadly human diseases in medical
research, to create improved variees in agriculture, livestock and aquaculture. In the aquaculture
industry, this approach may pave the way for growth-enhanced shes to increase the producvity.
References
www.Yourgenome.org
www.Genetherapynet.com
Khalil, K., Elayat, M., Khalifa, E., Daghash, S., Elaswad, A., Miller, M., Abdelrahman, H., Ye, Z., Odin,
R., Drescher, D., Vo, K., Gosh, K., Bugg, W., Robinson D, and Dunham R., (2017). Generaon
of Myostan Gene-Edited Channel Caish (Ictalurus punctatus) via Zygote Injecon of CRISPR/
Cas9 System. Scienc Reports volume 7, Arcle number: 7301.
Zhu, B., Ge, W., 2018. Genome eding in shes and their applicaons. General and Comparave
Endocrinology. 257, 3-12.
Chakrapani, V., Patra, S. K., Panda, R. P., Rasal, K. D., Jayasankar, P., Barman, H. K., 2016. Establishing
targeted carp TLR22 gene disrupon via homologous recombinaon using CRISPR/Cas9. J. of
Developmental and Comparave Immunology (61) 242-247.
Wang, K., Ouyang, H., Xie, Z., Yao, C., Guo, N., Li, M., Jiao, H, and Pang, D., 2015. Ecient Generaon
of Myostan Mutaons in Pigs Using the CRISPR/Cas9 System. Scienc Reports 5:16623.
Crispo, M., Mulet, A. P., Tesson, L., Barrera, N., Cuadro, F., dos Santos-Neto, P. C., Nguyen, T. H.,
Crénéguy, A., Brusselle, L., Anegón, I., Menchaca. A., 2015. Ecient Generaon of Myostan
Knock-Out Sheep Using CRISPR/Cas9 Technology and Microinjecon into Zygotes. PLoS ONE
10(8): e0136690.
Ophinni, Y., Inoue, M., Kotaki. T & Kameoka, M., 2018. CRISPR/Cas9 system targeng regulatory genes
of HIV-1 inhibits viral replicaon in infected T-cell cultures. Scienc Reports volume 8, Arcle
number: 7784.
Pablo Tebas, P., David Stein, D., Winson W. Tang, W. W., Ian Frank, I., Shelley Q. Wang, M.D., Gary Lee,
Ph.D., S. Kaye Spra, Ph.D., Richard T. Surosky, Ph.D., Marn A. Giedlin, Ph.D., Geo Nichol, M.D.,
Michael C. Holmes, Ph.D., Philip D. Gregory, Ph.D., et al. 2014. Gene Eding of CCR5 in Autologous
CD4 T Cells of Persons Infected with HIV. The New England journal of medicine. 370:901-910.
Hands on Training Aquaculture Genomics and Bioinformacs 95
GLOSSARY
ØRead – Base pair informaon of a given length from a DNA or cDNA fragment contained in a
sequencing library. Dierent sequencing plaorms are capable of generang dierent read
lengths.
ØSingle End Read – The sequence of the DNA is obtained from the 5’ end of only one strand of
the insert. These reads are typically expressed as 1x “y”, where “y” is the length of the read
in base pairs (ex. 1x50bp, 1x75bp).
ØPaired End Read – The sequence of the DNA is obtained from the 5’ ends of both strand of
the insert. These reads are typically expressed as 2x “y”, where “y” is the length of the read
in base pairs (ex. 2x100bp, 2x150bp).
ØMate Pair Read – The sequence of the DNA is obtained similar to paired-end reads, however
the size of the DNA insert is oen much greater in size (2-10kb in length) and the paired
reads originate from a single strand of the DNA insert.
ØDepth of Coverage – The number of reads that spans a given DNA sequence of interest. This
is commonly expressed in terms of “Yx” where “Y” is the number of reads and “x” is the unit
reecng the depth of coverage metric (i.e. 5x, 10x, 20x, 100x)
ØSequencing Depth – The amount of sequencing a given sample requires to achieve a certain
depth of coverage. This is frequently expressed as the number of reads a sample requires (ex.
40 million reads, 80 million reads) or the number of bases of sequencing a sample requires
(ex. 4 gigabases, 100 megabases).
ØSNP/SNV – Referring to a Single Nucleode Polymorphism or Single Nucleode Variant
detected in a sample.
ØInDels – One or more Inseron or Deleon event that is detected in a sample.
ØAnnotaon - Adding biological informaon to genome sequence. This is a very complex task,
and the process for doing this is rapidly evolving. Features that are added to the genome
oen include gene models, SNPs, and STSs.
ØCopy Number Variaon (CNV)- large-scale structural changes in DNA that vary from individual
to individual. These include inserons, deleons, duplicaons and complex mul-site
variants that range from kilobases to megabases in size. CNV can inuence gene expression,
phenotypic variaon and alter gene dosage, and in certain instances may be associated with
developmental disorders, cause disease or confer suscepbility to complex disease traits.
ØEST Expressed sequence tag - These are single-pass sequences of cDNA clones. Databases of
EST sequences are highly redundant but quite useful for gene idencaon. There are many
eorts to cluster EST sequences to remove the redundancy and low-quality sequences.
ØHaplotype (haploid genotype) - A set of closely linked genec markers present on one
chromosome that tend to be inherited together. A haplotype may also refer to a set of single
nucleode polymorphisms (SNPs) on a single chromad that are stascally associated with
one another.
ØReference sequence/genome - A fully assembled version of a genome that can be used for
mapping short DNA sequence reads for comparisons of genomes from various individuals
ICAR – Central Instute of Brackishwater Aquaculture, Chennai
96
ØCong - A cong (from conguous) is a set of overlapping DNA segments that together
represent a consensus region of DNA. In sequencing projects, a cong refers to overlapping
sequence data (reads).
ØScaold - A scaold is a poron of the genome sequence reconstructed from end-sequenced
whole-genome shotgun clones. Scaolds are composed of congs and gaps.
ØSpecicity -The percentage of sequences that map to the intended targets out of total bases
per run.
ØHomopolymer - Uninterrupted stretch of a single nucleode type (e.g., TTT or GGGGGG)
ØBase Call-Base calling is the process of assigning bases (nucleobases) to chromatogram peaks.
ØHomology
v Ortholog - Orthologous sequences are homologous sequences in dierent species that
have a common origin. Disncon of Orthologoes is a result of gradual evoluonary
modicaons from the common ancestor. Perform same funcon in dierent species
v Paralog - Paralogous sequences are homologous sequences that exists within a species.
They have a common origin but involve gene duplicaon events to arise. Perform
dierent funcons in same species
v BLAST E-values - The BLAST programs (Basic Local Alignment Search Tools) are a set of
sequence comparison algorithms introduced in 1990 that are used to search sequence
databases for opmal local alignments to a query.
v The E-value represents the amount of alignments you would expect to nd by chance
that have the same score as the alignment you are looking at. The e-value is calculated
with the formula E = (query length) * (length of database) * 2^-(S). A good, biologically
signicant e-value would be 0.05 or less.
N50: The number of largest congs whose sum is equal to or greater than half the genome
size.
L50: The smallest number of congs whose sum produces N50
Blast - type query and subject
blastn query is DNA, subject is DNA
blastp query is protein, subject is protein
blastx query is nucleic acid that is translated by the program into protein sequences (all
6 reading frames); subject database is protein
tblastn query is protein; database is DNA translated into protein sequences in all 6
reading frames.
tblastx
query is DNA translated into protein, subject is nucleode translated into protein.
Both are translated into all 6 frames. It is very slow relave to the other BLAST
types.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The CRISPR/Cas9 system provides a novel and promising tool for editing the HIV-1 proviral genome. We designed RNA-guided CRISPR/Cas9 targeting the HIV-1 regulatory genes tat and rev with guide RNAs (gRNA) selected from each gene based on CRISPR specificity and sequence conservation across six major HIV-1 subtypes. Each gRNA was cloned into lentiCRISPRv2 before co-transfection to create a lentiviral vector and transduction into target cells. CRISPR/Cas9 transduction into 293 T and HeLa cells stably expressing Tat and Rev proteins successfully abolished the expression of each protein relative to that in non-transduced and gRNA-absent vector-transduced cells. Tat functional assays showed significantly reduced HIV-1 promoter-driven luciferase expression after tat-CRISPR transduction, while Rev functional assays revealed abolished gp120 expression after rev-CRISPR transduction. The target gene was mutated at the Cas9 cleavage site with high frequency and various indel mutations. Conversely, no mutations were detected at off-target sites and Cas9 expression had no effect on cell viability. CRISPR/Cas9 was further tested in persistently and latently HIV-1-infected T-cell lines, in which p24 levels were significantly suppressed even after cytokine reactivation, and multiplexing all six gRNAs further increased efficiency. Thus, the CRISPR/Cas9 system targeting HIV-1 regulatory genes may serve as a favorable means to achieve functional cures.
Article
Full-text available
The rainbow trout growth hormone (rtGH) gene under the transcription control of long terminal repeat of Rous sarcoma virus was successfully introduced into three Indian major carps (IMC) viz. rohu (Labeo rohita) catla (Catla catla) and mrigal (Cirrihinus mrigala), en route electroporated sperms. The present communication enumerates the results of an array of electroporation experiments aimed at standardising diverse variables like voltage, capacitance, resistance and pulse constant to achieve maximum transformation efficiency. Stable genomic integration of the intruded alien gene was demonstrated through slot blot hybridization in all the three species. Per cent transgenic individuals were largely varied in the different species, maximum of 25% being observed in rohu followed by 23 and 13% in mrigal and catla, respectively. This indicates that electroporated sperm mediated gene transfer may emerge as a convenient means in fish transgenic development specially in IMC. Optimisation of electroporation condition for rtGH transformation to produce transgenic IMC are discussed hereunder in detail.
Article
Full-text available
The myostatin (MSTN) gene is important because of its role in regulation of skeletal muscle growth in all vertebrates. In this study, CRISPR/Cas9 was utilized to successfully target the channel catfish, Ictalurus punctatus, muscle suppressor gene MSTN. CRISPR/Cas9 induced high rates (88–100%) of mutagenesis in the target protein-encoding sites of MSTN. MSTN-edited fry had more muscle cells (p < 0.001) than controls, and the mean body weight of gene-edited fry increased by 29.7%. The nucleic acid alignment of the mutated sequences against the wild-type sequence revealed multiple insertions and deletions. These results demonstrate that CRISPR/Cas9 is a highly efficient tool for editing the channel catfish genome, and opens ways for facilitating channel catfish genetic enhancement and functional genomics. This approach may produce growth-enhanced channel catfish and increase productivity.
Article
Full-text available
Background Due to the great advantages in selection accuracy and efficiency, genomic selection (GS) has been widely studied in livestock, crop and aquatic animals. Our previous study based on one full-sib family of Litopenaeus vannamei (L. vannamei) showed that GS was feasible in penaeid shrimp. However, the applicability of GS might be influenced by many factors including heritability, marker density and population structure etc. Therefore it is necessary to evaluate the major factors affecting the prediction ability of GS in shrimp. The aim of this study was to evaluate the factors influencing the GS accuracy for growth traits in L. vannamei. Genotype and phenotype data of 200 individuals from 13 full-sib families were used for this analysis. ResultsIn the present study, the heritability of growth traits in L. vannamei was estimated firstly based on the full set of markers (23 K). It was 0.321 for body weight and 0.452 for body length. The estimated heritability increased rapidly with the increase of the marker density from 0.05 K to 3.2 K, and then it tended to be stable for both traits. For genomic prediction on the growth traits in L. vannamei, three statistic models (RR-BLUP, BayesA and Bayesian LASSO) showed similar performance for the prediction accuracy of genomic estimated breeding value (GEBV). The prediction accuracy was improved with the increasing of marker density. However, the marker density would bring a weak effect on the prediction accuracy after the marker number reached 3.2 K. In addition, the genetic relationship between reference and validation population could influence the GS accuracy significantly. A distant genetic relationship between reference and validation population resulted in a poor performance of genomic prediction for growth traits in L. vannamei. Conclusions For the growth traits with moderate or high heritability, such as body weight and body length, the number of about 3.2 K SNPs distributed evenly along the genome was able to satisfy the need for accurate GS prediction in the investigated L.vannamei population. The genetic relationship between the reference population and the validation population showed significant effects on the accuracy for genomic prediction. Therefore it is very important to optimize the design of the reference population when applying GS to shrimp breeding.
Article
Full-text available
Background Tilapias are the second most farmed fishes in the world and a sustainable source of food. Like many other fish, tilapias are sexually dimorphic and sex is a commercially important trait in these fish. In this study, we developed a significantly improved assembly of the tilapia genome using the latest genome sequencing methods and show how it improves the characterization of two sex determination regions in two tilapia species. ResultsA homozygous clonal XX female Nile tilapia (Oreochromis niloticus) was sequenced to 44X coverage using Pacific Biosciences (PacBio) SMRT sequencing. Dozens of candidate de novo assemblies were generated and an optimal assembly (contig NG50 of 3.3Mbp) was selected using principal component analysis of likelihood scores calculated from several paired-end sequencing libraries. Comparison of the new assembly to the previous O. niloticus genome assembly reveals that recently duplicated portions of the genome are now well represented. The overall number of genes in the new assembly increased by 27.3%, including a 67% increase in pseudogenes. The new tilapia genome assembly correctly represents two recent vasa gene duplication events that have been verified with BAC sequencing. At total of 146Mbp of additional transposable element sequence are now assembled, a large proportion of which are recent insertions. Large centromeric satellite repeats are assembled and annotated in cichlid fish for the first time. Finally, the new assembly identifies the long-range structure of both a ~9Mbp XY sex determination region on LG1 in O. niloticus, and a ~50Mbp WZ sex determination region on LG3 in the related species O. aureus. Conclusions This study highlights the use of long read sequencing to correctly assemble recent duplications and to characterize repeat-filled regions of the genome. The study serves as an example of the need for high quality genome assemblies and provides a framework for identifying sex determining genes in tilapia and related fish species.
Article
Full-text available
Advancing the production efficiency and profitability of aquaculture is dependent upon the ability to utilize a diverse array of genetic resources. The ultimate goals of aquaculture genomics, genetics and breeding research are to enhance aquaculture production efficiency, sustainability, product quality, and profitability in support of the commercial sector and for the benefit of consumers. In order to achieve these goals, it is important to understand the genomic structure and organization of aquaculture species, and their genomic and phenomic variations, as well as the genetic basis of traits and their interrelationships. In addition, it is also important to understand the mechanisms of regulation and evolutionary conservation at the levels of genome, transcriptome, proteome, epigenome, and systems biology. With genomic information and information between the genomes and phenomes, technologies for marker/causal mutation-assisted selection, genome selection, and genome editing can be developed for applications in aquaculture. A set of genomic tools and resources must be made available including reference genome sequences and their annotations (including coding and non-coding regulatory elements), genome-wide polymorphic markers, efficient genotyping platforms, high-density and high-resolution linkage maps, and transcriptome resources including non-coding transcripts. Genomic and genetic control of important performance and production traits, such as disease resistance, feed conversion efficiency, growth rate, processing yield, behaviour, reproductive characteristics, and tolerance to environmental stressors like low dissolved oxygen, high or low water temperature and salinity, must be understood. QTL need to be identified, validated across strains, lines and populations, and their mechanisms of control understood. Causal gene(s) need to be identified. Genetic and epigenetic regulation of important aquaculture traits need to be determined, and technologies for marker-assisted selection, causal gene/mutation-assisted selection, genome selection, and genome editing using CRISPR and other technologies must be developed, demonstrated with applicability, and application to aquaculture industries. Major progress has been made in aquaculture genomics for dozens of fish and shellfish species including the development of genetic linkage maps, physical maps, microarrays, single nucleotide polymorphism (SNP) arrays, transcriptome databases and various stages of genome reference sequences. This paper provides a general review of the current status, challenges and future research needs of aquaculture genomics, genetics, and breeding, with a focus on major aquaculture species in the United States: catfish, rainbow trout, Atlantic salmon, tilapia, striped bass, oysters, and shrimp. While the overall research priorities and the practical goals are similar across various aquaculture species, the current status in each species should dictate the next priority areas within the species. This paper is an output of the USDA Workshop for Aquaculture Genomics, Genetics, and Breeding held in late March 2016 in Auburn, Alabama, with participants from all parts of the United States.
Article
Full-text available
Single nucleotide polymorphisms (SNPs) are capable of providing the highest level of genome coverage for genomic and genetic analysis because of their abundance and relatively even distribution in the genome. Such a capacity, however, cannot be achieved without an efficient genotyping platform such as SNP arrays. In this work, we developed a high-density SNP array with 690,662 unique SNPs (herein 690 K array) that were relatively evenly distributed across the entire genome, and covered 98.6% of the reference genome sequence. Here we also report linkage mapping using the 690 K array, which allowed mapping of over 250,000 SNPs on the linkage map, the highest marker density among all the constructed linkage maps. These markers were mapped to 29 linkage groups (LGs) with 30,591 unique marker positions. This linkage map anchored 1,602 scaffolds of the reference genome sequence to LGs, accounting for over 97% of the total genome assembly. A total of 1,007 previously unmapped scaffolds were placed to LGs, allowing validation and in few instances correction of the reference genome sequence assembly. This linkage map should serve as a valuable resource for various genetic and genomic analyses, especially for GWAS and QTL mapping for genes associated with economically important traits.
Article
Litopenaeus vannamei is a typical euryhaline decapod model to study the osmoregulation mechanism in crustaceans. The proteomic was undertaken using isobaric tags for relative and absolute quantification together with the reverse phase in high-performance liquid chromatography mass spectrometry to quantitatively identify the proteins differentially expressed in the hepatopancreas under low salinity stress (3psu) compared with the control salinity (25psu). 533 proteins and 84 differentially expressed proteins were identified including 58 proteins with the 1.2-fold cut-off value under chronically low salinity stress. Among these proteins, 26 were up-regulated while 32 were down-regulated. 48 out of 58 differentially expressed proteins were annotated in the Uniprot database and were mapped into 38 pathways by KEGG analysis. These proteins were categorized into the pathways for energy metabolism, signaling, immunization and detoxification, lipid and protein metabolism. A more active glycometabolism, positive response detoxification pathway, immunosuppression and positive osmoregulation were identified in L.vannamei under low salinity stress. This study suggests that under chronically low salinity stress, L. vannamei showed low immunity and high demand for energy especially from glycometabolism. Signaling transfer related pathways, especially the Wnt signaling pathways were involved in the process of salinity adaption, but the in-depth mechanism warrants further investigation. Significance: In this study, a comprehensive physiological response was studied using proteomics to reveal the underlying mechanism of adaptation to low salinity in L.vannamei, which was the first report on the proteomic response of crustacean to salinity stress. The extensive proteomic investigation on hepatopancreas under low salinity stress provides a new insight into the adaptive mechanism of this euryhaline crustacean species to low salinity.