BookPDF Available

Hands on Training AQUACULTURE GENOMICS AND BIOINFORMATICS GENETICS AND BIOTECHNOLOGY UNIT Prepared by ICAR -CENTRAL INSTITUTE OF BRACKISHWATER AQUACULTURE 75, SANTHOME HIGH ROAD, RA PURAM MRC NAGAR, CHENNAI -600 028

February 2022

February 2022

Authors:

Vinaya Kumar Katneni

ICAR - Central Institute of Brackishwater Aquaculture

Misha Soman

Indian Council of Agricultural Research

Show all 10 authorsHide

Content uploaded by Sivamani Balasubramaniam

Content may be subject to copyright.

CIBA – TM Series – 2018 – No. 12

Hands on Training

AQUACULTURE GENOMICS AND

BIOINFORMATICS

Organized By

GENETICS AND BIOTECHNOLOGY UNIT

Prepared by

K. VINAYA KUMAR J. ASHOK KUMAR

MISHA SOMAN RAYMOND J ANGEL B. SIVAMANI

P. MAHALAKSHMI SHERLY TOMY

M. S. SHEKHAR

G. GOPIKRISHNA

ICAR – CENTRAL INSTITUTE OF BRACKISHWATER AQUACULTURE

75, SANTHOME HIGH ROAD, RA PURAM

MRC NAGAR, CHENNAI - 600 028

Published by

Dr. K. K. Vijayan

Director, ICAR-CIBA

Hands on Training Aquaculture Genomics and Bioinformacs iii

TABLE OF CONTENTS

Sr. No. Chapter Title Page number

1Introducon to Linux Environment 1

2Introducon to programming in R 4

3Python for Bioinformacs 10

4Understanding the Illumina datasets 14

5Checking quality of Illumina paired-end sequence data 17

6Quality control of RNAseq datasets – NGS QC Toolkit 19

7Quality control of RNAseq datasets – Trimmomac 21

8Assembling bacterial genomes 23

9RNAseq data analysis in Trinity 25

10 Phylogenomic analysis using MrBayes 31

11 Microsatellites genotypes generaon by Fragment analysis method 34

12 Genepop : Populaon Genecs analysis 38

13 Populaon genec analysis of microsatellite data in Arlequin 40

14 SoCompung techniques inBioinformacs 49

15 RNAseq data analysis – Genome-guided 56

16 Applicaon of ‘’OMICS’’ research in aquaculture with special reference

to penaeids

17 Shrimp Genomics : Current status and Challenges 72

18 Applicaon of Biotechnology in animal reproducon 76

19 Use of molecular techniques in growth enhancement 81

20 Gene Eding Tools and their applicaon in Aquaculture 90

Glossary 95

Hands on Training Aquaculture Genomics and Bioinformacs 1

1. Introducon to Linux Environment

J. Ashok Kumar and K. Vinaya Kumar

Opensource operang system (OS) Linux built based on Unix has become choicest OS worldwide

for servers as well as desktops in academic circles. There are dierent varients of Linux which include

Redhat, Ubuntu, fedora, CentOS, knoppix etc. Many of the bioinformacs soware and individual

programs are nave to linux OS. So it is important for a bioinformacian to have exposure to linux

commands. Here we give a list of most commonly used linux commands and procedure to execute

perl /python programmes. As advanced programming is beyond the scope of this training, we provide

here the basic constructs of perl/python programs which could be used for wring scripts for simple

bioinformacs tasks.

Linux commands

Accessing linux environment: You can access linux server using any windows based ssh client

from your system. This could be achieved by installing winSCP or Puy (both are free soware) on

your system. Once installed open WinSCP, ll in the Host name, user name and password columns

provided by system administrator and click on login buon which will prompt for password. Aer

successful login and selecng puy from menubar, console window pops upand you will see a dolloar

prompt where in you can submit commands for all the operaons you wish to perform on linux server.

Figure 1. WinSCP login window

Figure 2. Selecng Puy from winSCP

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

Figure 3. Linux console

The dollar prompt ($) shown in Fig. 3 is for users and the hash (#) prompt will be displayed for

administrators. Users who have the administrave privileges on the server can only work with hash

(#) prompt.

File system in linux: All the folders and les of the linux system will be under root (/) directory.

Users will have access to their home directories for which the path is /home/user_name

Once you login to the linux system by default you will be taken to your home directory. For

example is if user name is “david” aer login into Linux the current directory which he will be

accessing is /home/david. Users can input their commands aer the dollar ($) prompt. Some of the

most commonly used linux commands are given in the table below.

Funcon Command

Lisng the le names $ls

Lisng with le names along with other details $ls –l

Change to preexisng directory by name ‘test’ $cd test

Make a new directory by name ‘trial’ $mkdir trial

Viewing a preexisng le $vi mydata.txt

$nano mydata.txt

$more mydata.txt

$cat mydata.txt

Creang a new le $touch myle.txt

$vi myle.txt

$nano myle.txt

Renaming or moving the le $mv le1.txt le2.txt

$mv /home/ram/le1.txt /home/ram/

test/

Making duplicate of le $cp le1.txt le2.txt

$cat le1.txt > le2.txt

Appending two text les $cat le1.txt le2.txt > le3.txt

To display date $date

To nd number of lines in a le $wc –l xyz.txt

To display rst (top) 100 lines of a le $head -100 xyz.txt

To display last (boom) 100 lines of a le $tail -100 xyz.txt

Search for a paern in a le $grep “paern” le.txt

Search for paern at beginning of line $grep ‘^paern’ le.txt

Search for paern at the end of a line $grep ‘paern$’ le.txt

Search for only paern in the line $grep ‘^paern$’ le.txt

Hands on Training Aquaculture Genomics and Bioinformacs 3

Running perl /python programs

Perl program les will have extension “.pl”. Command to execute the programmes is

$ ./test_programme.pl

$perl test_programme.pl

Opons of the program may be checked from the help les of the soware/programs.

Same way python program les will have “.py” extensions and they could be executed by giving

following command.

$python test_programmes.py

Standalone blast

NCBI Blast is used for comparing nucleode and protein sequences with the sequence databases

to nd signicant matches. Alignment of sequences using blast can be done either by using web-tool

available on NCBI site or by installing blast on local servers.

Blast can be installed on local servers along with the databases available in public domain. In

addion, users can make their own databases on local servers. If you have your own protein dataset

then local databases can be created by

$makeblastdb -in xyz.fasta -dbtype ‘prot’ -out xyzdb

Now you can run the blast using your own database

$blastp -db xyzdb -query abc.fasta –out out.fasta

More general blast Command

$blastn -query nucl.fasta -db xyzdb -oumt 6 -evalue 1e-05 -out output.txt

For fetching the sequences in fasta le format from output make a le with IDs of hits and run

the following command

fastacmd -d database_name -i blast_output > hits.fasta

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

2. Introducon to programming in R

J. Ashok Kumar, K. Vinaya Kumar and B. Sivamani

R is a programming environment for data analysis and graphics. The language was inially wrien

by Ross Ihaka and Robert Gentleman at the Department of Stascs at the University of Auckland.

Since its birth, a number of people have contributed to the package. It is open source stascal

soware which can be downloaded free of cost. Base package and all the contributory packages

could be downloaded from hp://www.r-project.org/

R is available for all operang systems like windows, Linux and Mac OS. This training material is

based on R stats package installed in windows operang system.

Invoking R stats

Start  All programmes  R  R i386 3.2.0 (for 32 bit installaon)

Start  All programmes  R  R x64 3.2.0 (for 64 bit installaon)

R Stats Graphical user interface in windows

Procedure to install addional packages

We need to add addional libraries to Base installaon to ulize full potenal of R. This can be

achieved by following command.

Install.packages(‘name of the package’)

Once the above command is executed R system asks the user to select a CRAN mirror out of

several listed mirrors. User can select mirror of any locaon.

There is a package/library called ‘Rcmdr’ which can be used for carrying out most commonly

used stascal procedure with graphical user interface. The command to install ‘Rcmdr’ is

Hands on Training Aquaculture Genomics and Bioinformacs 5

Install.packages(‘Rcmdr’)

Command to invoke the Rcmdr

Library(‘Rcmdr’)

R studio

R studio is integrated development environment(IDE) for R. This IDE features R notebook for

wring scripts, console for command input, graphics viewer, package window and environment

window all in single framework.

R les input and output.

First set the working directory

Command to know the locaon of present working directory is

Ø getwd()

Command to set the working directory to any other folder

Ø setwd(“E:/data/”)

Basic command to read the les is

Ø read.table()

and command to create the data les is

Ø write.table()

Imporng data

Data with dierent le formats i.e., text les, excel les, SPSS data les, SAS data les etc., can

be input into R stats for data analysis. It is advised that excel les may rst be converted to comma

separated les for easy input into R stats.

Command to read a comma separated text le with variable names in the rst row

Ø Data <- read.table(‘lename’, header=TRUE, sep=”,”)

Here lename is name of the text le with extension, header statement is to specify whether

variable names are included in the rst row of the data le and ‘sep’ parameter tells the separator

present between variables (columns) like comma, space, tab etc., in the le.

If the specied text le is not in present working directory and you wish to select it though

graphical interface use the following command

Ø Data <- read.table( le.choose(), header=TRUE, sep=”,”)

Upon entering the above command a le selector window will pop up and one can select the le

located at any drive/directory/folder other than the present working directory.

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

Popup window for selecng les

For other text les like space separated and tab separated one need to change only ‘sep’parameter

of the above command with either “ “ or “ \t ”.

In the previous command ‘data’ is a dataframe which will contain all the variable names and data

Data in the dataframe can be edited and assigned the changed le contents to other dataframe

Ø data1<- edit(data)

Upon entering the above command a popup window appears for eding the data and all the

edits will be saved in data frame called ‘data1’

Data editor window

Exporng data

Data in the dataframe can be exported as a text le with the following command

Ø write.table(data, le=”xyz.csv”, col.names=TRUE, sep=”,”)

Hands on Training Aquaculture Genomics and Bioinformacs 7

Creang data les manually within Rstats

Data les can be created within Rstats by giving simple commands

Here we explain creang example table with variable names into R stats

S.No Bodyweight Length Species

1 25 15 aa

2 35 14 ab

3 65 27 ac

4 27 18 bb

5 45 22 cc

The above table can be created as a dataframe by giving the following commands

Øbodyweight <- c(25,35,65,27,45)

Ølength <- c(15,14,27,18,22)

Øspecies<-c(“aa”,”ab”,”ac”,”bb”,”cc”)

Ølengthweight <-cbind(bodyweight,length,species)

Descripve stascs

Suppose we have a variable by name ‘x’ and our task is to calculate all the descripve stascal

parameters like mean, median, standard deviaon, variance etc. for the variable x in R stats. First

create a variable x by giving the following command

Øx <- c(20,15,19,22,26,24,23,17,18,22)

Other way of creang variable ‘x’ is

Øx <- scan()

1: 20 15 19 22 26 24 23 17 18 22

11:

Read 10 items

Basic commands for descripve stascs

Ømean (x) # mean

Ømedian (x) # median

Øvar (x) # sample variance

Øsd(x) # sample std. deviaon

Øquanle (x,p) # sample quanle , p could be 0.25, 0.5,0.75

Ømin (x) # minimum of x

Ømax (x) # maximum of x

Ørange () # range of x

Ølibrary(e1071)

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

Øskewness (x) # skewness

Økurtosis (x) # kurtosis

Commands for stascal tests

Single sample t-test

Øt.test(y,mu=10)

here y is a variable; mu is populaon mean

Two sample t-test

Øt.test(y1,y2,var.equal=TRUE)

y1 and y2 are the two independent samples

Paired t-test

Øt.test(y1,y2,paired=TRUE)

y1 and y2 are the two paired samples

Chi-square test for goodness of t

Øn<- cbind(y1,y2)

Øchisq.test(n)

n is a datamatrix /conngency table

Correlaon

Øn <- cbind(y1,y2) # create dataframe n

Øcor(n)

where y1 and y2 are two variables and n is matrix of y1 and y2

Regression

Øt <- lm(y~x)

for mulple regression

Øt <- lm(y~x1+x2+x3)

Completely randomised design

Øtr <- c(1,1,1,2,2,2,3,3,3) # create treatment variable

Øyield<-c(25,41,54,65,45,65,25,12,35) # create dependent variable

Øt <- aov(yield ~ factor(tr)) # model statement

Øsummary(t)

Randomised Block Design

Øtr <- c(1,1,1,2,2,2,3,3,3) # create treatment variable

Ørep <-c(1,2,3,1,2,3,1,2,3) # create replicaon variable

Hands on Training Aquaculture Genomics and Bioinformacs 9

Øyield<-c(25,41,54,65,45,65,25,12,35) # create dependent variable

Øt <- aov(yield ~ factor(tr) + factor(rep))

Øsummary(t)

Two way factorialDesign

Øt <- aov(yield ~ factor(A) + factor(B) + factor(A) : factor(B) + factor(rep))

Øsummary(t)

Installing Bioconductor in R

Enter following commands in R console to install bioconductor packages.

source (hp://bioconductor.org/biocLite.R)

biocLite()

Steps in manipulang fasta les

First load library

Ølibrary(seqinr)

Set working directory in R where fasta les are loaded

Øsetwd(“c:/path/to/directory”)

Øseq1 <- read.fasta(“sequence.fasta”)

Øseq1.seq<- seq1[[1]] # to take the sequence from fasta le

Ølength(seq1.seq) # to nd length of the sequence (bases)

Øtable(seq1) # to nd frequency of each base

ØGC(seq1.seq) # to nd the GC content of the sequence

There are several advanced opons are available in R ranging from simple sequence analysis to

microarray data analysis. Purpose of this chapter is to introduce the R environment and to provide

hands-on for exploring the funconalies available in R.

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

3. Python for Bioinformacs

J. Ashok Kumar and K. Vinaya Kumar

Python is one of the most popular high level general purpose programming languages. It

was developed in the year 1991 by Guido van Rossum, a Dutch programmer. It is an open source

programming language available for download at www.python.org. In recent years it has gained lot

of importance due to development of several libraries applicable to various elds of research and

development. One such library widely used in Bioinformacs is BioPython. Here we introduce python

environment for wring scripts and provide a glimpse of Biopython funconalies.

Installaon of Python

Python is available for both windows and Linux plaorms. Windows / Linux binaries can be

obtained from www.python.org. In windows you may double click on the exe le and accept the

default installaon sengs to get it installed in the system. Once installed go to edit environment

variable Advanced environment variables and add new python path as show in the gure

Now you can open command line interface in windows by entering ‘cmd’ search box on the

taskbar and enter.

On most of the Linux installaons python comes with default installaon. If not available it can

be installed on debian/Ubuntu systems by keying-in the following command

$sudo apt-get install python

Installing pip

Pip is package manager for python. To install pip download get-pip.py from hps://pip.pypa.io/

en/stable/installing/ and enter the following command.

$Python get-pip.py

Once pip is installed, any python package can be installed by the following command

$pip install ‘package-name’

Hands on Training Aquaculture Genomics and Bioinformacs 11

Installing Jupyter

Jupyter is notebook applicaons for python wherein one can write scripts, execute the scripts

and save the notebooks in dierent formats like pdf, doc for future use. Run following commands for

installing and opening the jupyter notebook

$pip install jupyter # install the jupyter package

$python –m IPython notebook ## Opening notebook in windows.

$jupyter notebook ## opening notebook in Linux

One can install required addional packages like matplotlib for plong the graphs, numpy for

numerical calculaons pandas for data structures and data analysis tools, statmodels for stascal

analysis, scipy for mathemacal & scienc applicaons. All these can be installed using python.

Introducon to python programming

ØPrint “hellow world” ## prinng a text

Hellow world

Øtext1 = “CIBA” # text1 is a string variable

Øa = 20 # b is a numeric variable having value 20

Øb = 30

Øa+b

Øb-a

Øa*b

600

Øa/b

Ø0

Øa/oat(b)

0.666

Øa**b # which is a to the power of

1.073741824e+39

For mathemacal funcons

Øimport math

Ømath.log(a)

2.995

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

Ømath.cos(a)

0.408

Araay in python

Øa =[]

Øa = [“hi”,”this”,”is”,”python”]

Øa[2]

Declaring diconary

Ødict1={“apple”: 250,”banana”: 100,”cherry”: 300}

Ødict1.keys()

[‘cherry’, ‘apple’, ‘banana’]

Ødict1.values()

[300, 250, 100]

Ødict1[“cherry”]

300

Programming loops

Øfor i in range(0,10):

print i

Øj=1

Øwhile (j < 10):

print j

j=j+1

Funcons

Ødef f2c(x):

return (x-32)*5/9.0

Read and write les

Øinp=open(“input.txt”,’r’)

out=open(“output.txt”,’w’)

for line in inp:

if line[0]==”>”:

out.write(line)

inp.close()

out.close()

Hands on Training Aquaculture Genomics and Bioinformacs 13

Biopython

Biopython is the set of computaonal methods used for Bioinformacs analysis. Biopython can

be used to parse dierent les like fasta, blast output, genbank, expasy; execute online tools like NCBI

blast, entrez etc., code to sequence alignment, mulple sequence alignment, phylogeny and even

machine learning classicaon methods like naïve bayes, knearest neighbourhood, support vector

machines etc.,. Biopython library can be installed through pip installaon method.

Øpip install biopython (or python –m pip install biopython in windows)

Øimport Bio

Øfrom Bio.Seq import Seq

Øseq1 = Seq(“ATGCGGATC”)

Seq(‘ATGCGGATC’, Alphabet())

Øseq1.complement()

Seq(‘TACGCCTAG’, Alphabet())

Øseq1.reverse.complement()

Seq(‘GATCCGCAT’, Alphabet())

Parsing fasta le

Øfrom Bio import SeqIO

Øfor seq_record in SeqIO.parse(“sequence.fasta”, “fasta”):

print(seq_record.id)

print(repr(seq_record.seq))

print(len(seq_record))

Dierent system commands can also be executed from python using following commands

Øimport os

Øcom = “blastn – query seq.fasta –db nr –out out.txt”

Øos.system(com)

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

4. Understanding the Illumina datasets

K. Vinaya Kumar and J. Ashok Kumar

With connual improvements being made over past few years, the Next Generaon Sequencing

(NGS) plaorms came a long way in generang enormous sequence data at low cost and less me.

Many NGS plaorms like Illumina, Pacbio, Nanopore, Ion Torrent etc are well-known plaorms with

several published manuscripts quong usage of them. A feature common to all these plaorms is

massively parallel sequencing of single or clonally amplied DNA molecules. Of dierent plaorms

available ll date, the one oered by Illumina stands apart in terms of the amount of sequence data

generated and the cost involved. In case of Illumina, right from the Genome Analyzer IIx, the HiSeq

XXXX series, the MiSeq, the NextSeq XXX series to the latest NovaSeq 6000, there is an improvement

in data output while reducing the sequencing me.

There are two popular sequencing chemistry of Illumina plaorm namely, paired-end (PE) and

mate-pair (MP) that are commonly used by researchers. The PE sequencing is used for RNAseq studies

where we nd dierenally expressed transcripts in experimental samples compared to control

sample. The MP sequence reads are mostly used in assembly of whole genomes where they play

an important role in scaolding the congs. In this chapter we understand the structure of paired-

end sequence datasets generated on Illumina plaorm. The raw sequence data les generated on

Illumina plaorm are delivered as ‘.fastq’ les. For every sample, two les are provided, one read_1 or

forward sequence read and the other read_2 or reverse sequence read. The order of reads in forward

and reverse sequence reads les should not be altered as they are linked.

Open the WinSCP tool. The following window appears. Enter the host name as told by the tutor.

Enter the ‘user name’ and ‘password’ to log in to your account.

Aer logging in, the window of WinSCP tool appears. The window has two panels. The le panel

is the le system of your computer. The right panel is the le system of your account in server.

Hands on Training Aquaculture Genomics and Bioinformacs 15

Click on the icon displaying ‘two connected computers’ in the top toolbar to open the puy

window. In this window you run your jobs in server. Enter the log in credenals on prompt. Then

browse to the folder where a le with extension ‘.fastq’ is present. Then type-in the command ‘head

le.fastq’ to see the rst few lines of le.

You nd that, the informaon about each sequence read is represented in four lines.

Line 1: has informaon about instrument ID, run ID, ow cell ID, lane ID, le ID, X and Y coordinates

of clusters, read number, status about the read is ltered or not and control sample status etc.

Line 2: the sequence of the read which is the familiar A, T, G and C

Line 3: a plus (+) sign

Line 4: the quality scores of the sequence bases

You may visit the following page to understand more about the quality scores.

hps://www.illumina.com/documents/products/technotes/technote_understanding_quality_scores.pdf

The symbols in line 4 represent quality scores of bases. The quality scores ranges from 0 to 40.

A score of 40 indicates that the base called is of high quality. In this case, the error probability infers

that one base call in 10,000 base calls would be incorrect. The following table illustrates the relaon

between the symbols and the corresponding quality scores.

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

Table. List of symbols corresponding toquality scores of bases in Illumina sequence datasets.

Symbol Quality Score Symbol Quality Score

! 0 6 21

" 1 7 22

#2 8 23

$3 9 24

% 4 :25

&5;26

’6<27

(7=28

)8>29

*9 ? 30

+10 @ 31

, 11 A 32

- 12 B 33

.13 C 34

/14 D 35

0 15 E36

1 16 F37

2 17 G38

3 18 H39

4 19 I 40

5 20

Hands on Training Aquaculture Genomics and Bioinformacs 17

5. Checking quality of Illumina paired-end sequence data

K. Vinaya Kumar and J. Ashok Kumar

Illumina paired-end (PE) sequencing reads are commonly used for RNAseq studies and

assembling of genomes. For each sample, the sequencing machine prints output data in two paired

.fastq les. In this chapter, we discuss about the quality issues pertaining to PE reads. A beer

understanding of these helps in beer planning of read processing to extract quality data for further

studies.

One of the basic soware useful to understand the quality of PE reads le is ‘FastQC’. Visit the

following site to download the latest version of soware.

hps://www.bioinformacs.babraham.ac.uk/projects/download.html#fastqc

First, log in to your account using WinSCP tool. Open PuTTY SSH terminal. In your account, nd

a le named, a1F.fastq. We shall check the quality of this le using FastQC tool. To do this, run the

following command at your prompt.

$ fastqc<space> a1F.fastq

In less than two minutes, the analysis would be completed and two output les are printed,

a1F_fastqc.html and a1F_fastqc.zip. Save these les to your computer and open the .html le in any

browser. Check all images and understand their meaning. Observe carefully for the following aspects

in the le.

Box plot of quality scores along the sequence read length.

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

5.1. The reads are contaminated with adapter sequences used during sequencing.

The quality report warrants us to do some data processing which includes,

1. Removal of poor quality reads that are pulling down the average of quality scores.

2. Removal of poor quality bases at the start and end of sequence reads.

3. Removal of adapter sequences contaminang the reads.

Hands on Training Aquaculture Genomics and Bioinformacs 19

6. Quality control of RNAseq datasets – NGS QC Toolkit

K. Vinaya Kumar and J. Ashok Kumar

There are several freeware available for processing of paired-end sequence reads. In this chapter

we shall use NGS QC Toolkit for quality control of PE reads. First, log in to your account using WinSCP

tool. Open PuTTY SSH terminal. In your account, nd two les named, a1F.fastq and a1R.fastq. Check

the quality of both the paired les using FastQC tool. Pracce the following quality control steps and

observe the changes in quality of trimmed les.

6.1 Discarding low quality reads

perl<>IlluQC_PRLL.pl<> -pe <> a1F.fastq <>a1R.fastq <> 2<> A <>-l <> 70 –s<> 20<> -c<> 50

This command removes all those reads where the proporon of bases having a quality of > 20 is

less than 70%. Aer the run, nd that a folder ‘IlluQC_Filtered_les’ is printed. The trimmed les are

present in this folder. Do quality check of these two les with FastQC. Observe the changes in reads

le aer running this command.

Aer discarding about 3 million reads completely, the average quality of bases improved.

Therefore the improvement in quality came at the expense of losing about 30 % of sequence reads.

6.2 Discarding poor quality bases at both ends based on length.

perl<>TrimmingReads.pl<> -i <>a1F.fastq<> -irev<> a1R.fastq<> -l <>3 <> -r <> 30

This command removes 3 bases at 5’ end and 30 bases at 3’ end from all reads.

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

6.3 Discarding poor quality bases at 3’ end of reads based on quality score

perl<>TrimmingReads.pl<> -i <>a1F.fastq<> -irev<> a1R.fastq<> -q <>30

This command removes bases at 3’ ends where the base quality is <30. This improvement in

quality at ends came at the expense of some reads geng shorter.

6.4 Discarding reads based on read length

perl<>TrimmingReads.pl<> -i <>a1F_7020.fastq<> -irev<> a1R_7020.fastq<> -n <>25

This command removes reads shorter than 25 bases length.

A combinaon of these could be chosen and applied based on the inial base quality of sequence

datasets. Extract only the good quality data for downstream processing of reads.

Hands on Training Aquaculture Genomics and Bioinformacs 21

7. Quality control of RNAseq datasets – Trimmomac

K. Vinaya Kumar and J. Ashok Kumar

There are several freeware available for processing of paired-end sequence reads. In this chapter

we shall use ‘Trimmomac’ for quality control of PE reads. First, log in to your account using WinSCP

tool. Open PuTTY SSH terminal. In your account, nd two les named, a1F.fastq and a1R.fastq. Check

the quality of both the paired les using FastQC tool. Run the following command and observe the

changes in quality of trimmed les. The ‘<>’ sign used in the command argument indicates ‘space’.

The command

Java<> -jar<> trimmomac-0.36.jar<> PE<> -threads<> 70<> -trimlog<> a1.txt<> a1F.fastq<>

a1R.fastq<> a1F_P.fastq<> a1F_S.fastq<> a1R_P.fastq<> a1R_S.fastq<> ILLUMINACLIP:TruSeq3-PE-2.

fa:2:30:10<> LEADING:3<> TRAILING:13 <> SLIDINGWINDOW:4:15 <>MINLEN:100

De-coding the command

Each argument in the command has a purpose of improving the quality of trimmed les. It is

important to check the inial quality of sequence data and then apply the relevant arguments to

improve the quality.

Argument Meaning

PE Paired-end mode. Use this for processing of PE reads data

threads The argument to specify number of threads. Trimmomac supports

running arguments with mulple threads.

trimlog To specify a le name that stores log of the run.

a1F.fastq Input le name of forward or R1 reads

a1R.fastq Input le name of reverse or R2 reads

a1F_P.fastq Output le name of trimmed forward or R1 reads. This le is used

for subsequent analysis.

a1F_S.fastq Output le containing surviving forward reads of good quality. The

paired sequences in R2 le are discarded.

a1R_P.fastq Output le name of trimmed reverse or R2 reads. This le is used

for subsequent analysis.

a1R_S.fastq Output le containing surviving reverse reads of good quality. The

paired sequences in R1 le are discarded.

ILLUMINACLIP:TruSeq3-PE-2.

fa:2:30:10

Illuminaclip is used to remove adapter sequences from reads. The

TruSeq3-PE-2.fa is the le containing adapter sequences.

LEADING:3 To remove bases at the start of the read, if quality is below 3

TRAILING:13 To remove bases at the end of the read, if quality is below 13

SLIDINGWINDOW:4:15 This is an argument that trims reads based on base quality. Each

read is scanned from 5’ end. Four connuous bases are taken as

a window. The average quality of all windows in a read should

be higher than 15. Otherwise, the read gets trimmed from poor

quality window to the 3’ end of the read.

MINLEN:100 To discard reads shorter than 100 bases aer performing all the

steps.

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

Run FastQC on the trimmed les

Below are the quality of forward sequence reads before (le) and aer (right) trimming.

Below are the quality of reverse sequence reads before (le) and aer (right) trimming.

Even the reads containing the adapters are trimmed. These trimmed les would be taken up for

nding dierenally expressed transcripts. The single-end good quality reads are also used in case of

assembling genomes.

Hands on Training Aquaculture Genomics and Bioinformacs 23

8. Assembling bacterial genomes

J. Ashok Kumar and K. Vinaya Kumar

Genome sequencing forms basis for understanding biology and funconal characterizaon of

microorganisms. Recent advances in shotgun sequencing pave the way generate genome sequences

with me and cost advantage. Here we discuss whole genome assembly with of paired-end sequence

reads generated from illumina plaorm. First we aempt to describe steps involved in denovo

assembly of bacterial genome using masucra assembler and later we look into the steps involved in

reference based assembly using Bowe2.

Download MaSuRCA (Maryland Super Read Cabog Assembler)

MaSuRCA assembler can be downloaded from hp://www.genome.umd.edu/masurca.html

and once it is downloaded keep the folder in you directory and extract the tar ball using following

command.

user@server$ tar –zxvf MaSuRCA-3.2.6.tar.gz

This will extract the les in to the folder MaSuRCA-3.2.6 . You will nd all the executable programs

in the bin subfolder of the MaSuRCA-3.2.6 folder.

Preparing Illumina sequence reads

Copy and paste the illumina paired-end sequence reads in a folder. There will be two les one

for forward strand and other for reverse strand say for example vibgenome_R1.fastq vibgenome_

R2.fastq. These fastq les need to be quality checked and corrected using tools like fastqc, cutadapt

and trimmomac etc.

Preparing Masurca conguraon le

You will nd sample conguraon (sr_cong_example.txt) le in the installaon directory which

needs to be edited with the assembly parameters. There are two secons in conguraon le. One is

DATA secon and Other one is PARAMETERS secon.

In the data secon Opons are available to specify paired-end (PE), mate-pair (JUMP), PACBIO

and Other (Celera assembler reads). Mulple libraries data can be menoned in mulple lines of the

same read type.

For paired-end reads the following line of the data secon needs to be edited.

PE= aa 180 20 /FULL_PATH/frag_1.fastq /FULL_PATH/frag_2.fastq

PE: paired-end; aa- two leer prex; 180 is Average insert length; 20 standard deviaon of insert

length;

In the PARAMETERS the mandatory parameters that need to be edited are NUM_THREADS and

JF_SIZE .

NUM_THREADS are number of threads alloed for assembly task. Example : NUM_THREADS=16

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

JF_SIZE is the jellysh hash size, set this to about 10x the genome size but it can be genome size

mulplied by its coverage.

Denovo assembly using MaSuRCA

Command to run masura assembly is

user@server$ /path/to/bin/masurca /path/to/cong.txt

this will generate ‘assemble.sh’ le in the current locaon. Now we need to run this shell script

for compleng the assembly

user@server$sh assemble.sh

Successful compleon of assembly will create several les. Look for the directory named CA and

within that folder you will see 10-gapclose subfolder wherein you will nd nal assembled output.

The output les are ‘genome.ctg.fasta’ for the cong sequences and ‘genome.scf.fasta’ for the scaold

sequences.

Reference based assembly using bowe2

In reference based assembly reads are mapped to reference genome to idenfy variaons like

single nucleode polymorphisms(SNPs), indels, inserons, copy number variants, genome wide

associaon studies (GWAS).

Steps involved in reference based assembly are listed below with the commands for running

each step

ØIndexing a reference genome

$bowe2-build V_para_GCA_000328405.1.fna vibindex

ØAligning reads

$bowe2 -x vibindex -1 V-Para-DNA_R1.fastq -2 V-Para-DNA_R2.fastq -S align1.sam

ØCovert sam to bam le

$samtools view -bS align1.sam > align1.bam

(-bs: input sam and output bam)

ØSort the bam le

$samtools sort align1.bamalign1.sorted.bam

ØCreate the BCF le

$samtools mpileup -uf V_para_GCA_000328405.1.fna align1.sorted.bam.bam | bcools view

-Ov - > align.raw.bcf

(-u generate uncompress BCF output; -f faidx indexed reference sequence le; -Ov output

potenal variant sites only)

Hands on Training Aquaculture Genomics and Bioinformacs 25

9. RNAseq data analysis in Trinity

K. Vinaya Kumar, J. Ashok Kumar and M.S. Shekhar

Many of the commercially relevant aquaculture species including shrimp are not having publicly

available reference genome. Therefore the analysis of RNAseq data for such species mandates building

a de novotranscriptome assembly. For every experiment, a de novo assembly has to be made ulizing

the RNAseq reads of all the samples in the study. In this chapter, we shall pracce building a de novo

assembly of transcriptome and conducng dierenal transcript analysis in trinity soware.

9.1 The datasets

Let us assume an experiment involving two treatments a & b. Each treatment has three replicate

individuals. At the end of the experiment, ssue samples are collected from all replicate individuals

and RNAseq was performed on Illumina plaorm. The following datasets have been generated.

Table.Datasets to be used for RNAseq data analysis

Treatment A Treatment B

Forward

reads

Reverse

reads

Forward

reads

Reverse

reads

replicate 1 a1F.fastq a1R.fastq b1F.fastq b1R.fastq

replicate 2 a2F.fastq a2R.fastq b2F.fastq b2R.fastq

replicate 3 a3F.fastq a3R.fastq b3F.fastq b3R.fastq

9.2 Quality control of datasets

Process the raw reads using Trimmomac tool and obtain quality reads. Keep the following

arguments while running Trimmomac.

ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10

LEADING:3

TRAILING:13

SLIDINGWINDOW:4:15

MINLEN:100

The numbers of reads retained for downstream analysis are given below

Sample name Reads in raw le

(million)

Reads in processed le

(million)

a1 10 4.954252

a2 10 5.577112

a3 10 6.412094

b1 10 5.257203

b2 10 4.160784

b3 10 3.607086

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

9.3 Building a de novo assembly

As the experiment involves triplicate samples, prepare a text le showing the triplicate samples

under each treatment and their le names as shown below.

Then proceed for building assembly using the following command,

Trinity<> --seqType<> fq<> --samples_le<> ab_samples.txt<> --CPU<> 70<> --max_memory<>

300G<> --SS_lib_type<> FR<> --output <>trinity_ab

The command arguments details are,

Input les are fastq format

Samples le names are given in ab_samples.txt

Use 70 threads

Limit maximum memory to 300 GB

Data obtained from strand-specic library as forward and reverse reads

Store output in folder, trinity_ab

The assembly is completed when you see the messages printed as shown below.

Browse to the folder and nd the assembled transcripts le, Trinity.fasta. Rename the le as

‘Trinity_ab.fasta’ for easy idencaon.

Hands on Training Aquaculture Genomics and Bioinformacs 27

9.4 Assessing quality of assembly

9.4.1. N50: Compute N50 stasc by running the following command,

TrinityStats.pl<> Trinity_ab.fasta<>><> Trinity_ab_stats.txt

9.4.2. ExN50: The E90N50 is being considered as more appropriate for RNAseq studies rather than

N50. Get ExN50 stats with the following argument.

cong_ExN50_stasc.pl <>matrix.TMM.EXPR.matrix <>Trinity_ab.fasta | tee ExN50.stats

EMinimum expression ExN50 Number of transcripts

E90 2.28 1611 45381

E91 1.952 1511 53150

E92 1.916 1409 61794

E93 1.55 1314 71403

E94 1.5 1212 82217

E95 1.262 1102 94509

E96 1.122 1005 108691

E97 0.95 927 125654

E98 0.746 858 146688

E99 0.566 791 175457

E100 0 605 281008

The N50 calculated based on the top most expressed transcripts that represent 90% of the total

normalized expression data is 1611 bases and includes 45381 transcripts.

9.4.3. Read representaon: The proporon of paired-reads represented in the assembled transcripts

is another parameter that helps in evaluang the assembly. We shall use bowe2 tool for this. First

an index is to be made and then reads are to be aligned on to transcripts. Run the following two

commands.

bowe2-build<>Trinity_ab.fasta<> Trinity_ab.fasta

AND

bowe2<> -x<> Trinity_ab.fasta<> -q<> --fr<> -1<> a1F_P.fastq,a2F_P.fastq,a3F_P.fastq,b1F_P.

fastq,b2F_P.fastq,b3F_P.fastq<> -2<> a1R_P.fastq,a2R_P.fastq,a3R_P.fastq,b1R_P.fastq,b2R_P.

fastq,b3R_P.fastq<> -S<> samle<> --no-unal<> -p<>50

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

As per the stascs shown above, the overall alignment rate is 90% which is good.

9.5 Transcript quancaon

9.5.1. Esmate abundance

The rst step in transcript quancaon is to esmate the abundance of all transcripts in every

sample. We shall pracce esmang transcript abundance using alignment-based method, RSEM

though alignment-free methods such as kallisto and salmon exist. Run the following command to get

abundance esmates by aligning the sequence reads to transcripts and counng the number of reads

aligned for each transcript.

align_and_esmate_abundance.pl<> --transcripts<> Trinity_ab.fasta<> --seqType<> fq

<>--samples_le<> ab_samples.txt <>--est_method<> RSEM <>--aln_method<> bowe<> --trinity_

mode<> --prep_reference<> --SS_lib_type <>FR <>--output_dir<> ab_rsem_outdir <>--thread_

count<>20

Argument Meaning

align_and_esmate_abundance.pl Script to align reads on to transcripts and get abundance

esmates

--transcripts To dene the assembled transcripts le name

--seqType To dene input le format

--samples_le Dene the le name that contains treatments, replicates and

reads le names

--est_method To dene abundance esmaon method (opons are RSEM/

eXpress/kallisto/salmon)

--aln_method To dene alignment method (bowe/bowe2)

--trinity_mode To automacally generate gene_trans_map

--prep_reference To build target index

--SS_lib_type Specify if the library is strand-specic (FR/RF)

--output_dir Name of the directory to store output les

--thread_count Number of threads to use for running the argument

At the end of the run, nd that six folders are created corresponding to six samples. In each

folder observe for a le named, RSEM.isoforms.results. These les are used for further processing.

These abundance esmates are built in to matrix with the following argument,

abundance_esmates_to_matrix.pl<> --est_method<> RSEM<>RSEM.isoforms.results

Menon all the six le names of RSEM.isoforms.results corresponding to six samples.

9.5.2. Count the numbers of expressed transcripts

Plot the number of transcripts that are expressed at dierent TPM threshold by running the

following argument,

count_matrix_features_given_MIN_TPM_threshold.pl matrix.TPM.not_cross_norm | tee

counts_by_min_TPM

Hands on Training Aquaculture Genomics and Bioinformacs 29

The output looks like the table depicted below.

Neg_min_tpm Number of features

-10 24978

-9 29296

-8 35850

-7 45677

-6 62228

-5 84308

-4 111966

-3 151586

-2 202147

-1 228356

0 281008

9.6 Dierenal expression analysis

At present, Trinity supports four R packages for performing dierenal expression analysis.

These are edgeR, DEseq2, limma/voom, and ROTS. We shall use edgeR in this tutorial to understand

dierenal expression analysis. Run the following commands.

run_DE_analysis.pl<> --matrix<> matrix.counts.matrix<> --method<> edgeR<> --samples_le<>

ab_samples_DE.txt<> --output <>ab_edgeRresult

AND

analyze_di_expr.pl<> --matrix<> matrix.TMM.EXPR.matrix<> --output<> aVSb <>--samples<>

ab_samples_analyzeDE.txt

In this parcular example, we got ve transcripts that are dierenally expressed in sample b

compared to sample a. Now proceed to funconal annotaon of these transcripts and understand its

role for the given treatment in the study.

9.7 Quality check of samples and replicates: You may compare the samples as well as replicates in

each sample with the following commands

9.7.1. /PtR<> --matrix <>matrix.counts.matrix<> --samples <>ab_replicatesTest.txt<> --CPM <>--

log2<> --min_rowSums <>10<> --compare_replicates

9.7.2. /PtR<> --matrix<> matrix.counts.matrix<> --min_rowSums <>10<> -s<> ab_replicatesTest.txt<>

--log2<> --CPM <>--sample_cor_matrix

9.7.3. /PtR<> --matrix<> matrix.counts.matrix<> -s<> ab_replicatesTest.txt<> --min_rowSums 10<>

--log2 <>--CPM<> --center_rows <>--prin_comp 3

For example, in the picture below, it is evident that the replicates in treatment b are clustered

closely. This ensures that all the replicates behaved similarly.

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

Hands on Training Aquaculture Genomics and Bioinformacs 31

10. Phylogenomic analysis using MrBayes

K. Vinaya Kumar, J. Ashok Kumar and G. Gopikrishna

Researchers perform phylogenec analysis to understand the evoluonary relaons among

taxa. Such analyses require informaon on best-t paroning schemes and best-t models for

the sequence data in hand. The ParonFinder is a suitable tool to nd that informaon to build

phylogenec tree. In this chapter we conduct analyses using ParonFinder tool for nding the best-

t paroning scheme and evoluonary models. Then using these paroning schemes and models,

we would build a Bayesian tree in MrBayes tool.

10.1 ParonFinder

For this exercise, a sequence le containing sequence data of 5 genes on 10 taxa is provided in

your work folder. Open the folder and check for the le named, ‘sequence_10_5.phy’.

Taxa labels 10 taxa taxaA, taxaB, …….. taxaJ

Gene parons 5 genes

Gene1: 1-675 bp

Gene2: 676-834 bp

Gene3: 835-2373 bp

Gene4: 2374-3060 bp

Gene5: 3061-3849 bp

The arguments for running ParonFinder are to be provided in a conguraon le. Find the le

‘paron_nder.cfg’ in work folder. Keep sengs as per the table given below.

Argument Opon Meaning

alignment sequence_10_5.phy File containing sequences in phylip

format

branchlengths linked

Linked branch lengths are

supported by almost all phylogeny

programs

models mrbayes

Includes all the evoluonary

models that are compable for

MrBayes tool for tesng

model_selecon aicc Criterion to decide the best model

data_blocks

Gene1_pos1 = 1-675\3;

Gene1_pos2 = 2-675\3;

Gene1_pos3 = 3-675\3;

Gene2_pos1 = 676-834\3;

Gene2_pos2 = 677-834\3;

Gene2_pos3 = 678-834\3;

Gene3_pos1 = 835-2373\3;

Gene3_pos2 = 836-2373\3;

Dening data parons. For each

gene, three data parons are

dened based on the three base

posions of triplet code. We

dened 15 data blocks for 5 genes.

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

Gene3_pos3 = 837-2373\3;

Gene4_pos1 = 2374-3060\3;

Gene4_pos2 = 2375-3060\3;

Gene4_pos3 = 2376-3060\3;

Gene5_pos1 = 3061-3849\3;

Gene5_pos1 = 3062-3849\3;

Gene5_pos1 = 3063-3849\3;

search greedy Dening the method to use for

nding good paroning scheme

Then use the following command to run the ParonFinder. Here, you menon the name of the

folder containing sequence le and conguraon le in place of ‘folder_name’.

python<>ParonFinder.py <> folder_name/ --no-ml-tree

The output les are stored in the folder ‘analysis’. Find the le ‘best_scheme.txt’ that contains

the arguments for running the best t models on best parons in MrBayes tool.

10.2 MrBayes

The Bayesian analysis requires the input sequence le in nexus format. Find the le,

sequence_10_5.nxs le which was used for analysis in ParonFinder. Download windows version of

MrBayes tool and unzip the le. Start the tool by clicking on the executable. Then run the following

arguments.

execute sequence_10_5.nxs;

outgroup taxaJ;

type arguments given in output le of ParonFinder, ‘best_scheme.txt’

showmodel # to check for the model dened

mcmc ngen=10000000 nruns=2 nchains=4 samplefreq=100 prinreq=100

diagnstat=maxstddev diagnfreq=1000 savebr=yes lename= ParonFinder

Aer running for 10 million generaons, you would see the following screen.

Hands on Training Aquaculture Genomics and Bioinformacs 33

You could connue with more generaons if required by opng for ‘yes’ at the prompt.

Then obtain a summary of parameters with the following command. Here, by default, rst 25%

of observaons are discarded.

sump lename= ParonFinder

Look for the parameters like esmated sample size and potenal scale reducon factor. Then

summarize the trees with the following command. This prints a cladogram and a phylogram.

sumt lename= ParonFinder

Check for the .tre le and open it in FigTree to view the tree.

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

11. Microsatellites genotypes generaon by Fragment analysis method

B. Sivamani

Fragment analysis (Genotyping) can be performed on DNA fragments that have uorescent

labels. Using a labeled primer with PCR amplicaon is a common method used to incorporate these

labels. The Molecular Biology Core lab is already set to run mulple uorescent dye sets.

Steps

1. Microsatellite loci selecon

2. Primer designing (uorescent labelled)

3. PCR

4. Fragment analysis – ABI sequencer

5. Generang genotypes

1. Microsatellite loci selecon

The loci are selected loci through literature search or from any database. For sheries, the NBFGR

FishMicrosat database provides updated microsatellite loci and their primers for pcr amplicaon.

(hp://mail.nbfgr.res.in/shmicrosat/).

Steps to nd the microsatellite loci in Penaeus (Fenneropeaneus indicus)

ØVisit the site (hp://mail.nbfgr.res.in/shmicrosat/)

ØUnder Analysis and Primer, select your species of search and you will nd all the microsatellite

loci related to the specic search. The following details of the loci also present

1. Accession Number: link will lead to the NCBI site and will give all the details of the

nucleode sequence

2. SSR type: di, tri, tetra or compound

3. Microsatellite span in the sequence

4. Primers to amplify the locus

2. Designing primer

One can use the specied primer or a primer may be designed as per the user requirement by

ulizing the accession Number opon. One of the primers needs to be uorescent labeled.

3. PCR

Isolate Total DNA from the biological material (Blood/nclips/muscle,etc.) of the species. Verify

the DNA for quality and quanty. Carry out the PCR with labelled primers. Verify the amplicon by

agarose gel electrophoresis. The specic amplicaon of the product is considered beer. Else,

presence of some less intense non-specics also accepted.

Hands on Training Aquaculture Genomics and Bioinformacs 35

4. Fragment analysis – by ABI sequencer

The step is normally outsourced being the cost of the equipment is too high. We receive the

results generated by GeneMapper soware (Private rms use the inbuilt GeneMapper soware) with

the following les.

1. FSA le

2. PDF for electropherogram

3. Genotypes in excel sheet

Fig:1 Electropheogram

Fig: 2 Genotypes data generated by GeneMapper soware

5. Generang genotypes from FSA le using R

# 5.1 Install R from the site hps://www.r-project.org/

# 5.2 installing the package from R site##

Install.packages(“Fragman”)

# 5.3 To acvate, the package has to be loaded###

>Library(Fragman)

# 5.4 To specify the input fas le

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

FIM03<-storing.inds(“C:/Users/Admin/Desktop/training writeup/FAS le-ciba”)

#5.5 To specify the ladder used in the analysis

my.ladder <- c(35,50,75,100,139,150,160,200,250,300,340,350,400,450,490,500)

# 5.6 To merge both the earlier specied informaon (FAs le and ladder)

ladder.info.aach(stored=FIM03, ladder=my.ladder)

# 5.7 Tocreate friendly plots for any number of individuals specied and can be used to

#design panels for posterior automac scoring

overview2(my.inds=FIM03, channel = 2, ladder=my.ladder)

# 5.8 to view the vector with expected DNA sizes to be used in the next step for scoring

my.panel2<-overview2(my.inds=FIM03, channel = 2, ladder=my.ladder, init.thresh=3000,

xlim=c(90,130))

my.panel2

# 5.9 To score our samples for channel 2 with our panel created previously

res2 <- score.markers(my.inds=FIM03, channel = 2, panel=my.panel2$channel_2, ladder=my.

ladder, electro=FALSE)

# 5.10 To extract your peaks in a data.frame

nal.results <- get.scores(res2)

nal.results

# 5.11To get the results in text le format

write.table(nal.results, “ C:/Users/Admin/Desktop/training writeup/FIM03-18-2.txt”, sep=”\t”)

******

Note

install.packages = to install the specic package

library = to load addon package

storing.inds = is the funcon in charge of reading the FSA les and storing them

with a list structure

ladder.info.aach = uses the informaon read from the FSA les and a vector containing theladder

informaon (DNA size of the fragments) and matches the peaks from the channel where theladder

was run with the DNA sizes for all samples. Then loads such informaon in the R environmenor the

use of posterior funcons

Hands on Training Aquaculture Genomics and Bioinformacs 37

stored = List of dataframes obtained by using the storing.inds funcon

overview2 = create friendly plots for any number of individuals specied and can be used

to design panels (overview2) for posterior automacscoring (like licensed soware does)

my.inds =List with the channels informaon from the individuals specied, usually

comingfrom the storing.inds funcon output

Channel = The channel you wish to analyze, usually 1 is blue, 2 is green, 3 is yellow, 4 is

red and so on

init.thresh = An inial value of intensity to detect peaks. We recommend not to deal to

muchwith it unless you have highly controlled dna concentraons in your experiment.

score.markers = score the alleles by nding the peaks provided in the panel

panel =dierent dna sizes usually obtained by using overview and locator funcons

get.scores =Once the calls have been obtained we can extract a data frame with the get.

scores funcon.

******

xlim=c(a1,b1)) = the approximate amplicon size to be menoned in overview2

Dye sets used applied biosystem DNA analyser

Blue: 5FAM and 6FAM

Green: Hex, vic,Tet and Joe

Yellow: Tamra and Ned

Red: Rox and Pet

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

12. Genepop : Populaon Genecs analysis

B. Sivamani

GENEPOP is a populaon genecs soware package originally developed by Michel Raymond

(Raymond@isem.univ-montp2.fr) and Francois Rousset (Rousset@isem.univ-montp2.fr), at the

Laboraore de Geneque et Environment, Montpellier, France.

Access: Web version is easy to use but acve internet is required, also can be downloaded and

run under windows and linux without internet.

Genepop on the web

Can be accessed in the link: hp://genepop.curn.edu.au/

It has seven opons. Once the input le is prepared, all the opons can be run.

Opon1 Hardy Weinberg Exact Tests

Opon 2 Genotypic linkage disequilibrium

Opon 3 Populaon dierenaon

Opon 4 Nm esmates - private allele method

Opon 5 Basic Informaon

Opon 6 Fst and other correlaons

Opon 7 le conversion

Input le (e.g.)

Title:”P.indicus microsatellites based populaon diversity”

FIM03

FIM06

FIM20

FIM21

FIM17

FIM19

FIM23

POP

C051 , 113115 161161 128130 225225 308289 110118 201226

C054 , 109115 161167 123123 222231 299307 110116 222226

C055 , 113117 161164 123123 231231 307307 108110 191201

C056 , 109113 161164 123123 245248 307307 110118 191201

C057 , 117117 161161 123125 230230 293307 110118 191201

POP

K15 , 115115 161161 128130 230230 308308 110118 201222

K20 , 115117 160160 123125 000000 294307 110118 191201

K23 , 107113 149159 123123 000000 309308 110116 191201

K36 , 111115 160186 123123 000000 307308 108118 201222

K42 , 109115 138149 123125 000000 298300 110116 191201

Hands on Training Aquaculture Genomics and Bioinformacs 39

Instrucons to input le

ØInput le should be prepared in notepad, notepad++ or excel

ØThe input le should have txt extension e.g. lename.txt

ØFirst line, tle is wrien within inverted commas

ØNo constraint on blanks separang the various elds, tabs or spaces allowed.

ØLoci names can appear on separate lines, or on one line if separated by commas

ØIndividual idener may have blanks but must end with a comma

ØAlleles are numbered from 01 to 99 (or 001 to 999). Consecuve numbers to designate

alleles are not required.

ØPopulaons are dened by the posion of the “Pop” separator. To group various populaons,

just remove relevant “Pop” separators.

ØIndividual genotypes for the web version must be on one line. This diers from the PC

version.

Ø Missing data should be indicated as 00 (or 000) rather than blanks. There are three possibilies

for missing data :

v no informaon (0000) or (000000),

v paral informaon for rst allele (1000) or (010000),

v paral informaon for second allele (0010) or (000010).

ØThe number of locus names should correspond to the number of genotypes in each row. If

you remove one or several loci from your input le, you should remove both their names and

the corresponding genotypes.

ØNo empty lines should be found within the le.

ØNo more than one empty line should be present at the end of le.

To run in PC

Download Genepop form the link hp://kimura.univ-montp2.fr/~rousset/Genepop.htm

Based on OS 32 or 64 bit version can be run without installing from the PC. The Input le format

is same like Genepop on web. Input le should be in the same folder of the soware. Aer specifying

the input le, type the number of the opons (for analysis) and the output-le gets stored at the

same folder.

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

13. Populaon genec analysis of microsatellite data in Arlequin

B. Sivamani

Arlequin is a soware tool specially designed to extract informaon on genec and demographic

features of a collecon of populaon samples. Arlequin can handle several types of data either in

haplotypic or genotypicform. The data types include

ØDNA sequences

ØRFLP data

ØMicrosatellite data

ØStandard data

ØAllele frequency data

Arlequin can analyse various populaon parameters. They are standard indices, molecular diversity,

Linkage disequilibrium, Hardy-Weinberg equilibrium, Amova, Exact populaon dierenaon etc.,

Installaon and uninstallaon

1. Download WinArl35.zip to any temporary directory.

2. Extract all les contained in Arlequin35.zip in the directory of your choice.

3. Start Arlequin by double-clicking on the le WinArl35.exe, which is the main executable le.

4. To uninstall simply delete the directory where you installed Arlequin. The registries were not

modied by the installaon of Arlequin.

Conguraon

Download text editor tool from www.textpad.com and install. It is required to create, edit the

project les and to view the log les.

Download R from www.rproject.org and install it.

Running the soware Arlequin

Open the arlequin by double clicking “WinArl35.exe” which leads to the home page.

Hands on Training Aquaculture Genomics and Bioinformacs 41

Step1. Conguraon of Arlequin

1.1 Click on ‘Arlequin Conguraon’ box, select the opon Append results, XML output and use

64bit external . Append Results is selected to get the results of several runs of a specic

input le into a single output le. The XML output opon is to get the results in XML format.

1.2 Under ‘Helper Programs’, the path of the Text editor and R has to be specied for the

ulizaon. Click the ‘Browse’ box of the ‘Text editor’ and browse where the Textpad.exe le

is located.

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

1.3 Click the ‘Browse’ buon of ‘Rcmd’ and indicate the path by selecng the Rcmd Applicaon

form the specic folder.

Step 2: Project le preparaon

Arlequin requires project le (input) which has the extension “.arp”. Once the analysis over, the

output (results) will be stored in the same (WinArl35) folder as subfolder with the extension “.res”.

2.1 Open the arlequin soware by double cicking the “ WinArl35.exe”

2.2 Click on “project wizard” opon. An example project le will be created with the Arlequin

format.

2.3 Click the dropdown menu of ‘Datatype’. Select the opon ‘MICROSAT’

Hands on Training Aquaculture Genomics and Bioinformacs 43

2.4 Choose ‘Genotype data’

2.5 We have data on ve populaons for analysis. Therefore menon ‘No of samples’ as 5.

2.6 Choose ‘whitespace’ against the ‘Locus separator’

2.7 Type ‘?’ against ‘Missing data’

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

2.8 Select the opon ‘Include genec structure’

2.9 Click the ‘Browse’ opon

2.10 It opens a pop-up window where the project le need to be stored as ‘ciba-1’ in ‘WinArl35’

folder

2.11 Click on ‘CREATE PROJECT’ project which creates a project le named ‘ciba-1

2.12 Convert the Genepop format of input le into Arlequin project format

2.12.1 Open the ‘Genepop on the web’ (hp://genepop.curn.edu.au/)

2.12.2 Click on opon 7 (File Conversion) will open the window for Data format conversions.

Hands on Training Aquaculture Genomics and Bioinformacs 45

2.12.3 Select (opon 5) Genepop format to Arlequin project

2.12.4 Select ‘Datatype’ as ‘microsatellite’

2.12.5 Select ‘Genotypic data’ as ‘diploid’

2.12.6 For ‘Recessive (null) allele present’, select yes or no based on the data. Here our data

contains some null alleles. Therefore we select ‘Yes’ opon.

2.12.7 For ‘Gamec phase’, select ‘unknown’ opon (being a diploid data, gamec phase details

are not necessary; the same results will be obtained for either opon)

2.12.8 For ‘Output format & Delivery’ select any of the opons; ‘Email the results’ or ‘HTML -

Plain Text’. Under ‘Email the results’, enter your email id. The results will be sent to your mail

id. Plain text opon, will display in the same window.

2.12.9 Under ‘Choose File’ opon, browse your Genepop le (ciba_genpop_1.txt) and click

‘Submit data’ box. We get the Results in Arlequin project format.

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

2.13 Copy the results from ‘[Data]’ to ll the end.

2.14 Goto Arlequin soware page and click ‘Edit project’

2.15 It will open ciba1.arp le in text pad

Hands on Training Aquaculture Genomics and Bioinformacs 47

2.16 Paste the copied content in the ciba1.arp le and replace the [Data] content

2.17 Edit the ‘Structure’ content

2.17.1# (Enter the tle between inverted commas) (e.g.)

StructureName = “Fish-India”

2.17.2 #Number of groups + {1,2,3...} (Enter 1,2,3 ..Etc., as per the number of groups one has to make)

(e.g.)

NbGroups = 1

2.17.3#Dene hereaer the structure of the rst group; menon all the names of the populaons.

Every populaon name should be within inverted comma. The populaons belong to the specic

group has to be menoned. (e.g.)

Group ={ “C051”

“K15”

“MNI01”

“P094”

“Q02”

}

2.17.4 Aer eding, save and close the le.

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

3 Analyzing the data

3.1 Using ‘Open project’ box, open the le ‘ciba-1.arp’

3.2 Choose the required analysis from the ‘Sengs’

3.3 Click on ‘Start’ buon to start the analysis

3.4 View the results generated in the folder (project le name with the .res extension) ‘ciba-1.res’.

Hands on Training Aquaculture Genomics and Bioinformacs 49

14. So Compung techniques in Bioinformacs

P. Mahalakshmi

INTRODUCTION

The exponenal growth of the amount of biological data available raises two problems: on one

hand, ecient informaon storage and management and, on the other hand, the extracon of

useful informaon from these data. The second problem is one of the main challenges in computaonal

biology, which requires the development of tools and methods capable of transforming all these

heterogenerous data into biological knowledge about the underlying mechanism. These tools

and methods should allow us to go beyond a mere descripon of the data and provide knowledge

in the form of testable models. By this simplifying abstracon that constutes a model, we will be

able to obtain predicons of the system. There are several biological domains where so compung

techniques are applied for knowledge extracon from data.

Applicaon of so compung becomes relevant for solving some Bioinformacs and molecular

biology problems. Development in so compung method reveal the high principles of technology,

algorithms, and tools in bioinformacs for enthusiasc reason such as dependable and parallel

genome sequencing, fast sequence comparison, search in databases, mechanical gene idencaon,

ecient modeling and storage of mixed data, etc. Protein classicaon leads to idencaon and

proper funconal assignment of uncharacterized proteins with a nal goal towards nding homologies

and drug discovery. Again, structure based ligand design is one of the crucial steps in raonal drug

discovery, where a small molecule is designed by targeng the structure and biochemical properes

of the target.

The applicaon of so compung oers an on promising approach to achieve ecient and reliable

heurisc soluon. On the other side the incessant development of high quality biotechnology,

e.g. micro-array techniques and mass spectrometry, which provide complex paerns for the direct

characterizaon of cell processes, oers further promising opportunies for advanced research in

bioinformacs. So one important sub-discipline within bioinformacs involves the development of

new algorithms and models to extract new, and potenally useful informaon from various types of

biological data including DNA(nucleode sequences) and proteins (amino acid sequences). Analysis

of these macromolecules is performed both structurally and funconally using the major components

of so compung like Fuzzy Sets (FS), Arcial Neural Networks (ANN), Evoluonary Algorithms

(EAs) (including genec algorithms (GAs), Rough Sets (RS), Swarm Opmizaon (SO) etc. This lecture

notes aempts to describe the fuzzy logic, Arcial Neural Networks and genec algorithm and its

applicaons in bioinformacs.

NEED OF SOFT COMPUTING IN BIOINFORMATICS

The dierent tasks involved in the analysis of biological data include Sequence alignment,

genomics, proteomics, DNA and protein structure Predicon, gene/promoter idencaon

phylogenec analysis, analysis of Gene expression data, protein Folding, docking and molecule

and Drug design. Data analysis tools used earlier in bioinformacs were mainly based on stascal

techniques like regression and esmaon. So compung in bioinformacs can be used in handling

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

large, complex, inherently uncertain, data sets in biology in a robust and computaonally ecient

manner thus fuzzy sets (so compung technique) can be used as a natural framework for analysing

them. Most of the bioinformac tasks involve search and opmizaon of dierent criteria (like energy,

alignment score, overlap strength), while requiring robust, fast and close approximate soluons.

Missing and noisy data is one characterisc of biological data. The convenonal computer

techniques fail to handle this. So compung based techniques are able to deal with missing and noisy

data. As so compung are measured to handle vagueness, indecision and near opmality in large

and complex search spaces use of so compung gear for solving bioinformacs problems have been

gained the aenon of researchers. Most of the researches are woven around the tasks of paern

recognion and data mining like clustering, classicaon, feature selecon, and rule generaon,

while classicaon pertains to supervised or unsupervised learning, clustering corresponds to

unsupervised self -organizaon into homologous parons.

In molecular biology research, new data and concepts are generated every day, and those new

data and concepts update or replace the old ones. So compung can be easily adapted to a changing

environment. This benets system designers, as they do not need to re-design systems whenever the

environment changes. Moreover, since many of the problems involve mulple conicng objecves,

applicaon of so compung mul-objecve opmizaon algorithms like mulobjecve genec

algorithms appears to be natural and appropriate. So compung techniques, either individually or

in a hybridized manner, can be used for analyzing biological data in order to extract more and more

meaningful informaon and insights from them.

With advances in biotechnology, huge volumes of biological data are generated. In addion, it

is possible that important hidden relaonships and correlaons exist in the data. So compung

methods are designed to handle very large data sets, and can be used to extract such relaonships.

FUzzY LOGIC AND ITS APPLICATION IN BIOINFOAMTICS

Fuzzy Sets and Linguisc Variables

A fuzzy set is an extension of a crisp set. Crisp sets allow only full membership or no membership

at all, whereas fuzzy sets allow paral membership. In a crisp set, membership or non membership of

element x in set A is described by a characterisc funcon, where if and if . Fuzzy set theory extends

this concept by dening paral membership, where, where if; if and if x parally belongs to A.

Mathemacally, a fuzzy set A on a universe of discourse U is characterized by a membership funcon

that takes values in the interval [0 1] that can be dened as . Fuzzy set represent commonsense

linguisc labels viz., suitable, moderate, unsuitable, slow, very slow, fast etc. A given element can

be a member of more than one fuzzy set at a me. A fuzzy set A in U may be represented as a set of

ordered pairs. Each pair consists of a generic element x and its grade of membership funcon; that

is,, x is called a support value if (Zadeh, 1965). The concept of a linguisc variable plays important

role parcularly in fuzzy logic. A linguisc variable is a variable whose values are expressed in words

or sentences in natural language. For each input and output variables, fuzzy sets are created by

dividing its universe of discourse into a number of sub-regions and are named as linguisc variable

(Zimmermann, 1996).

Hands on Training Aquaculture Genomics and Bioinformacs 51

Membership Funcons

Although both classical and fuzzy subsets are dened by membership funcons, the degree to

which an element belongs to a classical subset is limited to being either zero or one. This means that

membership funcon may only be a step funcon (Figure 6.1a). On the other hand, in fuzzy logic, a

membership funcon (MF) is essenally a curve that denes how each point in the input space is

mapped to a membership value (or degree of membership) between 0 and 1.

Membership funcon for (a) crisp set and (b) fuzzy set

The membership funcons are usually dened for inputs and outputs in terms of linguisc

variables. Various types of membership funcons are used, such as triangular, trapezoidal, bell,

Gaussian, sigmoid funcons. In designing a fuzzy inference system, membership funcons are

associated with term sets that appear in the antecedent or consequent of rules. Many researchers

have used dierent techniques for determining membership funcons such as fuzzy clustering, neural

networks, and genec algorithms

Fuzzy Inference System

Fuzzy Inference System (FIS) incorporate an expert’s experience into the system design and they

are composed of four blocks. A FIS comprises a fuzzier that transforms the ‘crisp’ inputs into fuzzy

inputs by membership funcons that represent fuzzy sets of input vectors, a knowledge base that

includes the informaon given by the expert in the form of linguisc fuzzy rules, an inference engine

that uses them together with the knowledge base for inference by a method of implicaon and

aggregaon, and a defuzzier that transforms the fuzzy results of the inference into a crisp output

using a defuzzicaon method.

Fuzzy Inference System

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

The knowledge base comprises two components: a database, which denes the membership

funcons of the fuzzy sets used in the fuzzy rules, and a rule base comprising a collecon of linguisc

rules that are joined by a specic operator. Based on the consequent type of fuzzy rules, there are

two common types of FIS, which vary according to dierences between the specicaons of the

consequent part (Equaons 1 and 2). The rst fuzzy system uses the inference method proposed by

Mamdani in which the rule consequence is dened by fuzzy sets and has the following structure

IF x is A and y is B THEN z is C (1)

The second fuzzy system proposed by Takagi, Sugeno and Kang (TSK) contains an inference

engine in which the conclusion of a fuzzy rule comprises a constant (equaon 2 a) or a weighted linear

combinaon of the crisp inputs (equaon 2 b) rather than a fuzzy set. A fuzzy rule for the zero-order

Sugeno method is of the form

IF x is A and y is B THEN z = C (2 a)

where A and B are fuzzy sets in the antecedent and C is a constant. The rst-order Sugeno model

has rules of the form

IF x is A and y is B THEN z = px+qy+r (2 b)

where A and B are fuzzy sets in the antecedent and p, q, and r are constants

Fuzzy Inference Process

The inference process for evaluang the system needs ve steps

Fuzzy Inference Process

The rst step in evaluang the output of a FIS is to apply the inputs and determine the degree

to which they belong to each of the fuzzy sets via membership funcon (Figure 6.5). This is required

in order to acvate rules that are in terms of linguisc variables. Once membership funcons are

dened, fuzzicaon takes a real me input value and compares it with the stored membership

funcon to produce fuzzy input values. In order to perform this mapping, we can use fuzzy sets of any

shape, such as triangular, Gaussian, π-shaped, etc.

A fuzzy rule base contains a set of fuzzy rule R. For mul-input, single-output system is represented

),........,,(

21 n

RRRR =

where Ri can be represented as

( )

.......,,

yxmxi

TisythenTisxandTisxifR

Hands on Training Aquaculture Genomics and Bioinformacs 53

In this rule, m precondions of Ri form a fuzzy set

).......( 21 m

xxx TTT ×××

, and the consequent is

single output. Generally, if-then-rule can be interpreted by the following three steps:

Resolve all fuzzy statements in the antecedent to a degree of membership between 0 and 1.

If the rule has more than one antecedent, the fuzzy operator is applied to obtain one number

that represents the result of applying that rule. This is called ring strength or weight factor of that

rule. For example, consider an ith rule has two parts in the antecedent

( ) ( )

TisythenTisxandTisxifR

Then, the weight factor can be dened using either intersecon operators or product operators

( )

)(),(min

µµα

)()( 21 21 xx i

µµα

The weight factor is used to shape the output fuzzy set that represents the consequent part of

the rule.

The implicaon method is dened as the shaping of the consequent, which is the output fuzzy

set, based on the antecedent. The input for the implicaon process is a single number given by the

antecedent, and the output is a fuzzy set. Minimum or product are two commonly used methods,

which are represented by the following respecvely.

( )

)(,min)( oo i

µαµ

)()( oo i

µαµ

whereois the variable that represents the support value of the membership funcon.

Aggregaon takes all truncated or modied output fuzzy sets obtained as the output of the

implicaon process and combines them into a single fuzzy set. The output of the aggregaon process

is a single fuzzy set that represents the output variable. The aggregated output is used as the input

to the defuzzicaon process. Aggregaon occurs only once for each output variable. Since the

aggregaon method is commutave, the order in which the rules are executed is not important. The

commonly used aggregaon method is the maxmethod which can be dened as follows:

( )

)(),(max)( ooo

µµµ

The defuzzier maps output fuzzy sets into a crisp number. Defuzzicaon can be performed

by several methods such as: center of gravity, center of sums, center of the largest area, rst of the

maxima, middle of the maxima, maximum criterion and height defuzzicaon. Of these, center of

gravity (centroid method) and height defuzzicaon are the methods commonly used. The centroid

defuzzicaon method nds the center point of the soluon fuzzy region by calculang the weighted

mean of the output fuzzy region. It is the most widely used technique because the defuzzied values

tend to move smoothly around the output fuzzy region.

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

Fuzzy Logic in Bioinformacs

Fuzzy systems have been successfully applied to several areas in pracce like for building

knowledge-based systems, fuzzy logic-based and fuzzy rule-based models. They can control and

analyze processes and diagnose and make decisions in biomedical sciences. There are many

applicaon areas in biomedical science and bioinformacs, where fuzzy logic techniques [10] can

be applied successfully. Some of the important uses of fuzzy logic are listed below:

ØIncreasing exibility of protein mofs.

ØStudying dierences between various poly nucleodes.

ØAnalyzing experimental expression data using fuzzy adapve resonance theory .

ØStudying aligning sequences based on a fuzzy dynamic programming algorithm.

ØMathemacal modeling of complex traits inuenced by genes with fuzzy-valued in pedigreed

populaons.

ØFinding cluster membership values to genes applying a fuzzy paroning method using fuzzy

C-Means and fuzzy c-hard mean algorithms.

ØGenerang DNA sequencing using genec fuzzy and neuro-fuzzy systems by ancipang

disturbances due to intangible parameters.

ØIdenfying the cluster genes from micro-array data.

ØPredicng protein’s sub-cellular locaons using fuzzy k- nearest neighbor’s algorithm.

APPLICATION OF ARTIFICIAL NEURAL NETWORK

An Arcial Neural Network (ANN) is an informaon processing model that is able to capture

and represent complex input-output relaonships. The movaon the development of the ANN

technique came from a desire for an intelligent arcial system that could process informaon in

the same way the human brain. Its novel structure is represented as mulple layers of simple

processing elements, operang in parallel to solve specic problems. ANNs resemble human brain

in two respects: learning process and storing experienal knowledge. An arcial neural network

learns and classies problem through repeated adjustments of the connecng weights between

the elements. There are several learning strategies using in bioinformacs: Supervised Learning,

Unsupervised Learning and Reinforcement Learning

An ANN learns from examples and generalizes the learning beyond the examples supplied. The

methodology of modeling or esmaon is somewhat comparable to stascal modeling. Neural

networks should not, however, be heralded as a substute for stascal modeling but rather as a

complementary eort (without the restricve assumpon of a parcular stascal model) or an

alternave approach to ng non-linear data .Neural networks have been widely used in biology

since the early 1990s. Some of the important applicaons of ANNs are listed below:

ØPredicon and the translaon sites iniaon in DNA sequences and proteins.

ØExplain the theory of arcial neural networks using applicaons in biology.

ØPredict immunologically interesng pepdes by combining an evoluonary algorithm.

Hands on Training Aquaculture Genomics and Bioinformacs 55

ØCarry out paern classicaon and signal processing successfully in bioinformacs.

ØPerform protein sequence classicaon.

ØPredict protein secondary structure predicon.

GENETIC ALGORITHMS IN BIOINFORMATICS

The genec algorithm is a method for solving both constrained and unconstrained opmizaon

problems that is based on natural selecon, the process that drives biological evoluon. The

applicaons of GAs are for solving certain mul objecve problems of bioinformacs, which yields

opmizaon of computaon requirements, and robust, fast and close approximate soluons.

GAs are executed iteravely on coded soluons (populaon) biological basic Operators: selecon/

reproducon, crossover, and mutaon. They use objecve funcon informaon and probabilisc

transion rules for moving to the next iteraon. GAs is generally based on manipulang populaons

of bit-strings using both crossover and point-wise mutaon.

Some of the important applicaons of GAs are listed below:

ØAlignment and comparison of DNA, RNA, and protein sequences.

ØGene mappings in chromosomes.

ØRNA structure predicon

ØProtein structure predicon and clustering.

ØMolecular design and molecular docking.

ØGene nding and promoter idencaon from DNA sequences.

ØInterpretaon of gene expression and micro array data.

ØGene regulatory network idencaon.

ØConstrucon of phylogenec tree for studying evoluonary relaonship.

ØDNA structure predicon.

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

15. RNAseq data analysis – Genome-guided

K. Karthic

1. Introducon

The transcriptomic prole of an organism at any given me or condion gives the set of all its

transcripts and their quanes present at the specic me point or condion. The transcriptomereveals

a great deal about the funconal aspects of the genome as well as the dierent kinds of biomolecules

present within the cell or ssue. It is also very useful for studying the genecs behind growth,

development and disease.

This tutorial describes how to analyse RNA-seq data when a reference genome is available and

the steps involved in idenfying dierenally expressed genes between the two groups. For the

purpose of demonstraon, we have chosen an experiment conducted on Arabidopsis thaliana.

1.1. Input les

1. Reference genome in fasta format

2. RNA seq raw data for two groups in replicates in fastq format

1.2. Soware requirements

1. Bowe2

2. Tophat

3. Cuinks (and associated cudi and cumerge)

4. cummerbund (a R package for visualizing the results)

2. Methodology

2.1. Fetching Raw data

(To save me the raw data has been already downloaded and kept in respecve folders. So the

steps 1 to 15 are to be skipped here)

1. Open terminal and create new directories in your account

mkdir Athaliana

cd Athaliana

mkdir Ref_genome_raw

mkdir Transcriptome_raw

2. Go to Assembly database in NCBI [hps://www.ncbi.nlm.nih.gov/assembly/] and type

TAIR10 in the search bar and click search.

3. The summary of Arabidopsis thaliana assembly is displayed. Click on the Download As-

semblies buon and select Genomic fasta in the drop down menu and click download.

4. The genome downloads as a .tar le, copy the le to Ref_genome_raw folder.

5. Go to terminal and inside the Athaliana folder, type the following commands

cd Ref_genome_raw

tar xvf genome_assemblies.tar

6. A new folder is created with the name similar to ncbi-genomes-2018-08-22. Go to termi-

nal again and type the following commands.

Hands on Training Aquaculture Genomics and Bioinformacs 57

cd ncbi-genomes-2018-xx-xx

gunzip GCF_000001735.4_TAIR10.1_genomic.fna.gz

ls –l

7. Now you can see the lisng of les and in that you noce the fasta le of our genome

and its corresponding le size.

8. Go to terminal again and type the following command to copy and save our genome le

in a dierent name and format

cat GCF_000001735.4_TAIR10.1_genomic.fna > AraTha.fa

9. Now you can see our reference genome saved as AraTha.fa

10. To download RNA-seq data, go to Sequence Read Archive (SRA) database of NCBI [hp://

www.ncbi.nlm.nih.gov/sra] and type the experiment accession numbers SRR671946,

SRR671947, SRR671948 and SRR671949 one aer the other in the search bar and click

search.

11. A summary of the experiment is displayed, scroll down and click on the link displayed

below the run.

12. A summary of experiment of A.thaliana root treated with KCl, replicate-data is displayed

.Go to the downloads tap and click on FASTA/FASTQ link.

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

13. In the displayed page, type the experiment number and click show runs

14. Select FASTQ and click download.

15. Repeat steps 12, 13 and 14 for all four experiment runs (SRR671946, SRR671947,

SRR671948 and SRR671949).

16. Copy the downloaded fastq les to folder Transcriptome_raw.

17. Go to terminal and change the directory to Transcriptome_raw

18. Inside the Transcriptome_raw directory, you should have fastq les in zipped format for

all the four experiment runs. Go to terminal and type the below command to unzip the

les.

for i in *.gz;do gunzip $i;done;

19. With this we have downloaded all our raw data required for our analysis.

2.2. Data analysis

In this secon, how to run bowe2, tophat, cuinks,cumerge for analyzing the transcriptome

data is described. The soware installaons are not described. Please refer to respecve manual for

the same.

2.2.1. Indexing genome using bowe2

1. Go to terminal and change to the directory Ref_genome_raw/ncbi-genomes-2018-xx-xx and

type the following command:

bowtie2-build AraTha.fa AraTha

2. The above command will create bowe indexed les with .bt2 extension

2.2.2. Running tophat

1. tophat will align the RNA-seq data to our bowe indexed genome. To do so, type the

following command in terminal

Hands on Training Aquaculture Genomics and Bioinformacs 59

cd /home/user/Athaliana

mkdir analysis

cd analysis

tophat –o SRR671946_topout /home/user/Athaliana/Ref_genome_raw/AraTha /home/

user/Athaliana/Transcriptome_raw/ SRR671946.fastq

tophat –o SRR671947_topout /home/user/Athaliana/Ref_genome_raw/AraTha /home/

user/Athaliana/Transcriptome_raw/ SRR671947.fastq

tophat –o SRR671948_topout /home/user/Athaliana/Ref_genome_raw/AraTha /home/

user/Athaliana/Transcriptome_raw/ SRR671948.fastq

tophat –o SRR671949_topout /home/user/Athaliana/Ref_genome_raw/AraTha /home/

user/Athaliana/Transcriptome_raw/ SRR671949.fastq

2. The –o SRR671949_topout represents output folder. For each run, folder is created with

the following les: accepted_hits.bam, align_summary.txt, deleons.bed, inserons.bed,

juncons.bed, prep_reads.info, unmapped.bam and logs folder.

3. The accepted_hits.bam is the main result le containing the mapped results in binary format.

2.2.3. Running cuinks

1. From the alignment les generated from tophat, we can assemble the transcripts using

cuinks.

2. In terminal, type the following commands one aer another.

cuinks –o SRR671946_cuinksout /home/user/Athaliana/analysis/ SRR671946_topout/

accepted_hits.bam

cuinks –o SRR671947_cuinksout /home/user/Athaliana/analysis/ SRR671947_topout/

accepted_hits.bam

cuinks –o SRR671948_cuinksout /home/user/Athaliana/analysis/ SRR671948_topout/

accepted_hits.bam

cuinks –o SRR671949_cuinksout /home/user/Athaliana/analysis/ SRR671949_topout/

accepted_hits.bam

3. For each run, the designated output directory will contain the following les: genes.fpkm_

tracking, isoforms.fpkm_tracking, skipped.g, transcripts.g. The assembled transcripts are

contained in transcripts.g.

2.2.4. Running cumerge

1. cumerge will merge the transcripts to a comprehensive transcriptome.

2. Open a text editor, and type the path of the transcripts as below:

/home/user/Athaliana/analysis/ SRR671946_cuinksout/transcripts.g

/home/user/Athaliana/analysis/ SRR671947_cuinksout/transcripts.g

/home/user/Athaliana/analysis/ SRR671948_cuinksout/transcripts.g

/home/user/Athaliana/analysis/ SRR671949_cuinksout/transcripts.g

and save the le as assembled_transcripts.txt

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

3. In terminal, type the following command

cumerge –s /home/user/Athaliana/Reg_genome_raw/ ncbi-genomes-2018-xx-xx/AraTha.

fa assembled_transcripts.txt

4. The successful run creates a merged_asm directory, which contains a logs directory and a

le containing the informaon of the merged transcripts called merged.g.

2.2.5. Running cudi

1. cud is used to see dierenal gene expression in dierent condions. Go to terminal and

type the following command in a single line.

cudi -o di_result -b /home/user/Athaliana/Reg_genome_raw/ ncbi-genomes-2018-

xx-xx/AraTha.fa-L Root_Kcl_control,Root_KNO3_treatment -u merged_asm/merged.g

/home/user/Athaliana/analysis/SRR671946_topout/accepted_hits.bam,/home/user/

Athaliana/analysis/ SRR671947_topout/accepted_hits.bam /home/user/Athaliana/

analysis/SRR671948_topout/accepted_hits.bam/home/user/Athaliana/analysis/

SRR671949_topout/accepted_hits.bam

2. The successful run creates a directory di_result in the working directory. The directory

contains a number of dierent les and databases, listed as follows:

bias_params.info cds_exp.di

genes.fpkm_tracking isoforms.count_tracking

promoters.di splicing.di

tss_groups.fpkm_tracking cds.count_tracking

cds.fpkm_tracking gene_exp.di

genes.read_group_tracking isoforms.fpkm_tracking

read_groups.info tss_group_exp.di

tss_groups.read_group_tracking cds.di

cds.read_group_tracking genes.count_tracking isoform_exp.di

isoforms.read_group_tracking run.info tss_groups.count_tracking var_model.info

3. The fpkm tracking les give FPKM counts of primary transcripts (tss_groups.fpkm), genes

(genes.fpkm_tracking), coding sequences (cds.fpkm_tracking), and transcripts (isoforms.

fpkm_tracking).

4. The count tracking les give the number of fragments for each gene (genes.count_tracking),

transcript (isoforms.count_tracking), primary transcript (tss_groups.count_tracking) and

coding sequence (cds.count_tracking).

5. The read group tracking les contain informaon on the counts of genes, transcripts and

primary transcripts, grouped by replicates.

6. The di les ending with ‘exp.di’ contain informaon on the dierenal expression tests

performed on the genes (gene_exp.di), primary transcripts (tss_group_exp.di), transcripts

(isoform_exp.di), and coding sequences (cds_exp.di).

Hands on Training Aquaculture Genomics and Bioinformacs 61

3. Results

3.1. Running cummeRbund

1. cummeRbund is an R package used to visualise the results in dierent plots.

2. Start an R session In R, go to your working directory and copy the di_result folder to that.

3. Type the following commands in R

>library(‘cummeRbund’)

>cudata < - readCuinks(‘di_result’)

>cudata

4. The above commands will print the result similar to the below

CuSet instance with:

2 samples

33318 genes

42109 isoforms

34957 TSS

32921 CDS

33318 promoters

34957 splicing

27174 relCDS

5. To obtain a density plot showing the expression levels for each sample, type the below

commands:

>csDensity(genes(cudata))

6. To obtain a volcano plot showing the dierenal expressed genes across the two samples,

type the below command:

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

>csVolcano(genes(cudata), ‘Root_Kcl_control’, ‘Root_KNO3_treatment’)

7. To obtain a scaer plot showing the dierenal expressed genes across the two samples,

type the below command:

>csScaer(genes(cudata), ‘Root_Kcl_control’, ‘Root_KNO3_treatment’)

8. To print a table displaying the details of all the dierenally expressed genes, type out the

following command.

> gene_di_data < - diData(genes(cudata))

> sig_gene_data < - subset(gene_di_data, (signi cant == ‘yes’))

>nrow(sig_gene_data)

>write.table(sig_gene_data, ‘di_genes.txt’, sep = ‘/t’, row. names = F, col.names = T, quote

= F)

> sig_gene_data

Hands on Training Aquaculture Genomics and Bioinformacs 63

The last command prints out a table containing the details of all the dierenally expressed

genes. The screenshot of the sample output is below:

In this chapter, we described how to download whole genome and transcriptome raw data from

NCBI databases. A very brief introducon about the soware used in this tutorial was presented and

then using the same tools it was demonstrated how to index a whole genome, aligning reads to a

reference genome and how to esmate transcript abundance and idenfy dierenally expressed

genes. In the end, interpretaons of results were visually described.

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

16. Applicaon of ‘’OMICS’’ research in aquaculture with special

reference to penaeids

Gopikrishna, G., Vinaya Kumar, K., Shashi Shekhar, M. and Vijayan, K.K.

Introducon

The term OMICS refers informally to the eld of study in biology as in genomics, transcriptomics,

proteomics and metabolomics. Genomics is the study of genomes of organisms, transcriptomics is

the study of transcriptomes and so on. Convenonal genec improvement programmes rely mostly

on the phenotypic values which are then converted to breeding values on which the selecon is

carried out. In plants as well as livestock, applicaon of ‘omics’ has revealed interesng insights into

the genec and funconal biology. When these are integrated within selecve breeding programmes,

signicant improvements have been obtained in producvity.(Dekkers, 2012; Perez-de-Castro et al

2012). Omics approaches have been applied widely to elucidate the molecular basis of performance

traits ( eg. growth) and overcome poorly understood biological impediments that impede ecient

producon ( disease, reproducve failure etc) (Rothschild and Plastow 2008, Taylor et al 2016).

As far as livestock and plants are concerned, omics has had a transformaonal eect as observed

by Agrawal and Narayan (2015); Van Emon (2015) and Taylor et al (2016). Coming to the aquaculture

sector, the applicaon of selecve breeding programmes has been at a snail’s pace and it has been

suggested that the world aquaculture producon could be doubled in a period of 13 years if breeding

programmes were supplying stocks for the farmed species (Gjedrem and Rye, 2016). Less than 10%

of the aquaculture producon is derived from improved lines ( Gjedremet al 2012). Looking into the

above facts, it is quite clear that omics resources in aquac species need to be developed at a faster

pace so that these can be used in selecve breeding programmes to hasten genec response.

Crustaceans form a substanal aquaculture commodity globally. The global penaeid aquaculture

industry has exhibited remarkable growth and in 2015, the producon stood at 4.8 million tons (FAO,

2017). Penaeids are an important aquaculture resource the world over and it is necessary to have

selecve breeding programmes so that improved stocks could be generated and farmed. It is well

known that the Pacic white shrimp due to its ease of reproducve capability, has been subjected to

selecve breeding and genecally improved stocks are very much in demand. Informaon generated

from the genomes of shrimp can go a long way in aiding genec improvement programmes so that

the gains are realised at a much faster rate.

Informaon on whole genome of aquaculture species

Several aquaculture species like Oncorhynchus mykiss ( Berthelot et al 2014), Oreochromis

nilocus (Conte et al 2017) Lates calcarifer (Vij et al 2016) Ictalurus punctatus ( Liu et al 2016),

Salmo salar (Lien et al 2016) have had their whole genomes deciphered. In India, work on the whole

genome sequencing in Labeo rohita (Rohu) and Clarius batrachus has been carried out at ICAR-

NBFGR, Lucknow. Shrimp are unique in that the genome size is comparavely large ~ 2.2 Gbp in ger

shrimp and ~1.8 Gbp in Pacic white shrimp (Guppy et al 2018). The highly repeve nature of the

genome in shrimp is a major challenge to the assembly (Huang et al 2011; Baranski et al 2014). In

addion to this, penaeids have a large number of micro-chromosomes and higher levels of genomic

Hands on Training Aquaculture Genomics and Bioinformacs 65

heterozygosity ( Abdelrahmanet al 2017) compared to genome assemblies derived from terrestrial

farm species. Till date, no comprehensive genome assembly is available for a penaeid shrimp. (Guppy

et al 2018). There has been a lot of improvement in sequencing especially through the development

of high-throughput sequencing, resolving and assembling the many repeve regions of the penaeid

genome (~80%; Abdelrahman et al 2017) remains a major challenge.

Transcriptomics

For this, we require the sequence data of the transcriptome. The idea here is to get the mRNA

in individuals at a given point in me, thereaer obtain the cDNA and then go in for sequencing. The

primary focus of transcriptomics has been immunology, disease resistance and reproducve biology

(Guppy et al 2018). Generang transcriptome proles is much easier than generang the whole genome.

In P.vannamei, while invesgang the eect of ammonia exposure, many genes and pathways linked

to immune response (eg chinase, peritrophin, thrombospondin and penaeidin) and growth (linoleic

acid metabolism) were idened by Lu et al (2016a) to be suppressed. Reproducve dysfuncon is a

common feature we nd in capve broodstock of ger shrimp. Through dierenal gene expression

studies of whole transcriptome data, genes related to fay acid and steroid metabolism were found

to have altered expression paerns when comparing wild sourced and domescated stock (Rotllant

et al 2015).

Linkage mapping of genec markers in shrimp

One of the genomic resources is the linkage map which provides a wealth of genomic informaon

and also unravel the underlying genec architecture of commercially and biologically important

traits. In penaeids, there have been substanal eorts to generate linkage maps. Linkage maps are

constructed using data from family groups viz. parents as well as progeny. Earlier, Amplied Fragment

Length Polymrphism was used for construcon of linkage maps in ger shrimp ( Wilson et al 2002).

Later Baranski et al 2014 constructed the rst linkage map in ger shrimp using SNPs. Presently, linkage

maps are available that include between 3959 and 9298 markers and cover all 44 chromosomes of

the penaeid genome ( Baranski et al 2014, Yu et al 2015, Lu et al 2016b, Jones et al 2017a) . Such

maps have increased the applicability of these resources in assisng genome assembly, examining

architecture of traits and also for comparave mapping (Guppy et al 2018). It is interesng to note

that construcon of linkage maps has unravelled some hitherto unknown facts. Baranski et al (2014)

reported in ger shrimp that the female–specic map was substanally shorter than the male specic

map (2917 vs 4059 cM) whereas in P.vannamei,Perez et al (2004) and Zhang et al (2007) , reported

longer maps for females than males ( 4134 vs. 3221 cM and 2771 vs. 2116 cM respecvely) indicang

that there may be higher recombinaon in males. There is sll ambiguity in the karyotype due to

the micro-chromosomes in penaeids as a consequence of which it appears that the dierence in

map length between species exists and sex-based recombinaon might occur. (Baranski et al 2014).

Maps available for ger shrimp (Baranaski et al 2014) and Pacic white shrimp ( Yuet al 2015) have

average inter-marker distances between 0.9 and 0.7 cM respecvely across dierent map iteraons.

This is denitely a signicant achievement, however, 1 cM equates to an esmated physical genome

distance of ~ 400-600Kb for penaieds (P. monodon 395Kb/cM (Baranski et al 2014), P. vannamei

598.89 Kb/cM ( Yuet al 2015), P.japonicus 657.89Kb/cM (Lu et al 2016b) and presents a signicant

challenge when we look to characterise potenal useful genes or genomic regions underlying ndings

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

of trait-associaon studies. (Guppy et al 2018). Future work is required to obtain denser maps that

decrease the interval between markers. This could be accomplished by genotyping more families and

also more individuals per family which would provide addional observaons of informave meioc

recombinaon events or integrate orphaned (unplaced) markers into exisng maps (Fierst 2015).

Ulising enhanced cost-eecve genotyping strategies ( eg genotype by sequencing method ) could

result in genotyping of more families and also more individuals per family consequent to which ne

grain marker placement could be achieved (Guppy et al 2018).

Developing and applying polymorphic markers

There has been considerable eorts in the past, for development of a wide range of tradional

genomic markers ( eg. Allozymes, RFLP, AFLP and microsatellites) in several penaeid species. Most

of these markers have been used for assessing the wild populaons and manage family lines. These

markers exhibit caveats which have been reviewed by Benzie (1998, 2009). Due to the high cost in

developing them and the failure to unravel the complexity of producon traits, they have not found

favour in the penaied industry (Guppy et al 2018). Today, the tradional markers are being replaced

by powerful and cost-eecve markers like Single Nucleode Polymorphisms (SNPs). The SNPs are

very abundant in the genome and can help substanally in genome studies. About 9 million SNPs in

Bos taurus genome ( Xuet al 2017), 7 million SNPs in chickens (Rubin et al 2010) 9.7 million SNPs in

Atlanc Salmon ( Yanez et al 2016), 8.6 million SNPs in channel caish ( Zeng et al 2017) and 5.6 million

SNPs in Lates calcarifer ( Vij et al 2016) have been idened. The SNP discovery has further led to the

manufacture of SNP arrays in several species like cale, sheep, crops like wheat and in aquaculture

species like Caish and Atlanc Salmon. In P. monodon, at ICAR-CIBA, Baranski et al (2014) developed

a chip containing 6000 SNPs which were majorly idened using the transcriptomic approach. It would

be pernent to point out that ll date, only two studies have produced validated SNP genotyping

arrays ( Baranski et al 2014 ( 6000 SNPs) in ger shrimp and Jones et al (2017b) in Pacic white

shrimp (6400 SNPs). The laer one has been sold commercially as the Innium ShrimpLD-24 v1.0

Bead Chip. An interesng feature of these chips is that these arrays are based on type-I SNPs ( genic

rather than inter-genic) and many of these SNPs have been annotated with putave genes ( 62 and 47

%) respecvely, thereby providing a strong foundaon for further trait mapping studies. ( Robinsonet

al 2014 and Khatkar et al 2017b). An addional feature that needs to be factored in, is the cost of

the SNP arrays. The approximate cost of genotyping per individual has drascally fallen to about Rs.

5000/- This needs to be further reduced to make it cost-eecve. Selecon of a genotyping method

for commercial applicaons would hinge on the me required for sample processing, genotyping

and data analysis, as the window between pre-selecon of candidate broodstock at harvest and nal

breeding selecon and spawning is quite short ( less than 3-6 months) (Guppy et al 2018).

Genotype by sequencing

A unique advantage of the genotype by sequencing (GBS) method is the ability to discover and

genotype markers ( de novo marker discovery) without requiring reference to exisng genomic

informaon like genomic sequences and transcriptomes. In penaeids, a number of GBS approaches

have been ulised with 25140 and 23049 markers obtained in Pacic white shrimp (Yu et al 2015,

Wang et al 2017) and 28981 markers obtained in Kuruma shrimp ( Lu et al 2016b). Most of these

markers have been ulised to generate linkage maps, undertake Quantave Trait Loci (QTL) mapping

Hands on Training Aquaculture Genomics and Bioinformacs 67

(Yu et al 2015,Lu et al 2016b) and esmate genomic predicon accuracy (Wang et al 2017), and they

have yet to be ulised in the industry for genotyping.

Markers for breeding populaon management

Crustaceans have a tendency to frequently molt and this places them at a disadvantage in

idencaon. However, tagging with visible implant elastomer tags (for family idencaon) and

eye-ring tags ( for individual idencaon) have been found to address this issue to a certain extent.

The number of individuals available per family is rather large in shrimp and they need to be reared in

a common environment so that there is no confounding of environmental eects. Each family needs

to be reared ll tagging and this poses a signicant challenge on infrastructure. Tracking of pedigree

is of paramount importance to keep the inbreeding low. Use of genomic markers could enhance the

idencaon of individual shrimp but here again the cost of genotyping (high density solid state

arrays), lack of genotyping power (microsatellites) or a combinaon of both these factors are a major

stumbling block ( Vandepue and Haray, 2014).

Exploing genec variaon underlying phenotypes

It is important to comprehend the relaonship between genec variaon and the phenotypes

of economically important traits. The informaon so obtained could prove useful for integrang

genomics research into food producon industries. ( Abdelrahmanet al 2017). Through QTL mapping

and Genome-Wide Associaon Studies (GWAS), it may be possible to idenfy the number, locaon,

eect size of genec elements ( i.e. genes, loci and regions) that are linked to the observed phenotypic

variaon of a trait. (Mackay et al 2009). For this to be applied at the eld level, we need to idenfy

markers that are highly predicve for a superior or inferior phenotype in order to improve the selecon

of elite individuals for breeding programmes (Thorgaard et al 2006). Genomic breeding values have

recently been ulised in breeding programmes related to agriculture in an eort to improve simple

and complex traits. (Meuswissen et al 2001, 2006). Such a procedure could also be applied to shrimp

breeding programmes to elicit substanal genec response.

QTL mapping

A Quantave Trait Locus (QTL) is a region in the genome containing one or several genes that

aect variaon in a quantave trait which is idened by its linkage to polymorphic marker loci.

Mapping of QTLs involves two components: detecon and localisaon. Once the QTLs are detected,

they need to be localised and the gene(s) unravelled. QTLs can be localised through their genec

linkage to visible marker loci with genotypes that we can readily classify. In case a QTL is linked to

a marker locus, then individuals with dierent marker locus genotypes will exhibit dierent mean

values of the quantave trait. QTLs can be mapped in families or in segregang progeny of crosses

between genecally divergent strains ( linkage mapping) or in unrelated individuals from the same

populaon ( associaon mapping). Later, these QTLs need to be validated in a populaon of individuals.

If the validaon yields encouraging results, the QTLs can be ulised to improve the concerned trait

in a breeding programme. Two studies in aquaculture species related to QTL mapping have been

reported. One is by Li et al (2006) for growth in Kuruma shrimp P. japonicus and another by Robinson

et al (2014) for resistance to White Spot Syndrome Virus in ger shrimp. In the former case, AFLP

markers were used whereas in the case of ger shrimp SNP markers were used.

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

Genome-Wide Associaon Studies

These are studies aimed at associang a parcular QTL with a trait. Till date there have been

only two studies reported in aquaculture species. The rst one is in ger shrimp, the work of which

was carried out at ICAR-CIBA and NOFIMA Norway. Seven families of ger shrimp were exposed to

the White Spot Syndrome Virus. The number of shrimp genotyped was 1024. About 9 QTLs in ger

shrimp were found to be signicantly associated to hours of survival. In addion, 3 SNPs were found

to be associated with sex in ger shrimp.(Robinson et al 2014). The second study was for growth in

P. vannamei. The authors could not nd any signicant associaon of markers with growth. Earlier, Yu

et al (2015) while working in P. vannamei, had reported a large QTL for growth explaining 17.9% of

the phenotypic variaon.

Conclusion

Omics research in aquaculture has generated a lot of informaon during the past three

decades. Compared to plant and livestock breeding programmes, aquac species has a long way

to go. The informaon owing from various resources like linkage maps, physical maps, annotated

transcriptome, characterised proteome data and genome sequence need to be incorporated onto

a single plaorm for use by other sciensts working in this eld. Wide publicity needs to be given

on high-density linkage maps to comprehend genome architecture so as to help in future genec

improvement programmes. Indepth studies on economically important traits in aquac species are

also required urgently so as to help the farmers reap prots from culture of sh/shrimp.

References cited

Abdelrahman, H., ElHady, M., Alcivar-Warren, A., Allen, S., Al-Tobasei, R., Bao, L., et al. (2017).

Aquaculture genomics, genecs and breeding in the United States: current status, challenges,

and priories for future research. BMC Genomics 18:191. doi: 10.1186/s12864-017-3557-1

Agrawal, R., and Narayan, J. (2015).Unravelling the impact of bioinformacs and omics in agriculture.

Int. J. Plant Biol. Res. 3:1039.

Baranski, M., Gopikrishna, G., Robinson, N. A., Katneni, V. K., Shekhar, M. S.,Shanmugakarthik, J.,

et al. (2014). The development of a high density linkage map for black ger shrimp (Penaeus

monodon) based on cSNPs. PLoS ONE 9:e85413. doi: 10.1371/journal.pone.0085413

Benzie, J. A. (1998). Penaeid genecs and biotechnology. Aquaculture 164, 23–47. doi: 10.1016/

S0044-8486(98)00175-6

Benzie, J. A. (2009). Use and exchange of genec resources of penaeidshrimps for food and aquaculture.

Rev. Aquacult. 1, 232–250.doi: 10.1111/j.1753-5131.2009.01018.x

Berthelot, C., Brunet, F., Chalopin, D., Juanchich, A., Bernard, M., Noël,B., et al. (2014). The rainbow

trout genome provides novel insights intoevoluon aer whole-genome duplicaon in

vertebrates. Nat. Commun.5:3657. doi: 10.1038/ncomms4657

Conte, M. A., Gammerdinger, W. J., Bare, K. L., Penman, D. J., and Kocher, T.D. (2017).

A high quality assembly of the Nile Tilapia (Oreochromis nilocus)genome reveals the structure of

two sex determinaon regions. BMC Genomics18:341. doi: 10.1186/s12864-017-3723-5

Hands on Training Aquaculture Genomics and Bioinformacs 69

Dekkers, J. C. (2004). Commercial applicaon of marker-and gene-assistedselecon in livestock:

strategies and lessons. J. Anim. Sci. 82(13 Suppl.), E313–E328.doi: 10.2527/2004.8213_

supplE313x

FAO (2017). FishStat Plus - Universal Soware for Fishery Stascal Time Series.

FAO Fisheries and Aquaculture Department. Rome.

Fierst, J. L. (2015). Using linkage maps to correct and scaold de novo genomeassemblies: methods,

challenges, and computaonal tools. Front. Genet.6:220. doi: 10.3389/fgene.2015.00220

Gjedrem, T., and Rye, M. (2016). Selecon response in sh and shellsh: a review.

Rev. Aquacult. 10, 168–179. doi: 10.1111/raq.12154

Gjedrem, T., Robinson, N., and Rye, M. (2012). The importance of selecvebreeding in aquaculture to

meet future demands for animal protein: a review.

Aquaculture 350, 117–129. doi: 10.1016/j.aquaculture.2012.04.008

Guppy, J.L., Jones, D.B., Jerry, D.R., Wade, N.M., Raadsma, H.W., Huerlimann, R., and Zenger,K.R. (2018).

The State of ‘’Omics’’ Research for farmed penaeids: Advances in research and impediments to

industry ulisaon.

Front. Genet. 9:282, doi:10.3389/fgene.2018.00282

Huang, S.-W., Lin, Y.-Y., You, E.-M., Liu, T.-T., Shu, H.-Y., Wu, K.-M., et al.(2011). Fosmid library end

sequencing reveals a rarely known genomestructure of marine shrimp Penaeus monodon. BMC

Genomics 12:242.doi: 10.1186/1471-2164-12-242

Jones, D. B., Jerry, D. R., Khatkar, M. S., Raadsma, H. W., Steen, H. V. D.,Prochaska, J., et al. (2017a).

A comparave integrated gene-based linkage andlocus ordering by linkage disequilibrium map

for the Pacic white shrimp,Litopenaeus vannamei. Sci. Rep. 7:10360. doi: 10.1038/s41598-017-

10515-7

Jones, D. B., Zenger, K. R., Khatkar, M. S., Raadsma, H. W., Steen, H. A. M. V.D., Prochaska, J., et

al. (2017b). “Development of a low-density commercialgenotyping array for the white legged

shrimp, Litopenaeus vannamei,” inAAABG, Edited by Genecs AAoABa (Townsville, QLD).

Khatkar, M., Coman, G., Thomson, P., and Raadsma, H. (2017a). “Comparisonof dierent breeding

design opons for long term genec gain and diversityin aquaculture species,” in Proc Assoc

Advmt Anim Breed Genet (Townsville,QLD), 449–452.

Li, Y., Dierens, L., Byrne, K., Miggiano, E., Lehnert, S., Preston, N.,et al. (2006). QTL detecon of

producon traits for the Kuruma prawnPenaeus japonicus (Bate) using AFLP markers. Aquaculture

258, 198–210.doi: 10.1016/j.aquaculture.2006.04.027

Lien, S., Koop, B. F., Sandve, S. R.,Miller, J. R., Kent,M. P., Nome, T., et al. (2016).

The Atlanc salmon genome provides insights into rediploidizaon. Nature533, 500–505. doi:

10.1038/nature17164

Liu, Z., Liu, S., Yao, J., Bao, L., Zhang, J., Li, Y., et al. (2016). The channel caishgenome sequence

provides insights into the evoluon of scale formaon inteleosts. Nat. Commun. 7:11757. doi:

10.1038/ncomms11757

Lu, X., Kong, J., Luan, S., Dai, P., Meng, X., Cao, B., et al. (2016a).

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

Transcriptome analysis of the hepatopancreas in the Pacic White Shrimp(Litopenaeus vannamei)

under acute ammonia stress. PLoS ONE 11:e0164396.

doi: 10.1371/journal.pone.0164396

Lu, X., Luan, S., Hu, L. Y., Mao, Y., Tao, Y., Zhong, S. P., et al. (2016b). Highresoluon

genec linkage mapping, high-temperature tolerance and growthrelatedquantave trait locus (QTL)

idencaon inMarsupenaeus japonicus.Mol. Genet. Genomics 291, 1391–1405. doi: 10.1007/

s00438-016-1192-1

Mackay, T. F., Stone, E. A., and Ayroles, J. F. (2009). The genecs ofquantave traits: challenges and

prospects. Nat. Rev. Genet. 10, 565–577.doi: 10.1038/nrg2612

Meuwissen, T.,Hayes, B., and Goddard,M. (2001). Predicon of total genec valueusing genome-wide

dense marker maps.Genecs 157, 1819.

Meuwissen, T., Hayes, B., and Goddard,M. (2016). Genomic selecon: a paradigmshi in animal

breeding. Anim. Front. 6, 6–14. doi: 10.2527/af.2016-0002

Pérez, F., Erazo, C., Zhinaula, M., Volckaert, F., and Calderón, J. (2004).

A sex-specic linkage map of the white shrimp Penaeus (Litopenaeus) vannamei) based on AFLP

markers. Aquaculture 242, 105–118.doi: 10.1016/j.aquaculture.2004.09.002

Pérez-de-Castro, A. M., Vilanova, S., Cañizares, J., Pascual, L., Blanca, J. M., Diez,M. J., et al. (2012).

Applicaon of genomic tools in plant breeding.Curr.Genomics 13, 179–195.

doi: 10.2174/138920212800543084

Robinson, N. A., Gopikrishna, G., Baranski, M., Katneni, V. K., Shekhar, M.S., Shanmugakarthik, J., et al.

(2014). QTL for white spot syndrome virusresistance and the sex-determining locus in the Indian

black ger shrimp(Penaeus monodon).

BMC Genomics 15:731. doi: 10.1186/1471-2164-15-731

Rothschild, M. F., and Plastow, G. S. (2008).Impact of genomics on animalagriculture and opportunies

for animal health.Trends Biotechnol. 26, 21–25.

doi: 10.1016/j.btech.2007.10.001

Rotllant, G.,Wade,N.M., Arnold, S. J., Coman, G. J., Preston,N. P., and Glencross,B. D. (2015).

Idencaon of genes involved in reproducon and lipid pathwaymetabolism in wild and

domescated shrimps. Mar. Genomics 22, 55–61.

doi: 10.1016/j.margen.2015.04.001

Rubin, C.-J., Zody, M. C., Eriksson, J., Meadows, J. R. S., Sherwood, E.,Webster, M. T., et al. (2010).

Whole-genome resequencing reveals loci underselecon during chicken domescaon. Nature

464:587. doi: 10.1038/nature08832

Taylor, J. F., Taylor, K. H., and Decker, J. E. (2016). Holsteins are thegenomic selecon poster cows.

Proc. Natl. Acad. Sci. U.S.A. 113, 7690–7692.doi: 10.1073/pnas.1608144113

Thorgaard, G. H., Nichols, K.M., and Phillips, R. B. (2006).Comparave gene andQTL mapping in

aquaculture species.Israeli J. Aquacult.Bamidgeh 58, 4.

Van Emon, J. M. (2015). The omics revoluon in agricultural research.

J. Agric.Food Chem. 64, 36–44. doi: 10.1021/acs.jafc.5b04515

Hands on Training Aquaculture Genomics and Bioinformacs 71

Vandepue, M., and Haray, P. (2014). Parentage assignment with genomicmarkers: a major advance

for understanding and exploing genec variaonof quantave traits in farmed aquac animals.

Front. Genet. 5:432.doi: 10.3389/fgene.2014.0043

Vij, S., Kuhl, H., Kuznetsova, I. S., Komissarov, A., Yurchenko, A. A., VanHeusden, P., et al. (2016).

Chromosomal-level assembly of the Asian seabassgenome using long sequence reads and mul-

layered scaolding. PLoS Genet.12:e1005954.

doi: 10.1371/journal.pgen.1005954

Wang, Q., Yu, Y., Yuan, J., Zhang, X., Huang, H., Li, F., et al. (2017). Eects ofmarker density and

populaon structure on the genomic predicon accuracyfor growth trait in Pacic white shrimp

Litopenaeus vannamei.

BMC Genet.18:45. doi: 10.1186/s12863-017-0507-5

Wilson, K., Li, Y. T., Whan, V., Lehnert, S., Byrne, K., Moore, S., et al.(2002). Genec mapping of the

black ger shrimp Penaeus monodonwith amplied fragment length polymorphism. Aquaculture

204, 297–309.doi: 10.1016/S0044-8486(01)00842-0

Xu, C., Li, E., Liu, Y., Wang, X., Qin, J. G., and Chen, L. (2017).Comparaveproteome analysis of the

hepatopancreas from the Pacic white shrimpLitopenaeus vannamei under long-term low

salinity stress. J. Proteomics 162,1–10. doi: 10.1016/j.jprot.2017.04.013

Yáñez, J. M., Naswa, S., López, M., Bassini, L., Correa, K., Gilbey, J., et al.(2016). Genomewide single

nucleode polymorphism discovery in Atlancsalmon (Salmo salar): validaon in wild and

farmed American and Europeanpopulaons.

Mol. Ecol. Resour. 16, 1002–1011. doi: 10.1111/1755-0998.12503

Yu, Y., Zhang, X., Yuan, J., Li, F., Chen, X., Zhao, Y., et al. (2015). Genomesurvey and high-density

genec map construcon provide genomic and genecresources for the PacicWhite Shrimp

Litopenaeus vannamei. Sci. Rep. 5:15612.doi: 10.1038/srep15612

Zeng, Q., Fu, Q., Li, Y., Waldbieser, G., Bosworth, B., Liu, S., et al. (2017).

Development of a 690K SNP array in caish and its applicaon for genecmapping and validaon of

the reference genome sequence. Sci. Rep. 7:40347.doi: 10.1038/srep40347

Zhang, L., Yang, C., Zhang, Y., Li, L., Zhang, X., Zhang, Q., et al. (2007). Agenec linkage map of Pacic

white shrimp (Litopenaeus vannamei): sex-linkedmicrosatellite markers and high recombinaon

rates. Geneca 131, 37–49.doi: 10.1007/s10709-006-9111-8

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

17. Shrimp Genomics : Current status and Challenges

M.S. Shekhar, K. Vinaya Kumar and K.K. Vijayan

The shrimp genomics has evolved a into a considerable research progress over last few decades.

The recent advances in “omics” in parcular with the advancement in NGS techniques, have

provided the aquaculture industry the opportunies as well the challenges faced in understanding

the complexity of the whole genome of shrimp. However, the currently available molecular biology

resources and bioinformacs techniques require further development to undertake the challenges

and provide the most informave results in deciphering the shrimp genome.

1. Introducon

The consumpon of food shes globally is projected to increase tremendously. However, with

exploitaon and decrease in wild catch sheries worldwide, much importance is now being given to

increase the producon from aquaculture. In aquaculture and sheries management for an eecve

genec improvement breeding programs, studies relang to populaon structure, genec diversity,

environmental adaptaon and molecular response to bioc and abioc stress are very important.

“Biotechnology” integrated with “Omics” is a term that has now come to encompass many of the

excing new developments in aquaculture during recent years. Hence, for sustainable aquaculture,

genec improvement for desired traits etc. through biotechnological means has gained importance in

recent years. Aquaculture biotechnology deals with the use of knowledge and techniques in the eld

of molecular, cellular and genec processes to develop improved aquaculture products and variees.

Therefore, a wide term of ‘omics’ which includes methods and techniques that are required for analyzing

all dierent types of molecules and the pathways associated with them is used in aquaculture as well.

This encompasses the major four “omics”, namely transcriptomics, proteomics, metabolomics and

epigenomics. Viral infecons are one of the major reasons for the huge economic losses in shrimp

farming. The control of viral diseases in shrimp remains a serious challenge for the shrimp aquaculture

industry. White spot syndrome virus (WSSV), is a major pathogen which is geographically widespread

and connues to be a serious threat aecng shrimp farms the world over. In the absence of a true

adapve immune response system in invertebrates, shrimps respond by non-specic innate immune

mechanisms. Shrimp genome annotaon and transcriptome generaon as “omics” tools would aid

to unravel the molecular mechanisms involved in the immune defence network that occur in shrimp

in response to WSSV infecon in addion to development of genecally improved variees of shrimp

with desirable traits through genec improvement breeding programmes.

2. Transcriptomics

Next-generaon high-throughput RNA sequencing technology (RNA-seq) is a modern and a high

throughput method which is not restricted by the unavailability of a genome reference sequence

has tremendous potenal for idencaon, proling and quanfying RNA transcripts with increased

sensivity. Transcriptome is the complete set of transcripts in a cell, indicang a specic developmental

stage or physiological condion together with the quanty. Transcriptome helps in idenfying the

funconal elements of genome revealing molecular constuents of cells and ssues, in response to

environmental stress with an accurate quancaon of gene expression levels. Because of these

several advantages over other techniques expression this approach has been widely used now in

Hands on Training Aquaculture Genomics and Bioinformacs 73

decoding the funconal role of gene and cell responses against environmental stress. Signicant

progress has been recently achieved in understanding the transcript expression of marine crustaceans

such as Litopenaeus vannamei, Fenneropenaeus chinensis, Eriocheir sinensis and Macrobrachium

nipponense in response to bioc and abioc stress factors. Transcriptome data aids in idencaon of

novel genes in absence of shrimp genome database as shown in Table 1. Next-generaon sequencing

technologies have therefore inuenced the analysis of gene regulaon.

Table 1. Transcriptomes generated from shrimp species

Species Tissues Transcriptome generaon

L. vannamei Hemocytes WSSV

L. vannamei Hepatopancreas WSSV

L. vannamei Hepatopancreas and muscle WSSV and growth

L. vannamei Hemolymph and hemocytes TSV

L. vannamei Hepatopancreas TSV

L. vannamei Tess and Ovaries Gonadal development

L. vannamei Hepatopancreas Acute ammonia stress

L. vannamei Hepatopancreas Osmoregulatory Stress

L. vannamei Gills Osmoregulatory Stress

L. vannamei Hepatopancreas and hemocytes Nitrite

L. vannamei Whole larvae Embryo development

L. vannamei Embryo, Nauplius, zoea, mysis, post

larvae

Larval Development

L. vannamei Whole shrimp Molng

L. vannamei Muscle Feed eciency

L. vannamei Heart, muscle, hepatopancreas and

eyestalk

Growth

P. monodon Hepatopancreas and ovary Reproducon and

development

P. monodon Eyestalk, stomach, female gonad, male

gonad, gill, haemolymph,

hepatopancreas,

lymphoid organ, tail muscle, embryos,

nauplii, zoea, and mysis, whole larvae

Gene discovery

M. japonicus Ferlized eggs, embryos and vegetal

halves

Embryo development

F. chinensis Cephalothorax WSSV

F. merguiensis Cucle, muscle, androgenic gland,

hepatopancreas, stomach, nervous

system, eyestalk, male gonads, female

gonads

Color

F. merguiensis Hepatopancreas, stomach, eye stalk,

nerve cord, male gonad, female gonad,

androgenic gland region, muscle and

cucle

Reproducon and

development

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

3. Complexity of shrimp genome

Shrimp genomes are large with highly repeve sequences which pose signicant challenges in

deciphering the whole genome and other genec studies. In our study, the shrimp genome esmated

by ow cytometry showed the shrimp genome to be of very high size. The genome size for the

four major species of genus Penaeus (Penaeus monodon, Penaeus indicus, Penaeus vannamei and

Penaeus japonicus) were found in similar range. The genome size of female shrimps ranged from

2.91 ± 0.03 pg (P. monodon) to 2.14 ± 0.02 pg (P. japonicus). In male shrimps, the genome size ranged

from 2.86 ± 0.06 pg (P.monodon)to 2.19 ± 0.02 pg (P. indicus). Signicant dierence was observed in

the genome size between male and female shrimp of all species except in P.monodon. The highest

relave dierence of 12.78% was observed in the genome size between the either sex in P.indicus.

The interspecic relave dierence of 30.59% in genome size was highest between the male shrimps

of P. monodon and P. indicus and 35.98% between the female shrimps of P. monodon and P. japonicus.

This study was undertaken to esmate genome size in shrimps which will help guiding the research

aimed towards generang the sequence data for the whole genome of these species in future. The

penaeid genome (80% repeve) remains a challenge even today for sequencing and assembly.

Short read second-generaon sequencing methods for example illumina sequencing technology

is preferred for non-complex genomes, by idenfying and overlaying sequences and building the

resulng congs and scaolds. However, when short read sequencing methods are applied to highly

repeve regions within the genome, it leads to diculty in building conguous sequences. The

shrimp genomes also have high levels of heterozygosity. The previous short-read assembly in shrimps

have led highly fragmented assembly with high number of scaolds. There are reports that shrimp

with polysaccharides contaminaon and high DNase acvity can interfere with long read sequencing

methodologies which are major challenges to overcome and methods to isolate intact pure shrimp

genome needs more standardizaon.

4. NGS plaorm for shrimp genome sequencing

Several NGS plaorms are currently in use such as Illumina MiSeq, Ion Torrent PGM, PacBio RS,

Illumina GAIIx, Illumina HiSeq 2000, etc. The key feature which determines the opmal plaorm to

be used is their speed of sequencing with less of error rates. The sequencing methodology has been

dominated by Illumina. However, the use of this technology is not adequate in dealing with complex

shrimp genomes which requires generaon of longer read lengths. One such latest plaorm which

yields longer read lengths is PacBio. PacBio is based on single molecule real me (SMRT) sequencing.

The DNA polymerase molecules, binds to a DNA template, are present at the base of 50 nm-wide wells

called zero-mode waveguides (ZMWs). Second strand DNA synthesis in the presence of γ-phosphate

uorescently labeled nucleodes is carried out by each polymerase. With each base incorporaon, a

disncve pulse of uorescence is detected in real me. The PacBio plaorm, by virtue of its long read

lengths, has the potenal applicaon in de novo sequencing of shrimp genome. Approx. mean read

lengths of 1500 bp were generated using the PacBio RS system with the rst generaon of chemistry

(C1 chemistry) , the advanced PacBio RS II system with the C4 chemistry yields average read lengths

over 10 kb , with an N50 of more than 20 kb and maximum read lengths over 60 kb. The latest PacBio

Sequel System is a advanced version with higher throughput, more scalability, a reduced footprint

and lower sequencing project costs compared to the PacBio® RS II System. This advanced version of

Hands on Training Aquaculture Genomics and Bioinformacs 75

the Sequel System is the capacity of its redesigned SMRT Cells, which contain one million zero-mode

waveguides (ZMWs) as, compared to 150,000 ZMWs in the PacBio RS II. Acve individual polymerases

are immobilized within the ZMWs, providing windows to observe and record DNA sequencing in real

me. In future the successful assemblies for shrimp genome will depend upon a “hybrid assembly”

approach, ulizing short-read sequencing to correct the high error rate observed in long read PacBio

sequencing system.

5. The Challenges

In comparison to other livestock industries, very less improved lines are used in aquaculture

producon (Gjedrem et al., 2012). The aquaculture producon has also not completely ulized the

exisng natural genec potenal and resources for increased producvity. In case of shrimp, there

have been numerous molecular studies on the expression and funcon of selected genes involved

in metabolic pathways, however, lile aenon is given to the metabolic dierences which exist

between shrimp or to their developmental stages. The dierence among shrimp due to result of

parcular metabolic and adaptaons to varied environmental condions needs to be studied in detail.

These types of studies have direct relevance to the beer management pracces and formulaon of

opmal diets for the domescaon of shrimp in aquaculture. In the immediate future, the main

challenges are to integrate the available genomic data with physiological studies on shrimp. These

outcomes will elucidate species-specic adaptaons to environmental condions, and have the

potenal to inform and smulate research in many biological disciplines. For, any genomic studies

and analysis, a reference genome is essenal, however, except for a brachiopod Daphnia pulex, no

informaon on complete genome assembly is available from other crustaceans. The genome size of

D. pulex is comparavely smaller in size of about 200 Mb, containing 30,970 genes and very less 9.4%

repeve sequences, however in shrimp, the genomes are too big and complex for sequencing and

assembly. Bioinformacs, data mining and sequence annotaon needs to be dened and developed

for complex genomes which would aid in complete genome assembly.

6. Future potenal

Introducing of improved bioinformacs approach for error-correcon of longer read sequencing

lengths and use of opcal mapping would help in compleng the large size genome assembly of shrimps

and other aquac species. There is also an urgent need to construct linkage and physical maps, and

to develop database for annotated transcriptome, proteomics and metabolomics, which would help

in generang highly informave shrimp “omics” to understand genome structure, genome evoluon,

phylogeny and natural selecon of aquaculture species. The funconal genomics with annotated

genome and validaon of candidate genes by experimental CRISPER or RNAi knockdown studies

would be signicant progress towards in idencaon of target genes of commercial importance

such as growth, and disease resistance. Understanding the genome and genec makeup of shrimp

would benet in deciphering complex traits which would eventually accelerate the breeding program

in shrimp. A high-density linkage map is essenal for shrimp genomics and genec studies. Creaon

of a high-density linkage map would help in mapping of QTLs for traits of interest such as body

weight, body length, disease resistance and other traits which have high commercial signicance in

aquaculture.

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

18. Applicaon of Biotechnology in animal reproducon

Sherly Tomy

Reproducve eciency is a major factor determining the economic success of any livestock

enterprise.Majority of the animal breeding programs have aimed at enhancing the genec worth

of animals using convenonal selecon methods primarily based on phenotype. Revoluonary

tools in reproducve biotechnology like use of recombinant DNA procedures, genome engineering,

transgenic technology, somac cell nuclear transplantaon etc has added new dimensionstoanimal

breeding. Applicaon of biotechnology in animal breeding has resulted in several remarkable

discoveries like the sheep Dolly, created by the somac cloning technique, transgenic pigs that can

be organ donors for humans, and animal bioreactors producing human therapeuc proteins in milk.

Compared to the terrestrial animals, the development in aquac animals is comparavely less. Only a

small percentage of farmed aquac species have been subject to genec improvement programmes.

However, biotechnology have great potenal to increase sh producon mainly due to the availability

of large numbers of gametes, use of external ferlizaon, and ease of hormone treatments during

development to induce sterility or funconal sex reversal. Some of the important reproducve

biotechnological tools used in farm animals are:

Arcial inseminaon: Using this technology new breeds of animals are produced through the

introducon of the male sperm from one superior male to the female reproducve tract without

mang. The advantage of AI includes reduced transmission of venereal disease, lessens the need of

farms to maintain breeding males, facilitates more accurate recording of pedigrees, and minimizes the

cost of introducing improved genecs. However, success of AI depends on accurate heat detecon,

proper frozen semen handling and mely inseminaon by a trained inseminator.

Sex Determinaon of Sperm: Sexing of sperm could help to pre-determine the sex of the progeny. This

technique works on the principle of ow cytometric separaon of uorescent-labelled X-chromosome

bearing spermatozoa from the sperms carrying uorescent-labelled Y-chromosome. The accuracy of

this technique is high, however, the laser light used reduces the viability of the sexed sperm and

the throughput is low.However, new generaon ow cytometer with high sorng rates have opened

avenues for increasing sorted sperm output with minimal or no damage to sperm. Sex chromosome-

specic proteins (SCSPs) idened on the surface of sperm are also currently used for sperm sexing

which are less invasive and less damaging to sperm.

Sperm Encapsulaon: This involves encapsulaon of sperm for longer preservaon of sperm in

vivo and to allow progressive release of viable spermatozoa over several days in various domesc

species including human. The technique also prevents cryocapacitaon and also reported to have

increased concepon rate. The technique has been developed in cale and swine, sll it needs more

sophiscated instrument for encapsulaon and standardizaon, to be used under eld condions in

other livestock species.

Ovum Pick Up (OPU) :This is a non-invasive and repeatable technique used for recovering large numbers

of competent oocytes from antral follicles of live animals. Embryo producon from ovum pick-up

oocytes is aected by age, season, follicle smulang hormone (FSH) smulaon. It also evident that

Hands on Training Aquaculture Genomics and Bioinformacs 77

repeated OPU can be performed without side eects both in cale and bualoes with a minimal

stress to the animal. In India, the rst bualo calf (Saubhagya) was produced through this technique

by Prasad et al.2013, and subsequently, rst bovine calf (Holi) was produced at ICAR-Naonal Dairy

Research Instute. OPU has advantage to collect oocytes from animals with less invasiveness and the

use of superior animals as oocyte donors in embryo transfer. One of the limitaons of this technique

is the low oocyte yield per ovary and necessity for sophiscated instrument for carrying out this

technique.

In Vitro Maturaon, Ferlizaon and Culture (IVMFC) :This involves oocyte collecon from

slaughterhouse ovaries or from live animals followed by maturaon and ferlizaon in vitro for the

producon of viable embryos. Since the birth of the rst rabbit conceived through IVF in 1959, IVF has

been pracsed in several animals. Various methods for in vitro maturaon, IVF, and in vitro culture

have been standardized in animals. In addion, IVMFC has provided an excellent source for embryo

transfer, cloning, transgenesis, and other advanced in vitro techniques. It has also allowed the analysis

of the developmental potenal of embryos, paern of gene expression, epigenec modicaons

and cytogenec disorders in various domesc species and has been used as a model for human

embryogenesis studies. The low success rate and the costs make the technique less feasible for

applicaon in livestocks under eld condions

Intracytoplasmic Sperm Injecon (ICSI) :ICSI is a micromanipulaon technique used for treang male

inferlity. It involves mechanical inseron of a selected sperm into the cytoplasm of an oocyte to

produce desirable embryo. Since the rst report of ICSI success, ICSI has been done in other species

such as rabbits, mice, sheep, humans, horses, cale, and pigs including bualoes. This technique is

also used for sperm vector system for animal transgenic.

Mulple Ovulaon andEmbryo Transfer: In this technique selected genecally superior (elite) females

are induced to superovulate hormonally and inseminated with high quality semen of a superior male

at an appropriate me relave to ovulaon depending on the species. Week-old embryos are ushed

out of the donor’s uterus, isolated, examined microscopically for number and quality, and inserted

into the lining of the uterus of surrogate mothers non-surgically. ET increases reproducve rate of

selected females, reduces disease transfer, and facilitates the development of rare and economically

important genec stocks. The main liming factor for the ET is that this technique involves costly

hormones, labour intensive protocols and experse in addion to the poor super ovulatory response

and pregnancy outcomes.

Somac cell Nuclear Transfer or Cloning: Somac cell nuclear transfer (SCNT) is a major technique

for delivering nuclease-mediated genec alteraons in livestock. In this technique, the nucleus of a

somac cell is transferred into a female egg cell or oocyte in which the nucleus has been removed to

generate a new individual, genecally idencal to the somac cell donor. The advantage of SCNT is

that the gene-edited cell line can be genotyped and/or screened before transfer into the enucleated

oocyte to ensure that the desired edits, and no o-target edits, have occurred. A number of gene-

edited animals have been produced through SCNT cloning technique. This technique was used to

generate Dolly from a dierenated adult mammary epithelial cell. Further research is needed to

improve the eciency of the cloning. SCNT is a procedure of cloning within the same species whereas

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

interspecies cloning (interspecies Somac Cell Nuclear Transfer -iSCNT) are also feasible. The cloned

animals have already been produced between closely related species. Eg- domesc cale (Bos taurus)

and wild ox (Bos grunniens). Cloning procedure using embryonic stem cells (ESCs) referred as Nuclear

Transfer-derived Embryonic Stem Cell (NTESC) is sll unsuccessful. Despite the achievements made

through SCNT-eding method, certain drawbacks associated with cloning such as early embryonic

losses, postnatal death, and birth defects cannot be ignored

Cryopreservaon: Cryopreservaon is a process where cells, whole ssues, or any other substances

suscepble to damage caused by chemical reacvity or me are preserved by cooling to sub-zero

temperatures. Cryopreservaon is a mulstage complex process incorporang cryoprotectants or

anfreeze agents. The ability to cryopreserve germplasm indenitely allows genec diversity to

be preserved. Unlike semen, cryopreservaon of embryo helps in the preservaon of complete

genotypes. Freezing of embryos is an established commercial pracce especially in cale. In contrast

to embryos, oocytes are extremely sensive to chilling and are dicult to cryopreserve without

losing their viability. However, research is in progress on the vernalisaon of oocytes, where very low

temperature storage, without freezing, could preserve the oocytes for several months. This technique

is advantageous as it reduces the risk and expense in the transportaon of expensive animals; reduce

disease transmission and conservaon of endangered species germplasm.

Embryo Sexing :Embryo sexing is a technique in reproducve biotechnology having praccal

applicaons. Sex determinaon is performed by Y-chromosome-specic DNA probe technology

coupled with polymerase chain reacon (PCR) amplicaon of specic Y-chromosome region. Other

methods involve detecon of embryonic H-Y angen in the embryos and use of loop-mediated

isothermal amplicaon and duplex PCR-based assay showing more than 95% accuracy but involves

high cost, me and experse for carrying out these protocols.

Transgenesis: Transgenic animals have a foreign gene deliberately inserted into their genome by the

micro-injecon of DNA into the pronuclei of a ferlised egg which is subsequently implanted into

the oviduct of a surrogate mother. Transgenesis has great potenal in molecular breeding of farm

animals, such as development of animals with high fecundity, higher ferlity, disease resistance etc.

Transgenic technologies in shes can enhance growth rates and market size, feed conversion raos,

resistance to disease, sterility issues and tolerance of extreme environmental condions. In the

shrimp aquaculture sector, transgenic shrimp have been reported (Mialhe et al., 1995), but there has

been no successful development to date for commercial culture. The cost for making transgenic farm

animals is high and the eciency is low.

Stem Cells: Stem cells are unspecialized cells that renew themselves for long periods through cell

division, and later become specialized on receiving specic signals. Based on their source, stem cells

have been classied into three types, viz., embryonic, adult and fetal stem cells. ES cells are derived

from embryos at (blastocyst stage 32 cell stage), can give rise to cells from all three embryonic germ

layer.The ESs cells are advantageous as they do not form tumours when transferred into the body which

potenates their use in transplantaon. On the other, adult stem cells are those undierenated cells

found throughout the body which is needed for replenish and regenerate cells in any damaged ssue.

The spermatogonial stem cells are the only adult stem cells having the responsibility of transferring

Hands on Training Aquaculture Genomics and Bioinformacs 79

genes to next generaons via the process of ferlizaon of ovum. Some of the potenal applicaons

of this technology are surrogate producon of spermatozoa, reduced me for progeny tesng,

producon of transgenic animals and conservaon of endangered species.

Gene eding: Gene eding is a powerful tool to manipulate genome, bearingapplicaons in

animal breeding programs. Gene eding allows specic deleons, addions, or allele alteraon at

unambiguous locaons in a genome. The development of designer nucleases (zinc nger nucleases

[ZFNs], transcripon acvator-like eector nuclease [TALENs], and clustered regularly interspaced

short palindromic repeats [CRISPR/Cas9]) has enabled extremely ecient and more facile genome

eding in dierent animal species. These tools could be employed to enhance producvity, disease

resistance, breeding eciency, and for generaon of novel animal models. Such alteraons, if made

in zygotes or germ line cells, can be permanent and heritable. Recently, genome eding in many

livestock species has been reported such as myostan (MSTN) gene eding for “double muscling”

in pigs, cale, and sheep, polled gene introducon in dairy cale, and edits to confer resistance to

porcine reproducve and respiratory syndrome virus and African swine fever virus in pigs.

Endocrine regulaon of reproducon in sh

Biotechnology can be applied to enhance the reproducve performance of cultured aquac

species exhibing reproducve dysfuncon is capvity. In the past, sh gonadotropin, a group of

hormones that smulate reproducon, were produced in small amounts by extracon and puricaon

from crude preparaons of thousands of pituitary glands. At present, large quanes of highly puried

gonadotropin can be produced in the laboratory through recombinant DNA technology. The use of

synthec Gonadotropin Releasing Hormone (GnRH), the key regulator of reproducve cascade in all

vertebrates, triggers the secreon of the sh’s own gonadotropin. GnRHa is synthesized chemically

and does not carry the risk of transming diseases to the broodstock. However, injecon of GnRHa

does not always result in 100% ovulaon and oen mulple injecons are oen necessary to induce

ovulaon. Development of controlled-release delivery systems for synthec GnRHas has contributed

to capve breeding of many commercially important sh species. The hormones implants mixed with

cholesterol, ethylene-vinyl in biodegradable microspheres have been ecient in inducing maturaon

and spawning in many cultured sh.

Sex control :The control of sh sex could be useful where one sex displays advantageous

characteriscs, such as larger adult size, producon of high-value caviar(sturgeon), faster growth

rate, or higher age at rst sexual maturaon. Monosex populaons of the most advantageous sex can

be produced either by direct sex control via steroid treatment (masculinisaon by administraon of

androgens; feminisaon by administraon of estrogens); or by genec controland steroid treatment

of broodstock (indirect hormonal treatment, gynogenesis, androgenesis); or by control of external

factors (temperature, density etc.). In the case of lapia, males are preferred for culture as they grow

faster than females. The YY male technology involves a genec breeding programme combining the

hormone feminizaon of a normal male (XY female) followed by mang with normal males (in lapia).

Sterility: Sterility in sh by manipulaon of reproducon would help to increase growth by reducing

energy consumpon for reproducon. Sterility can be achieved by ploidy manipulaon to produce

sterile triploids or the use of transgenics by gene “knock-out” or “gene knock-down”.

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

Conclusion

Reproducve biotechnology has revoluonized animal breeding and genec progress in livestock

industry.The applicaon of biotechnology in aquaculture including the use of synthec hormones

in induced breeding, producon of monosex, surrogate broodstock, transgenic sh etc has played

major role to ensure the connued expansion and intensicaon of aquaculture to meet the growing

sh demand.The emerging techniques should be judicially implemented for manipulaon and

improvement of reproducve performance of the livestock species.

Source of informaon

K. K. Choudhary, K. M. Kavya,A. Jerome, R. K. Sharma (2016). Advances in reproducve biotechnologies.

Vet World 9(4): 388–395.

Role of Biotechnology in Assisted Reproducon, Science, 14 May – 2014.

W. S. Lakra and S. Ayyappan (2002).Recent Advances in Biotechnology Applicaons to Aquaculture.

Internaonal Symposium on Recent Advances in Animal Nutrion,22nd September, New Delhi,

Pg-455-461

Hands on Training Aquaculture Genomics and Bioinformacs 81

19.Use of molecular techniques in growth enhancement

Raymond J Angel

Introducon

It is proved that the use of molecular techniques in aquaculture has the potenal to alleviate

the predicted sh shortages and price increases by enhancing producon eciency, minimizing costs

and reducing disease. Growth enhanced sh using molecular techniques will be equally benecial to

aquaculture and is more eecve than tradional breeding techniques to develop new sh strains. In

principle, the technology can be used to improve growth rate of the sh, control sexual maturaon,

sterility and sex dierenaon, improve survival by increasing disease resistance against pathogen,

adapt to extreme environment such as cold resistance and alter the biochemical characteriscs of

the esh to enhance the nutrional qualies. Since sh can be readily improved by applicaon of

molecular techniques, it is clearly mely to consider what genecally modied (GM) sh are likely to

oer in the future, both in terms of benets and disadvantages (Maclean and Norman, 2003). Growth

Hormone has also been ulised in recent years extensively for construcon of transgenic shes to

enhance growth. Genec engineering is an important tool to develop and improve traits of sh for

aquaculture. Species showing high growth rate is widely used to isolate Growth Hormone gene for

the producon of transgenic sh.

An overview of various target species used in growth enhancement using molecular techniques

Transgenic sh have been produced for numerous species of sh including non-commercial

model species such as the Loach, Misgurnus anguilIicaudatus (Maclean et al. 1987a), Medaka, Oryzias

lapes (Ozato et al. 1986), Topminnows and Zebra sh, although Gong et al. (2002) have developed

transgenic Rainbow zebra sh for the ornamental sh industry. Several experiments have evaluated

transgenic farmed sh species including Goldsh (Zhu et al. 1985), Common carp, Silver carp, Mud

loach, Rainbow trout (Chourrout, 1986), Atlanc salmon, Coho salmon, Chinook salmon, Channel

caish (Dunham et al. 1987) and Nile lapia (Brem et al. 1988). Addionally, gene transfer has been

accomplished in a game sh, Northern pike (Gross et al. 1992).

Many species of sh have been used in studies for standardizing GH- involved transgenesis. Even

though, many studies reported a posive enhancement of growth in target species, some proved to

be unsuccessful due to many unknown reasons. Some of the studies have been quoted for reference

(Table 1).

Techniques for growth enhancement

There are many ways to enhance growth including inbreeding, gynogenesis, androgenesis,

selecon, intraspecic crossbreeding, interspecic hybridizaon, polyploidy, sex reversal and

breeding, nuclear transplantaon and transgenesis. Cloned populaons have been produced via

gynogenesis and androgenesis (Dunham, 2004), but direct cloning of an individual sh of interest

has not yet been accomplished. Gene transfer technology has produced a great impact in modern

biology and biotechnology (Powers et al. 1998). A number of sh species are in focus for gene transfer

experiments and can be divided into two main groups: animals used in aquaculture (Fletcher and

Davies, 1991; Hew et al. 1995; Chen and Lu, 1998) and model sh used in basic research (Chen and

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

Lu, 1998). Among the major food sh species are Carp (Cyprinus sp.), Tilapia (Oreochromis sp.),

Salmon (Salmo sp., Oncorhynchus sp.) and Channel caish (Ictalurus punctatus) while Zebrash

(Danio rerio), Medaka (Oryzias lapes) and Goldsh (Carassius auratus) are used in basic research.

Genec engineering of farm animals oers great potenal for improvement of selected genec traits

of agricultural signicance. Several species of sh have also been used to exploit this technology for

commercial purposes, and examples include aempted inducon of freeze resistance in transgenic

salmon using an An-Freeze Protein gene (Fletcher et al., 1988) and producon of growth enhanced

sh using novel Growth Hormone (GH) genes (Dunham et al., 1987; Brem et al., 1988; Penman et al.,

1990) or an Insulin-like Growth Factor (IGF) gene (Chen et al., 1995). Although several species of sh

have been used to produce lines of transgenic sh, in only a few cases has germline transmission and

stable long term transgene expression been sasfactorily demonstrated.

Techniques for gene transfer

Microinjecon

Microinjecon is most successfully and widely used technique for gene transfer in sh. Gene

transfer research with sh began in the mid 1980’s ulizing microinjecon (Zhu, et al 1985, Dunham

et al 1987). Zhu et al. (1985) published the rst report of transgenes microinjected into the ferlized

eggs of goldsh. In almost all sh gene transfer research, the foreign gene was microinjected into the

cytoplasm of one-to- four cell embryos (Hayat, 1989) as pronuclei are extremely dicult to visualize

in live one-cell sh embryos.

To ensure the integraon of the DNA it should be injected to intact cells close to the cut site. The

injecon apparatus consists of a dissecng stereomicroscope and two micro-manipulators, one with

a glass micro-needle for delivering transgene and other with a micropipee for holding sh embryo in

place (Fig. 1). The success of microinjecon technique depends on the nature of egg chorion. The so

chorion facilitates the microinjecon while the thick chorion limits the ability to visualize the target

for injecon of DNA. In many shes (Atlanc salmon and rainbow trout) the egg chorion gets tough

and hard just aer the ferlizaon or to contact with the water and provides a diculty in injecng

the DNA.

Steps of Microinjecon Technique

(1) Desired eggs and sperms are stored separately at the opmum condions.

(2) Add water and sperms and iniate the ferlizaon.

(3) Ten minutes aer the ferlizaon, eggs are dechorionated by trypsinizaon.

(4) Ferlized eggs are microinjected with desired DNA just within a few hours of ferlizaon. DNA

is released into the centre of the germinal disc to the rst cleavage in dechorionated eggs.

The me available for microinjecon is rst 25 minutes and that too between ferlizaon

and rst cleavage.

(5) Aer microinjecon the embryos are incubated in water unl hatching takes place.

Survival rates of microinjected sh embryos is seem to be about 30-80% depending the sh

species.

Hands on Training Aquaculture Genomics and Bioinformacs 83

Fig 1.Microinjecon technique.

Other methods

Microinjecon is a tedious and slow procedure (Powers, et al. 1992) and can result in high egg

mortality (Dunham, et al. 1987). Aer the inial development of microinjecon, new techniques such

as electroporaon, retroviral integraon, liposomal-reverse-phase-evaporaon, sperm mediated

transfer and high velocity micro-projecle bombardment were developed (Chen and Powers, 1990)

that somemes can more eciently produce large quanes of transgenic individuals in a shorter

me period. The rst successful gene transfer ulizing electroporaon produced integraon rates and

survival similar to that for microinjecon (Inoue, et al.1990). Powers, et al. (1992) demonstrated that

electroporaon can be more ecient than microinjecon with integraon rates somemes as high as

30-100%. Walker (1993) found that hatching rates were higher for electroporated embryos than for

microinjected channel caish embryos, and post-ferlizaon electroporaon treatments had higher

hatching rates than electroporaon of sperm and then eggs prior to ferlizaon.

Environmental Concerns about Transgenic Fish and risk migaon

The primary environmental concerns about releases of transgenic sh, for example, include

compeon with wild populaons, movement of the transgene into the wild gene pool, and ecological

disrupons due to changes in prey and other niche requirements in the transgenic variety versus the

wild populaons.

It is important to note that developers of transgenic sh are aempng to reduce or eliminate

both gene ow and invasive species risks by sterilizing transgenic sh. Sterilizaon is relavely easy

and inexpensive but success rates are highly variable. In addion, sterilizaon does not necessarily

neutralize environmental risks. Academic sciensts note that an escaped, sterile sh might sll engage

in courtship and spawning behaviour, disrupng breeding in wild populaons. Waves of escaped

sterile sh could also create ecological disrupons as each group is replaced by another equally strong

group of transgenic sterile sh.

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

Conclusion and future prospecve

Transgenic sh technology has great potenal in the aquaculture industry. By introducing

desirable genec traits into shes, mollusks, and crustaceans, superior transgenic strains can be

produced for aquaculture. These traits include faster growth rates, improved food conversion

eciency, resistance to some known diseases, tolerance to low oxygen concentraons, and tolerance

to extreme temperatures. Our laboratory and those of others have shown that transfer, expression

and inheritance of sh growth hormone transgenes can be achieved in several sh species and that

the resulng transgenics grow substanally faster than their non-transgenic siblings. This is a vivid

example of the potenal applicaon of the gene transfer technology to aquaculture.

However, to realize the full potenal of the transgenic sh technology in aquaculture orother

biotechnological applicaons, several important scienc breakthroughs are required. These include:

(1) more ecient technologies for mass gene transfer,

(2) targeted gene transfer technologies such as embryonic stem cell gene transfer or ribozyme

gene inacvaon,

(3) suitablepromoters to direct the expression of transgenes at opmal levels during the desired

developmental stages,

(4) idened genes of desirable traits for aquaculture and other applicaons,

(5) informaonon the physiological, nutrional, immunological and environmental factors that

maximize the performance of the transgenics, and

(6) safety and environmental impacts of transgenic sh. Once these problems are resolved, the

commercial applicaon of the transgenic sh technology will be readily aained.

Table 1. Studies showing enhancement of growth achieved in dierent target

organisms worldwide with citaon

FAMILY AND SPECIES CONSTRUCT GROWTH COUNTRY SUPPORTING CITATION

Salmonidae

Atlanc salmon,

Salmo salar

opAFP-csGH 2–6-Fold Canada Du et al. (1992)and

Fletcher et al. (2004)

Coho salmon,

Oncorhynchus kisutch

ssMT-ssGH Up to 11-

fold

Canada Devlin et al. (1994a,b)

Coho salmon

Oncorhynchus kisutch

opAFP-csGH 3–10-Fold Canada Devlin et al. (1995a)

Chinook salmon,

O. tshawhytscha

opAFP-csGH 6-Fold Canada Devlin et al. (1995a)

Rainbow trout,

O. mykiss

opAFP-csGH 3.2-Fold Canada Devlin et al. (1995a)

Cuhroat trout,

O. clarki

opAFP-csGH 6-Fold Canada Devlin et al. (1995a)

Hands on Training Aquaculture Genomics and Bioinformacs 85

Arcc charr,

Salvelinus alpines

Various

constructs

Up to 14-

fold

Finland Pitkanen et al. (1999)

Rainbow trout

O. mykiss

ssGH-ssGH None Finland Pitkanen et al. (1999)

Cichlidae

Nile lapia,

Oreochromis nilocus

opAFP-csGH 2–4-Fold UK Rahman et al. (1998;

2001)

and Rahman and

Maclean (1999)

Nile lapia

Oreochromisnilocus

ssMT-ssGH None UK Rahman et al.

(1998; 2001)

and Rahman and

Maclean (1999)

Tilapia, O. hornorum

Hybrid

hCMV-GH 82% Cuba Marnez et al. (1996)

Ictaluridae

Channel caish,

Ictalurus punctatus

RSVLTR-rtGH, Up to 26% USA Dunham et al. (1992)

Channel caish

Ictalurus punctatus

mMT-hGH None USA Dunham et al. (1987)

Heteropneusdae

Heteropneustes fossilis Zpb-ypGH 30–60% India Sheela et al. (1999)

Cyprinidae

Goldsh,

Carassiusauratus

mMT-hGH None PR China Zhu et al. (1985)

Common carp, Cyprinus

carpio

mMT-hGH None PR China Zhu et al. (1989)

Common carp

Cyprinus carpio

cbA-gcGH 42–80% PR China Zhu (1992) and

Wang et al. (2001)

Catla, Catla catla RSVLTR-rtGH None India Sarangi et al. (1999)

Common carp

Cyprinus carpio

ccbA-ccGH 4-Fold Israel Hinits and Moav (1999)

Rohu

Labeo rohita

CMV-roGH 4-Fold India Venugopal et al. (2004)

Rohu

Labeo rohita

gcbA-roGH 4.5–5.8-

Fold

India Venugopal et al. (2004)

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

Esocidae

Northern pike RSVLTR-bGH 30% USA Gross et al. (1992)

Cobidae

Mud loach,

Misgurnusmisolepis

mlb-acn-mlGH Up to 35-

fold

Republic

of Korea

Nam et al. (2001; 2002)

REFERENCES

Brem, G., Brenig, B., Horstgen-Schwark, G. and Winnacker, E. L., 1988. Gene transfer in lapia

(Oreochromis nilocus). Aquaculture., 68: 209-219.

Chen, T. T. and Lu, J. K., 1998. Transgenic sh technology: Basic principles and its applicaon in basic

and applied research. In: De la Fuente J. and Castro F.O. eds. Gene transfer in aquac organisms.

RG Landes Company and Germany: Springer-Verlag, Ausn, Texas, USA., pp. 45-73.

Chen, T. T. and Powers, D. A. (1990) Transgenic sh. Trends in Biotechnology., 8: 209-214.

Chen, T. T., Lu, J. K., Shamblo, M.J., Cheng, C. M., Lin, C. M., Burns, J. C., Reimschuessel, R., Chatakondi,

N. and Dunham, R. A., 1995.Transgenic sh: ideal models for basic research and biotechnological

applicaons. Zool. Studies.,344: pp. 215–234.

Chourrout, D., 1986. Techniques of chromosome manipulaon in rainbow trout: a new evaluaon

with karyology. Theorecal and Applied Genecs., 72: 627-632.

Devlin, R. H., Bya, J. C., McLean, E., Yesaki, T. Y., Krivi, G.G., Jaworski, E.G. and Donaldson, E.M.,

1994b. Bovine placental lactogen is a potent smulator of growth and displays strong binding to

hepac receptor sites of coho salmon. Gen. Comp. Endocrinol., 95: 31–41.

Devlin, R. H., Yesaki, T. Y., Biagi, C. A., Donaldson, E. M., Swanson, E. M. P. and Chan, W. K., 1994a.

Extraordinary salmon growth.Nature, 371, 209–210.

Devlin, R. H., Yesaki, T. Y., Donaldson, E. M., Du, S. J. and Hew, C. L., 1995a. Producon of germline

transgenic Pacic salmonids with dramacally increased growth performance. Can. J. Fish.Aquat.

Sci., 52: 1376–1384.

Du, S. J., Gong, Z. Y., Fletcher, G. L., Shears, M. A., King, M. J., Idler, D. R. and Hew, C. L., 1992. Growth

enhancement in transgenic Atlanc salmon by the use of an all-sh chimeric growth hormone

gene construct. BioTechnology., 10: 176–181.

Dunham, R. A., Ramboux, A. C., Duncan, P. L., Hayat, M., Chen, T. T., Lin, C. M., Kight, K., Gonzalez-

Villasenor, I. and Powers, D. A., 1992.Transfer, expression and inheritance of salmonid growth

hormone genes in channel caish, Ictalurus punctatus, and eects on performance traits.Mar.

Mol. Biol. Biotechnol., 1: 380–389.

Dunham, R. A. 2004.,Aquaculture and Fisheries Biotechnology Genec Approaches.CABI publishing,

Wallingford ,UK., 17: P. 400.

Hands on Training Aquaculture Genomics and Bioinformacs 87

Dunham, R. A., Eash, J., Askins, J. and Townes, T.M., 1987. Transfer of the metallothione in human

growth hormone fusion gene into channel caish. Transacons of the AmericanFisheries

Society.,116: 87-91.

Fletcher, G. L., Shears, M. A., King, M. J., Davies, P. L. and Hew, C. L., 1988. Evidence for anfreeze

protein gene transfer in Atlanc salmon (Salmo salar).Can. J. Fish.Aquat. Sci.,45, pp. 352–357

Fletcher, G. L., Shears, M. A., Yaskowiak, E. S., King, M. J. and Goddard, S. V., 2004. Gene transfer:

potenal to enhance the genome of Atlanc salmon for aquaculture. Aust. J. Exp. Agric., 44:

1095–1100.

Fletcher. G. L. and Davies, P. L., I991. Transgenic sh for aquaculture.Gen. Eng., 13: 33l-369.

Gong, Z., Wan, H., Ju, B., He, J., Wang, X.,and Yan, T., 2002. Generaon of living color transgenic

zebrash. In: Shimizu, N., Aoki, T., Hirono, I., and Takashima, F. (Eds.). Aquac Genomics: Steps

Toward a Great Future, Springer-Verlag, New York, NY., pp. 329-339.

Gross, M. L., Schneider, J. F., Moav, N., Moav, B., Alvarez, C., Myster, S. H., Liu, Z., Hallerman, E.

M., Hacke, P. B., Guise, K. S., Faras, A. J. and Kapuscinski, A. R., 1992. Molecular analysis and

growth evaluaon of northern pike (Esox lucius) microinjected with growth hormone genes.

Aquaculture.,103: 253-273.

Hayat, M., 1989. Transfer, expression and inheritance of growth hormone genes in channel caish

(Ictalurus punctatus) and common carp (Cyprinus carpio). Doctoral Dissertaon. Auburn

University, AL, USA.

Hew, C. L.; Fletcher, G. L. and Davies, P. L., 1995. Transgenic salmon: tailoring the genome for food

producon. Journal of Fish Biology, 47: 1-19.

Hinits, Y. and Moav, B., 1999. Growth performance studies in transgenic Cyprinus carpio. Aquaculture.,

173: 285–296.

Inoue, K., Yamashita, S., Hata, J. I., Kabeno, S., Asada, S., Nagahisa, E. and Fujita, T., 1990.

Electrophoration as a new technique for producing transgenic sh.Cell Differentiationand

Development.,29: 123-128.

Maclean, N. and Norman.,2003.Genetically modied sh and their effects on food quality and human

health and nutrition.Trends in Food Science & Technology., 14: (5-8), 242-252.

Maclean, N., Penman, D. and Talwar, S., 1987a.Introduction of novel genes into sh.Biotechnology.,

5: 257-261.

Marnez, R., Estrada, M. P., Berlanga, J., Guillen, I., Hernandez, O., Cabrera, E., Pimentel, R.,Morales,

R., Herrera, F., Morales, A., Pina, J. C., Abad, Z., Sanchez, V., Melamed, P., Lleonart, R. and de

la Fuente, J., 1996. Growth enhancement in transgenic lapia by ectopic expression of lapia

growth hormone.Mol. Mar. Biol. Biotechnol., 5: 62–70.

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

Nam, Y. K., Cho, Y. S., Cho, H. and Kim, D. S., 2002. Accelerated growth performance and stable germ-

line transmission in androgenecally derived homozygous transgenic mud loach, Misgurnus

mizolepis. Aquaculture., 209: 257–270.

Nam, Y. K., Noh, J. K., Cho, Y. S., Cho, H. J., Cho, K. N., Kim, C. G. and Kim, D. S., 2001. Dramacally

accelerated growth and extraordinary gigansm of transgenic mud loach Misgurnus mizolepis.

Transgenic Res., 10: 353–362.

Ozato, K., Kondoh, H., Inohara, H., Iwamatsu, T., Wakamatsu, Y. and Okada, T. S.,1986. Producon of

transgenic sh: introducon and expression of chicken delta-crystallin gene in medaka embryos.

Cell Dier. Dev., 19: 237-244.

Penman, D. J., Beeching. A .J., Penn, S. and Maclean, N., 1990. Factors aecng survival and integraon

following microinjecon of novel DNA into rainbow trout eggs.Aquaculture, 85: 35-50.

Pitkanen, T. I., Krasnov, A., Teerijoki, H. and Molsa, H., 1999. Transfer of growth hormone (GH) genes

into Arcc charr (Salvelinus alpinus L.). I. Growth response to various GH constructs. Genet. Anal.:

Biomol. Eng., 15: 91–98.

Powers, D. A., Cole, T., Creech, K., Chen,T. T., Lin, C. M., Kight, K. and Dunham, R., 1992. Electroporaon:

a method for transferring genes into the gametes of zebrash, Brachydanio rerio, channel caish,

Ictalurus punctatus, and common carp, Cyprinuscarpio. Mol. Mar. Biol. Biotech., 1:301-309.

Powers, D. A.; Gómez-Chiarri, M.; Chen, T. T. and Dunham, R.,1998. Genec Enginering of Finsh and

shellsh.In: De la Fuente J. and Castro F.O. eds. Gene transfer in aquac organisms. RG Landes

Company and Germany, Springer-Verlag, Ausn, Texas, USA. pp. 17-34.

Rahman, M. A. and Maclean, N., 1999. Growth performance of transgenic lapia containing an

exogenous piscine growth hormone gene. Aqaculture, 173: 333–346.

Rahman, M. A., Mak, R., Ayad, H., Smith, A. and Maclean, N., 1998. Expression of a novel piscine

growth hormone gene results in growth enhancement in transgenic lapia (Oreochromis

nilocus). Transgenic Res., 7: 357– 369.

Rahman, M. A., Ronyai, A., Engidaw, B. Z., Jauncey, K., Hwang, G. L., Smith, A., Roderick, E., Penman,

D., Varadi, L. and Maclean, N., 2001. Growth and nutrional trials on transgenic Nile lapia

containing an exogenous sh growth hormone gene.J. Fish Biol., 59: 62–78

Sarangi, N., Mandall, A. B., Bandyopadhyay, A. K., Venugopal, T., Mathavan, S. and Pandian, T. J., 1999.

Electroporated sperm-mediated gene transfer in Indian major carps. Asia-Pacic J. Mol. Biol.

Biotechnol., 7: 151–158.

Sheela, S. G., Pandian, T. J. and Mathavan, S., 1999. Electroporac transfer, stable integraon, and

transmission of pZp beta ypGH and pZp beta rtGH in Indian caish, Heteropneustes fossilis

(Bloch).Aquac. Res., 30: 233–248.

Venugopal, T., Anathy, V., Kirankumar, S. and Pandian, T.J., 2004.Growth enhancement and food

conversion eciency of transgenic sh, Labeo rohita.J. Exp. Biol. 301A: 477–490.

Hands on Training Aquaculture Genomics and Bioinformacs 89

Walker, D.S., 1993. Eect of electroporaon and microinjecon on survival of ictalurid caish embryos.

Master of Science Thesis. Auburn University, AL.

Wang, Y., Hu, W., Wu, G., Sun, Y., Chen, S., Zhang, F., Zhu, Z., Feng, J. and Zhang, X., 2001. Genec

analysis of ‘‘all-sh’’ growth hormone gene transferred carp (Cyprinus carpio L.) and its F1

generaon. Chin. Sci. Bull., 46: 1174–1177.

Zhu, Z., 1992. Generaon of fast-growing transgenic sh: methods and mechanisms. In: Hew, C.L.,

Fletcher, G.L. (Eds.), Transgenic Fish. World Publishing, Singapore, pp. 92–119.

Zhu, Z., Li, G., He, L. and Chen, S., 1985.Novel gene transfer into the ferlized eggs of goldsh (Carassius

auratus, 1758).Journal of Applied Ichthyology, 1: 31-33.

Zhu, Z., Xu, K., Xie, Y., Li, G. and He, L., 1989.A model of transgenic sh.Sci. Sin., B 2: 147–155.

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

20. Gene Eding Tools and their applicaon in Aquaculture

Misha Soman

Introducon

Genome eding is a kind of genec engineering in which a gene of interest is inserted, or erased in

the genome of an organism or cells using engineered restricon enzymes called “molecular scissors.”

These nucleases create site-specic double-strand breaks (DSBs) at desired locaons in the genome.

The induced double-strand breaks are repaired through non-homologous end-joining (NHEJ) or

homologous recombinaon (HR), resulng in targeted mutaons (‘edits’). By eding the genome the

characteriscs of a cell or an organism can be changed.

Genome eding uses ‘engineered nuclease’ which cuts the DNA at its targeted site. Engineered

nucleases have two parts a nuclease part and the DNA-targeng part that is designed in such a way that

it guides the nuclease to cut a specic sequence of DNA. When a cut forms within a parcular place

of DNA, the cell starts to repair the cut naturally.Gene eding technologies have wide applicaons in

dierent sh species for basic as well as applied research in disease modeling and aquaculture.

Genome eding can be used

For research: It can be used to alter the DNA in organisms to study impact of gene modicaon.

To treat disease: Genome eding is being used in medical research to study the viability of the

technology to treat deadly human diseases like leukemia, AIDS, cancer, etc. (Youdiil Ophinni, et al.,

2018;Pablo Tebas, et al., 2014).

For biotechnology: Genome eding has been used in agriculture to produce genecally modied crops

to improve their yields and resistance to disease, as well as to make genecally modied pigs(Kankan

Wang, et al., 2015; Jin-Dan Kang et al., 2017), sheep (Crispo, et al., 2015), and shes(Karim Khalil et

al., 2017).

TYPES OF GENE EDITING

ØZinc nger nucleases (ZFNs)

ØTALENS

ØCRISPR-Cas9

ZINC FINGER NUCLEASES (ZFNS)

Zinc nger nucleases (ZFNs) are the type of engineered restricon nucleases produced by joining

zinc nger DNA-binding domain and DNA-cleavage domain (FokI) that promote targeted eding of

the genome by generang double-strand breaks in DNA at targeted locaons. This nuclease is a site-

specic endonuclease designed to bind and cleave DNA at parcular locaons. ZFN is composed

of three to six zinc nger mofs, and each mof parcularly recognizes three nucleodes in a DNA

sequence. Hence, each ZFN can idenfy target 9 to 18 base pairs. The cleavage of target DNA requires

dimerizaon of two ZFNs for the FokI enzyme results in double-strand break (DSB) at the target locus

(Durai et al., 2005). Double-strand breaks are important for site-specic mutagenesis in that they

Hands on Training Aquaculture Genomics and Bioinformacs 91

smulate the cell’s natural DNA-repair processes homology-directed repair and Non-Homologous

End Joining (NHEJ); these reagents can be used to modify the genome precisely.

genomes precisely.

Fok I

Zinc Finger

Module (DNA binding domain)

5’…………

…

3’………..

…..3’

…..5’

Catalytic module

DNA cleavage domain

Fig 1: DNA-binding domain and DNA-cleaving domains are fused together,

a highly-specic pair of ‘genomic scissors’ formed.

TALENS

Transcripon acvator-like eector nuclease (TALEN) technology use engineered restricon

enzymes generated by fusing a TAL eector DNA-binding domain to a DNA cleavage domain (FokI).

Restricon enzymes can be designed that will precisely cut any desired DNA sequence. When these

restricon enzymes are introduced into cells, it makes double-stranded breaks in the gene of interest.

The nucleases consist of programmable and sequence-specic DNA-binding modules coupled with a

regular DNA cleaved domain that allows accurate and ecient genec alteraons by smulang the

targeted DNA double-strand breaks to induce cellular DNA repair, including error-prone NHEJ and

HDR.

The DNA binding domain contains a repeated highly conserved 33–34 amino acid sequence

with divergent 12th and 13th amino acids. These two posions, referred to as the Repeat Variable

Diresidue (RVD), are highly variable and show a strong correlaon with specic nucleode recognion.

Dierent RVD allows each module to specically recognize one individual nucleode instead of three

nucleodes as in ZFN (Moscou and Bogdanove, 2009). The dimerized FokI randomly cleaves the DNA

sequence between the le and right TALEN target sites.

sites.

FokI

Catalytic module (DNA

cleavage domain)

TALE module

(DNA binding domain)

5’…………

…

…..5’

3’………..

…..3’

Fig 2: TALENS mechanism

Zinc FingerModule

(DNA binding domain)

Catalytic module

(DNA cleavage domain)

TALE module

(DNA binding domain)

Catalytic module

(DNA cleavage domain)

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

CRISP-R Cas9

The clustered regularly interspaced short palindromic repeat (CRISPR) and associated protein(Cas9)

emerged as a faster, cheaper and more precise gene eding tool in a wide range of organisms. It is

an adapve immunity mechanism in prokaryotes to eliminate invading genec material in which the

foreign genec material cut into fragments and integrated into its CRISPR locus as a series of short

repeats (20 bps). The loci are transcribed and processed into small RNAs which are called as guide

RNAs to guide nucleases to cleave the target DNA based on sequence complementarity. This unique

technology enables genecists and medical researchers to edit the genome by adding, removing or

altering the DNA sequence. The CRISPR-Cas9 system consists of two key players that make mutaon

into the targeted DNA. These are the enzyme Cas9 and a piece of RNA called guide RNA. The cas9 act

as a molecular scissor which cuts double-stranded DNA at a specic targeted site. So that bits of the

sequence can be added or removed. The guide RNA (gRNA) is an about 20 bases long pre-designed

RNA sequence located within the RNA scaold. The scaold part binds to DNA and the pre-designed

guide RNA ‘guides’ Cas9 nuclease to the targeted region of the genome, and it ensures that the Cas9

enzyme cuts at the right point in the genome.

Cas9 enzyme cuts at the right point in the genome.

Cas9

sgRNA

Target sequence

PAM

Sequence

5’……

3’……

…..3’

…..5’

Fig 3: CRISP-R Cas9 mechanism

DNA double stranded breaks (DSB) repair mechanisms

Most DSBs get repaired by either the non-homologous end joining (NHEJ) pathway or the

homology-directed repair (HDR) pathway. The NHEJ repair pathway causes nucleode inserons or

deleons (indels) at the cleavage site. In most cases, NHEJ gives rise to small indels in the targeted

DNA that result in deleons, inserons, or frameshi mutaons leading to the formaon of premature

stop codons inside the open reading frame (ORF) of the targeted gene and causes gene disrupon. It

results in the loss of funcon of the targeted gene.

Homology-directed repair (HDR) is a process of homologous recombinaon where a DNA

template is used for precise repair of a double-strand break (DSB). This template can be either from

the cell during the late S phase and the G2 phase of the cell cycle, before the compleon of mitosis, or

it can be an exogenous repair templates delivered into a cell mostly in the form of a synthec, single-

strand DNA donor oligo or DNA donor plasmid, to generate a specic change in the genome.

Cleavage

Hands on Training Aquaculture Genomics and Bioinformacs 93

Advantages of CRISPR-Cas 9 system over ZFNs AND TALENS

ØHighly ecient mutagenesis

ØEecve introducon of targeted indels at required genomic locaon

ØTarget eciency >80%

ØIn CRISPR-Cas9 system only one customized sg RNA is required to target a specic sequence,

the same Cas9 can be used for all targeted sequences.

ØZFNs and TALENS require design and assembly of two nucleases for each target site.

ØSg RNAs are of short sequences <100bp, therefore reduces complicaons

Applicaons of gene eding tools in shes

Fish species, especially the model species such as the zebrash, have played important roles

in tesng new protocols of genome eding because of the biological advantages of sh models.

A large number of genes have been disrupted or modied in sh species for funconal studies,

especially those involved in reproducon. These gene eding technologies can be ulized to modify

the genomes of a variety of industrially relevant organisms and standard research animals including

zebrash, rats, pigs, caish. The cis-regulatory mechanisms and gene knockdowns or knockouts can be

invesgated by using genome eding tools to know the unexplored processes of animal development

and gene funcon to use in basic and applied sciences. Genome eding can be ulized to study early

embryogenesis, inducon of mutaon, producon of knockout lines, to unravel ancestral features of

chordate development. It can be used to systemacally study the funconal analysis of reproducve

performance in shes, disease resistance, tolerance to environmental stressors, sex determinaon,

sex dierenaon, funconal analysis of genes in non-reproducve funcons like pigmentaon,

growth, and development and also for the disease modeling and drug screening. CRISPR is one of the

most useful and powerful tools for gene manipulaon in sh; even though o-target occurrence is a

serious concern. The authors report that o-target mutaon eciency can be reduced by lowering

the concentraon of gRNAs in the injecon. Genome eding tools were applied in zebrash, mainly to

induce mutaons which would give valuable insights for medical science. The myostan (MSTN) gene

(muscle suppressor gene) disrupon by CRISPR/Cas9 was successfully carried out in channel caish,

Ictalurus punctatus which resulted in 88–100% rates of mutagenesis in the protein-coding sites of

Myostan. The MSTN altered fry had more muscle cells, and the mean body weight also increased

by 29.7%. The alignment of the mutated sequences vs. wild-type showed mulple inserons and

deleons. (Karim Khalil et al., 2017). In India, Central Instute of Freshwater Aquaculture successfully

disrupted Toll-like receptor 22 (TLR22) gene of Labeo rohita (rohu) involved in innate immunity and

solely present in teleost shes and amphibians using the CRISPR/Cas9 technology and the mutants

lacked TLR22 mRNA expression (Chakrapani et al., 2016). These results conrm that CRISPR/Cas9

is a highly ecient tool for eding the sh genome, and exposes ways for promong sh genec

enhancement and funconal genomics.

Conclusion

Gene eding tools are widely used for studying the manipulaon of the gene in human, animals,

vegetables, and sh for various purposes. With this high-eciency gene eding in shes, we are

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

entering into a new era for the adopon of powerful technologies to study various gene funcons

to improve the traits. Gene eding tools widely used to study the impact of the manipulaon of

the gene in animals, vegetables, sh and in humans for various purposes. With these high-eciency

genes eding in shes, we are entering into a new era of powerful technologies to study mulple

gene funcons to improve the traits. These technologies will give insights into the gene funcons

and the evoluon of vertebrates and also the possibility to treat deadly human diseases in medical

research, to create improved variees in agriculture, livestock and aquaculture. In the aquaculture

industry, this approach may pave the way for growth-enhanced shes to increase the producvity.

References

www.Yourgenome.org

www.Genetherapynet.com

Khalil, K., Elayat, M., Khalifa, E., Daghash, S., Elaswad, A., Miller, M., Abdelrahman, H., Ye, Z., Odin,

R., Drescher, D., Vo, K., Gosh, K., Bugg, W., Robinson D, and Dunham R., (2017). Generaon

of Myostan Gene-Edited Channel Caish (Ictalurus punctatus) via Zygote Injecon of CRISPR/

Cas9 System. Scienc Reports volume 7, Arcle number: 7301.

Zhu, B., Ge, W., 2018. Genome eding in shes and their applicaons. General and Comparave

Endocrinology. 257, 3-12.

Chakrapani, V., Patra, S. K., Panda, R. P., Rasal, K. D., Jayasankar, P., Barman, H. K., 2016. Establishing

targeted carp TLR22 gene disrupon via homologous recombinaon using CRISPR/Cas9. J. of

Developmental and Comparave Immunology (61) 242-247.

Wang, K., Ouyang, H., Xie, Z., Yao, C., Guo, N., Li, M., Jiao, H, and Pang, D., 2015. Ecient Generaon

of Myostan Mutaons in Pigs Using the CRISPR/Cas9 System. Scienc Reports 5:16623.

Crispo, M., Mulet, A. P., Tesson, L., Barrera, N., Cuadro, F., dos Santos-Neto, P. C., Nguyen, T. H.,

Crénéguy, A., Brusselle, L., Anegón, I., Menchaca. A., 2015. Ecient Generaon of Myostan

Knock-Out Sheep Using CRISPR/Cas9 Technology and Microinjecon into Zygotes. PLoS ONE

10(8): e0136690.

Ophinni, Y., Inoue, M., Kotaki. T & Kameoka, M., 2018. CRISPR/Cas9 system targeng regulatory genes

of HIV-1 inhibits viral replicaon in infected T-cell cultures. Scienc Reports volume 8, Arcle

number: 7784.

Pablo Tebas, P., David Stein, D., Winson W. Tang, W. W., Ian Frank, I., Shelley Q. Wang, M.D., Gary Lee,

Ph.D., S. Kaye Spra, Ph.D., Richard T. Surosky, Ph.D., Marn A. Giedlin, Ph.D., Geo Nichol, M.D.,

Michael C. Holmes, Ph.D., Philip D. Gregory, Ph.D., et al. 2014. Gene Eding of CCR5 in Autologous

CD4 T Cells of Persons Infected with HIV. The New England journal of medicine. 370:901-910.

Hands on Training Aquaculture Genomics and Bioinformacs 95

GLOSSARY

ØRead – Base pair informaon of a given length from a DNA or cDNA fragment contained in a

sequencing library. Dierent sequencing plaorms are capable of generang dierent read

lengths.

ØSingle End Read – The sequence of the DNA is obtained from the 5’ end of only one strand of

the insert. These reads are typically expressed as 1x “y”, where “y” is the length of the read

in base pairs (ex. 1x50bp, 1x75bp).

ØPaired End Read – The sequence of the DNA is obtained from the 5’ ends of both strand of

the insert. These reads are typically expressed as 2x “y”, where “y” is the length of the read

in base pairs (ex. 2x100bp, 2x150bp).

ØMate Pair Read – The sequence of the DNA is obtained similar to paired-end reads, however

the size of the DNA insert is oen much greater in size (2-10kb in length) and the paired

reads originate from a single strand of the DNA insert.

ØDepth of Coverage – The number of reads that spans a given DNA sequence of interest. This

is commonly expressed in terms of “Yx” where “Y” is the number of reads and “x” is the unit

reecng the depth of coverage metric (i.e. 5x, 10x, 20x, 100x)

ØSequencing Depth – The amount of sequencing a given sample requires to achieve a certain

depth of coverage. This is frequently expressed as the number of reads a sample requires (ex.

40 million reads, 80 million reads) or the number of bases of sequencing a sample requires

(ex. 4 gigabases, 100 megabases).

ØSNP/SNV – Referring to a Single Nucleode Polymorphism or Single Nucleode Variant

detected in a sample.

ØInDels – One or more Inseron or Deleon event that is detected in a sample.

ØAnnotaon - Adding biological informaon to genome sequence. This is a very complex task,

and the process for doing this is rapidly evolving. Features that are added to the genome

oen include gene models, SNPs, and STSs.

ØCopy Number Variaon (CNV)- large-scale structural changes in DNA that vary from individual

to individual. These include inserons, deleons, duplicaons and complex mul-site

variants that range from kilobases to megabases in size. CNV can inuence gene expression,

phenotypic variaon and alter gene dosage, and in certain instances may be associated with

developmental disorders, cause disease or confer suscepbility to complex disease traits.

ØEST Expressed sequence tag - These are single-pass sequences of cDNA clones. Databases of

EST sequences are highly redundant but quite useful for gene idencaon. There are many

eorts to cluster EST sequences to remove the redundancy and low-quality sequences.

ØHaplotype (haploid genotype) - A set of closely linked genec markers present on one

chromosome that tend to be inherited together. A haplotype may also refer to a set of single

nucleode polymorphisms (SNPs) on a single chromad that are stascally associated with

one another.

ØReference sequence/genome - A fully assembled version of a genome that can be used for

mapping short DNA sequence reads for comparisons of genomes from various individuals

ICAR – Central Instute of Brackishwater Aquaculture, Chennai

ØCong - A cong (from conguous) is a set of overlapping DNA segments that together

represent a consensus region of DNA. In sequencing projects, a cong refers to overlapping

sequence data (reads).

ØScaold - A scaold is a poron of the genome sequence reconstructed from end-sequenced

whole-genome shotgun clones. Scaolds are composed of congs and gaps.

ØSpecicity -The percentage of sequences that map to the intended targets out of total bases

per run.

ØHomopolymer - Uninterrupted stretch of a single nucleode type (e.g., TTT or GGGGGG)

ØBase Call-Base calling is the process of assigning bases (nucleobases) to chromatogram peaks.

ØHomology

v Ortholog - Orthologous sequences are homologous sequences in dierent species that

have a common origin. Disncon of Orthologoes is a result of gradual evoluonary

modicaons from the common ancestor. Perform same funcon in dierent species

v Paralog - Paralogous sequences are homologous sequences that exists within a species.

They have a common origin but involve gene duplicaon events to arise. Perform

dierent funcons in same species

v BLAST E-values - The BLAST programs (Basic Local Alignment Search Tools) are a set of

sequence comparison algorithms introduced in 1990 that are used to search sequence

databases for opmal local alignments to a query.

v The E-value represents the amount of alignments you would expect to nd by chance

that have the same score as the alignment you are looking at. The e-value is calculated

with the formula E = (query length) * (length of database) * 2^-(S). A good, biologically

signicant e-value would be 0.05 or less.

N50: The number of largest congs whose sum is equal to or greater than half the genome

size.

L50: The smallest number of congs whose sum produces N50

Blast - type query and subject

blastn query is DNA, subject is DNA

blastp query is protein, subject is protein

blastx query is nucleic acid that is translated by the program into protein sequences (all

6 reading frames); subject database is protein

tblastn query is protein; database is DNA translated into protein sequences in all 6

reading frames.

tblastx

query is DNA translated into protein, subject is nucleode translated into protein.

Both are translated into all 6 frames. It is very slow relave to the other BLAST

types.

ResearchGate has not been able to resolve any citations for this publication.

CRISPR/Cas9 system targeting regulatory genes of HIV-1 inhibits viral replication in infected T-cell cultures

Article

Full-text available

May 2018

The CRISPR/Cas9 system provides a novel and promising tool for editing the HIV-1 proviral genome. We designed RNA-guided CRISPR/Cas9 targeting the HIV-1 regulatory genes tat and rev with guide RNAs (gRNA) selected from each gene based on CRISPR specificity and sequence conservation across six major HIV-1 subtypes. Each gRNA was cloned into lentiCRISPRv2 before co-transfection to create a lentiviral vector and transduction into target cells. CRISPR/Cas9 transduction into 293 T and HeLa cells stably expressing Tat and Rev proteins successfully abolished the expression of each protein relative to that in non-transduced and gRNA-absent vector-transduced cells. Tat functional assays showed significantly reduced HIV-1 promoter-driven luciferase expression after tat-CRISPR transduction, while Rev functional assays revealed abolished gp120 expression after rev-CRISPR transduction. The target gene was mutated at the Cas9 cleavage site with high frequency and various indel mutations. Conversely, no mutations were detected at off-target sites and Cas9 expression had no effect on cell viability. CRISPR/Cas9 was further tested in persistently and latently HIV-1-infected T-cell lines, in which p24 levels were significantly suppressed even after cytokine reactivation, and multiplexing all six gRNAs further increased efficiency. Thus, the CRISPR/Cas9 system targeting HIV-1 regulatory genes may serve as a favorable means to achieve functional cures.

Electroporated sperm mediated gene transfer in Indian major carps

Article

Full-text available

Jan 1999

The rainbow trout growth hormone (rtGH) gene under the transcription control of long terminal repeat of Rous sarcoma virus was successfully introduced into three Indian major carps (IMC) viz. rohu (Labeo rohita) catla (Catla catla) and mrigal (Cirrihinus mrigala), en route electroporated sperms. The present communication enumerates the results of an array of electroporation experiments aimed at standardising diverse variables like voltage, capacitance, resistance and pulse constant to achieve maximum transformation efficiency. Stable genomic integration of the intruded alien gene was demonstrated through slot blot hybridization in all the three species. Per cent transgenic individuals were largely varied in the different species, maximum of 25% being observed in rohu followed by 23 and 13% in mrigal and catla, respectively. This indicates that electroporated sperm mediated gene transfer may emerge as a convenient means in fish transgenic development specially in IMC. Optimisation of electroporation condition for rtGH transformation to produce transgenic IMC are discussed hereunder in detail.

Generation of Myostatin Gene-Edited Channel Catfish (Ictalurus punctatus) via Zygote Injection of CRISPR/Cas9 System

Article

Full-text available

Dec 2017

The myostatin (MSTN) gene is important because of its role in regulation of skeletal muscle growth in all vertebrates. In this study, CRISPR/Cas9 was utilized to successfully target the channel catfish, Ictalurus punctatus, muscle suppressor gene MSTN. CRISPR/Cas9 induced high rates (88–100%) of mutagenesis in the target protein-encoding sites of MSTN. MSTN-edited fry had more muscle cells (p < 0.001) than controls, and the mean body weight of gene-edited fry increased by 29.7%. The nucleic acid alignment of the mutated sequences against the wild-type sequence revealed multiple insertions and deletions. These results demonstrate that CRISPR/Cas9 is a highly efficient tool for editing the channel catfish genome, and opens ways for facilitating channel catfish genetic enhancement and functional genomics. This approach may produce growth-enhanced channel catfish and increase productivity.

Effects of marker density and population structure on the genomic prediction accuracy for growth trait in Pacific white shrimp Litopenaeus vannamei

Article

Full-text available

May 2017
BMC GENET

Background Due to the great advantages in selection accuracy and efficiency, genomic selection (GS) has been widely studied in livestock, crop and aquatic animals. Our previous study based on one full-sib family of Litopenaeus vannamei (L. vannamei) showed that GS was feasible in penaeid shrimp. However, the applicability of GS might be influenced by many factors including heritability, marker density and population structure etc. Therefore it is necessary to evaluate the major factors affecting the prediction ability of GS in shrimp. The aim of this study was to evaluate the factors influencing the GS accuracy for growth traits in L. vannamei. Genotype and phenotype data of 200 individuals from 13 full-sib families were used for this analysis. ResultsIn the present study, the heritability of growth traits in L. vannamei was estimated firstly based on the full set of markers (23 K). It was 0.321 for body weight and 0.452 for body length. The estimated heritability increased rapidly with the increase of the marker density from 0.05 K to 3.2 K, and then it tended to be stable for both traits. For genomic prediction on the growth traits in L. vannamei, three statistic models (RR-BLUP, BayesA and Bayesian LASSO) showed similar performance for the prediction accuracy of genomic estimated breeding value (GEBV). The prediction accuracy was improved with the increasing of marker density. However, the marker density would bring a weak effect on the prediction accuracy after the marker number reached 3.2 K. In addition, the genetic relationship between reference and validation population could influence the GS accuracy significantly. A distant genetic relationship between reference and validation population resulted in a poor performance of genomic prediction for growth traits in L. vannamei. Conclusions For the growth traits with moderate or high heritability, such as body weight and body length, the number of about 3.2 K SNPs distributed evenly along the genome was able to satisfy the need for accurate GS prediction in the investigated L.vannamei population. The genetic relationship between the reference population and the validation population showed significant effects on the accuracy for genomic prediction. Therefore it is very important to optimize the design of the reference population when applying GS to shrimp breeding.

A high quality assembly of the Nile Tilapia (Oreochromis niloticus) genome reveals the structure of two sex determination regions

Article

Full-text available

May 2017
BMC GENOMICS

Background Tilapias are the second most farmed fishes in the world and a sustainable source of food. Like many other fish, tilapias are sexually dimorphic and sex is a commercially important trait in these fish. In this study, we developed a significantly improved assembly of the tilapia genome using the latest genome sequencing methods and show how it improves the characterization of two sex determination regions in two tilapia species. ResultsA homozygous clonal XX female Nile tilapia (Oreochromis niloticus) was sequenced to 44X coverage using Pacific Biosciences (PacBio) SMRT sequencing. Dozens of candidate de novo assemblies were generated and an optimal assembly (contig NG50 of 3.3Mbp) was selected using principal component analysis of likelihood scores calculated from several paired-end sequencing libraries. Comparison of the new assembly to the previous O. niloticus genome assembly reveals that recently duplicated portions of the genome are now well represented. The overall number of genes in the new assembly increased by 27.3%, including a 67% increase in pseudogenes. The new tilapia genome assembly correctly represents two recent vasa gene duplication events that have been verified with BAC sequencing. At total of 146Mbp of additional transposable element sequence are now assembled, a large proportion of which are recent insertions. Large centromeric satellite repeats are assembled and annotated in cichlid fish for the first time. Finally, the new assembly identifies the long-range structure of both a ~9Mbp XY sex determination region on LG1 in O. niloticus, and a ~50Mbp WZ sex determination region on LG3 in the related species O. aureus. Conclusions This study highlights the use of long read sequencing to correctly assemble recent duplications and to characterize repeat-filled regions of the genome. The study serves as an example of the need for high quality genome assemblies and provides a framework for identifying sex determining genes in tilapia and related fish species.

Aquaculture genomics, genetics and breeding in the United States: current status, challenges, and priorities for future research The Aquaculture Genomics, Genetics and Breeding Workshop

Article

Full-text available

Dec 2017
BMC GENOMICS

Advancing the production efficiency and profitability of aquaculture is dependent upon the ability to utilize a diverse array of genetic resources. The ultimate goals of aquaculture genomics, genetics and breeding research are to enhance aquaculture production efficiency, sustainability, product quality, and profitability in support of the commercial sector and for the benefit of consumers. In order to achieve these goals, it is important to understand the genomic structure and organization of aquaculture species, and their genomic and phenomic variations, as well as the genetic basis of traits and their interrelationships. In addition, it is also important to understand the mechanisms of regulation and evolutionary conservation at the levels of genome, transcriptome, proteome, epigenome, and systems biology. With genomic information and information between the genomes and phenomes, technologies for marker/causal mutation-assisted selection, genome selection, and genome editing can be developed for applications in aquaculture. A set of genomic tools and resources must be made available including reference genome sequences and their annotations (including coding and non-coding regulatory elements), genome-wide polymorphic markers, efficient genotyping platforms, high-density and high-resolution linkage maps, and transcriptome resources including non-coding transcripts. Genomic and genetic control of important performance and production traits, such as disease resistance, feed conversion efficiency, growth rate, processing yield, behaviour, reproductive characteristics, and tolerance to environmental stressors like low dissolved oxygen, high or low water temperature and salinity, must be understood. QTL need to be identified, validated across strains, lines and populations, and their mechanisms of control understood. Causal gene(s) need to be identified. Genetic and epigenetic regulation of important aquaculture traits need to be determined, and technologies for marker-assisted selection, causal gene/mutation-assisted selection, genome selection, and genome editing using CRISPR and other technologies must be developed, demonstrated with applicability, and application to aquaculture industries. Major progress has been made in aquaculture genomics for dozens of fish and shellfish species including the development of genetic linkage maps, physical maps, microarrays, single nucleotide polymorphism (SNP) arrays, transcriptome databases and various stages of genome reference sequences. This paper provides a general review of the current status, challenges and future research needs of aquaculture genomics, genetics, and breeding, with a focus on major aquaculture species in the United States: catfish, rainbow trout, Atlantic salmon, tilapia, striped bass, oysters, and shrimp. While the overall research priorities and the practical goals are similar across various aquaculture species, the current status in each species should dictate the next priority areas within the species. This paper is an output of the USDA Workshop for Aquaculture Genomics, Genetics, and Breeding held in late March 2016 in Auburn, Alabama, with participants from all parts of the United States.

Development of a 690 K SNP array in catfish and its application for genetic mapping and validation of the reference genome sequence

Article

Full-text available

Jan 2017

Single nucleotide polymorphisms (SNPs) are capable of providing the highest level of genome coverage for genomic and genetic analysis because of their abundance and relatively even distribution in the genome. Such a capacity, however, cannot be achieved without an efficient genotyping platform such as SNP arrays. In this work, we developed a high-density SNP array with 690,662 unique SNPs (herein 690 K array) that were relatively evenly distributed across the entire genome, and covered 98.6% of the reference genome sequence. Here we also report linkage mapping using the 690 K array, which allowed mapping of over 250,000 SNPs on the linkage map, the highest marker density among all the constructed linkage maps. These markers were mapped to 29 linkage groups (LGs) with 30,591 unique marker positions. This linkage map anchored 1,602 scaffolds of the reference genome sequence to LGs, accounting for over 97% of the total genome assembly. A total of 1,007 previously unmapped scaffolds were placed to LGs, allowing validation and in few instances correction of the reference genome sequence assembly. This linkage map should serve as a valuable resource for various genetic and genomic analyses, especially for GWAS and QTL mapping for genes associated with economically important traits.

Genome Editing in Fishes and Their Applications

Article

Sep 2017

Comparative proteome analysis of the hepatopancreas from the Pacific white shrimp Litopenaeus vannamei under long-term low salinity stress

Article

Apr 2017

Litopenaeus vannamei is a typical euryhaline decapod model to study the osmoregulation mechanism in crustaceans. The proteomic was undertaken using isobaric tags for relative and absolute quantification together with the reverse phase in high-performance liquid chromatography mass spectrometry to quantitatively identify the proteins differentially expressed in the hepatopancreas under low salinity stress (3psu) compared with the control salinity (25psu). 533 proteins and 84 differentially expressed proteins were identified including 58 proteins with the 1.2-fold cut-off value under chronically low salinity stress. Among these proteins, 26 were up-regulated while 32 were down-regulated. 48 out of 58 differentially expressed proteins were annotated in the Uniprot database and were mapped into 38 pathways by KEGG analysis. These proteins were categorized into the pathways for energy metabolism, signaling, immunization and detoxification, lipid and protein metabolism. A more active glycometabolism, positive response detoxification pathway, immunosuppression and positive osmoregulation were identified in L.vannamei under low salinity stress. This study suggests that under chronically low salinity stress, L. vannamei showed low immunity and high demand for energy especially from glycometabolism. Signaling transfer related pathways, especially the Wnt signaling pathways were involved in the process of salinity adaption, but the in-depth mechanism warrants further investigation. Significance: In this study, a comprehensive physiological response was studied using proteomics to reveal the underlying mechanism of adaptation to low salinity in L.vannamei, which was the first report on the proteomic response of crustacean to salinity stress. The extensive proteomic investigation on hepatopancreas under low salinity stress provides a new insight into the adaptive mechanism of this euryhaline crustacean species to low salinity.

Transgenic fish technology: Basic principles and their application in basic and applied research

Article

Jan 1998

Hands on Training AQUACULTURE GENOMICS AND BIOINFORMATICS GENETICS AND BIOTECHNOLOGY UNIT Prepared by ICAR -CENTRAL INSTITUTE OF BRACKISHWATER AQUACULTURE 75, SANTHOME HIGH ROAD, RA PURAM MRC NAGAR, CHENNAI -600 028

Recommended publications

Training on RECENT ADVANCES IN GENETICS AND BIOTECHNOLOGICAL TOOLS Genetics and Biotechnology Unit

Genetic tools and techniques for fish improvement

A fast sex detection method for the Whiteleg shrimp Litopenaeus vannamei by post-PCR high resolution...

Growth performance and white spot syndrome virus resistance in families of Kuruma shrimp (Marsupenae...