Why might the final few bases of a Sanger sequencing read be unreliable?

Practical skills in Biochemistry A
Cloning Workshop
Aim
To design and verify plasmid DNA constructs for the expression of the fluorescent Light-Oxygen-Voltage (LOV) domain from the Arabidopsis thaliana phototropin 2 protein in a standard E. coli protein expression vector and a vector modified for Golden Gate cloning.
Before attending this workshop we recommend you acquaint your self with the Polymerase Chain Reaction (PCR) and restriction endonucleases. Chapter 7 of Molecular Cell Biology by Lodish et al, has a good overview of molecular cloning techniques:
http://www.ncbi.nlm.nih.gov/books/NBK21712/
Learning outcomes
At the end of this workshop you should be able to:
1. Search DNA sequence databases for genomic DNA sequences encoding protein targets.
2. Translate nucleotide sequences for open reading frames into protein sequences using online tools.
3. Determine domain boundaries in a polypeptide chain.
4. Design protein expression constructs based on predicted domain boundaries
a. Select suitable expression vectors for protein targets.
b. Identify restriction endonuclease sites for insertion into multiple cloning sites of common expression vectors.
c. Understand the difference between Type II and Type IIs restriction endonucleases and how they recognise and cut DNA.
d. Design primers to amplify genomic/synthetic DNA with appropriate restriction sites and overhangs for classical restriction/ligation and Golden Gate cloning strategies.
5. Design a protocol for cloning targets of interest using both classical and Golden Gate strategies.
6. Analyse Sanger sequencing data to determine successful insertion of cloned fragments into expression vectors.
Introduction
Before the advent of mass genome sequencing projects, proteins of biochemical interest had to be purified from their source organism or tissue in a series of time consuming and usually messy/cold experiments. Now, we have genome sequences for more than 10,000 bacterial species and thousands of eukaryotes. We can download and analyse the DNA sequences of almost any gene we could conceive of and design protein expression constructs for these in silico. Native genomic DNA, or cDNA, is available from a number of sources, and can often be bought online with a credit card. It is also possible to take native protein encoding DNA and optimise the codon usage for specific organisms and purchase synthetic DNA, or even entire constructs. The Edinburgh Genome Foundry, run by Professor Rosser and Dr Cai will eventually make it possible to design and produce complete synthetic eukaryotic chromosomes.
In this workshop we are going to design protein expression constructs for the fluorescent LOV domain from the Arabidopsis thaliana phototropin 2 protein. This protein is a useful alternative fluorescent marker to Green Fluorescent Protein (GFP) as it uses a flavin mononucleotide cofactor, rather than the protein-derived cofactor of GFP, and unlike GFP it is fluorescent in anaerobic conditions. We wish to produce this protein for biochemical and biophysical characterisation. To do this we must first generate the plasmids to clone and produce the protein fragments for our study, as is common practice in structural and biochemical studies of proteins.
We will explore the design rationale for protein expression constructs using both a standard restriction/ligation and a Golden Gate cloning method (Engler, Kandzia, & Marillonnet, 2008).

Important notes:
Save intermediate DNA/mRNA/protein sequence files as plain text fasta (https://en.wikipedia.org/wiki/FASTA_format) files using notepad/wordpad.
Keep tabs for each different webserver you use open. You’ll need to switch between them a few times.
Part 1
DNA sequence databases
The Kyoto Encyclopedia of Genes and Genomes: www.kegg.jp
NCBI gene and genome database: www.ncbi.nlm.nih.gov
Department of Energy genome database: http://img.jgi.doe.gov/
Multiple sequence alignment:
Clustal omega: http://www.ebi.ac.uk/Tools/msa/clustalo/
Task:
• Find the mRNA sequence of A. thaliana PHOT2 and save the sequence as a fasta file on your computer.
What is the NCBI reference sequence accession number for this mRNA?
NM_180880.1/ http://www.ncbi.nlm.nih.gov/nuccore/5391441
How does a Eukaryotic mRNA differ from the genomic DNA sequence?
Presence of introns in genomic DNA, 5’ UTR, 3’ poly A.
What are the important features of a fasta file?

Starts with ‘>’ and single line descriptor of file
Then sequence on next line.
Can append multiple sequences in one file as long as each one starts with a ‘>’
Saved as plain text, usually ‘.fas’ or ‘.fasta’ extension.

Part 2
We are primarily interested in the protein-coding region of this mRNA, so we need to translate this from nucleotide sequence  amino acid sequence and determine the correct reading frame for translation.
Task:
• Translate the nucleotide sequence into amino acids using the Expasy translation server.
o http://web.expasy.org/translate/
• Paste the mRNA sequence into the box and select ‘includes nucleotide sequence’ from the Output format drop down list.
• Click ‘translate sequence’

Which reading frame gives the correct sequence?
5’ frame 1

• If you click on this reading frame you will see the translated sequence for this reading frame. It has a few possible start/stop sites. Select the longest open reading frame by clicking on the initiating methionine.
Paste the full protein sequence in Fasta format here:
>LOV
MERPRAPPSP LNDAESLSER RSLEIFNPSS GKETHGSTSS SSKPPLDGNN KGSSSKWMEF QDSAKITERT AEWGLSAVKP DSGDDGISFK LSSEVERSKN MSRRSSEEST SSESGAFPRV SQELKTALST LQQTFVVSDA TQPHCPIVYA SSGFFTMTGY SSKEIVGRNC RFLQGPDTDK NEVAKIRDCV KNGKSYCGRL LNYKKDGTPF WNLLTVTPIK DDQGNTIKFI GMQVEVSKYT EGVNDKALRP NGLSKSLIRY DARQKEKALD SITEVVQTIR HRKSQVQESV SNDTMVKPDS STTPTPGRQT RQSDEASKSF RTPGRVSTPT GSKLKSSNNR HEDLLRMEPE ELMLSTEVIG QRDSWDLSDR ERDIRQGIDL ATTLERIEKN FVISDPRLPD NPIIFASDSF LELTEYSREE ILGRNCRFLQ GPETDQATVQ KIRDAIRDQR EITVQLINYT KSGKKFWNLF HLQPMRDQKG ELQYFIGVQL DGSDHVEPLQ NRLSERTEMQ SSKLVKATAT NVDEAVRELP DANTRPEDLW AAHSKPVYPL PHNKESTSWK AIKKIQASGE TVGLHHFKPI KPLGSGDTGS VHLVELKGTG ELYAMKAMEK TMMLNRNKAH RACIEREIIS LLDHPFLPTL YASFQTSTHV CLITDFCPGG ELFALLDRQP MKILTEDSAR FYAAEVVIGL EYLHCLGIVY RDLKPENILL KKDGHIVLAD FDLSFMTTCT PQLIIPAAPS KRRRSKSQPL PTFVAEPSTQ SNSFVGTEEY IAPEIITGAG HTSAIDWWAL GILLYEMLYG RTPFRGKNRQ KTFANILHKD LTFPSSIPVS LVGRQLINTL LNRDPSSRLG SKGGANEIKQ HAFFRGINWP LIRGMSPPPL DAPLSIIEKD PNAKDIKWED DGVLVNSTDLDIDLF

Part 3
We want to clone the LOV domain from this protein, so we need to determine what domains the protein sequence encodes and the domain boundaries.
Task:
• Paste your protein sequence into the ‘protein sequence’ box on the SMART page (http://smart.embl-heidelberg.de) and check the boxes for ‘outlier homologues’, ‘PFAM domains’, signal peptides’ and ‘internal repeats’.
The server will return a graphical view of the different domains in your protein sequence.

Complete the table below with details of the predicted domains
Name Start End
Low complexity 102 114
PAS 122 191
PAC 197 239
PAS 378 447
PAC 453 495
S_TKc 577 864

• Look at the box with outlier homologues and homologues of known structure. This displays known structures with sequence homology to the predicted domains in the protein.
• Click on the names of the domains to get further information.

Which of the domains are related to LOV proteins of known structure?
PAS1 – 2z6d
PAS2 – 2z6d

The domain boundary predictions given by SMART cover only highly conserved regions of predicted domains. To design robust expression constructs you usually need to include a few amino acids either side of the predicted domain to ensure you have the full protein domain, or you may wish to use structural homologues to guide the design of your construct.

Part 4
Construct design
We wish to clone the protein sequence corresponding to the first LOV domain in the protein.
Task:
• Using the homologues as a reference, identify the first LOV domain and click on the link in the homologue view to see the domain in the protein that matches this. Look at the table and identify the start and end residues for this domain and locate these in your translated nucleotide sequence

What are the start and end residues for this domain?
Start: 117
End: 246
Paste the protein sequence for this domain here:

PRVSQELKTALSTLQQTFVVSDATQPHCPIVYASSGFFTMTGYSSKEIVGRNCRFLQGPD
TDKNEVAKIRDCVKNGKSYCGRLLNYKKDGTPFWNLLTVTPIKDDQGNTIKFIGMQVEVS
KYTEGVNDK

• Now that you have the protein sequence for this domain, you need to go back to the mRNA sequence and translation to determine the mRNA sequence encoding this domain.
• Go back to your translated sequence and find the correct reading frame.

Paste the first 30 nucleotides corresponding to the sequence of the LOV domain here:
cct aga gta tcc cag gag ctc aaa act gct

Paste the final 30 nucleotides of the LOV domain here:
agc aaa tac aca gaa ggg gta aat gat aaa

Paste the full nucleotide sequence of the domain here:
cctagagtatcccaggagctcaaaactgctctatccacgttgcagcagacttttgttgtctctgacgctacacagcctcactgtcccatagtctatgccagcagtggattctttaccatgactggttattcttccaaggaaattgttggaagaaactgccggtttctgcaagggccagacaccgacaagaatgaggttgccaaaatcagagattgtgtcaagaatggaaaaagttactgtggaaggctgttaaactacaaaaaggacggaactcccttctggaatcttctcacagtcactcctatcaaggacgaccagggcaataccatcaagttcattgggatgcaggttgaagttagcaaatacacagaaggggtaaatgataaa

The amplification of DNA fragments by PCR requires two primers that are complimentary to the 5’ and 3’ ends of the region you wish to amplify. You can read more about the PCR reaction in Lodish. The sequences you identified above will be used to design primers to amplify the LOV domain from cDNA.

What is cDNA?
Complementary DNA synthesized from mRNA
Why do we use this and not mRNA/genomic DNA from the plant in a PCR reaction?
No introns in cDNA

Part 5 Vectors for recombinant protein over-expression in E. coli
To produce sufficient recombinant material for biochemical and biophysical analysis open reading frames encoding proteins of interest are usually cloned into a plasmid vector that allows the controlled over-expression of the protein. The pET series of vectors from Novagen (www.novagen.com) are some of the most commonly used for recombinant protein production. These vectors have the minimal features for propagation in E. coli: an origin of replication and selectable antibiotic resistance marker. In addition to these features they possess a promoter for controlling expression of the recombinant promoter, a ribosome binding site, transcription terminator and multiple cloning site (MCS). The multiple cloning site allows the insertion of PCR amplified, or synthetic, DNA into the vector through digestion with specific restriction endonucleases. Protein expression vectors may also have additional open reading frames for proteins that control expression levels of your inserted gene and additional purification, or antibody affinity tags for enhanced protein purification and identification.

We will clone our LOV domain into the MCS of pET28a and a version of this vector that has been modified to allow Golden Gate cloning. The MCS of pET28a is shown below.

Part 6 restriction cloning
You will clone the LOV domain into the MCS of pET28a between the NcoI and XhoI restriction sites. These two restriction enzymes are Type II restriction enzymes that cleave DNA at defined sequence regions:
https://www.neb.com/products/restriction-endonucleases/restriction-endonucleases/types-of-restriction-endonucleases
Task:
• Look up details of the NcoI and XhoI restriction enzymes on the NEB website
Write the recognition/cut sites of the two restriction enzymes
(Use an ^ to indicate the cut sites)
NcoI: 5’-C^CATGG-3’
3’-GGTAC^C-5’
XhoI: 5’-C^TCGAG-3’
3’-GAGCT^C-5’

Addition of the recognition sequence for these restriction enzymes to you PCR primers allows these enzymes to cut the PCR product. The cut PCR product can then be inserted into a vector that has been cut with the same enzymes. We use two different enzymes to ensure that our PCR product can only be inserted into the vector in one direction.
Why is directional insertion of DNA fragments into vectors important?
To ensure insert goes in the correct way to give proper reading frame.
All open reading frames must begin with an ‘ATG’ (or in some cases ‘TTG’) start codon after the ribosome binding site.
Write the NcoI restriction sequence followed by the nucleotide sequence for the first five codons of your coding sequence. Below this paste the translation in one letter amino acid sequence.
CCA TGG cct aga gta tcc cag
P W P R V S Q
Does this result in an in-frame translation past the restriction site?
No
How might you ensure an in-frame translation when using this site?
Add two nucleotides after the restriction site to put it in frame, such as a GC.
We want to produce the LOV protein with a C-terminal His-tag, which makes protein purification easier by appending six histidine residues to the protein. This tag has a high affinity for nickel and cobalt; therefore, most proteins with this tag can be purified using metal-ion affinity chromatography.
Write the nucleotide sequence final five codons of your coding sequence followed by the XhoI restriction sequence. Below this paste the translation in one letter amino acid sequence.
ggg gta aat gat aaa CTC GAG
G V N D K L E

Does this result in an in-frame translation past the coding sequence to produce a His-tag?
Yes

Part 7 Golden Gate cloning
Standard restriction cloning methods have been used for many years, but they can be time consuming as both the vector and PCR product must be digested and purified separately before they can be ligated with a DNA ligase to produce the desired vector. To address this productivity bottleneck a number of different cloning methods have been introduced, including topoisomerase and recombinase based methods that use proteins that can cut and splice DNA together in a single reaction. Much like standard restriction cloning, these techniques require the addition of specific sequences to the 5’ and 3’ ends of your amplified DNA. One of the major barriers to their widespread adoption is the cost of the reagents needed.
Golden Gate cloning was introduced in 2008 and uses Type IIs restriction endonucleases, which cut DNA outside of their recognition site and allow a variety of 3-5 bp overhangs to be created to allow directional cloning with the use of a single restriction enzyme. Golden Gate vectors are usually designed with a selectable marker such as GFP/RFP flanked by inverted Type IIs restriction sites. This allows one-way cloning as suitably designed PCR products and vectors that have been cut by Type IIs enzymes no longer possess the recognition sequence. Once cut, ligation of these fragments results in a product that can no longer be cut by the enzyme.
The most commonly used Type IIs restriction enzyme is BsaI. Write out the recognition sequence and cut site of this enzyme below
(use a ^ to show the cut site and ‘n’ for unspecified nucleotides):
5’-GGTCTCN^-3’
3’-CCAGAGNNNNN^-5’

We have modified the pET28 vector to produce a Golden Gate vector by replacing the MCS with an RFP marker and BsaI restriction sites as shown below:

Part 8 Checking your nucleotide sequence for restriction sites
Before you can design any primers for PCR you must check that the natural DNA, mRNA, or cDNA sequence you are using as your template does not contain the recognition sites for the restriction enzymes you are using to amplify your PCR product. If it does, you will find that your enzymes cut these sites too and you won’t be able to insert your DNA into your vector as it will be cut in extra places.
You can use a number of programs to check for the presence of restriction sites in your DNA. They all work by searching for the restriction sites of common restriction endonucleases in the DNA sequence provided and will report on which enzymes cut and which don’t.
You can use the NEB cutter server to check for your restriction sites in the LOV mRNA.
Go to the NEB cutter website (http://nc2.neb.com/NEBcutter2/) and paste the nucleotide sequence for the LOV domain into the box. Push ‘submit’ and wait until it has cut your DNA. It displays any predicted open reading frames and a few enzymes that cut your DNA. Click on the ‘0 cutters’ link and search for your three enzymes in this list.

Can you find any NcoI, XhoI, or BsaI sites in your mRNA?
NcoI No
XhoI No
BsaI No

If your DNA sequence does have recognition sites for these enzymes, you can consider using different sites from the MCS of pET28b, or you can remove these sites by a two step-PCR where the sites are edited out by mismatched primers.
It is becoming very common to just purchase synthetic DNA fragments, where the DNA sequence can be specified based on the protein sequence and optimal codon-usage for the final host organism. This simplifies things immensely as you can avoid the PCR step and hopefully produce more protein from your optimised DNA sequence once it is cloned. There are a variety of companies offering synthetic genes and fragments of DNA with tools for sequence optimisation.

Part 9 PCR primer design
Only once you have determined the sequence you wish to amplify, decided on a vector and restriction sites required for cloning and have checked the DNA sequence for these sites can you design your PCR primers. You should be familiar with how PCR works, so I won’t go into detail here.
Restriction cloning
The key elements of a PCR primer for standard restriction cloning are as follows:
1. Random clamp-DNA (3-10 bp)
This helps the primer to anneal to your template DNA and helps the restriction enzyme cut the DNA by providing a buffer between the restriction site and the end of the DNA.
It should usually start with a G nucleotide and shouldn’t contain restriction sites, or triplet repeats.
It can be modified to adjust the Tm of the primer/template.
2. Restriction site (6 – 10 bp)
To allow specific cleavage of the DNA backbone to produce sticky ends for ligation into a vector.
3. Spacer sequence
To ensure the correct reading frame if necessary (important for NcoI sites)
4. Complementary sequence 20 – 30 bp
20 – 30 bp of sequence that is complementary to the DNA you are amplifying.
Ideally ends with a G nucleotide
Length can be adjusted to modify Tm of primer
5’ primer must have ATG start codon if there isn’t one in the restriction site/vector.
3’ primer must be reverse complement and have TAA stop codon if you need one!
Golden Gate cloning
The key difference between a primer for Golden Gate cloning and standard restriction cloning is the need to specify an overhang for the cut site. This is the great strength of Golden Gate cloning, but you must be careful to ensure your overhangs match the vector and other parts you plan to ligate together.
1. Random clamp-DNA (3 – 10 bp)
As before, this helps the enzyme to bind the DNA.
2. Recognition site (6 – 10 bp)
Unlike type II restriction enzymes, type IIs enzymes cut outside of their recognition site. Type IIs sites are not palindromic, and have directionality, so take care in choosing the orientation of the site.
3. Cut site (4 – 6 bp)
Different enzymes cut at different distances from their recognition site, make sure to choose the correct number of nucleotides for your cut site and that the overhang created matches your vector, or other parts you are ligating together.
4. Complementary sequence (20 – 30 bp)
As before!
Note: The melting temperatures of primer pairs should be within 5°C of each other to help ensure accurate annealing and amplification during PCR. You can calculate the Tm of primers by hand, or use an online tool or program to calculate this, such as the NCBI primer desigh tool: http://www.ncbi.nlm.nih.gov/tools/primer-blast/
The Tm of primer pairs can be adjusted by the addition/removal of clamp and complementary nucleotides.

Standard PCR primers are usually between 30 – 50 bp long.
Task:
• Design and write down PCR primers for the following constructs (mark the four different regions of your primers in different colours and note their Tm):
o LOV domain in pET28a cloned between NcoI/XhoI sites.
o LOV domain in pET28GG with BsaI overhangs as follows:
 Forward, 5’-AATG-3’
 Reverse, 5’-TAAGC-3’
 Note BsaI sites should be 5’ to overhangs.

Primer name Sequence
NcoI_Fwd
GACCCATGGGC cctagagtatcccaggagc 59.5
XhoI_Rev
GACCTCGAG tttatcatttaccccttctgtgtatt 60.0
BsaI_Fwd
GACGGTCTCAAATG cctagagtatcccaggagc 59.5
BsaI_Rev
GGCGGTCTCTAAGC tttatcatttaccccttctgtgtatt 60

You will go through the stages of PCR and cloning over the following weeks in the wet lab!

Part 10 Sequence validation
After cloning an insert into an expression vector its sequence and correct insertion must be verified. To do this we send a small proportion of our construct for Sanger sequencing (https://en.wikipedia.org/wiki/Sanger_sequencing). The results of Sanger sequencing are returned as a file with the chromatogram from the sequencing run and a consensus sequence derived from this data.
Sanger sequencing is only reliable for around 800 – 1000 bp, so for DNA longer than this we usually perform multiple sequencing runs to ensure we can check the entire sequence. Most vectors have sequences outside of the MCS for recognition by specific sequencing primers, these are usually at the 5’ and 3’ ends of the MCS and a sufficient distance away from the MCS to ensure an accurate sequencing read.
The genomic DNA sequence, and chromatograms for both Forward and Reverse sequencing reads for a recent expression construct generated in the Marles-Wright Laboratory are available on Learn.
Task
• Inspect the chromatogram traces for the sequencing runs using Finch.tv and output the consensus sequences as a fasta files.
Why might the first few bases of a Sanger sequencing read be unreliable?
Primer binding effects

Why might the final few bases of a Sanger sequencing read be unreliable?
Resolution of sequencing gel /capillary becomes poor after 700 – 800 nt.

• To confirm the insert sequence, input both sequences separately into the NCBI BLASTX server. This translates the nucleotide sequences into protein and aligns them against the non-redundant protein databases.

What is the top hit for each of the sequencing runs?
Forward: Bacterioicin R. rubrum
Reverse: No
Are there any differences between our sequences and the top hits from BLAST?
Forward: Bacterioicin R. rubrum
Reverse: No

To verify that this is what we were expecting from the genomic DNA sequence, run BLAST with the genomic sequence given in LEARN.
Is our construct the sequence we were expecting?
Yes

To check the DNA sequences of our construct against our genomic DNA we can run Clustal Omega (http://www.ebi.ac.uk/Tools/msa/clustalo/) with the sequencing data and genomic DNA.
• Paste the Genomic DNA sequence and Forward read into Clustal Omega and align the two DNA sequences.
Does the sequencing read match the genomic DNA sequence?
Yes.

For the reverse read you need to produce the reverse complement of this strand so that it matches the genomic DNA before you can align them.
• Go to a reverse complement server (http://www.bioinformatics.org/sms/rev_comp.html) and paste in your reverse read.
• Align this reverse complement against your genomic DNA in Clustal Omega.
Does the sequencing read match the genomic DNA sequence?
Yes

Are there any significant mismatched regions in either of the sequence reads?
No

Most laboratories have their own unique workflow for the design of protein expression constructs and favoured cloning methods. The ones presented here highlight some important general concepts, but there are many different ways to achieve the same ends. There are also number of free and commercially available programs for cloning and sequence analysis that simplify this workflow and allow you to arrange all your primers, constructs and sequence data together. Programmes such as Benchling, Serial Cloner and Snap Gene all have different levels of functionality and sharing features.

Next week you will begin the wet work for this practical class with a PCR reaction to produce expression constructs for proteins of interest to our laboratory.

References
Engler, C., Kandzia, R., & Marillonnet, S. (2008). A one pot, one step, precision cloning method with high throughput capability. PloS One, 3(11), e3647. doi:10.1371/journal.pone.0003647