Proteins are polypeptide structures consisting of one or more long chains of amino acid residues. They carry out a wide variety of organism functions, including DNA replication, transporting molecules, catalyzing metabolic reactions, and providing structural support to cells. A protein can be identified based on each level of its structure. Every protein at least contains a primary, secondary, and tertiary structure. Only some proteins have a quaternary structure as well. The primary structure is comprised of a linear chain of amino acids.
The secondary structure contains regions of amino acid chains that are stabilized by hydrogen bonds from the polypeptide backbone. These hydrogen bonds create alpha-helix and beta-pleated sheets of the secondary structure. The three-dimensional shape of a protein, its tertiary structure, is determined by the interactions of side chains from the polypeptide backbone. The quaternary structure also influences the three-dimensional shape of the protein and is formed through the side-chain interactions between two or more polypeptides. Each protein at least contains a primary, secondary, and tertiary structure. Only some proteins have a quaternary structure as well.
To reiterate, the primary structure of a protein is defined as the sequence of amino acids linked together to form a polypeptide chain. Each amino acid is linked to the next amino acid through peptide bonds created during the protein biosynthesis process. The two ends of each polypeptide chain are known as the amino terminus (N-terminus) and the carboxyl terminus (C-terminus). Twenty different amino acids can be used multiple times in the same polypeptide to create a specific primary protein structure sequence.
In any cell, the DNA preserves the code used to synthesize proteins. The nucleotide sequence of a protein-encoding gene is transcribed into another nucleotide sequence (RNA transcript or mRNA), which is then used to synthesize a sequence of amino acids to form a protein. This is historically known as the Central Dogma. RNA polymerases transcribe genes.
In eukaryotes, there are three polymerases (I, II, III):
- RNA polymerase I transcribes genes for 18S, 5.8S, and 28S ribosomal RNA; it is located in the nucleus.
- RNA polymerase II transcribes encoding genes for messenger RNA (mRNA) and small RNAs.
- RNA polymerase III synthesizes 5S RNA, and all transfer RNA (tRNA); is located in the nucleoplasm.
The RNA polymerases are aided in the transcription function by a series of proteins called transcription factors, which have several functions mainly to facilitate the binding of RNA polymerase II to the promoter sequence, forming the transcription initiation complex. The initiation complex recognizes the promoter region in the DNA as a signal for the transcription starting point. The consensus sequence TATTAA (called TATA box) is located 25–35 bases upstream of the initiation site, followed by a poly-G/C consensus sequence located 32–38 bases upstream.
The transcribed RNA contains multiple segments called exons and introns. Exons are segments containing the encoding sequence for protein synthesis, while introns are intervening segments of nucleotides that will be excised from the primary RNA transcript (pre-mRNA) in a process called splicing. Splice sequences are found at the 5′ and 3′ tails of introns marked with the dinucleotide GU at its 5′ end and AG at the 3′ end. Alternative splicing is a mechanism used to create different isoforms of the same protein so that several protein isoforms can be created from a limited repertoire of genes. Alternative splicing is now known to deeply influence biological behavior (sex determination, cell differentiation, and programmed cell death). The mechanism underlying alternative splicing is not fully understood yet, but histone modification signatures correlate with splice site switching via chromatin-binding proteins.
After a mature mRNA transcript leaves the nucleus, a ribosome attaches to the transcript to begin translation. The initiation site in both prokaryotes and eukaryotes is not composed of a unique initiation sequence but is more probably a multisubstrate enzyme system based on substrate specificity. A postulated mechanism (scanning mechanism) predicts that the ribosomes initiate protein synthesis recognizing and binding the 5′ tail of the mRNA corresponding to a 7-methylguanosine referred to as "cap." The first methionine codon located in the 5′ end of the mRNA is the initiator codon. Several factors bind to this initiation sequence (IF4F initiation factors), forming the ribosome-eIF2-GTP-Met-tRNA preinitiation complex (PIC).
A favorable initiation sequence in mammals is known as “Kozak consensus”: 5’ (A/G)CCAUGG 3’. The PIC scans the mRNA in the 5’ region until it reaches the AUG start codon that is complementary with the anticodon of Met-tRNA. The first AUG codon can be skipped (a process called “leaky scanning”) to use a downstream AUG. This creates protein isoforms when the downstream AUG is located in the same open reading frame as the first one, or a completely different protein if it is not in-frame.
Each ribosome is made up of a larger and smaller subunit, which is different for eukaryotes (the 60S and 40S) and prokaryotes (50S and 30S). Both ribosomal subunits join to form the binding sites for tRNA to translate the mRNA transcript. A triplet of nucleotides identifies each amino acid, called a codon. There is more than one codon encrypting for each amino acid except methionine (Met) and tryptophan (Trp). The three binding sites are the aminoacyl site (A site), peptidyl site (P site), and exit site (E site).
The A site is the first binding site where the initial incoming aminoacyl tRNA pairs with its codon. As the second aminoacyl tRNA binds to the A site, the initial tRNA molecule shifts to the adjacent P site. As the shift happens, the amino acid attached to the tRNA molecule in the P site forms a peptide bond to the amino acid in the A site. This leaves the tRNA at the P site deacylated while tRNA at the A site carries the peptide chain. The deacylated tRNA moves into the E site and gets released as the tRNA in the A site moves over to the P site. The P site holds the tRNAs linked to the growing polypeptide chain. Then a new tRNA molecule attaches to the codon in the unoccupied A site. The codons UAG, UGA, and UAA are used to end the protein translation (termination codons). The process of adding amino acids to the chain is repeated until a stop codon is reached.
Each amino acid consists of a carboxyl group, an amino group, and a side chain. Amino acids are linked together by joining the amino group of one amino acid with the carboxyl group of the adjacent amino acid. Each amino acid side chain has differing properties. Some side chains can be either acidic or basic, while others can be polar uncharged or just non-polar. These characteristics provide insight into whether the protein generally functions better in acidic or basic environments, solubility in water or lipids, the temperature range for optimal protein function, and which parts of the protein are found on the protein interior being in contact with the external aqueous environment. Some amino acids contained within the polypeptide chain can even create ionic bonds and disulfide bridges. The location of certain amino acids in the primary structure dictates how the secondary, tertiary, and quaternary structure will look.
Nonpolar, Aliphatic Amino Acids - backbone molecules of the amino acid are used to form hydrogen bonds.
- Glycine - can cause a bend when used in an alpha helix chain (secondary structure)
Nonpolar, Aromatic Amino Acids - backbone molecules of the amino acid are used to form hydrogen bonds.
Polar, Uncharged Amino Acids - backbone/ side chain molecules of the amino acid can be used to form hydrogen bonds (besides proline and cysteine)
- Proline - causes a bend when used in an alpha helix chain (secondary structure)
- Cysteine - the sulfur atoms from two cysteine side chains covalently bond together to form a disulfide bridge
Acidic Amino Acids - can be used to form hydrogen bonds (backbone/ side chain molecules) and salt bridges (side chain molecules only)
- Aspartic Acid/Aspartate
- Glutamic Acid/Glutamate
Basic Amino Acids - can be used to form hydrogen bonds (backbone/ side chain molecules) and salt bridges (side chain molecules only)
The process of transcription and translation has some similarities and differences between eukaryotes and prokaryotes. This section will only focus on eukaryotic molecular mechanisms.
Transcription begins in the nucleus with the formation of the pre-initiation complex. First, transcription factor II D (TFIID) binds to the TATA box through the TATA-binding protein (TBP). Then five other transcription factors (TFIIA, TFIIB, TFIIE, TFIIF, AND TFIIH), along with RNA polymerase II, combine through a series of stages to form the pre-initiation complex. Specifically, TFIIH has a role in nucleotide excision repair and separating the opposing strands of double-stranded DNA. This strand is read by RNA polymerase in the 3’-5’ direction and is transcribed in the 5’-3’ direction. During this process, RNA Pol II attaches complementary base pairs to the template strand: adenine on the DNA strand is paired with thymine, guanine is paired with cytosine, and cytosine is paired with guanine. However, whenever this enzyme reaches a thymine base pair on the DNA strand, it substitutes the original adenine complement base pair with a uracil base pair instead.
With regards to protein-encoding genes, elongation continues until the polymerase transcribes the AAUAAA and GU-rich sequence. Once both sequences have been transcribed, cleavage-polyadenylation specificity factor (CPSF) and cleavage stimulatory factor (CstF) bind to the AAUAAA and GU-rich sequences, respectively. Poly(A) Polymerase forms a complex with CPSF, CstF, nuclear poly(A) binding protein (PABP), and a few other proteins to catalyze the addition of the 3’ poly-A tail of the pre-mRNA strand.
This newly synthesized pre-mRNA strand undergoes post-translational modifications to prevent degradation while exiting the nucleus to enter the cytoplasm. At the 5’ end of the pre-mRNA strand, the 7-methylguanosine cap is added by guanyl transferase.
Splicing and Pre-mRNA Modifications
Splicing occurs within the nucleus. Several steps are catalyzed by large (60S) molecules called spliceosomes composed of small ribonucleoproteins (snRNPs) and splicing factors. These enzymes excise the introns out of the mRNA transcript while leaving exons in the transcript alone. Spliceosomes shuffle around these exons, depending on the type of protein that needs to be synthesized.
After splicing, the 5′ and 3′ tails are modified to help the translation process. At the 5′ end of the pre-mRNA molecule, GTP reacts with the first nucleotide in the triphosphate group on the 5′ ribose's carbon and form a 5′–5′ triphosphate linkage; the nitrogen-7 of guanine is then methylated to form the 5 ′ caps of the mRNA. At the 3′ end, the sequence AAUAAA is cleaved by an endonuclease so that a Poly(A)-polymerase can add the adenylate residues.
After all the necessary modifications are made, the mRNA strand is considered mature and is then translocated to the cytoplasm to begin translation.
Translation of mRNA normally occurs in the cytoplasm or the rough endoplasmic reticulum. However, it can happen in any compartment of the cell that has ribosomes. The process of initiating eukaryotic translation begins with the formation of the 80S initiation complex. The eukaryotic initiation factor 4F (eIF4F) protein complex initially recognizes the 5’-cap structure of the mRNA molecule.
Sometimes, eIF4F can recognize mRNA at an internal ribosome entry site (IRES), which drives translation independently of the 5’-cap structure. Then eIF4f recruits the pre-initiation complex, including the 40S subunits and another complex comprised of GTP, eukaryotic initiation factor 2 (eIF2), and the initiator met-tRNA. eIF2 binds to the initiator met-tRNA (by hydrolyzing GTP to GDP) to create the final initiation complex. The initiation complex scans the mRNA until initiator met-tRNA recognizes the start codon AUG and binds to the P site. Hydrolysis of GTP gives the energy needed to release eIF2 from the complex, subsequently allowing the 60S and 40S ribosomal subunits to assemble into a function 80S ribosome.
Elongation occurs as the ribosome reads in the mRNA strand 5’-3’ direction through an open reading frame (ORF). The ORF is a continuous stretch of codons that begins with the AUG start codon and ends at one of three termination codons (UAG, UAA, or UGA), which are not associated with any amino acid. The ribosome moves one triplet at a time along the mRNA. An aminoacyl-tRNA, complexed with eukaryotic elongation factor 1 (eEF1), hydrolyzes GTP to release eEF1. Thus, providing energy for the aminoacyl-tRNA (with the appropriate anticodon to match the mRNA codon) to bind to the A site.
Elongation of the polypeptide occurs through the stepwise addition of amino acids, bound by peptide bonds, between the amino acids (attached to the tRNA) bound to the A site and P site. The tRNA molecule bound on the P site is called the peptidyl-tRNA since it bears the polypeptide chain. For clarification, the aminoacyl-tRNA only holds a new single amino acid (at the A site) to be added to the growing polypeptide chain. Translocation of tRNA molecules from the A site to the P site occurs through GTP hydrolysis catalyzed by eukaryotic elongation factor 2 (eEF2). As this shift moves the tRNA from the A site to the P site, the tRNA already on the P site moves to the E site, where it is now called the unloaded tRNA. From here, the unloaded tRNA is released from the E site.
Termination of translation happens eukaryotic translation termination factor 1 (eRF1) recognizes one of the stop codons in the mRNA transcript. Eukaryotic translation termination factor 3 (eRF1) is recruited to hydrolyze (using GTP) the polypeptide chain from the tRNA occupying the P site. The newly formed polypeptide chain is released from the ribosome, and the ribosomal subunits dissociate.
These types of modifications can occur during any step of the protein’s life. Some modifications occur right after transcription, while other modifications occur after protein folding by chaperones. Below is a list of some modifications that can be added to the side chains of amino acids in the polypeptide chain:
- Glycosylation- attachment of a carbohydrate group through N-glycosidic or O-glycosidic bonds
- Lipid Anchors- includes acylation (linking with long-chain fatty acids), isoprenylation (linkage of polyisoprene), and GPI anchor (linkage with glycosylphosphatidylinositol, GPI)
- Acetylation- attachment of acetyl groups (-COCH3)
- Methylation- attachment of methyl groups (-CH3)
- Carboxylation- attachment of carboxylic acid groups (-COOH)
- Hydroxylation- attachment of hydroxy groups (-OH)
- Phosphorylation- attachment of phosphate residues (-OPO2)
There are numerous methods to perform qualitative and quantitative analysis of a polypeptide chain. Here are two major procedures currently used:
This method is used to find the sequence of amino acids of a protein. Essentially, the amino-terminal residue is labeled and cleaved from the polypeptide chain without disturbing the peptide bonds that hold together other amino acid residues. This process repeats by cleaving off one amino acid at a time until the entire chain is sequenced. Due to this method's tedious nature, a protein sequenator can be used to perform the Edman degradation in an automated way.
Amino acid sequencing of peptides can be performed through mass spectrometry. Although this technique gains limited data from analyzing entire proteins at a time, it can effectively analyze peptides. One method of using mass spectrometry is called peptide mass fingerprinting (PMF). PMF is widely used to identify single purified proteins but is not feasible for heterogeneous protein mixtures. A generalized procedure of this method is as follows:
- Break up the protein sample into smaller peptide fragments by the use of proteolytic enzymes.
- Extract the fragments using acetonitrile.
- Dry the fragments by vacuum.
- Insert the peptides into the vacuum chamber of a mass spectrometer such as MALDI-TOF (matrix-assisted laser desorption/ionization time-of-flight) or ESI-TOF (electrospray ionization time-of-flight). MALDI is an ionization technique that requires a laser absorbing matrix to create ions from organic molecules with low amounts of fragmentation. ESI produces ions by applying a high voltage to a liquid to form an aerosol. TOF is a measurement of time used to measure the velocity of the ions through the vacuum chamber.
- The mass spectrometer creates a list of molecular weights (called the peak list) from the sample and is then compared against databases (i.e., GenBank or SwissProt) for relevant matches.
- Appropriate software performs a simulated chemical cleavage reaction with the relevant protein sequences found on a database. The mass of these simulated peptide fragments is calculated and then compared to the peak list of the experimental peptide masses. These results are statistically analyzed, and possible matches are shown.
Numerous diseases happen due to improper amino acid composition within a protein. Here are a few examples:
A CAG trinucleotide repeat disorder causes HD in the HTT gene on chromosome 4. Normally, the CAG segment is repeated 10-35 times in the gene. Individuals with 36 to 39 CAG repeats might develop HD, while those with 40 or more repeats almost always develop the disorder. This progressive brain disease is caused by gradual degradation of both the caudate nucleus and the putamen of the basal ganglia. This can lead to behavioral/psychiatric disturbances, unwanted choreatic movements, and dementia.
Sickle Cell Anemia
The genetic defect of SCA is found within the beta globulin gene. Both the normal and mutated protein products consist of 147 amino acids. However, the mutated beta globulin protein has a single-base substitution (point mutation) at the sixth amino acid in the chain. In the sickle cell gene, GTG (the codon for valine) replaces the normal GAG codon (for glutamic acid). This defect deforms the normal biconcave disc shape of red blood cells into a crescent or sickle shape instead. Symptoms include anemia, episodes of pain, organ damage, delayed growth, and swelling of the hands and feet.
In this disease, there is a mutation on the CFTR (cystic fibrosis transmembrane regulator) gene on chromosome 7. The CFTR gene is used to create the cystic fibrosis transmembrane conductance regulator. Two types of base-pair mutations account for the majority of mutated alleles: F508 del and G551D. However, more than 1000 mutations have been accounted for in the CFTR gene to cause CF. F508del represents a deletion of the three base pairs that encode for phenylalanine at the 508 position in CFTR. G551D is caused by the base substitution of guanine for adenine at the second position in the codon of the mRNA transcript. This substitution replaces glycine for aspartate at the 551 position.
Disease-causing mutations within the CFTR gene alter the protein structure, thus impairing chloride ion transport. Normally, this protein transports chloride ions in and out of cells that produce saliva, sweat, tears, mucus, and digestive enzymes. Due to the impaired chloride ion transport, cells that line the passageways of the pancreas, lungs, and other organs start producing abnormally thick and sticky mucus. In turn, the abnormal mucus production obstructs the airways and glands. CF symptoms include salty sweat/salty-tasting skin, exocrine pancreatic insufficiency, chronic pneumonia, lung fibrosis, obstructive azoospermia, and accumulation of thick, sticky mucus.
Identification of the specific base-pair mutations within the gene can help healthcare professionals further understand the disease phenotype of the patient. The severity of each disease is dependent on a few factors, such as the function of the protein, how many amino acids are involved with the base-pair mutation, and the type of mutation. For example, a base pair substitution that leads to a silent mutation will not change any part of the amino acid sequence. On the other hand, an insertion or deletion of 1 or 2 base pairs leads to a frameshift mutation. After the frameshift mutation translates, all codons lead to a different amino acid composition of the polypeptide chain, which normally leads to a nonfunctional protein. A mutated protein with an insertion or deletion of exactly 3 base pairs (or multiples of 3 base pairs) can have varied functionality.