Homework 10: Genomics

Assigned
Wednesday October 19, 2016
Due
12 noon, Friday October 21, 2016
Summary
The goal of this assignment is to gain further experience with string processing.
Collaboration
Do this assignment with your assigned partner.
Submitting
Submit a Python program using the online turnin form. See below.
Scoring
12 points

Your Task: Finding Genes

Do project 6.2 from our textbook (p. 316-320).

Before you begin, right-click to download the skeleton code for this project:

You will write your code in genefinder.py.

Also download the following data files, containing various size prefixes of the genome of a particular strain of E. coli:

You will begin with the smaller files.

Follow the instructions in the textbook to do both part 1 and part 2. In part 1, you will complete the definitions of the function orf1(dna, rf, tortoise). In part 2, you will complete the definition of the function gcFreq(dna, window, tortoise).

Above & Beyond: Microsatellites

Hungry for more? Try some of the more challenging exercises from section 6.7, which conern the identification of microsatellites or simple sequence repeats (SSRs).

Create a file named ssr.py and do exercises 6.7.8, 6.7.9, and 6.7.10 (p. 310). Rather than naming all the functions ssr(dna, repeat), as the textbook suggests, name your functions as follows:

6.7.8
firstSSR(dna, repeat)
6.7.9
longestSSR(dna, repeat)
6.7.10
longestDinucleotideSSR(dna)

Here is my test code:

if __name__=="__main__":
inputFile = open('eco536-500.txt', 'r')
testSequence = inputFile.read()
print(testSequence)
print("First 'caga' SSR:", firstSSR(testSequence, "caga"))
print("Longest 't' SSR:", longestSSR(testSequence, "t"))
print("The most repeated dinucleotide is", longestDinucleotideSSR(testSequence))
And here is the sample output:
agcttttcattctgactgcaacgggcaatatgtctctgtgtggattaaaaaaagagtgtctgatagcagcttctgaactggttacctgccgtgagtaaattaaaattttattgacttaggtcactaaatactttaaccaatataggcatagcgcacagacagataaaaattacagagtacacaacatccatgaaacgcattagcaccaccattaccaccaccatcaccattaccacaggtaacggtgcgggctgacgcgtacaggaaacacagaaaaaagcccgcacctgacagtgcgggcttttttttcgaccaaaggtaacgaggtaacaaccatgcgagtgttgaagttcggcggtacatcagtggcaaatgcagaacgttttctgcgggttgccgatattctggaaagcaatgccaggcaggggcaggtggccaccgtcctctctgcccccgccaaaatcaccaaccatctggtagcgatgattgaaaaaaccat
First 'caga' SSR: 2
Longest 't' SSR: 8
The most repeated dinucleotide is tt

Grading and Submission

Please style your code per section 3.4 of the textbook.

Submit one file, genefinder.py, through the online turnin form. You may optionally submit a second file, ssr.py, for the Above & Beyond problems.



Janet Davis (davisj@whitman.edu).
This assignment is adapted from our textbook.

Created October 17, 2016
Last revisedOctober 17, 2016, 03:52:33 PM PDT
CC-BY-NC-SA This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.