Code Tutorial #1: Python Code for reading a FASTA file (Bioinformatics)

A simple exercise with code

When I first started out with my bioinformatics project while at grad school, one of the biggest challenges I faced in my research was to understand and access information from various file types. So I decided to start off the first code tutorial here with an introduction to reading and extracting information from a FASTA file.

This is going to be useful if you are a biologist planning on taking up research in bioinformatics or computational genomics. You will be dealing with Big Data almost all of the time. FASTA file holds the nucleotide (DNA) sequence that makes up an organism and is unique for different species of organisms. There are also differences in the individual level that make each individual unique.

You will need to read a FASTA file when you are analyzing the genetic sequence of an organism. You may need it for analyzing specific areas of the sequence or calculating the GC content for the species and so on.

Here is what a FASTA file looks like:

The first line that starts with a ‘>’ is a FASTA format identifier that is tracked by many online bioinformatics tools that you may use on a regular basis.

Here is the python script that helps you read it:

def readGenomeFromFasta(file_genome):
genome=''
withopen(file_genome, 'r') as new_file_1:
for line in new_file_1.readlines():
if not line[0] == '>':
genome += line.rstrip()
return genome
return len(genome)

This not only strips the first line that starts with a ‘>’, it also strips off any blank spaces when the file is being read. It also uses a control that returns the number of nucleotides in the file.

The following piece of code is used along with the function definition, and is used to read the file, write to a new file and call the function to process the file:

print ("Type filename")
file_genome= input("> ")
input_file=open(file_genome, 'r')
print ("Type output filename for genome")
output_file= input("> ")
out_file = open(output_file, 'w')
final_file = readGenomeFromFasta(file_genome)#1
out_file.write(final_file)

Other than just reading the FASTA file, here is a bit of code that gives us a glimpse of what we can do with this code. Here I am simply creating an artificial nested list of “1”s to mark every position of the nucleotide. This is not helpful by itself (but it was a part of a bigger problem I was solving in my project) but serves to give a demo of how to use the file once you can read and modify it.

[[1],[1],[1],[1],[1],[1],[1]]

def new_list_for_index(genome):
"""Return a new list that has all the same index as of the main new_ref"""
new_list=[]
for idx in genome:
new_index=[1]
new_list.append(new_index
return new_list
print ("type name of marker file you want to output index data into")
marker_index= input ("> ")
markerIndex=open(marker_index, 'w')
markerIdx=new_list_for_index(readGenomeFromFasta(file_genome))
markerIndex.write(str(markerIdx))

Stay tuned for more, and check out ‘Prediktr’: https://thecoderchick.com/portfolio/prediktr/

Published by Anya

Founder at The TechGirl Journal & The IDEA Bucket ; Anya lives in California while working in the field of Computational Genomics. TechGirl Journal is focussed on the lifestyle of a girl in STEM and tips on how to build a business and a career in tech with a focus on skill-development, interviews, internship, personal projects, and pet-peeves! The IDEA Bucket is focused on small business ventures and practical, urban lifestyles. For specific inquiries, you can e-mail: hello@techgirljournal.com

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: