Searching for small DNA segments
In bioinformatics, it is often necessary to run our programs on the command line. Fortunately Python’s argparse module- part of the standard library makes this easy.
Python has a collection of useful libraries which make it easy to build user interfaces. Ordinarily, when we think about user interfaces, a graphical point and click interface comes to mind. However, a user interface is anything that allows a user to interact with your program, including command line interfaces- the focus of this article.
To introduce argparse, I will build a small program which searches for specific small segments of DNA specified as input by the user. These small segments of DNA that the program will search for are called restriction enzymes sites, i.e. small Python strings with a length between 4–6 characters. This program will search the genome of bacteria to see how often they are present!
The program has two clear intentions:
- Be informative to the user, for example by providing helpful messages when the program crashes.
- Be Flexible. The program should be able to search any DNA of the user’s choosing by changing command line arguments.
To begin, import the argparse module and create an ArgumentParser object:
It is not necessary, but I find it helpful to provide a descriptive string using the description keyword argument. When help is called on the command line later in the program, a descriptive overview of the python program is provided. This may be useful, if there are multiple users, and time elapse between program runs.
Following this, we then tell the argument parser object what arguments to expect by calling the add_argument() method. Once all the arguments has been added, we can tell it to go ahead and parse the command line arguments by calling the parse_args ()method.
Three arguments have been added by the add_argument() method. In addition, various other optional parameters have been added, such as help=, type=, choice=, default= ect.
This is where the argparse module really comes into its own.
The short and long arguments are specified with either a single hyphen(-), or a double hyphen ( — ) respectively. This means that on the command line in the terminal, when the program is run, the argument can be expressed in a shorthand way, -i for the input file name, or as the longhand way, — input_filename for the input file name. A further benefit of short and long arguments is that they can be expressed in any order on the command line!
Short and long arguments in a mix and matched arrangement are shown below.
We can make the help text even better by calling add_argument() with the help keyword argument. When the program is now run with no command line arguments, useful help text appears for the three arguments (highlighted in yellow):
The type= parameter in the parser.add_argument method accepts the name of a Python function
The type parameter refers to a function, for example the — input_filename is checked by the function called file_check. This function shown below, checks that the file actually exists.
Finally the choice and nargs arguments means the user can choose any of the elements in the choice list. As the user can choose multiple arguments, the nargs parameters is need and an asterisks is specified.
Finally, now all the arguments have been added and functions added for input validation, a small script can be written. I first write a small dictionary, where the restriction enzymes names are the keys are their corresponding sequences are their values.
For completeness, I have included the modules required for this program above the restriciton_enzyme_dict dictionary.
The parameters specified by the user are prefixed with the args followed by their input name.
The DNA file is first opened, and read using the .read() method. Following this, the pattern(s) specified by the user are iterated over, and for each restriction enzyme I find it value using the .get() method. The regular expression re.findall finds all occurrences of the sequences in the DNA, and append the sequences to a results list.
The results list is converted to a special dictionary using the collection.Counter method () which counts how many times a unique sequence is found.
To finish, this new dictionary created by collections.Counter called restriction_sites_dict can be iterated over and if the sequence is above a count specified by the user a message is printed telling the user how many times it was found and its sequence.
To see the program in action, lets run a few test examples!
Example output is shown:
Here, I searched for the three restriction enzymes, -p HindIII, EcoRI and KpnI using the p flag (the p flag corresponds to the pattern specified by the user). In the first run, I did not specify a count. Earlier, in the add_argument for count_no I set a useful default to 0.
When I run the program again, with a count threshold of 500, the program informs me that the HindIII site is below the 500 threshold I specified.
This program is working nicely. The results are returned quickly and the restriction enzymes sites within the whole genome of E.coli have been searched! Pretty cool (this Fasta file was obtained from NCBI), but on our command line we could easily choose to search other bacterial genomes, by specifying their respective filename.
To finish, it would be useful to address an earlier intention. Specifically, the program was designed to be flexible. What happens if the users adds an incorrect restriction enzyme called for example, FakeEnzyme?
Lets run the program to find out.
I have highlighted the useful error message that appears if the user enters an incorrect choice which prompts the user to choose from the selection list provided.
This brief tutorial has shown how a command line interface can be added to a program. Thankfully Python provides useful libraries for building interfaces, of which argparse is one.