141x Filetype PDF File size 0.38 MB Source: pdfs.semanticscholar.org
VECT: an automatic visual Perl programming tool for nonprogrammers Hui-Hsien Chou BioTechniques 38:615-621 (April 2005) Modern high-throughput biological research produces enormous amount of data that must be processed by computers, but many biologists dealing with these data are not professional programmers. Despite increased awareness of interdisciplinary training in bioinformatics, many biologists still find it difficult to create their own computational solutions. VECT, the Visual Extraction and Conversion Tool, has been developed to assist nonprogrammers to create simple bioinformatics without having to master a pro- gramming language. VECT provides a unified graphical user interface for data extraction, data conversion, output composition, and Perl code generation. Programming using VECT is achieved by visually performing the desired data extraction, conversion, and out- put composition tasks using some sample user data. These tasks are then compiled by VECT into an executable Perl program, which can be saved for later use and can carry out the same computation independently of VECT. VECT is released under the GNU General ® ® Public License and is freely available for all major computing platforms including Macintosh OS X, Linux, and Microsoft Win- ® dows at www.complex.iastate.edu. INTRODUCTION data inside its user interface, and then For each protein sequence, things are generates Perl programs to replicate a bit more complex since it can span In the genomics and postgenomics these tasks (1). Anyone who needs to several input lines, so an inner loop is eras, biologists frequently need to process textual format data can poten- run to collect all its parts until the end process a lot of biological data. Usually, tially benefit from using Vect. of the sequence is seen. Subsequently, biologists know how their data can all quotation marks must be removed, be manually handled, but only a few and only when both the name and of them are well versed in computer MATERIALS AND METHODS the sequence of a protein have been science to be able to turn that into collected will output be produced in executable code. Powerful bioinformatic Vect employs a data flow the FASTA format (www.ncbi.nlm. tools have been created to solve truly programming paradigm that is nih.gov/blast/html/search.html). The difficult and well-defined problems in different from the control flow scanning continues until all input lines computational biology. However, not programming paradigm more familiar have been seen. all needs of biologists are as generic. to programmers. An example problem There is nothing wrong with the Actually, most of the time biologists of extracting the translated protein control flow programming paradigm. need some bridging programs to connect sequences of predicted open reading In fact, most programmers take it existing bioinformatic tools together frames from a GenBank® report (www. for granted. However, the data flow to form their data processing pipeline. ncbi.nlm.nih.gov) is used to illus- programming paradigm shown in These generally involve data extraction, trate the difference between the two Figure 1 seems to be an easier approach conversion, and reporting tasks that are programming paradigms. Suppose both for nonprogrammers to follow. In this very specific to their ongoing research. the names and sequences of the proteins paradigm, focus is placed on how input Creating these bridging programs is must be extracted. These data are data can be extracted and processed, easy by experienced programmers, but delimited by the /protein_id and /trans- disregarding the order of their arrival. to nonprogrammers, this work can be lation= tags embedded inside the CDS For example, obtaining protein names detrimental and slow. regions in the report. To extract them, and sequences are considered as two The author believes this limiting a programmer might have followed the unrelated processes. A user simply factor of modern biological research control flow logic shown in Figure 1. A needs to define the steps to extract and can be resolved in a creative manner. main loop is scanning through all input process them separately (e.g., protein In this paper, a visual programming lines. Each input line is then checked names have to be taken out of quotes, tool, Vect (the Visual Extraction and against the name and protein delimiter and protein sequences have to be Conversion Tool), is introduced. It tags. For the name of a protein, its concatenated and then also taken out allows users to manipulate their sample quoted string name must be extracted. of quotes). Output is produced using an Iowa State University, Ames, IA, USA Vol. 38, No. 4 (2005) BioTechniques 615 RESEARCH REPORT output template. Therefore a user does before each part of it is explained in the latter is a placeholder for data sets. not need to worry whether the name detail. The protein extraction problem This formatting is taken by Vect as a or the sequence of a protein reaches mentioned earlier is used as an example template to group each pair of data from the output template first; the user only again. To begin with, a GenBank file the two sets to produce the output. The needs to know that when they have both is loaded directly into the Input Data results can be checked in the Output view arrived, they will be output together panel of Vect. The first thing to grab shown in Figure 2E. If they are correct, using the template as defined. is the protein sequence, so we use we can finally go to the Perl Program Although data flow programming the right mouse button to click and panel shown in Figure 2F and click the used to refer to specialized hardware drag over /translation=” to set it as Compile button to obtain a Perl program and software that have never been an opening block tag. We also set the that can reproduce the same operations. in widespread use (2), in Vect, this ending double quote ” as the closing This Perl program can be saved for later is simply the programming method block tag. This defines regions in the use and can work on the other GenBank adopted to facilitate user programming input file where data can be selected. files with similar contents. effort. The data flow programming Since the ending double quote is not paradigm naturally leads to an in the same position for each protein Data Extraction example-driven programming style. sequences, we change both the In Vect, programming is achieved by opening and closing tags to position The Input Data panel of Vect letting users handle some sample data independent. This allows all protein allows users to define the extraction in its interface. This is similar to using sequences in the input to be identified of useful data from input files. It is an editor or a spreadsheet program. and selected. The result is shown in designed to handle semi-structured text Vect then translates user actions into Figure 2A, where pink regions are not files commonly produced by online executable Perl code expressed in the selectable, green- and red-colored texts databases. Selection can be based on control flow programming paradigm. are the opening and closing block tags, fields, in which each field is a sequence Example-driven programming respectively, and grey regions are the of nonwhite characters separated by started in the 1960s and 1970s with the text actually selected. characters such as tabs or spaces. RPG (3) and COBOL (4) programming The selected data are sent to the To select an entire field, just click on languages that allowed programmers to Convert Data panel by clicking the the field. Selected data are always format reports using output templates. Move button. The initial data set is highlighted with a grey background In the 1980s, spreadsheet programs named Protein Parts in the Convert color. Selection can also be based on (5,6) were invented that allowed users Data panel. Here, additional rules can positions relative to either a field or an to program number-crunching jobs be added to convert the data. Specifi- entire line. By clicking and dragging by inserting formulas in some cells cally, we need to add a concatenation over a range of characters, a position- and copying them to the others. In the rule to connect the broken protein based selection is made. If the selected 1990s, the popularity of GUI programs parts into complete protein sequences, characters are completely contained ignited the development of several and then a quoted data extraction within a field, then the position object-oriented interface libraries and rule is needed to remove the /trans- selection is relative to the same field on their associated rapid application devel- lation=” and ” tags that are not part of each line. Otherwise, or if the shift key opment (RAD) tools (7,8). These tools the proteins. The results are shown in is pressed while dragging, the position allow a user interface to be designed Figure 2B. The resulted data set Pure selection is relative to the entire line. graphically, almost like drawing in a Proteins can be copied to the Output Selections can be restricted by graphic program. The design is then Data panel by clicking the Copy button. designated tags in the input. Tags do converted into program code that can In Figure 2C, similar extraction and tag not select data per se, but they help recreate the interface at runtime. To sum removal steps are defined for protein define the desired data that are to be up, programming by examples is not names. The resulted data set is named selected. Tags can be block opening, new, but to apply this concept to the data Pure Names. Note that yellow-colored block closing, or simply line tags. A extraction, conversion, and formatting texts in this figure indicate line-based line tag allows only lines containing it needs of biologists is new to the best selection tags. to be selected. The opening and closing of our knowledge. In the following To produce the desired output, both tags enclose a region for selection, example, the author demonstrates how Pure Names and Pure Proteins are copied but they do not have to be paired. If this form of programming in Vect can to the Output Data panel. This panel an opening tag is followed by another help biologists create Perl programs. has both Template and Output views. opening tag, the second tag defines a To compose a correct FASTA output, new selection region (i.e., it functions we need to add a greater than symbol > both as a closing tag for the previous RESULTS in front of the Pure Names and separate region and as an opening tag for its own it from Pure Proteins by a new line region). All tags are defined by using Vect Programming Tutorial (see Figure 2D). This FASTA-required the right mouse button to select text, symbol is not to be confused with the but otherwise they are selected exactly It is helpful to demonstrate how pair of arrows enclosing data set names. the same way as regular text (i.e., tags Vect works from a user’s perspective The former is a static text to output, but can also be field- or position-based). 616BioTechniques Vol. 38, No. 4 (2005) input • • • The Control Flow Paradigm /protein_id="AAD16616.1" The Data Flow Paradigm • • • /translation="MQLLRTL... each linesequentially scan ........................ raw ........................ names ........................ ........VSLIK" emerged • • • raw /protein_id="AAD16616.1" AAD16616.1 proteins get emerged quoted seeing string /protein_id no /translation="MQLLRTL... ........................ ........................ seeing ........................ /translation no ........VSLIK" yes yes concatenate save this line concatenate /translation="MQLLRTL...........................VSLIK" this line next line get quoted get quoted seeing no string string end "? yes MQLLRTL.............................VSLIK concatenate store the the last line name fill in the template get quoted string output template is both > AAD16616.1 name and protein store the no collected? protein "MQLLRTL......... ................. ................. yes ................. ................. output > , .......VSLIK" name and protein clean > AAD16616.1 name and protein MQLLRTL.......... ................. ................. ................. ................. .......VSLIK output Figure 1. Comparison of the control flow and data flow programming paradigms. Vect (the Visual Extraction and Conversion Tool) presents to its users a data flow programming paradigm shown in the right. Users can separately define how data sources can be extracted, converted, and composed to produce the output. Vect then compiles the design into a Perl program that is expressed in the control flow paradigm shown in the left, which actually implements the computation. Vol. 38, No. 4 (2005) BioTechniques 617 RESEARCH REPORT A B C D E F Figure 2. Vect (the Visual Extraction and Conversion Tool) programming: a tutorial. (A) Each protein block is defined by a green opening tag and a red closing tag. Pink regions are not selectable. The actual selected text is shown in grey. (B) Selected protein fragments are sent to the Convert Data panel and named Protein Parts. Two rules are added to concatenate the fragments (Quoted Proteins) and remove the quotes (Pure Proteins). (C) Similar selection is con- ducted for protein names. Here the yellow line selection tags are used. (D) The Output Data panel provides a template view to compose user output. (E) The Output Data panel also provides an output view to show the actual output. (F) Finally, a working Perl program can be obtained by clicking the Compile button in the Perl Program panel. 618BioTechniques Vol. 38, No. 4 (2005)
no reviews yet
Please Login to review.