jagomart
digital resources
picture1_Programming Perl Pdf 196116 | 78a88dc7d47277addc1d648ab7f3e749af0a


 141x       Filetype PDF       File size 0.38 MB       Source: pdfs.semanticscholar.org


File: Programming Perl Pdf 196116 | 78a88dc7d47277addc1d648ab7f3e749af0a
vect an automatic visual perl programming tool for nonprogrammers hui hsien chou biotechniques 38 615 621 april 2005 modern high throughput biological research produces enormous amount of data that must ...

icon picture PDF Filetype PDF | Posted on 07 Feb 2023 | 2 years ago
Partial capture of text on file.
                    VECT: an automatic visual Perl programming 
                                             tool for nonprogrammers
                                                                    Hui-Hsien Chou
                                                               BioTechniques 38:615-621 (April 2005)
                  Modern high-throughput biological research produces enormous amount of data that must be processed by computers, but many 
                  biologists dealing with these data are not professional programmers. Despite increased awareness of interdisciplinary training in 
                  bioinformatics, many biologists still find it difficult to create their own computational solutions. VECT, the Visual Extraction and 
                  Conversion Tool, has been developed to assist nonprogrammers to create simple bioinformatics without having to master a pro-
                  gramming language. VECT provides a unified graphical user interface for data extraction, data conversion, output composition, and 
                  Perl code generation. Programming using VECT is achieved by visually performing the desired data extraction, conversion, and out-
                  put composition tasks using some sample user data. These tasks are then compiled by VECT into an executable Perl program, which 
                  can be saved for later use and can carry out the same computation independently of VECT. VECT is released under the GNU General 
                                                                                                          ®                           ®
                  Public License and is freely available for all major computing platforms including Macintosh  OS X, Linux, and Microsoft  Win-
                       ®
                  dows  at www.complex.iastate.edu.
              INTRODUCTION                                 data inside its user interface, and then     For each protein sequence, things are 
                                                           generates  Perl  programs  to  replicate     a bit more complex since it can span 
                 In the genomics and postgenomics          these tasks (1). Anyone who needs to         several input lines, so an inner loop is 
              eras,  biologists  frequently  need  to      process textual format data can poten-       run to collect all its parts until the end 
              process a lot of biological data. Usually,   tially benefit from using Vect.              of the sequence is seen. Subsequently, 
              biologists  know  how  their  data  can                                                   all quotation marks must be removed, 
              be manually handled, but only a few                                                       and  only  when  both  the  name  and 
              of  them are well versed in computer         MATERIALS AND METHODS                        the sequence of a protein have been 
              science  to  be  able  to  turn  that  into                                               collected will output be produced in 
              executable code. Powerful bioinformatic         Vect  employs  a  data  flow              the  FASTA  format  (www.ncbi.nlm.
              tools have been created to solve truly       programming  paradigm  that  is              nih.gov/blast/html/search.html).  The 
              difficult and well-defined problems in       different  from  the  control  flow          scanning continues until all input lines 
              computational  biology.  However,  not       programming paradigm more familiar           have been seen.
              all needs of biologists are as generic.      to programmers. An example problem              There  is  nothing  wrong  with  the 
              Actually,  most  of  the  time  biologists   of  extracting  the  translated  protein     control flow programming paradigm. 
              need some bridging programs to connect       sequences of predicted open reading          In  fact,  most  programmers  take  it 
              existing  bioinformatic  tools  together     frames from a GenBank® report (www.          for  granted.  However,  the  data  flow 
              to form their data processing pipeline.      ncbi.nlm.nih.gov)  is  used  to  illus-      programming  paradigm  shown  in 
              These generally involve data extraction,     trate  the  difference  between  the  two    Figure 1 seems to be an easier approach 
              conversion, and reporting tasks that are     programming paradigms. Suppose both          for nonprogrammers to follow. In this 
              very specific to their ongoing research.     the names and sequences of the proteins      paradigm, focus is placed on how input 
              Creating  these  bridging  programs  is      must  be  extracted.  These  data  are       data can be extracted and processed, 
              easy by experienced programmers, but         delimited by the /protein_id and /trans-     disregarding the order of their arrival. 
              to nonprogrammers, this work can be          lation= tags embedded inside the CDS         For example, obtaining protein names 
              detrimental and slow.                        regions in the report. To extract them,      and sequences are considered as two 
                 The  author  believes  this  limiting     a programmer might have followed the         unrelated  processes.  A  user  simply 
              factor of modern biological research         control flow logic shown in Figure 1. A      needs to define the steps to extract and 
              can be resolved in a creative manner.        main loop is scanning through all input      process them separately (e.g., protein 
              In  this  paper,  a  visual  programming     lines. Each input line is then checked       names have to be taken out of quotes, 
              tool, Vect (the Visual Extraction and        against the name and protein delimiter       and  protein  sequences  have  to  be 
              Conversion  Tool),  is  introduced.  It      tags.  For  the  name  of  a  protein,  its  concatenated and then also taken out 
              allows users to manipulate their sample      quoted string name must be extracted.        of quotes). Output is produced using an 
                                                             Iowa State University, Ames, IA, USA
              Vol. 38, No. 4 (2005)                                                                                              BioTechniques 615
           RESEARCH REPORT
           output template. Therefore a user does          before each part of it is explained in         the latter is a placeholder for data sets. 
           not need to worry whether the name              detail. The protein extraction problem         This formatting is taken by Vect as a 
           or the sequence of a protein reaches            mentioned earlier is used as an example        template to group each pair of data from 
           the output template first; the user only        again. To begin with, a GenBank file           the two sets to produce the output. The 
           needs to know that when they have both          is loaded directly into the Input Data         results can be checked in the Output view 
           arrived,  they  will  be  output  together      panel of Vect. The first thing to grab         shown in Figure 2E. If they are correct, 
           using the template as defined.                  is  the  protein  sequence,  so  we  use       we can finally go to the Perl Program 
              Although data flow programming               the  right  mouse  button  to  click  and      panel shown in Figure 2F and click the 
           used to refer to specialized hardware           drag  over  /translation=”  to  set  it  as    Compile button to obtain a Perl program 
           and  software  that  have  never  been          an opening block tag. We also set the          that can reproduce the same operations. 
           in  widespread use (2), in Vect, this           ending double quote ” as the closing           This Perl program can be saved for later 
           is  simply  the  programming  method            block tag. This defines regions in the         use and can work on the other GenBank 
           adopted to facilitate user programming          input file where data can be selected.         files with similar contents.
           effort.  The  data  flow  programming           Since the ending double quote is not 
           paradigm  naturally  leads  to  an              in the same position for each protein          Data Extraction
           example-driven  programming  style.             sequences,  we  change  both  the 
           In Vect, programming is achieved by             opening and closing tags to position              The  Input  Data  panel  of  Vect 
           letting users handle some sample data           independent. This  allows  all  protein        allows users to define the extraction 
           in its interface. This is similar to using      sequences in the input to be identified        of  useful  data  from  input  files.  It  is 
           an  editor  or  a  spreadsheet  program.        and selected. The result is shown in           designed to handle semi-structured text 
           Vect then translates user actions into          Figure 2A, where pink regions are not          files  commonly  produced  by  online 
           executable Perl code expressed in the           selectable, green- and red-colored texts       databases. Selection can be based on 
           control flow programming paradigm.              are the opening and closing block tags,        fields, in which each field is a sequence 
              Example-driven          programming          respectively, and grey regions are the         of  nonwhite  characters  separated  by 
           started in the 1960s and 1970s with the         text actually selected.                        characters  such  as  tabs  or  spaces. 
           RPG (3) and COBOL (4) programming                  The selected  data  are  sent  to  the      To select an entire field, just click on 
           languages that allowed programmers to           Convert  Data  panel  by  clicking  the        the  field.  Selected  data  are  always 
           format reports using output templates.          Move button. The initial  data  set  is        highlighted  with  a  grey  background 
           In  the  1980s,  spreadsheet  programs          named  Protein  Parts  in  the  Convert        color. Selection can also be based on 
           (5,6) were invented that allowed users          Data panel. Here, additional rules can         positions relative to either a field or an 
           to  program  number-crunching  jobs             be added to convert the data. Specifi-         entire line. By clicking and dragging 
           by  inserting  formulas  in  some  cells        cally, we need to add a concatenation          over a range of characters, a position-
           and copying them to the others. In the          rule  to  connect  the  broken  protein        based selection is made. If the selected 
           1990s, the popularity of GUI programs           parts into complete protein sequences,         characters  are  completely  contained 
           ignited  the  development  of  several          and  then  a  quoted  data  extraction         within  a  field,  then  the  position 
           object-oriented interface libraries and         rule  is  needed  to  remove  the  /trans-     selection is relative to the same field on 
           their associated rapid application devel-       lation=” and ” tags that are not part of       each line. Otherwise, or if the shift key 
           opment (RAD) tools (7,8). These tools           the proteins. The results are shown in         is pressed while dragging, the position 
           allow a user interface to be designed           Figure 2B. The resulted data set Pure          selection is relative to the entire line.
           graphically,  almost  like  drawing  in  a      Proteins can be copied to the Output              Selections  can  be  restricted  by 
           graphic  program. The  design  is  then         Data panel by clicking the Copy button.        designated tags in the input. Tags do 
           converted into program code that can            In Figure 2C, similar extraction and tag       not select data per se, but they help 
           recreate the interface at runtime. To sum       removal steps are defined for protein          define the desired data that are to be 
           up, programming by examples is not              names. The resulted data set is named          selected. Tags can be block opening, 
           new, but to apply this concept to the data      Pure Names. Note that yellow-colored           block closing, or simply line tags. A 
           extraction, conversion, and formatting          texts in this figure indicate line-based       line tag allows only lines containing it 
           needs of biologists is new to the best          selection tags.                                to be selected. The opening and closing 
           of  our  knowledge.  In  the  following            To produce the desired output, both         tags  enclose  a  region  for  selection, 
           example, the author demonstrates how            Pure Names and Pure Proteins are copied        but they do not have to be paired. If 
           this form of programming in Vect can            to  the  Output  Data  panel.  This  panel     an opening tag is followed by another 
           help biologists create Perl programs.           has  both Template and Output views.           opening tag, the second tag defines a 
                                                           To compose a correct FASTA output,             new selection region (i.e., it functions 
                                                           we need to add a greater than symbol >         both as a closing tag for the previous 
           RESULTS                                         in front of the Pure Names and separate        region and as an opening tag for its own 
                                                           it  from  Pure  Proteins  by  a  new  line     region). All tags are defined by using 
           Vect Programming Tutorial                       (see Figure 2D). This FASTA-required           the right mouse button to select text, 
                                                           symbol is not to be confused with the          but otherwise they are selected exactly 
              It  is  helpful  to  demonstrate  how        pair of arrows enclosing data set names.       the same way as regular text (i.e., tags 
           Vect works from a user’s perspective            The former is a static text to output, but     can also be field- or position-based). 
           616BioTechniques                                                                                                       Vol. 38, No. 4 (2005)
                                                                              input
                                                                     • 
                                                                     •
                                                                     •
                       The Control Flow Paradigm                     /protein_id="AAD16616.1"                             The Data Flow Paradigm
                                                                     • 
                                                                     •
                                                                     •
                                                                     /translation="MQLLRTL...
                                   each linesequentially scan        ........................       raw 
                                                                     ........................       names
                                                                     ........................
                                                                     ........VSLIK"                 emerged
                                                                     • 
                                                                     •
                                                                     •
                                                                                            raw         /protein_id="AAD16616.1"           AAD16616.1
                                                                                       proteins                                   get
                                                                                       emerged                                    quoted
                                    seeing                                                                                        string
                                /protein_id          no                                                 /translation="MQLLRTL...
                                                                                                        ........................
                                                                                                        ........................
                                                               seeing                                   ........................
                                                           /translation       no                        ........VSLIK"
                                        yes
                                                              yes                                                 concatenate
                                 save this line
                                                             concatenate                       /translation="MQLLRTL...........................VSLIK"
                                                               this line
                                                                               next line                         get
                                                                                                                 quoted
                                  get quoted                   seeing         no                                 string
                                    string                     end "?
                                                                   yes                         MQLLRTL.............................VSLIK
                                                             concatenate
                                   store the                 the last line
                                    name                                                                          fill in the
                                                                                                                  template 
                                                             get quoted
                                                                string
                                                                                                                           output template
                                    is both                                                                              >     AAD16616.1
                                name and protein              store the
                        no         collected?                  protein                                                    "MQLLRTL.........
                                                                                                                          .................
                                                                                                                          .................
                                        yes                                                                               .................
                                                                                                                          .................
                                   output > ,                                                                             .......VSLIK"
                                name and protein
                                     clean                                     > AAD16616.1
                               name and protein                                MQLLRTL..........
                                                                               .................
                                                                               .................
                                                                               .................
                                                                               .................
                                                                               .......VSLIK
                                                                                     output
                Figure 1. Comparison of the control flow and data flow programming paradigms. Vect (the Visual Extraction and Conversion Tool) presents to its users 
                a data flow programming paradigm shown in the right. Users can separately define how data sources can be extracted, converted, and composed to produce 
                the output. Vect then compiles the design into a Perl program that is expressed in the control flow paradigm shown in the left, which actually implements the 
                computation.
                Vol. 38, No. 4 (2005)                                                                                                              BioTechniques 617
      RESEARCH REPORT
            A                                B
             C                                D
             E                                F
      Figure 2. Vect (the Visual Extraction and Conversion Tool) programming: a tutorial. (A) Each protein block is defined by a green opening tag and a red 
      closing tag. Pink regions are not selectable. The actual selected text is shown in grey. (B) Selected protein fragments are sent to the Convert Data panel and 
      named Protein Parts. Two rules are added to concatenate the fragments (Quoted Proteins) and remove the quotes (Pure Proteins). (C) Similar selection is con-
      ducted for protein names. Here the yellow line selection tags are used. (D) The Output Data panel provides a template view to compose user output. (E) The 
      Output Data panel also provides an output view to show the actual output. (F) Finally, a working Perl program can be obtained by clicking the Compile button 
      in the Perl Program panel.
      618BioTechniques                                                   Vol. 38, No. 4 (2005)
The words contained in this file might help you see if this file matches what you are looking for:

...Vect an automatic visual perl programming tool for nonprogrammers hui hsien chou biotechniques april modern high throughput biological research produces enormous amount of data that must be processed by computers but many biologists dealing with these are not professional programmers despite increased awareness interdisciplinary training in bioinformatics still find it difficult to create their own computational solutions the extraction and conversion has been developed assist simple without having master a pro gramming language provides unified graphical user interface output composition code generation using is achieved visually performing desired out put tasks some sample then compiled into executable program which can saved later use carry same computation independently released under gnu general public license freely available all major computing platforms including macintosh os x linux microsoft win dows at www complex iastate edu introduction inside its each protein sequence thi...

no reviews yet
Please Login to review.