Tuto4: Awk command.md

October 1, 2023 · View on GitHub

Pracatical for Awk command

  • By the end of this tutorial you will be able to :
    • Improve text processing skills.
    • Extract & Transform your data.
    • Clean & validate your data.


  1. Download the gff file exmaple from this link above in your working directory Dir3.
  2. To print the content of your file please type this command;
 cat GFFgile 
  1. Print out the content of the GFFfile by "Aho, Weinberger, and Kernighan" command;
awk '{print}' GFFfile 
  1. Repeat question 3 and substitute "{print}" by " '//' ", is there any difference ?
awk '//' mGFFfile 
  1. Repeat question 3 and substitute "print" by "print $0", is there any difference ?
awk '{print \$0}' GFFfile 
  1. In case you wanted to print a ormatted output, you may type;
awk '{printf \$0}' GFFfile 
  1. How many lines contain the GFFfile. ( use wc command with the option -l );
 cat GFFgile | wc -l 
  1. To print the current line number using awk just type;
awk '{print NR}' GFFfile  
  • If you are confused and you wanted to check again, you could pipe the previous command's output to wc -l command.
  1. Note that awk is a programming language, so you can use implement many options using a single command, First lets print how many field contains each line;
awk '{print NF}' GFFfile 
  1. Now Try to append NF and NR in one command ;
Hint

You could write a string in your command exp awk '{ ... ,"your string", ...}' input

Answer:
 awk '{print "The number of field in line",NR," is : ", NF}' GFFfile 
  1. Supposing that You wanted to print out specific fields, awk offer you this ability, try to print field number 1,3,4,5,7,9 then redirect the output to a modified newfile named mGFFfile;
awk  '{print \$1,\$3,\$4,\$5,\$7,\$9}' GFFfile > mGFFfile
  1. Inspect your new file use " more " command.
  • See that mFFfile bacame non a tab separated file so to get your output in "\t" format, change awk’s default behaviour by using the "Output Field Separator(OFS)"
awk  'BEGIN {OFS="\t"} {print \$1,\$3,\$4,\$5,\$7,\$9}' GFFfile > mGFFfile
  • Inspect again your new file.

+++ Filter lines based on conditions:


  • In this part of the TUTO we will work only on mGFFfile.
  1. Extract only lines which contain CDS;
 awk '/CDS/' mGFFfile 
  • To be more acurate you can specify the field(column);
awk '\$2=="CDS" {print}' mGFFfile 
  1. Extract lines which contain CDS or data with End position greater 5000 ( column 4 ) ;
awk '\$2=="CDS" || \$3>5000 {print}' mGFFfile
  1. Repeat question 13 then print out both lines with CDS which contain end position > 5000;
awk '\$2=="CDS" && \$3>5000 {print}' mGFFfile 
  1. Print only unique type of feature ( $2 );
awk -F'\t' '{print \$2}' mGFFfile | uniq
  1. Assuming that you wanted to print lines longer than a certain length (exp : print out words longer than 4 in the second field);
 awk 'length(\$2) > 4' mGFFfile 
  1. Repeat Q16, try to select only words with length equal to 4;
Answer:
awk 'length(\$2)== 4' mGFFfile 
  1. Try Print a specific line (Hint: Use NR to print the 10th line);
awk 'NR==10' mGFFfile
  1. Calculate and print the average of values in the third column;
awk '{sum += \$3} END {print "Average:", sum/NR}' mGFFfile 
  1. Calculate the diference between column 4 and 3 ;
awk '{diff = \$4 - \$3; print diff}' mGFFfile
  1. Perform the command in Q21 to calculate the sum of the differences between column 4 and column 3 for all lines in the file and print the result at the end, you can modify the Awk command as follows;
 awk ' {diff = \$4 - \$3; sum += diff} END {print "Sum of difference: ", sum}' mGFFfile 
  • You may also define initial variables as bellow ( add BEGIN{variable1;variable2;...;VariableN};
awk 'BEGIN{sum=0;diff=0} {diff = \$4 - \$3; sum += diff} END {print "Sum:", sum}' mGFFfile 
  1. Furthemore, Awk command offer you the ability to change and append specific field or patterns, so Let us;
  • change the non-existing field 7 from "." to "human" and append the substitution to lines which do not have this field :
awk '{\$7 = "Human"} 1' mGFFfile 
  • Only change existing field;
awk '{\$2 = "HumaN"} 1' GFFfile
  • Substitute "." by "Human" in column 11) in GFFfile;
awk '{gsub(".", "Human", \$11)} 1 ' GFFfile 
Note :

=> Note1 : that gsub is a built-in function in Awk that stands for "global substitution." It is used to search for a pattern within each line of input text and replace all occurrences of that pattern with a specified replacement text: * "." : Is the :regexp: regular expression pattern you want to search for within the target string. * "Human" : IS the :replacement: the text that you want to replace the matched pattern with. * \$1 : IS the :target: the variable or field where you want to perform the substitution. * 1 :IS a common Awk idiom that means to print the modified line.

=> Note2 : Nothing here to be changer as the reason there is no field 11 to change.

  • Repeat the previous step with a target field equal to 2 ; what do you notice ?
awk '{gsub(".", "Human", \$2)} 1 ' GFFfile 
  • Try to substitute "CDS" with "test" only in the second field in GFFfile; Do not forget to use OFS.
Answer :
awk 'BEGIN{OFS="\t"} {gsub("CDS", "test", \$2)} 1' mGFFfile 
Congratulations! You've covered almost the fundamentals basics of Awk commands. Get ready for more advanced topics in the upcoming tutorials.