Tuto4: Awk command.md

October 1, 2023 · View on GitHub

Pracatical for Awk command

By the end of this tutorial you will be able to :
- Improve text processing skills.
- Extract & Transform your data.
- Clean & validate your data.

+++ Print-Command examples :

Download the gff file exmaple from this link above in your working directory Dir3.
- Check your path, use pwd command.
- Hint : use wget command to download the file.
- link : https://github.com/Zemzemfiras1/Mastering_Linux_Tutorials/blob/master/Tutorials/GFFfile :: source
To print the content of your file please type this command;

 cat GFFgile

Print out the content of the GFFfile by "Aho, Weinberger, and Kernighan" command;

awk '{print}' GFFfile

Repeat question 3 and substitute "{print}" by " '//' ", is there any difference ?

awk '//' mGFFfile

Repeat question 3 and substitute "print" by "print $0", is there any difference ?

awk '{print \$0}' GFFfile

In case you wanted to print a ormatted output, you may type;

awk '{printf \$0}' GFFfile

How many lines contain the GFFfile. ( use wc command with the option -l );

 cat GFFgile | wc -l

To print the current line number using awk just type;

awk '{print NR}' GFFfile

If you are confused and you wanted to check again, you could pipe the previous command's output to wc -l command.

Note that awk is a programming language, so you can use implement many options using a single command, First lets print how many field contains each line;

awk '{print NF}' GFFfile

Now Try to append NF and NR in one command ;

Hint

You could write a string in your command exp awk '{ ... ,"your string", ...}' input

Answer:

 awk '{print "The number of field in line",NR," is : ", NF}' GFFfile

Supposing that You wanted to print out specific fields, awk offer you this ability, try to print field number 1,3,4,5,7,9 then redirect the output to a modified newfile named mGFFfile;

awk  '{print \$1,\$3,\$4,\$5,\$7,\$9}' GFFfile > mGFFfile

Inspect your new file use " more " command.

See that mFFfile bacame non a tab separated file so to get your output in "\t" format, change awk’s default behaviour by using the "Output Field Separator(OFS)"

awk  'BEGIN {OFS="\t"} {print \$1,\$3,\$4,\$5,\$7,\$9}' GFFfile > mGFFfile

Inspect again your new file.

+++ Filter lines based on conditions:

In this part of the TUTO we will work only on mGFFfile.

Extract only lines which contain CDS;

 awk '/CDS/' mGFFfile

To be more acurate you can specify the field(column);

awk '\$2=="CDS" {print}' mGFFfile

Extract lines which contain CDS or data with End position greater 5000 ( column 4 ) ;

for more information about gff files : please visit ; http://www.ensembl.org/info/website/upload/gff3.html

awk '\$2=="CDS" || \$3>5000 {print}' mGFFfile

Repeat question 13 then print out both lines with CDS which contain end position > 5000;

awk '\$2=="CDS" && \$3>5000 {print}' mGFFfile

Print only unique type of feature ( $2 );

awk -F'\t' '{print \$2}' mGFFfile | uniq

Assuming that you wanted to print lines longer than a certain length (exp : print out words longer than 4 in the second field);

 awk 'length(\$2) > 4' mGFFfile

Repeat Q16, try to select only words with length equal to 4;

Answer:

awk 'length(\$2)== 4' mGFFfile

Try Print a specific line (Hint: Use NR to print the 10th line);

awk 'NR==10' mGFFfile

Calculate and print the average of values in the third column;

awk '{sum += \$3} END {print "Average:", sum/NR}' mGFFfile

Calculate the diference between column 4 and 3 ;

awk '{diff = \$4 - \$3; print diff}' mGFFfile

Perform the command in Q21 to calculate the sum of the differences between column 4 and column 3 for all lines in the file and print the result at the end, you can modify the Awk command as follows;

 awk ' {diff = \$4 - \$3; sum += diff} END {print "Sum of difference: ", sum}' mGFFfile

You may also define initial variables as bellow ( add BEGIN{variable1;variable2;...;VariableN};

awk 'BEGIN{sum=0;diff=0} {diff = \$4 - \$3; sum += diff} END {print "Sum:", sum}' mGFFfile

Furthemore, Awk command offer you the ability to change and append specific field or patterns, so Let us;

change the non-existing field 7 from "." to "human" and append the substitution to lines which do not have this field :

awk '{\$7 = "Human"} 1' mGFFfile

Only change existing field;

awk '{\$2 = "HumaN"} 1' GFFfile

Substitute "." by "Human" in column 11) in GFFfile;

awk '{gsub(".", "Human", \$11)} 1 ' GFFfile

Note :

=> Note1 : that gsub is a built-in function in Awk that stands for "global substitution." It is used to search for a pattern within each line of input text and replace all occurrences of that pattern with a specified replacement text: * "." : Is the :regexp: regular expression pattern you want to search for within the target string. * "Human" : IS the :replacement: the text that you want to replace the matched pattern with. * \$1 : IS the :target: the variable or field where you want to perform the substitution. * 1 :IS a common Awk idiom that means to print the modified line.

=> Note2 : Nothing here to be changer as the reason there is no field 11 to change.

Repeat the previous step with a target field equal to 2 ; what do you notice ?

awk '{gsub(".", "Human", \$2)} 1 ' GFFfile

Try to substitute "CDS" with "test" only in the second field in GFFfile; Do not forget to use OFS.

Answer :

awk 'BEGIN{OFS="\t"} {gsub("CDS", "test", \$2)} 1' mGFFfile

Congratulations! You've covered almost the fundamentals basics of Awk commands. Get ready for more advanced topics in the upcoming tutorials.