Bash Commands for NLP Engineers

As using Bash commands is inevitable if you work on NLP and MT tasks, I thought it would be useful to list the majority of commands I learnt to use on a daily base, thanks to practice, searching, and helpful colleagues I met over years. Obviously, this is not an exclusive list; however, I hope it includes most of the one-line Bash commands you would need. Please note the majority of these commands have been mainly tested on Linux.

Table of Contents

File Management
Reading Files
Nano Editor Commands
Finding
Downloading
Compressing and Extracting
Server-related Bash Commands
Other Useful Packages

File Management

Open a directory:

cd <path/dir_name>

List the files and sub-directories in the current directory:

ls

Create a new directory:

mkdir <dir_name>

Rename or move a file or directory:

mv <old_filename> <new_filename>

Move a file to a directory:

mv <old_filename> <dir_name>

Move all files whose name starting with a string, using *:

mv <old_filename>* <folder_name>

Rename multiple files: (details) rename ‘s/<original_string>/<new_string>/g’ *

Delete a file:

rm <file_name>

To delete multiple files, just add them after the rm command separated by spaces:

rm <file_name1> <file_name2> <file_name3>

Delete any file that starts with “wow”, using *:

rm wow*

Delete a directory and its contents:

rm -r <dir_name>

Avoid deleting files by mistake by using trash instead of rm, installing trash-cli:

sudo apt-get install trash-cli
• Delete:
trash <file_name>
• List trashed items:
trash-list
• Restore a file (first move to the root folder or a specific folder):
restore-trash and then type a number.
• Empty the trash list:
trash-empty

Copy a file:

cp <original_filename> <new_filename>

Copy a directory and its contained files (at least -r is required):

cp -avr <original_dirname> <new_dirname>

Copy and show a progress bar (good for large files)

rsync -ah --progress <source> <destination>

Complete a command or file name (e.g. my_file_name.txt):

Type my and then press Tab – once if there is no other file starting with “my”.
OR
Type my and then press Tab – twice if you want to know what files starting with “my”.

Move to a location in a command or text: Move the cursor to the location, press Alt or Option, and click.

Clear the current window:

Type clear
OR
Press Ctrl+l

End the current command (before it finishes):

Press Ctrl+c

Move to the last accessed path:

cd -

List your previous commands

history

Search your command history
Ctrl+r

List the *.txt files in the current directory (or path):

ls *.txt

Show the files in all folders that starts with “aaa”:

ls aaa*

Show files and subdirectories in all directories in the current directory:

ls *

List all the files with details:

ls -l

Display file details:

ls -l <file_name>

List all the files with details, the size is in MB/GB:

ls -lh

List all the files with details, the size in MB/GB, arrange by time ascendingly:

ls -lht ls -lht <dir_name1>/*/<dir_name2>

List all the files with details, the size in MB/GB, arrange by time ascendingly:

ls -lhtr

List file sizes only for all files in the current directory:

ls -hs
OR
du

Display the file size only:

ls -hs <file_name> OR
du -h <file_name> OR for one size for all directories and files du -hs <file_name>

Display the last modified file:

ls -t | head -1

Display sizes of the current directory:

du -d 1 -h . Sort the results in ascending order:
du -d 1 -h . | sort -h Sort the results in descending order:
du -d 1 -h . | sort -h -r

Find files the are bigger than 200MB:

find /home/$USER/ -type f -size +200000k -exec ls -lh {} \; | awk '{ print $9 ": " $5 }'

Display file size with stat (Linux):

stat –printf=”%s” <file_name>

Display file last edited time (Linux):

stat -c %y <file_name>

Display file last edited time (Mac):

stat -x <file_name>

Get the current path (print working directory):

pwd

Create a symbolic link, i.e. a shortcut to a file or directory:

ln -s <file_name> <shortcut_name>

Get the path of a file:

readlink -f <file_name> OR echo “$(pwd)/file_name” OR realpath <file_name>

Get word count in a file:

wc <file_name>

Get the number of lines in a file:

wc -l <file_name>

Count lines of all file in subdirectories; use * if the file name is partial (details):

find ./ -type f -name “<file_name>” -exec wc -l {} +

Count lines in a*. gz file, use -c to avoid writing the uncompressed file to desk:

gunzip -c <file_name.gz> | wc -l

Split a file into multiple files, 3000 lines each, with numeric-suffixes:

split -a 4 -d -l 3000 <file_name> <prefix> –additional-suffix <extension>

Find out if two files are identical:

cmp –silent first_file_name second_file_name | echo “——> Files are different.”

Find out the difference between two files:

diff <file_name1> <file_name2>

Find different lines in file1.txt compared to file2.txt:

comm -23 <(sort file1.txt) <(sort file2.txt) > different.txt

Find common lines in both file1.txt and file2.txt:

comm -12 <(sort file1.txt) <(sort file2.txt) > common.txt

Complete a long command in a new line:

\

Reading Files

Read the whole file:

cat <file_name>

Read the whole file; display line numbers:

cat -n <file_name>

Read the first 10 lines of a file:

head <file_name>

Read the first 4 lines of a file:

head -4 <file_name> OR
head -n 4 <file_name>

Read the first 3 lines of two files:

head -q -n <file_name1> <file_name2>

Read the last 10 lines of a file:

tail <file_name>

Read the last 3 lines of a file:

tail -3 <file_name> OR
tail -n 3 <file_name>

Read a specific line of a file, e.g. line #10:

sed -n 10p <file_name>

Read the end of the file and use -f to update the output:

tail -f <file_name>
Use Ctrl+c to exit.

Read a file in chunks:

less <file_name> Press Enter to move to the next chunk of the file, and “q” to quick.

Read a file in chunks, display line numbers:

less -N <file_name>

Disable sending to stdout (i.e. printing in Terminal) by adding 1> /dev/null

cat <file_name1> <file_name2> | tee <output_file_name> 1> /dev/null

Processing

Merge two files, use > to create the output file:

cat <file_name1> <file_name2> > <output_file>

Merge all the files that ends with (say “.en”) to a file (e.g. “all.en”):

cat *.en > all.en

Merge all the files in the current folder:

cat * > <output_file_name>

Merge the source text and target translation into one tab-delimited file

paste -d "\t" all.en all.ar > all.enar

Remove duplicates from a file

sort -S 95% --parallel=8 all.enar | uniq -u > all.unique.enar

Shuffle

shuf all.unique.enar > all.unique.shuf.enar

Split into the source and target from a one tab-delimited file into two files

cut -f 1 all.unique.shuf.enar > all.unique.en
cut -f 2 all.unique.shuf.enar > all.unique.ar

Replace “abc” with “XYZ” in a file

sed -i -e 's/abc/XYZ/g' /tmp/file.txt

Nano Editor Commands

Create a new file:

nano <new_file_name>

Open an existing file:

nano <file_name>

Open multiple files:

nano <file_name1> <file_name2>

Search the current file:

Ctrl+w

Move to the end of the file:

Ctrl+w and then Ctrl+v

Move to the end of the line:

Ctrl+e

Move to the start of the line:

Ctrl+a

Delete the current line:

Ctrl+k

Move a page down:

Ctrl+v

Move a page up:

Ctrl+y

Cut the curret line
Ctrl+k

Mark text:

Ctrl+Shift+6 (i.e. it is Ctrl+^) and then move in the direction to you need.

Cut the marked text:

Ctrl+k

Paste the cut text:

Ctrl+u

Note to be able to pate across multiple files, the second file must be open first open the two files, copy/cut from the first file, close it, and then paste to the second file.

Close the current file:

Ctrl+x

You will be prompted if you want to save; type “y” for yes and “n” for no. If you select to save, just press Enter to keep the current file name. You can also move between two open files as in the next command.

Move between two open files:

alt+. to move forward one file.
alt+, move backward one file.

Note that if you are on Mac, Option+. and Option+, are used to insert ≥≤ symbols, so you need to first press Alt+Command+O to change the behaviour of Option in Terminal.

Finding

Find a file that includes a word (e.g. “really great” in *.txt files):

grep “really great” *.txt

Search sub-directories recursively using grep:

grep -r <word_to_search> * OR
grep -R <word_to_search> *

Use regular expressions with grep, e.g. the only word in the line is ‘nan’:

grep ^nan$ <file_name>

Find a file on the machine by name:

sudo find / -name <file_name>

Find all files in directory and subdirectories that end with *.en:

find “$PWD” -type f | grep ‘.en$’

Find all files in directory and subdirectories that has ‘aaa’ followed with some text:

find “$PWD” -type f | grep “aaa*”

Find files in the current directory that either whose name or content includes “wonderful”:

ls | grep “wonderful”

If you have very long list generated by ls and want to display them page by page:

ls | less

List files whose names include a range of numbers:

ls model.0{1..3}*

List files whose names include different letters:

ls model.{a,b,c,d}

Move multiple files (or run any command on multiple files):

add the difference between { } separated by a comma.

Find installed Python3 packages:

pip3 freeze

Find installed Python3 packages that start with “tensor”, use -i to ignore case:

pip3 freeze | grep -i tensor

Find the location of a command (e.g python3):

which python3

Downloading

Download a file using curl:

curl <http://some.url> –output <file_name>

If this is the first time to use curl, you might get a message like “Command ‘curl’ not found, but can be installed with:

sudo apt install curl

Download a file that requires cookies:

curl –cookie <cookies.txt> <http://some.url> –output <file_name> To get the “cookies.txt” file, you can use a Chrome extension like “cookies.txt” to export cookies into a TXT file.

Copy GitHub repository to the machine:

git clone https://github.com/USERNAME/REPOSITORYNAME

Update a downloaded GitHub repository:

cd <repository_dir_name> git pull git checkout master

Stage and Commit a GitHub repository (details)

git add <file_name\>
git commit -m “Message, e.g. Update file”
git push origin main

The default branch is usually called “master” or “main” – if it is not, replace it with the right name.

Compressing and Extracting

Extract a *.zip file:

unzip <file_name>

Create a zip archive from file(s):

zip <archive_filename> <file_list>

Create a zip archive from a directory with high level of compression:

zip -r -9 <archive_filename.zip> <dir_name>

Extract a *.gz file:

gunzip <file_name.gz>

Compress all the files separately as file_name.gz

cd <dir_name>
gzip *

Compress all the files in the same directory even if there are subdirectories:

cd <dir_name>
gzip -r .

Extract a *.tar.gz file:

tar xzvf <file_name.tar.gz>

Extract a *.tgz file:

tar xzvf <file_name.tgz>

Extract in a different directory:

tar xzvf <file_name.tgz> -C </path/dir_name>
OR
gunzip -c <file_name.tgz> | tar xvf -

Create a *.tar archive:

tar -czvf archive.tar.gz <dir_name>

Create a *.tar archive from multiple files/directories:

tar -czvf <archive_file_name.tar.gz> <file_name1> <file_name2>

Compress as *.tar.bz2 (higher compression):

tar -jcvf <archive_name.tar.bz2> <file_dir_name>

Extract a *.tar.bz2 archive:

tar -jxvf <archive_name.tar.bz2>

Compress all the files separately as file_name.bz2

bzip2 *

Extract file_name.bz2 (without tar)

bzip2 -d <file_name.bz2>

Obviously, many of these commands can be used locally, but they are most useful while working on servers.

Find out the server date and time:

date

Measure time taken to run a script or command:

time <python3 script.py>

Find out the space on the desk:

df -h

Create an alias for a command: (details)

alias <command>

To save aliases, put this in ~/.bash_aliases

nano ~/.bash_aliases

For example, you can add this command to the ~/.bash_aliases file, use quotes for multi-word commands:

alias frz="pip3 freeze"

For the alias change to take effect

source ~/.bash_aliases
OR
exec bash
The next time you type frz in the Terminal, it will run the command pip3 freeze

Repeat the same command

watch

Avoid ending a command if the local Terminal is closed:

screen

Create a new screen with a name:

screen -S <name>

Create a new screen with logging enabled; screenlog.0 is created:

screen -L -S <name>

Detach the current screen:

Ctrl+a+d

Resume a single screen:

screen -r

Resume a screen from multiple running screen:

screen -r <name> OR screen -r <id>

List the currently running screens:

screen -list screen -ls

End a screen:

screen -X -S <id> quit
or resume the screen and then
Ctrl+a then k then y

Shutdown the machine after finishing a command — separate them with ;

python3 file.py; sudo shutdown

Adjust File permissions, access by the current user only:

chmod 700 <file_name>

For example, this is required before using the *.pem key file provided by AWS E2.

Display RAM used:

free -m

Display GPU memory used:

nvidia-smi

Find the CUDA version:

nvcc –version

Run a command continously (optionally use -n for interval seconds, and -d to highlight changes):

watch

Check kernel termination errors (use one of these commands)

dmesg
OR
nano /var/log/kern.log

Check currently running processes - use grep if you are looking for a specific type of processes:

ps -ef | grep python3

Move a file from a server (e.g. AWS2) to the local Machine (run it from the local machine):

scp <file_name> <user>@<serpver_ip:port>:/<dir_name>

Move a directory from a server (e.g. AWS2) to the local Machine; use -r (run it from the local machine):

scp -r <dir_name> <user>@<serpver_ip:port>:/<dir_name>

Move a file from AWS2 to the local Machine (run it from the local machine):

scp -i <key.pem> <file_name> ubuntu@ec2[…].compute.amazonaws.com:~/<dir_name>

Move a file from the local machine to a server (run it from the local machine):

scp <user>@<server_ip:port>:/<dir_name>/<file_name> </path/on/the/local/machine>

Move a file from Google Could to the local machine:

gcloud compute scp –project <project_name> –recurse <user_name>@machine_name:~/<dir_name>/<file_name> </path/on/the/local/machine>

Log out of the current connection (and similar senarios):

Ctrl+d

Other Useful Packages

Among useful packages that you might want to install yourself are:

curl or wget for downloading files, or aria2c for faster download
trash-cli for trashing unwanted files into a folder instead of using the rm command
tree for displaying the directory structure
htop for monitoring CPU resources
locate for quickly finding files by name after updatedb
ack for searching files like grep, but faster
parallel for multithreading from the bash
s3cmd for uploading and downloading files between AWS S3 buckets and non-AWS servers. For AWS E2 servers, use the aws s3 command instead.

Yasmin Moslem