Quick text files merging, data preparation

It's very often that in natural language processing, you will have to re-format your data to take as inputs to different systems. In this case, these simple linux commands will help you do it much quicker without having to write a script.

1. Merging two files to one file with two column

Input f1 looks like this:
1
2
3
4
Input f2 looks like this:
a
b
c
d
Output f3 will look like this:
1 a
2 b
3 c
4 d
Command: paste f1 f2 > f3
The delimiter by default is a tab. You can also define it (for example, separated by a comma) as follows:
paste -d ',' f1 f2 > f3

2. Create a line number to each line of a text file

Assume that you want to create an index to each line in a text file, i.e. inserting a line number and then a tab before the content of each line:
Input f1:
a
b
c
d

Output f2:

1 a

2 b

3 c

4 d

Command: nl f1 > f2

3. Joining two files with a common field

Input f1:
1 aaa

2 bbb

3 ccc

4 ddd

Input f2:
1 a

2 b

3 c

4 d

Output f3 (joining on the first field):
1 aaa a

2 bbb b

3 ccc c

4 ddd d

Command: join f1 f2 > f3
We can also use join command to join on different fields, different columns (having to sort them first). Further instructions about join can be found here.

Comments

Dieu-Thu LeAugust 21, 2018 at 10:09 AM
To get columns (e.g., column 3 and 5) from a text file data.txt, one can use "cut" command as follows:
cut -d' ' -f3,5 < data.txt
This command is usually much faster than using shell scripts.
ReplyDelete
Replies

Add comment

Wandering around

Search This Blog