Skip to main content

Quick text files merging, data preparation

It's very often that in natural language processing, you will have to re-format your data to take as inputs to different systems. In this case, these simple linux commands will help you do it much quicker without having to write a script.

1. Merging two files to one file with two column

Input f1 looks like this:
1
2
3
4
Input f2 looks like this:
a
b
c
d
Output f3 will look like this:
1  a
2  b
3  c
4  d
Command: paste f1 f2 > f3 
The delimiter by default is a tab. You can also define it (for example, separated by a comma) as follows:
paste -d ',' f1 f2 > f3

2.  Create a line number to each line of a text file

Assume that you want to create an index to each line in a text file, i.e. inserting a line number and then a tab before the content of each line:
Input f1:
a
b
c
d
Output f2:
1  a
2  b
3  c
4  d
Command: nl f1 > f2

3. Joining two files with a common field

Input f1:
1  aaa
2  bbb
3  ccc
4  ddd
Input f2:
1  a                              
2  b                              
3  c                              
4  d                              
Output f3 (joining on the first field):
1 aaa a                                
2 bbb b                                
3 ccc c                                
4 ddd d                                
Command: join f1 f2 > f3
We can also use join command to join on different fields, different columns (having to sort them first). Further instructions about join can be found here.

Comments

  1. To get columns (e.g., column 3 and 5) from a text file data.txt, one can use "cut" command as follows:
    cut -d' ' -f3,5 < data.txt
    This command is usually much faster than using shell scripts.

    ReplyDelete

Post a Comment

Popular posts from this blog

SAXParser: too many exceptions for invalid XML character..

I'm working on my Similarity Search project, in which I have to implement the Tree Edit Distance and Traversal String Edit Distance. Trees are all represented in XML format and I'm using SAXParser to parse those XML files in java. I've used it a lot of times before but still, I don't quite like. So my first step is to create a valid XML database. However, "valid" to be parsed using SAXParser is complicated!! Here is what I get again and again: File Read Error: org.xml.sax.SAXParseException : The content of elements must consist of well-formed character data or markup. org.xml.sax.SAXParseException: The content of elements must consist of well-formed character data or markup. The reasons can be different, like: - Tags cannot contain number (e.g., is an invalid tag) - Tags cannot contain some symbols, like {, ., ?, etc. ("_" or "-" is fine) - Tags cannot be empty However, in my database, all of the tag are numbers.. To make it a valid XML fi...

Sigmoid, tanh, ReLU functions. What are they and when to use which?

If you are working on Deep Learning or Machine Learning in general, you have heard of these three functions quite frequently. We know that they can all be used as activation functions in neural networks. But what are these functions and why do people use for example ReLU in this part, sigmoid in another part and so on? Here is a friendly introduction to these functions and a brief explanation of when to use which. Sigmoid function Output from 0 to 1 Exponential computation (hence, slow) Is usually used for binary classification (when output is 0 or 1) Almost never used (e.g., tanh is a better option) Tanh function A rescaled logistic sigmoid function (center at 0) Exponential computation Works better than sigmoid ReLU function (Rectified Linear Unit) and its variants Faster to compute Often used as default for activation function in hidden layers ReLU is a simple model which gives 0 value to all W*x + b < 0. The importance is that it introduces t...