Skip to main content

Quick text files merging, data preparation

It's very often that in natural language processing, you will have to re-format your data to take as inputs to different systems. In this case, these simple linux commands will help you do it much quicker without having to write a script.

1. Merging two files to one file with two column

Input f1 looks like this:
1
2
3
4
Input f2 looks like this:
a
b
c
d
Output f3 will look like this:
1  a
2  b
3  c
4  d
Command: paste f1 f2 > f3 
The delimiter by default is a tab. You can also define it (for example, separated by a comma) as follows:
paste -d ',' f1 f2 > f3

2.  Create a line number to each line of a text file

Assume that you want to create an index to each line in a text file, i.e. inserting a line number and then a tab before the content of each line:
Input f1:
a
b
c
d
Output f2:
1  a
2  b
3  c
4  d
Command: nl f1 > f2

3. Joining two files with a common field

Input f1:
1  aaa
2  bbb
3  ccc
4  ddd
Input f2:
1  a                              
2  b                              
3  c                              
4  d                              
Output f3 (joining on the first field):
1 aaa a                                
2 bbb b                                
3 ccc c                                
4 ddd d                                
Command: join f1 f2 > f3
We can also use join command to join on different fields, different columns (having to sort them first). Further instructions about join can be found here.

Comments

  1. To get columns (e.g., column 3 and 5) from a text file data.txt, one can use "cut" command as follows:
    cut -d' ' -f3,5 < data.txt
    This command is usually much faster than using shell scripts.

    ReplyDelete

Post a Comment

Popular posts from this blog

Pytorch and Keras cheat sheets

Python Tkinter: Changing background images using key press

Let's write a simple Python application that changes its background image everytime you click on it. Here is a short code that helps you do that: import os, sys import Tkinter import Image, ImageTk def key(event): print "pressed", repr(event.char) event.widget.quit() root = Tkinter.Tk() root.bind_all(' ', key) root.geometry('+%d+%d' % (100,100)) dirlist = os.listdir('.') old_label_image = None for f in dirlist: try: image1 = Image.open(f) root.geometry('%dx%d' % (image1.size[0],image1.size[1])) tkpi = ImageTk.PhotoImage(image1) label_image = Tkinter.Label(root, image=tkpi) label_image.place(x=0,y=0,width=image1.size[0],height=image1.size[1]) root.title(f) if old_label_image is not None: old_label_image.destroy() old_label_image = label_image root.mainloop() # wait until user clicks the window except Exception, e: # Skip a...

Word embeddings

In this post, we are going to talk about word embedding (or word vector), which is how we represent words in NLP. Word embedding is used in many higher-level applications such as sentiment analysis, Q&A, etc. Let's have a look at the most currently widely used models. One-hot vector is a vector of size V, with V is the vocabulary size. It has value 1 in one position (represents the value of this word "appears") and 0 in all other positions. [0, 0, ... 1, .., 0] This is usually used as the input of a word2vec model. It is just operating as a lookup table. So this one-hot encoding treats words as independent units. In fact, we want to find the "similarity" between words for many other higher-level tasks such as document classification, Q&A, etc. The idea is: To capture the meaning of a word, we look at the words that frequently appear close-by this word. Let's have a look at some state-of-the-art architectures that give us the results of word ve...