Skip to main content

SAXParser: too many exceptions for invalid XML character..

I'm working on my Similarity Search project, in which I have to implement the Tree Edit Distance and Traversal String Edit Distance.

Trees are all represented in XML format and I'm using SAXParser to parse those XML files in java. I've used it a lot of times before but still, I don't quite like.

So my first step is to create a valid XML database. However, "valid" to be parsed using SAXParser is complicated!!

Here is what I get again and again:

File Read Error: org.xml.sax.SAXParseException: The content of elements must consist of well-formed character data or markup.
org.xml.sax.SAXParseException: The content of elements must consist of well-formed character data or markup.

The reasons can be different, like:
- Tags cannot contain number (e.g., <1> is an invalid tag)
- Tags cannot contain some symbols, like {, ., ?, etc. ("_" or "-" is fine)
- Tags cannot be empty

However, in my database, all of the tag are numbers..
To make it a valid XML file, I have defined a mapping between each number to a valid character in the alphabet using ASCII code. For example: 0 is mapped to a, 1 to b, 2 to c, etc. using the following code:


public String toXMLTagString(String instr) {
String outstr = "";
for (int i=0; i<instr.length(); i++)
outstr+= new Character((char)(Character.getNumericValue(instr.charAt(i))+97)).toString();

return outstr;
}



So with instr = "01234", outstr = "abcef"
Note: "a" corresponds to 97 in the ASCII code.

Comments

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete

Post a Comment

Popular posts from this blog

Spam and Bayes' theorem

I divide my email into three categories: A1 = spam. A2 = low priority, A3 = high priority. I find that: P(A1) = .7 P(A2) = .2 P(A3) = .1 Let B be the event that an email contains the word "free". P(B|A1) = .9 P(B|A2) = .01 P(B|A3) = .01 I receive an email with the word "free". What is the probability that it is spam?

Python Tkinter: Changing background images using key press

Let's write a simple Python application that changes its background image everytime you click on it. Here is a short code that helps you do that: import os, sys import Tkinter import Image, ImageTk def key(event): print "pressed", repr(event.char) event.widget.quit() root = Tkinter.Tk() root.bind_all(' ', key) root.geometry('+%d+%d' % (100,100)) dirlist = os.listdir('.') old_label_image = None for f in dirlist: try: image1 = Image.open(f) root.geometry('%dx%d' % (image1.size[0],image1.size[1])) tkpi = ImageTk.PhotoImage(image1) label_image = Tkinter.Label(root, image=tkpi) label_image.place(x=0,y=0,width=image1.size[0],height=image1.size[1]) root.title(f) if old_label_image is not None: old_label_image.destroy() old_label_image = label_image root.mainloop() # wait until user clicks the window except Exception, e: # Skip a

Skip-gram model and negative sampling

In the previous post , we have seen the 3 word2vec models: skip-gram, CBOW and GloVe. Now let's have a look at negative sampling and what it is used to make training skip-gram faster. The idea is originated from this paper: " Distributed Representations of Words and Phrases and their Compositionality ” (Mikolov et al. 2013) In the previous example , we have seen that if we have a vocabulary of size 10K, and we want to train word vectors of size 300. Then the number of parameters we have to estimate in each layer is 10Kx300. This number is big and makes training prone to over-fitting and gives too much focus on words that appear often, and less focus on rare words. Subsampling of frequent words So the idea of subsampling is that: we try to maximize the probability that "real outside word" appears, and minimize the probability that "random words" appear around center word. Real outside words are words that characterize the meaning of the center word, wh