Skip to main content

SAXParser: too many exceptions for invalid XML character..

I'm working on my Similarity Search project, in which I have to implement the Tree Edit Distance and Traversal String Edit Distance.

Trees are all represented in XML format and I'm using SAXParser to parse those XML files in java. I've used it a lot of times before but still, I don't quite like.

So my first step is to create a valid XML database. However, "valid" to be parsed using SAXParser is complicated!!

Here is what I get again and again:

File Read Error: org.xml.sax.SAXParseException: The content of elements must consist of well-formed character data or markup.
org.xml.sax.SAXParseException: The content of elements must consist of well-formed character data or markup.

The reasons can be different, like:
- Tags cannot contain number (e.g., <1> is an invalid tag)
- Tags cannot contain some symbols, like {, ., ?, etc. ("_" or "-" is fine)
- Tags cannot be empty

However, in my database, all of the tag are numbers..
To make it a valid XML file, I have defined a mapping between each number to a valid character in the alphabet using ASCII code. For example: 0 is mapped to a, 1 to b, 2 to c, etc. using the following code:


public String toXMLTagString(String instr) {
String outstr = "";
for (int i=0; i<instr.length(); i++)
outstr+= new Character((char)(Character.getNumericValue(instr.charAt(i))+97)).toString();

return outstr;
}



So with instr = "01234", outstr = "abcef"
Note: "a" corresponds to 97 in the ASCII code.

Comments

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete

Post a Comment

Popular posts from this blog

Pytorch and Keras cheat sheets

Python Tkinter: Changing background images using key press

Let's write a simple Python application that changes its background image everytime you click on it. Here is a short code that helps you do that: import os, sys import Tkinter import Image, ImageTk def key(event): print "pressed", repr(event.char) event.widget.quit() root = Tkinter.Tk() root.bind_all(' ', key) root.geometry('+%d+%d' % (100,100)) dirlist = os.listdir('.') old_label_image = None for f in dirlist: try: image1 = Image.open(f) root.geometry('%dx%d' % (image1.size[0],image1.size[1])) tkpi = ImageTk.PhotoImage(image1) label_image = Tkinter.Label(root, image=tkpi) label_image.place(x=0,y=0,width=image1.size[0],height=image1.size[1]) root.title(f) if old_label_image is not None: old_label_image.destroy() old_label_image = label_image root.mainloop() # wait until user clicks the window except Exception, e: # Skip a...

Word embeddings

In this post, we are going to talk about word embedding (or word vector), which is how we represent words in NLP. Word embedding is used in many higher-level applications such as sentiment analysis, Q&A, etc. Let's have a look at the most currently widely used models. One-hot vector is a vector of size V, with V is the vocabulary size. It has value 1 in one position (represents the value of this word "appears") and 0 in all other positions. [0, 0, ... 1, .., 0] This is usually used as the input of a word2vec model. It is just operating as a lookup table. So this one-hot encoding treats words as independent units. In fact, we want to find the "similarity" between words for many other higher-level tasks such as document classification, Q&A, etc. The idea is: To capture the meaning of a word, we look at the words that frequently appear close-by this word. Let's have a look at some state-of-the-art architectures that give us the results of word ve...