Non-Extractive Parsing for XML

25 May 2004

Text processing is one of the most common tasks in application development. Whether it is a Java Servlet or a VOIP application, the conversion from a raw text-based input message to a machine-readable internal representation almost always requires parsing (or tokenization), which, in its current form, refers to a process of extracting tokens (the smallest unit of relevent textual information) and storing them in null-terminated memory blocks, also known as strings. Over the years, people have invented various automation techniques and tools, e.g. regular expression and Lex, to reduce the complexity of manual parsing. Proven both useful and stable, those techniques and tools have stood the test of time. As a result, the current framework of text processing is generally considered to be fairly well-established. ead the article