Task refinement, blog-organization, etc.

It’s been about a full month since the last update. Inspite of the fact of having no textual feedback during that time (excluding the irregular tweets), there happened a lot of things that will determine my activities for the following month, i.e. my part-time job as student assistent, a project adding some major functionality to my University’s website plus another rather time-consuming course’s task spanning over the whole semester. As consequence, I will at first tackle those urging topics as fast as possible and will focus back again to my thesis.

Speaking of which, today I met my academic advisor settling up a more refined task description replacing the one I presented in my very first blog entry:

Wikipedia summary and overview generation (current draft)
The amount of semi-structured information, such as Wikipedia, presents a problem similar to information overload, where the quantity of information may confuse or overwhelm the user.

To address this problem we propose the generation of an intermediate stage between a document index and the detailed documents by generating a dynamic graph animation of semantically related document summaries.

This would provide a conceptual map of structured information where important or central concepts can be found with little difficulty. A dynamic visual representation could alleviate the cognitive load of the user.

The approach for generating results is focused on the used of natural language techniques and semantic web information to add to the available structured information. Although structured information can and should be used to bootstrap the process.

A special case of the results is the generation of a causal-temporal graph to represent the progression of events or activities, such as a visually animated timeline or flowchart.

  • Subtasks:
    1. Define semantic relations that are not explicit or complete within the structured data ie. Wikipedia.
    2. Define mechanisms to enhance the available semi-structured information using natural language techniques and semantic web information.
    3. Implement a visualization that updated with streaming information.
    4. Present the generated information in the visualization.

This description now conveys a more granulated view of what I am doing. As you might see the semantic relation has shifted from the original strong ones (logical relations like “consequence”, etc.) to the weaker causal-temporal, yet very beneficial, ones. As a side remark: Even though Discourse Analysis differentiate a variety of potential relation, but there formalization seems to be vague. The visualization aspect was personal of importance, since I would like to have something in the end, which supports the reader getting a clear and intuitive overview.

*Important*  For reasons of readability and general blog organisation, I will try to distinguish between thesis-related and off-topic, personal-related entries. I also stipulate, that everything written in the whole blog reflects my personal opinion/thinking and not any other’s party (neither Wikipedia, TU Dresden, etc.) – just to prevent any potential confusions.

Literature & Preparations

xyz

Cover of "Natural Language Processing with Python"

It’s been some days of investigating and diving deep into the ocean of NLP. My current strategy includes getting a glimpse of the overall approaches from certain sub-areas. Especially those techniques from Text Mining and Computational linguistics are of importance, when dealing with Topic Detection and Tracking (TDT).

The main source of information are for the moment the Wikipedia articles and some of its linked sources, but I will soon shift to specific books and papers. I am planning to read the Book “Natural Language Processing with Python” intensively, which I found for now easily readable and comprehensive, given tons of examples. [As a little side note: Contrary to the standard Python IDE (IDLE), IPython turns out very handy when dealing with huge text output and offers nice features like object? as command, which introspects the object and provides excerpts from the documentation for that object.]  Additionally “The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data” will also provide deeper insight in the formal aspects.

The very recent article I put my hands on is “Themenentdeckung und -verfolgung und ihr Einsatz bei Informationsdiensten für Nachrichten” [PDF, german] by Wolfgang G.Stock, who exemplifies the topic TDT on the basis of Google News, which automatically aggregates interesting news by actuality and topic and presents them by user’s preferences if demanded.  Recently (not to say, today) Google news underwent a relaunch improving their content and presentation. They added a new section for manually written news by some authorized people from newspapers – not involving those news in the automatization process. The reasons for this are still open – probably editors provide better topics, people want to read and those are used as basis for improving the generated ones, but that’s just speculation from the my point of view.

Twitters seems to be fun. I hopefully won’t get overwhelmed by hundreds of tweets.

Hello world!

This is yet another blog in the big internet which has been published due to mere egoistic reasons. The purpose of this site is to record the progress of my student’s thesis, which addresses problems both in the field of Natural Language Processing (NLP) and Graphs to illustrate the results. My particular task involves the “Generation of Event Chains from Wikipedia Articles”. The general idea behind this is that events have a so called cause-effect relationship, e.g. in news this would be the event “Country X attacks Y” and effect “Country Y announces retaliation for Country Y” and so on. Those relationships will be extracted using certain keywords like “causes”, “background” or “trigger”. The overall result will be presented in a summary graph which readily present the sources of events.

Since NLP offers a huge collection of algorithms and the fact, that I have not been working in that field so far, means that I have to learn a lot from scratch building up basic vocabulary and get familiar with the most fundamental approaches. On the practical side, I will make intensive use of the Natural Language Toolkit (written in Python) being told that’s a great introduction into that field.

To my person: I am a computer-science student at the University TU Dresden, Germany, majoring in “Intelligent Systems”. My other interests are writing little tools to improve daily life, use Mnemonics whenever possible, indie games, canoeing and Japanese language/culture.

PS.: There will be also off-topic stuff here I find interesting or funny. ;-)