Main Web>TWikiUsers>SotirisFragkiskos>EMMOsint>CombiningCSAndEMMOSint (2013-05-30, SotirisFragkiskos)

Combining Collaboration Spotting and EMM OSINT (Correspondence with Gerhard from the JRC)

Introduction

The three main concepts of EMM, as we understand it, are entity recognition, categorisation and clustering. So far, we are looking into structured sources, so we are already extracting the entities of interest to us with Collaboration Spotting, so our main interest is in the latter two.

The powerful categorisation and clustering mechanisms from the EMM Newsbrief seem to be missing from the OSINT desktop application. We would like to know if we could use these mechanisms to handle publications, patents, technologies etc. This would allow us to categorise technologies, find formal collaborations, and other things explained in the conclusion at the end of this document.

Collaboration Spotting’s needs

Entity recognition

We have already implemented mechanisms for parsing organisations, but we would be interested in improving our algorithms. The EMM are faced with a similar type of challenge, and maybe we could learn something from how you address these issues. In addition, we would like to feed our already identified organisations (potentially people as well in the future) into the entity database, so that these can be recognised by the EOS. Is this feasible?

Comment from Gerhard Wagner: Yes, we can import a list of entities as a tab separated value file.

Categorisation

We believe that the algorithms of the EMM Newsbrief have the potential to help us discover formal collaborations. This could be done by extracting and analysing news articles from both scientific and non-scientific sources, as well as from abstracts and full-text publications.

Another issue is to recognise if a document is on the use on, or the development of, a technology. The EMM Newsbrief could potentially solve this issue, and also determine if a given technology is a fundamental technology or part of a composite technology.

We are already looking into ways of determining ownership of organisations. For now, we rely on organisation trees, either manually created or gathered from the web. It would be a useful addition if we could use your algorithms to recognise if/when a company is bought by another, two companies merge or if a company changes its name.

Clustering

EMM performs “Automated information linking” to link articles together if they correspond to each other. We understand that this mechanism uses some sophisticated ways of deciding which stories belong together. This could possibly be used to “cluster” technologies in any way we want. We would be particularly interested in clustering technologies inside a technology family, or deciding whether a paper is on the use or development of a technology.

Comment from Gerhard Wagner:

Clustering is currently not part of EMM OSINT Suite, however I think we can relatively simply add it so that it works on a set of documents on local disk. (I have opened an issue in our support system to explore this further).

Questions for EMM

Question:

It seems connections between entities are defined by whether they appear in the same file. Can we teach the program to make a connection between, for example, two companies that appear in the same abstract (and not the whole file)?

Reply from Gerhard Wagner:

This is true, currently we take the complete document into account. In a previous version of our extraction engine we had already a function which was based on ranges within a document. This could be added again to make sure that only the abstract is taken into account. (Again, I open an issue in our tracking system to include this in the next major version of the software).

Question:

Our structures are in XML format. Do you think that OSINT could support XML input in the future?

Reply from Gerhard Wagner: Yes, this should be no problem. The component we use to extract text from documents (Apache Tika) provides already a filter for XML. We have to discuss which structured information of the input XML needs to be retained.

Question:

Do you use other databases for the Newsbrief to complete the clustering and categorisation? On the same note, (how) can we build our own database of entities?

Reply from Gerhard Wagner:

Currently all our data is derived from EMM Newsbrief and its components. A databases of entities could be built by feeding a large set of documents to EMM OSINT Suite and then later sanitize the name variant database manually (importing it into MS Access for editing, for example). Also the entity guessing rules of EMM OSINT Suite could be amended to find more organisations for your domain.

Question:

The Newsbrief can perform categorisation of news articles, for example natural disasters, crimes, conflicts etc. based on keyword proximity/patterns. Can we build on this so that the engine can recognise formal collaborations, joint ventures, and maybe even distinguish between use and development?

Reply from Gerhard Wagner:

I think this can be done, but we may need to come up with some new formal rules for our system to detect these kind of connections between entities. As a first step we need to integrate the categories matching module to EOS and then try to find a set of keywords which work roughly and then go from there improving it further.

Topic revision: r1 - 2013-05-30 - SotirisFragkiskos

Main

Webs

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
Main All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback