Working on text mining – this is what we do


Our participation in the OpenMinTeD Horizon 2020 project allowed us to get to know a little bit more on text mining, the communities around it, the various types of stakeholders and the issues that they face.

Our job was everything around the user requirements’ elicitation; we were responsible for defining and applying a methodology for creating the profiles of various types of text mining stakeholders, understanding their content-related needs, identifying their issues and proposing the optimal solutions to address them – focusing on the text mining-related ones. Pretty demanding, right?

The project encompasses four (4) different communities:

  1. Scholarly Communication
  2. Agriculture / Biodiversity
  3. Life Sciences
  4. Social Sciences

Our methodology consisted of two rounds of requirements’ elicitation:

i) A general online questionnaire was prepared and project partners were asked to adapt it for their communities. These adapted questionnaires aimed at identifying different types of stakeholders and collecting general requirements and were circulated among stakeholders identified by each one of the partners involved in this process. So in the end we had questionnaires for the agricultural, the scholarly communication, the life sciences and the social sciences communities. These questionnaires were pretty generic, but they allowed us to quickly identify the different types of stakeholders in each community and their most challenging issues.

ii) After the first round of feedback was collected, the initial feedback was taken into consideration and four more refined questionnaires were prepared to serve the needs of the four (4) most prominent scientific research communities identified:

  1. Data/content providers, such as institutional repository managers, publishers, journal editors etc.
  2. Operators of e-infrastructure and aggregators, such as OpenAIRE and META-SHARE (as well as our domain-specific AGINFRA)
  3. Text mining researchers, both individuals and the ones working for large organizations
  4. Researcher application developers, referring to those who develop applications making use of text mining tools and services.

The questions of the online questionnaires may also be used as questions for face to face interviews and focus groups. In this case, additional feedback can be requested from participants, such as the description of existing usage scenarios/workflows (that will help us identify issues and contributions of the project’s outcomes), envisaged ones (that provide direct feedback on what the stakeholders expect), as well as envisaged user interfaces (in the form of draft sketches) that express the way that stakeholders would like to see the new TDM-powered services integrated in existing websites, portals and other web pages – in general things that can be better described using pen and paper. The feedback will have to be analyzed, organized, validated (e.g. by additional stakeholders or domain experts for each domain) and then the requirements will be used for driving the design and implementation of the OpenMinTeD platform and services.

Part of an OpenMinTeD online questionnaire for content providers

Part of an OpenMinTeD online questionnaire for content providers

The work is still ongoing so there will be updates, including some solid results within the next months.

related article that we recently came across, stated that the three top issues faced by commercial text miners are the:

  1. Limited information in abstracts: No matter how well-written they are, abstracts are still abstracts so they contain only limited information compared to the full-text. However, they are more frequently available compared to the full text, so they can be used to make up for missing full text.
  2. Restricted access to XML content: Almost all research publications are available as PDF files; however, even elaborated text mining tools have a hard time mining information from proprietary PDF files. Conversion of PDF files to XML is an alternative process, but still it is time-consuming, error prone (see all these strange characters after a PDF is converted) and lead to loss of formatted text and objects, such as tables and figures.
  3. Lengthy negotiations in the absence of consistent Terms and Conditions: One of the early findings of the OpenMinTeD project is the lack of well-defined legal framework covering text-mining applications over existing text. Since there is no provision for text mining applications in existing licensing schemes, negotiations need to take place from scratch on an individual basis with various actors of the process, such as authors, repository managers, aggregator operators, publishers etc.

We expect that the analysis of the responses received through the questionnaires and other means, as mentioned earlier, will allow the identification of additional text mining-related challenges and a number of solutions expected by various different stakeholders. This will allow the project to build a platform and services on top of it that will actually be meaningful and useful for their expected users, meeting their requirements, addressing their TDM-related issues and facilitating their work/enhancing the services that they provide to their end users.

Leave a Comment.

Can you prove that you are not a bot? * Time limit is exhausted. Please reload CAPTCHA.