Search User Interfaces for Digital Library Content

Lecturer: Christin Seifert (Chair of Media Informatics, University of Passau)

Dr. Christin Seifert is a post-doctoral researcher at the University of Passau, Germany. She is currently leading a research group in the EU project EEXCESS. In her doctoral thesis at the University of Graz, Austria, she investigated the use of interactive visualisations for machine learning. She received her diploma from the University of Chemnitz, Germany, in the field of artificial intelligence.
She has been working at the Know-Center, Graz and Joanneum Research Graz from 2004 to 2012 in many nationally and internationally funded projects (MACS, Dyonipos, APOSDLE, MOBVIS).
Christin Seifert has published more than 70 peer-reviewed publications in the fields of machine learning, information visualisation, digital libraries and object recognition.

Duration: 3 hours

Aims and Objectives: Provide an overview of human search models
Outline potential scenarios for visual search interfaces
Show case studies of visual search interfaces

Brief Description:
Digital library content is nowadays mostly accessible via Web interfaces. This also brought a larger, more diverse user base, that is accessing the content. Traditional user interfaces are targeted at content experts, i.e. librarians, and are not necessarily suitable for all users.
This tutorial introduces alternative search user interfaces for digital library content.

List of Topics to be covered:
  1. Human Search Models
  2. Query Specification in Visual Search Interfaces
  3. Visualisation of Search Results
  4. Integrating User Feedback

Click Models for Web Search and their Applications

Lecturer: Ilya Markov (Postdoctoral Researcher at the University of Amsterdam)

Ilya Markov is a postdoctoral researcher at the University of Amsterdam. His research agenda builds around information retrieval methods for heterogeneous search environments. Ilya has experience in federated search, user behavior analysis, click models and effectiveness metrics. He is a PC member of leading IR conferences, such as SIGIR, WWW, ECIR, ICTIR, a PC chair of the RuSSIR 2015 summer school and a co-organizer of the IMine-2 task at NTCIR-12. Ilya is currently teaching MSc course on web search and has previously taught information retrieval courses at the BSc and MSc levels and has given tutorials at conferences and summer schools in IR (SIGIR, ECIR, RuSSIR).

Aims and Objectives:
Click models, probabilistic models of the behavior of search engine users, have been studied extensively by the information retrieval community during the last five years. We now have a handful of basic click models, inference methods, evaluation principles and applications for click models, that form the building blocks of ongoing research efforts in the area. The goal of this tutorial is to bring together current efforts in the area, summarize the research performed so far and give a holistic view on existing click models for web search.

List of Topics to be covered.
The aims of this tutorial are the following:
  1. Describe existing click models in a unified way, i.e., using common notation and terminology, so that different models can easily be related to each other.
  2. Compare commonly used click models, discuss their advantages and limitations and provide a set of recommendations on their usage.
  3. Provide ready-to-use formulas and implementations of existing click models and detail general inference procedures to facilitate the development of new ones.
  4. Give an overview of existing datasets and tools for working with click models and develop new ones.
  5. Provide an overview of click model applications and directions for future development of click models.

The tutorial will be accompanied by coding sessions (in Python). For these sessions, we will provide code examples and data samples. The participants will be able to either follow the examples on the slides or perform them live along with the presentation. To follow the examples live, the participants need to have a laptop with iPython and PyClick installed.


Part I

Lecturer: Chris Biemann (Assistant professor and head of the Language Technology group at TU Darmstadt, Germany)

Chris is assistant professor and head of the Language Technology group at TU Darmstadt in Germany. He received his Ph.D. from the University of Leipzig, and subsequently spent three years in industrial search engine research at Powerset and Microsoft Bing in San Francisco, California. He is regularly publishing in journals and top conferences in the field of Computational Linguistics.
His research is targeted towards self-learning structure from natural language text, specifically regarding semantic representations. Using big-data techniques, his group has built an open-source, scalable language-independent framework for symbolic distributional semantics. To connect induced structures to tasks, Chris is frequently using crowdsourcing techniques for the acquisition of natural language semantics data

Tutorial on Crowdsourcing Linguistic Datasets

Aims and Objectives:
ntroduce crowdsourcing as an instrument to quickly acquire linguistic datasets for training and evaluation purposes.
Provide step-by-step instructions on how to realize simple and complex crowdsourcing projects.
Discuss general recommendations on designing and evolving successful crowdsourcing tasks.

Brief Description:
Since the introduction of Amazon’s Mechanical Turk crowdsourcing platform in 2005, crowdsourcing has become a means to quickly scale the collection of data from humans via paid work. Since then, there have been an ever-increasing number of publications that use crowdsourcing techniques for the collection of labeled datasets for computational linguistics and many other fields. The tutorial gives a general introduction to crowdsourcing as an instrument to quickly acquire linguistic datasets for training and evaluation purposes.

List of Topics to be covered:
Crowdsourcing for language tasks
Crowdsourcing platforms
Successful design patterns

Part II

Lecturer: Gianluca Demartini (Information School of the University of Sheffield, UK)

Dr. Gianluca Demartini is a Lecturer in Data Science at the Information School of the University of Sheffield, UK. Previously, he was post-doctoral researcher at the eXascale Infolab at the University of Fribourg, visiting researcher at UC Berkeley, junior researcher at the L3S Research Center, and intern at Yahoo! Research. He obtained a Ph.D. in Computer Science at the Leibniz University of Hannover in Germany focusing on entity-oriented search.

His research interests include Crowdsourcing for Human Computation, Information Retrieval, and Semantic Web. He has published more than 60 peer-reviewed scientific publications and given tutorials about Entity Retrieval and Crowdsourcing at international research conferences. He is editorial board member for the Journal of Web Semantics and has been program committee member for a number of international research conferences including SIGIR, ISWC, WWW, and CIKM.

Improving the quality of crowdsourced annotations

Aims and Objectives:
Provide an overview of quality assurance techniques in crowdsourcing platforms.
Describe crowd answer aggregation techniques.
Look at the human dimension of crowdsourcing by understanding worker behaviours and by modelling their skills and knowledge.

Brief Description:
Crowdsourcing platforms allow people to obtain manually annotated data at scale. However several challenges exists like, for example, the need for quality control and the lack of predictable completion time. This tutorial focuses on recent techniques to improve the quality of crowdsourced annotations by aggregating crowd answers as well as by modelling worker skills and behaviours.

List of Topics to be covered:
Crowd Answer Aggregation Techniques
Worker profiling, selection, and trust
Malicious worker behaviours
Worker types

Distributional Semantics

Part I

Lecturer: Alexander Panchenko (Postdoctoral researcher at Technische Universität Darmstadt, Language Technology Group , Darmstadt, Germany)

I am a postdoctoral researcher in natural language processing (NLP) at Language Technology Group of Technische Universität Darmstadt. My main research interest is computational lexical semantics, including semantic similarity, word sense induction and disambiguation. In the past I worked also on other NLP-related topics including short text classification for social media analysis (sentiment analysis, topic categorization, gender categorization, age detection) and skill extraction from text. More generally, I am interested in statistical natural language processing, information retrieval, semantic web, machine learning and interactions of any interactions of these fields. In the past, I worked as an associated researcher in NLP at Université catholique de Louvain and a senior researcher in social media analysis at Digital Society Laboratory LLC. I have completed a PhD program in Natural Language Processing at Université catholique de Louvain in co-tutelle with Moscow State Technical University. I was involved in the European research project iCOP and the Belgian research project ELIS-IT. Wallonie-Bruxelles International supported my research. I hold a degree in information technologies from Moscow State Technical University.

Aims and Objectives:
Learn the basics of the computational lexical semantics.
Get understanding of the resource-based approaches.
Get understanding of the data-driven approaches (distributional semantics).
Learn about evaluation techniques and datasets in the field.
Understand how lexoco-semantic knowledge can be used in the framework of NLP applications.

Brief Description:
Computational lexical semantics is a subfield of Natural Language Processing that studies computational models of lexical items, such as words, noun phrases and multiword expressions. Modeling semantic relations between words (e.g. synonyms) and word senses (e.g. python as programming language vs python as snake) is of practical interest in context of various language processing and information retrieval applications. We are going to discover how this kind of linguistic knowledge can be bootstrapped from text in a data-driven way with help of method of Distributional Semantics. The lectures will be complimented with a hands-on session on tools such as JoBimText, Serelex and word2vec.

List of Topics to be covered:
Distributional semantics: vector-based, structure-based and neural approaches.
Semantic relations and resources
Semantic similarity and relatedness
Word sense induction and disambiguation

Part II

Lecturer: Andrey Kutuzov (University of Oslo, Language Technology Group, Ph.D. Candidate)

Andrey Kutuzov is currently doing his Ph.D. at the University of Oslo, Norway. His research interests include distributional semantics, neural network language models and generally ways to inference knowledge from raw natural texts.
For several years, andrey has been working on computational linguistics tasks at search. He also taught at the school of linguistics of national research University Higher School of Economics (Moscow).

Brief Description:
Word2vec and others buzzwords: unsupervised machine learning approach to distributional semantics.
Recent boost of interest towards distributional semantics is related to employing simple and fast artificial neural networks to directly learn high-quality vector word representations (embeddings) from unannotated text data. They can be used in almost any nlp task you can think of.
I will explain how this approach is different from more traditional vector space models of semantics, and demonstrate the results of applying it to russian language modeling.

Social Media Computing

Lecturer: Aleksandr Farseev (The Lab for Media Search (LMS), National University of Singapore(NUS))

Mr. Aleksandr Farseev is a Ph.D. candidate in School of Computing of National University of Singapore. His research interests mainly include multi-source user profiling, data gathering, and user mobility analysis.

Aims and Objectives:
This module introduces the background and present states of social networks and their analysis in terms of contents, users, social relations and applications. The social network to be covered include microblogs sites like Twitter, social communication sites like Facebook, location sharing sites like 4Square, and photo sharing sites like Instagram and Flicker. At the end of this tutorial, participants are expected to have initial understanding of the background, design, analysis and implementation of social media analysis systems.

Brief Description:
The emergence of WWW, smart mobile devices and social networks has revolutionized the way we communicate, create, disseminate, and consume information. This has ushered in a new era of communications that involves complex information exchanges and user relationships. This module aims to study the social network phenomena by analyzing the complex social relation networks between users, the contents they shared, and the ways contents and events are perceived and propagated through the social networks. The analysis will provide better understanding of the concerns and interests of users.

List of Topics to be covered:
  1. Introduction to social networks: types of social networks, the targeted users, social impacts, privacy and trust issues.
  2. Overview of social media analysis framework; information gathering, storage and analysis; applications; truth and reliability.
  3. The social media contents: text, social text, images, videos, location check-ins; their crawling, feature extraction, analysis, fusion, indexing and search.
  4. Analytics: location, people and organization analytics; trend analysis, user communities, detection of live and emerging events.
  5. Applications: social assistance and recommendations.
  6. Future trends.

Mathematical models of social processes

Lecturer: Olga Proncheva Keldysh Institute of Applied Mathematics Russian Academy of Sciences, Moscow Institute of Physics and Technology (State University)

Olga Proncheva graduated from Moscow Institute of Physics and Technology and The Russian Presidential Academy of National Economy and Public Administration in 2014. Now she is a postgraduate at Keldysh Institute of Applied Mathematics and an assistant at MIPT. Her research interests include mathematical modeling of social processes and information processes.

Duration: 2 hours

Aims and Objectives:
Introduce students to some models of social processes.

Brief Description:
Mathematical modeling is one of the main instruments of sociologic researches. In this tutorial will be presented some examples of such models. The main attention will be given to the models of information attack and information warfare.

List of Topics to be covered:
  1. Biparental model (that takes into account gender structure),
  2. Maltus model and logistic model,
  3. Leslie model,
  4. the model of the relations dynamics in a small group,
  5. the model of imitative behavior,
  6. the model of information attacks,
  7. the model of information warfare.

Plagiarism detection

Lecturer: AntiPlagiat

List of Topics to be covered:
  1. Intrinsic plagiarism.
    Lecturer: Alexey Romanov, Daria Beresneva
  2. Extrinsic plagiarism.
    Lecturer: Alberto Barron Cedeno
  3. How to detect deception through stylometric analyses
    Lecturer: Tommaso Fornaciari
  4. Strange documents and traduced plagiarism.
    Lecturer: Rita Kuznetsova
  5. Systems for revealing plagiarism.
    Lecturer: Andrey Ivahnenko, Alexey Romanov
  1. Intrinsic plagiarism detection: methods and approaches

    Lecturer: Alexey Romanov Antiplagiat Research

    Lecturer: Daria Beresneva Antiplagiat Research

    Intrinsic plagiarism detection is a problem of finding reused text when no reference corpus is given. As a result you cannot compare text from the document being checked with other texts to find coincidences. In this talk we will present trending methods to intrinsic plagiarism detection using machine learning approach. Discuss performance of the methods and results to date.

  2. Detection of Text Re-Use and Plagiarism External Approach

    Lecturer: Alberto Barron Cedeno Qatar Computing Research Institute

    Alberto Barrón-Cedeño holds a PhD on Computer Science from Universitat Politènica de Valéncia. He is currently a Scientist at Qatar Computing Research Institute, HBKU. His current research is focused on the design of Natural Language Processing models to enhance community question answering fora, the analysis and exploitation of multilingual text resources, and the analysis of text re-use. He has 45+ publications in international journals and conferences on Natural Language Processing and Information Retrieval.

    The best evidence to support a case of text re-use ---plagiarism if no proper citation is provided---, is to show a chunk of text together with the source it was borrowed from. Given a suspicious document, external plagiarism detection consists precisely of retrieving and spotting potential cases of re-used text together with their claimed source.

    This talk kicks off with some basic information retrieval and natural language processing concepts. Later on, the most representive models for external plagiarism detection are discussed. PAN, probably the most important initiative in the research on plagiarism detection, in then overviewed. Finally, directions to start on working on this topic are provided and proposals are made towards pushing the state of the art in the field.

  3. How to detect deception through stylometric analyses

    Lecturer: Tommaso Fornaciari Italian National Police

    The tutorial will follow step by step the a research activity in the field of deception detection. The dataset employed is DeCour - DEception in COURts - a corpus constituted by transcripts of hearings held in four Italian courts. The case which will be shown is a typical example of text classification carried out thorough stylometric techniques. Here the task is to train models to distinguish false from truthful statements, however the methodological approach is pretty similar to those applied in computational linguistics for other forensic tasks, such as author profiling, author attribution and plagiarism analysis.
    The process will be examined from the data collection, through the preprocessing and the feature selection. In the end, the data analysis and the results will be discussed, taking into account possible perspectives for future researches.

  4. Computer-generated text detection

    Lecturer: Rita Kuznetsova Antiplagiat Research

    Researchers made computers smart enough to generate well-looking texts and even scientific papers, Now there is a problem that these generated papers get published or defended as part of diploma or thesis. In this lecture we will discuss what are these texts, how they are created and contemporary approaches to detecting such texts. Several methods will be covered in detail, results presented.

  5. Systems for revealing plagiarism

    Lecturer: Andrey Ivahnenko Antiplagiat Research

    Lecturer: Alexey Romanov Antiplagiat Research

    In this lecture we will discuss existing systems and tools that detect text reuse and machine generated texts. Overall introduction will be given to Antiplagiat Software as a state of the art in exact extrinsic plagiarism engine. We sill show how can it be applied to detect text reuse in separate documents as well as a whole text corpus – a feature used for deep text analysis for reuse. Several tools for producing and detecting machine-generated scientific papers will be discusses, including SciGen and SciDetect, and other tools that use methods discussed earlier in the tutorial.

Inductive modeling


Lecturer: Volodymyr Stepashko (head of the Department for Information Technologies of Inductive Modeling at the International Research and Training Center (IRTC) for Information Technologies and Systems of the National Academy of Sciences of Ukraine, Kyiv, Professor, Dr.Sc.)

Prof. Stepashko received his M.Sc. degree in Lviv State University in ‘Radio-physics’ (1970). He received his Cand.Sc. (PhD) degree in ‘Technical Cybernetics and Information Theory’ (1976) and Dr.Sc. degree in ‘System Analysis and Automatic Control’ (1994) at the National Academy of Sciences of Ukraine (NASU). Since 1998 he heads the department for information technologies of inductive modeling at IRTC NASU. Since 1998 he is professor of the National Technical University of Ukraine ‘Kyiv Polytechnic Institute’. Prof. Stepashko is the organizer and chairman of annual Conferences and Workshops as well as Summer Schools on Inductive Modeling.. He is the Editor in chief of the Journal ‘Inductive Modeling of Complex Systems’. His areas of interests are: mathematical modeling, inductive modeling, Group Method of Data Handling, computational intelligence, machine learning, data/knowledge mining, and information technologies


Lecturer: Vadim Strijov (leading researcher at the Computing Center of the Russian Academy of Sciences; associate professor at the Moscow Institute of Physics and Technology)

Vadim Strijov received his Ph.D. degree in Physics and Mathematics (2002) and Dr.Sc. degree in Theoretical Foundations of Computer Sciences (2014) at the Computing Centre of the Russian Academy of Sciences. Dr. Strijov is associate professor at the Moscow Institute of Physics and Technology, the editor-in-chief of the Journal of Machine Learning and Data Analysis, and one of administrators of a wiki-resource His areas of interest are: machine learning, data analysis, model selection, structure learning, preference learning.


Lecturer: Mikhail Alexandrov (associate professor of the Russian Presidential Academy of national economy and public administration, invited researcher of the fLexSem research group at the Autonomous University of Barcelona)

Mikhail Alexandrov received his M.Sc. degree in Radiotechnics at Moscow State Aviation Institute (1973), M.Sc. degree in Applied Mathematics at Moscow State Lomonosov University (1977) and Ph.D. degree in Physics and Mathematics (1982) at Moscow State Geologo-Prospecting Institute and Problem Control Institute of the Russian Academy of Sciences. Since 2009 he works at the department of system analysis and informatics of the Russian Presidential Academy of the national economy and public administration. His areas of interests are: mathematical modeling, data mining, and text mining.


Lecturer: Oleksiy Koshulko (senior researcher at Glushkov Institute of Cybernetics of National Academy of Sciences of Ukraine; co-founder at GMDH LLC)

Oleksiy Koshulko received his M.Sc. degree in Computer Sciences (2003) at the Faculty of Cybernetics of Taras Shevchenko Kyiv National University. He received his Ph.D. in Mathematical Modelling and Numeral Methods (2007) at Glushkov Institute of Cybernetics of National Academy of Sciences of Ukraine. During his Ph.D. study, Oleksiy Koshulko was granted with the Presidential Award for Most Talented Young Scientists. Dr. Koshulko is Co-Founder at GMDH LLC (www. – a forecasting toolbox based on algorithms of the Group Method of Data Handling. His areas of interest include: inductive modeling and high-performance computing.

Duration: 4 hours

Aims and Objectives:
Inductive modeling (IM) is one of the most promising directions of machine learning. The principal approach of inductive modeling is generation and selection of optimal models from a given class (set) to describe experimental data when we have (almost) no a priori information on model structure. In our mini course we intend: to acquaint young conference participants with the basic algorithms of IM; to demonstrate several applications of IM techniques, primarily to natural language processing; to give initial practical skills for working with IM toolbox named GMDH Shell.

Brief Description:
The course consists of 2 lectures and 1 practice work.
  1. Machine learning model generation and selection (Strijov V., lecture = 2 hours).
    To extract knowledge from big data we have to investigate a large set of various machine learning models with intention to find an adequate one. So we inductively generate this set of competitive models automatically and select an optimal one according to some criterions. The talk is devoted to model generation techniques. Application problems of document analyses, time series and image classification illustrate the topic.
  2. Group Method of Data Handling (GMDH) and its applications in Natural Language Processing (Stepashko V., Alexandrov M., Koshulko O. lecture = 1 hour)
    GMDH is one of IM techniques originated by Ukrainian Academician Oleksiy Ivakhnenko in 1960-th and still actively developed. GMDH provides an automatic self-organizing construction of noise-immune models of optimal complexity after short samples of noisy experimental/statistical data. In the paper we give brief description of 4 typical GMDH algorithms and demonstrate their application to some problems of NLP (vocabularies, sentiment analysis, text based Forex forecast etc.)
  3. GMDH Shell: a platform for GMDH applications (Koshulko O., 1-hour training)
    We prepared our for students having experience in solution of problems: regression, forecasting, classification. Attendees will learn two the most popular GMDH algorithms – the combinatorial algorithm and GMDH artificial neural networks. We will demonstrate examples of automated multi-parametric time series forecasting, regression and classification. During the hands-on part, we encourage attendees to have a laptop with the latest version of GMDH Shell for Data Science Trial Version.

Introduction to deep neural networks

Lecturer: Andrey Filchenkov (Head of machine learning research group at Computer Technologies Lab in ITMO University, St. Petersburg)

Andrey Filchenkov is Associate Professor at Computer Technology Chair and Head of machine learning research group at Computer Technologies Lab in ITMO University, St. Petersburg.
He has received PhD degree in 2013 for the thesis devoted to structural learning of probabilistic graphical models.
His research topics include meta-learning, feature selection, graphical models and ensemble learning.

Lecturer: Evgeny Putin (Researcher at Computer Technologies Lab in ITMO University, St. Petersburg)

Evgeny Putin is PhD student at Computer Technology Chair and Researcher at Computer Technologies Lab in ITMO University, St. Petersburg.
He has graduated from Faculty of Mathematics and Mechanics in St. Petersburg State University in 2014.
He is interested in deep neural networks and their application to various data mining problems.

Aims and Objectives:
Learn the basics neural networks.
Learn state-of-art neural deep learning models.
Understand how all this deep learning magic happens and what are the requirements and restrictions for deep learning network application.
Get practice in applying neural networks with Theano.

Brief Description:
Three lectures from very basic concepts such as gradient descent and one-layer neural network to state-of-the-art models of deep networks are suited both for people who never looked inside artificial neural networks and know nothing about them as well as for the ones who is familiar with basic theory but wants to learn more about modern wildly-spoken paradigm of deep learning.
The first lecture is introduction to neural networks, so it is not obligatory if you are familiar with them (but it may still be useful to remain the fundamentals). The second and the third lectures are describing modern deep-learning network models and their applications.
The lectures will be accompanied by coding sessions in Python 3 with Theano library. For these sessions, we will provide code examples and data samples. The participants need to have a laptop with Python 3, Theano library installed.

List of Topics to be covered:
  1. Gradient descent
  2. Linear perceptron
  3. Multi-layer neural network
  4. Deep architecture
  5. Convolution neural network
  6. Recurrent neural network

Crowdsourcing technologies

Altruistic Crowd Work
There are two primary approaches to crowdsourcing. The first one is altruism-driven, where people are contributing voluntarily, like in the cases of Wikipedia or OpenStreetMap. The latter one is pay-driven, where workers receive micropayments for their activity on such platforms as Amazon MTurk or Yandex.Toloka. In Russia, for several reasons—including the legal ones—both researchers and practitioners still have to invite volunteers, even for the tasks that are usually paid in other countries.

At the panel discussion we will discuss the means and opportunities for stimulating the workers and rewarding them for their help.