The goal of information systems is to satisfy user's
information need. However, since the need is hard to quantify and we often get to know users through theirs
interaction with the system. There are access modes
- Active seeking. Search. Enter keywords. Issue queries.
- Passive reading. Feeds, news. Stay for a while (e.g. 30s).
- Active browsing. see what is interesting. Keep clicking.
- Transactional. Perform tasks - buy stuffs, send e-mails, comment blogs, author content, add friends.
To
satisfy user's need, much has been studied in the last 60 years.
However, only very recently, part of the need could be met at the
large-scale. Here we outline some aspects of current information
systems:
- Semantic Web. The W3C initiative is ambitious: to build a representation and connectivity
that machines can confidently processed. This is a huge step from the
current Web, which is for humans. Much has been said about Semantic
Web, but the main problems are how to create ontologies, integrate them,
and process them it the way we want the machines do. I guess this
remains a far-reaching goal, and just keep people motivated. A long the
way, other things get improved as a result.
- User interface.
This is where humans meet machines. Can't overestimate it more. Is list
of result enough? Why many attempts of visualisation never catch on?
- Deep analysis of content/context. Perhaps the NLP stuffs go here, as many people expected. But non has proved successful.
- Serve targeted ads.
The main goal of any commercial system is to monetise. An advertising
is what most people aim for, given the great fortune made by Google. So,
most machines are just constantly collecting data for just one purpose:
serving better ads.
- Mobile.
This is not a trend anymore. It is a fact. The main difficulty is with
the small screen, poor navigation (even with touch-screens).
- Localisation. Everyone loves talking about things physically nearby. So the local languages.
- Personalisation.
Much has been tried but not sure when it works. From standard
usability, provide proven standardised interface is much better in
mental load.
- Change in user access modes.
People actively seek for something news when they are triggered by some
things. They turn back to the passive mode if they are tired, and just
want something to read, like news.
- Large-scale architecture. Methods that are not scalable never catch on the Web.
- Investment. Oops, after all, we need money to pay for everything.
- Attract and keep talents.
On the engineering side, it is about research, system engineering, user
interface, etc. Everything needs to fit with each other, at the level
never seen before. We need talents to move beyond the norms.
- Enterprise search.
This is meaty because no technologies will fit everything. A lot of
customerisation is needed. Often, no hyperlinks, some is poorly
organised, a lot of privacy, internal issues, type of documents,
insights, leveraging resources (e.g. documentation, code, expertise),
variety of platforms, protocols, legacy systems, database, XMLs, free
texts, jobs, resumes, Web access log.
- Social networks. People trust people in the circle of reach. Some messages are specific to the pair, some intended for public audience.
- Query parsing, intention discovery.
Query is currently the most effective way of expressing information
need. When we search, we have a specific intention. However, people are
often not good at forming query that can describe their intention. Does a
query mean "give me a set of documents that contain these words", or
"give me some further understanding of these topics"? Trained users with
clear understanding of boolean formation may mean the first. It is
likely that word order has some affect, and the collocation patterns of
these keywords in the document may mean level of relevance.
- Ranking.
Ranking is the core of the search art. Users need a specific answers.
Even a comprehensive description of some concepts may appear in only few
pages. So it is not about recall but precision.
- Learning to rank. This is currently a hot topic in machine learning. This is populated by the work of Cohen, Schapire and Singer 10
years ago. Ranking in traditional IR is heuristics-based, and it does
not consistently use feedback to improve the ranking. Given implicit
preference can be obtained from search engines, learning to rank has
good promise. The hard part is that searching for the best ordering is
NP-hard because this is a permutation proplem.
- Hyperlink analysis.
Google has proved that hyperlinks are the most important property of
the Web when it comes to assessing quality of Web pages. There is much
more that can be explored, e.g. at ranking at multiple levels (sites,
clusters, objects, passages) and challenging the stationary assumption
of PageRank.
- Semantics emerged through users' interaction with systems, and other users. Interaction contexts count (time, location, keywords, keyword order, browsing/search history).
- Clustering. Clustering and topic models likely increase recalls due to its capacity to group similar words together.
- Insights. People like Ramesh Jain
believe that the next generation of search should provide comprehensive
picture of the world with respect to what users's need, and from this
insights are drawn. Topic models, automatic annotation and visualisation
may play a role here, but it is likely that most people will still need
very simple form of keyword search.
- Evaluation. No progress will be made if evaluation is not properly carried out. This is an area of research in its own right.
- Common-sense. Common-sense assertion plus rough weighting may be important. For example, we may assert: Color(Roses,Red), HasProperty(Table,Leg,4), Hate(Cat,Dog). Top AI guys like Marvin Minsky believe common-sense is what current AI lacks to advance further. The most notable effort is the creation of CYC
- a common-sense database, which was started arround 25 years ago, and
is now claimed to reach the level that commercial applications can be
developed.
- Wiki as a common-sense database expressed in English.
Wiki has become an important source of knowledge for both human and
machines. Numerous papers appear just aim at making sense of Wiki
texts. From the commercial side, the company Powerset does just that: it makes use of knowledge from Wiki only. It is now part of Microsoft.
- Information extraction.
Unlike Semantic Web which assumes formally structured concepts to be
put right from the beginning, information extraction starts from
unstructured texts to build up concepts. Some believe that this is the
perhaps how ontologies, and thus Semantic Web, are materialised.
However, the main challenge is to build a Web-scale extraction system
that can extract thousands of concepts, making use of common-sense
knowledge and collaborative effort. Current supervised learning,
although promises high performance, cannot scale to this level yet.
- Vertical search.
Generic search is not enough for specific group of users. We are
interested in high quality search systems in various domains: people,
history, inventions, weather, health, musics, films, arts, images, wars,
politics, mechanical parts, physics, chemistry, biology, family-line,
schools, cafe, food, alcohol, writing, novels, poetry, IT, books,
events,
- User search experience.
Beside making money, improving user search experience is perhaps the
most important goal of search engines. Fancy interfaces may look
interesting but simple text boxes are often enough.
- Universal knowledge versus local knowledge.
Roughly, facts described in (English) Wiki can be considered as
universal. These facts, when well collected, can be translated without
loss to local languages (e.g. Vietnamese) by humans. Other types of
local knowledge must be appropriately learned for each language.
- Business intelligence.
This is a popular term. From the well-structured database community,
this is about making sense of the data available to provide
competitive-edge for business. However, current massive amount of free
unstructured data asks for constantly brand monitoring, opinions,
sentiments about products/services.
- Question answering.
It is sometimes believed that question answering is the next step of
information retrieval. In fact, Google has gradually incorporated some
degree of QA to their system. Shallow QA can exploit the great
redundancy of expressions of facts on the Web to provide questions about
simple facts like "Who", "Where", "What", "When". However, deep QA
should be employed to deal with "How" and "Why" questions, because it is
about reasoning.
- Knowledge representation.
First-order logic was once believed to represent most knowledge
available. However, one has to deal with a whole lot of things other
than pure knowledge: scale, uncertainty/noise, and emerging/shifting
concepts.