Socia: Linked Open Data of Context behind Local Concerns for Supporting Public Participation

—To address public concerns that threat the sustainability of local societies, supporting public participation by sharing the background context behind these concerns is essentially important. We designed a SOCIA ontology, which was a linked data model, for sharing context behind local concerns with two approaches: (1) structuring Web news articles and microblogs about local concerns on the basis of geographical regions and events that were referred to by content, and (2) structuring public issues and their solutions as public goals. We moreover built a SOCIA dataset, which was a linked open dataset, on the basis of the SOCIA ontology. Web news articles and microblogs related to local concerns were semi-automatically gathered and structured. Public issues and goals were manually extracted from Web content related to revitalization from the Great East Japan Earthquake. Towards more accurate extraction of public concerns, we investigated feature expressions for extracting public concerns from microblogs written in Japanese. To address a technical issue about sample selection bias in our microblog corpus, we formulated a metric in mining feature expressions, i.e., bias-penalized information gain (BPIG). Furthermore, we developed a prototype of a public debate support system that utilized the SOCIA dataset and formulated the similarity between public goals for a goal matching service to facilitate collaboration.


I. INTRODUCTION
Japanese regional societies currently face complicated and ongoing social issues or concerns, e.g., dwindling birth rates, an aging population, public finance problems, disaster risks, dilapidated infrastructures, and radiation pollution that threaten the sustainability of societies.The coverage of government services is expected to decrease along with an escalation in these concerns.Some Japanese researchers regard such troubling situations as "a front-runner of emerging issues" [1].To address these concerns, supporting public participation by sharing background context behind these concerns is essentially important.
We have aimed to develop a Web platform to support public participation, which provides a function for sharing background context behind local concerns [2], [3], [4].Since citizens who have beneficial awareness or knowledge are not always experts on relevant social concerns, background context needs to be shared to reduce barriers to public participation.It is difficult to participate in addressing concerns without background context.Linked open data (LOD) [5], which are semantically connected data on the basis of universal resource identifiers (URIs) and the resource description framework (RDF), play an important role in fostering open government [6].To increase transparency and participation in regional communities, it is important for citizens, government officials, and experts to share public concerns.Background context should be structured and open to facilitate the assessment and sharing of public concerns.The LOD framework is suitable for structuring such background contexts and concerns.The structure of public concerns is an important context when building consensus.We have called the process of structuring public concerns "concern assessment".
We designed a linked data model and built an LOD dataset, which were called Social Opinions and Concerns for Ideal Argumentation (SOCIA), to share the context behind local concerns.The data model of SOCIA ontology was designed with two approaches.The first was attained by structuring Web news articles and microblogs about local concerns on the basis of geographical regions and events that were referred to by the content.The second was attained by structuring public issues and their solutions as public goals.We moreover built a SOCIA dataset, which was a linked open dataset (LOD), on the basis of the SOCIA ontology.Japanese local news articles, microblog posts, and minutes of city council meetings are semi-automatically structured on the basis of geographical regions and events.The SOCIA dataset also included public issues and goals that were manually extracted from news articles.Furthermore, we preliminarily investigated feature expressions to extract public concerns from microblogs written in Japanese.The feature expressions were mined from a corpus consisting of microblogs about public concerns (positive examples) and microblogs about irrelevant to public concerns (negative examples).We addressed a technical issue about the sample selection bias in the positive examples, i.e., there were unsuitable feature expressions that were frequently used by only one specific person.
The rest of the paper is organized as follows.Section II presents conventional works related to e-Participation.The SO-CIA ontology is described in Section III.Section IV describes the SOCIA dataset built by semi-automatically structuring Web content related to local concerns and manually structuring public issues and goals extracted from Web content.Section V explains how Japanese feature expressions for extracting public concerns from microblogs were mined with a corpus-based approach.Section VI describes applications of the SOCIA dataset and Section VII concludes the paper.

A. Public Participation and Open Data
The International Association for Public Participation (IAP2) and the Obama administration's Open Government Initiative (OGI) have presented similar stages for public participation, i.e., the Spectrum of Public Participation [7] and the Principles of Open Government [8] shown in Figure 1

B. Modeling Public Debate and Participation
Providing background information related to public debate is important in order to support concern assessment.In view of this, argument visualization is an effective approach for supporting eParticipation [9].Jeong et al. visualized the difference in cognition for several topics among participants in public debates using the co-occurrence of terms [10].Visualizing an overview of public debate is also effective for grasping the background.Several argument visualization tools currently exist [11]: Compendium [12], Cohere [13], MIT Deliberatorium [14], Araucaria [15], Discourse Semantic Authoring [16], [17], etc.Typically, these tools produce "box and arrow" diagrams in which premises and conclusions are formulated as statements [18].
Within the context of LOD and the semantic Web, the Talk of Europe project proposed a linked data model to structure

III. DESIGNING SOCIA ONTOLOGY
This section describes the design of the SOCIA ontology to structure Web news articles and microblogs about local concerns on the basis of geographical regions and events that are referred to by content, and to structure public issues and their solutions as public goals.

A. Structuring Web Content about Local Concerns
To design a data model for sharing background context behind local concerns, we consider applications of the dataset.O 2 , an abbreviation for Open Opinion, is our Web platform for citizen participation in debates about regional issues.As shown in Fig. 2, the O 2 platform has three stages.In stage (1), the mining and pre-processing system crawls the Web and gathers information from news articles, microblogs, and meeting minutes that can be used for debates.In stage (2), Fig. 3: Cycle of utilizing regional information for e-Participation the system geographically classifies the gathered contents and clusters them by event.Relevant information is then structured and stored in the SOCIA dataset in accordance with the SOCIA ontology as openly published Linked Open Data.In stage (3), the structured information is used for public participation, i.e., debate support, concern assessment, etc.
The cycle of utilizing regional information in SOCIA for eParticipation is illustrated in Fig. 3. To help citizens understand public concerns and express their opinions, background information needs to be provided because most citizens are not experts about diversified public concerns.The opinions expressed can also be utilized as background information after being structured in the SOCIA dataset.For Web contents (e.g.news articles, blogs, and tweets) to be used as background information, they need to be classified by region and then presented to citizens in an understandable way.Our platform and ontology can be used to structure the URLs of Web contents and then link them with regional issues.
The SOCIA dataset is openly published on the Web using the SOCIA ontology, 3 designed using Web Ontology Language (OWL) as shown in Fig. 4. Through this process, eParticipative Text mined from the Web is structured in the form of events by region, which are then used as discussion seeds to further build the SOCIA dataset.Citizens then create discussion topics out of each seed, e.g., a cluster of news articles related to the same event, and input their opinions by using the system, among other functionalities.
To improve the structuring accuracy, the history of how the LOD properties were annotated (e.g., which algorithm, which parameter, by whom is needed) because the automatic structuring by Sophia has an inherent error of a few percent.To maintain the annotation history, we defined the AnnotationInfo class, as shown in Fig. 5.Such meta-context information is necessary when the data set is used as a corpus for research on natural language processing.

B. Structuring Public Issues and Goals
Public collaboration and consensus building between stakeholders are essential to enable revitalization from disasters, e.g., the Great East Japan Earthquake.Collaboration between multiple agents generally requires the following conditions: • Similarity of the agents' goals or objectives

• Complementarity of the agents' skills, abilities, or resources
As the first step, this study focuses on the similarity of the goals.Sharing a data set of public goals can help citizens, who have similar goals, build consensus and collaborate with one another.
We focus on the following three problems related to public collaboration.
1) Citizens cannot easily find somebody whose goals are similar to their ones.2) Stakeholders who have similar goals occasionally conflict with one another when building consensus because subgoals are sometimes difficult to be agreed on even if the final goal is generally agreed on.3) A too abstract and general goal is hard to be contributed collaboratively.
We presume that the hierarchies of goals and subgoals play important roles to address these problems.First, the hierarchical structure can make methods of calculating the similarity between public goals more sophisticated.The hierarchy provides rich context to improve retrieval of similar goals.If the data set of public goals had only short textual descriptions without hierarchical structures, calculating the similarity between goals would be difficult and the recall ratio in retrieving similar goals would be lower.Second, visualizing the hierarchies is expected to support people in conflict to attain compromises.Third, dividing goals into fine-grained subgoals reduces barriers to participation and collaboration because small contributions to fine-grained subgoals are more easily provided.

IV. BUILDING SOCIA DATASET
This section describes semi-automatic structuring of Web content on local concerns and manual structuring of public issues and goals.

A. Gathering Web Content about Local Concerns
The system first collects news articles, microblog posts (in this work, tweets), and minutes of city council meeting from the Web along with necessary metadata (dates, emission sources, etc).It then classifies this crawled Web contents by region and filters out contents unrelated to the interests of regional communities or to current events.Next, the system extracts target events from the news articles and microblogs, and links them using the ontology.
Citizens can then add further links to events, news articles, and microblogs, by creating relevant topics and can debate them by inputting their opinions, polling, or sharing further resources.Those resources and new links are also incorporated in the data set, as are the opinions and the discussion.This creates a virtuous cycle in which the intelligent platform, by creating understandable and relevant discussion seeds, involves citizens in eParticipation.The citizens add further data to the data set, making it grow over time, and this data can be used as input again (e.g. for training better learning models and developing better ontologies).

1) Classification by Geographic Region:
After the mining, the gathered news articles and tweets are classified geographically (by the 47 prefectures of Japan).To this end, we use Transformed Weight-normalized Complementary Naive Bayes (TWCNB) algorithm [21].In the classification, the feature vectors for each document consist of the TF*IDF value of morpheme bi-grams.To decide whether contents should be filtered out or not, we use a confidence threshold where the confidence value is defined as the difference between log scores of the highest-ranked class and that of second-ranked class.We conducted a classification experiment through varying threshold of confidence value, using 8,811 news articles related to Japanese prefectures crawled from Yahoo!Japan News4 during Jun.13 to Jul. 12, 2011, and 1,133 ones that do not related to any prefectures.The experimental result showed that the precision is 98.2% and the recall is 98.0% for the optimal threshold [22], [23].

2) Clustering by Events:
The SOCIA dataset stored 54,854 news articles, with about 13,000 ones classified as related to a prefictures. 5The events are extracted as clusters of similar news articles [23].The similarity between news articles are calculated as a cosine similarity which is weighted by a window function determined by for considering dates/times the news articles were published.As shown in Fig. 7, about 35,000 events were extracted through the clustering of these articles.

B. Manual Extraction of Public Goals from Web News Articles
We built an LOD set6 by manually extracting public goals from news articles and related documents.The 657 public Fig. 8: Instance of public goal: "Developing new package tour product" Fig. 9: Processing flow for mining features to extract public concerns goals and 4349 RDF triples were manually extracted from 96 news articles and two related documents by one human annotator.The most abstract goal that is the root node of the goalsubgoal hierarchy is "revitalization from the earthquake". 7The subgoals are linked from this goal with the socia:subgoal property.
The manually built LOD set can be used for developing a method of calculating the similarities between public goals.It can also be used as example seed data when citizen users input their own goals for revitalization.Fig. 8 shows an instance of a public goal to revitalize the Tohoku region from the Great East Japan Earthquake.This goal of "developing a new package tour product", has a title in Japanese, a description in Japanese, and two subgoal data resources.This dataset about public goals for revitalization won the 2nd Prize of Dataset Track of the Linked Open Data Challenge Japan 2013 8 .

V. MINING FEATURE EXPRESSION TO EXTRACT CONCERNS
Automatic needs to become more accurate with a filter for noisy text to support concern assessment because consumer-generated Web content (e.g., microblogs) frequently contains noise information on the target regions.We aimed to construct a binary classifier between tweets including public concerns and others.To define the boundary between the positive class c + (corresponding to public concerns) and the negative class c − (corresponding to tweets other than public concerns), we investigate approximative examples collected through hashtag search.Figure 9 represents the processing flow for investigating the approximate examples.Firstly, we manually prepare the list of hashtags that may frequently cooccur with public concerns in Japanese tweets: #政治 (politics), 7 http://data.open-opinion.org/socia/data/Goal/%E9%9C%87%E7%81%BD%E5%BE %A9%E8%88%88 (in Japanese) 8 http://lod.sfc.keio.ac.jp/blog/?p=2074(in Japanese) #社会 (society), #環境 (environment), and so on.The tweets collected through searching by these hashtags from Topsy's Otter API 9 are regarded as candidates of positive examples.These examples are labeled as class c + 0 , an approximative positive class.However, note that the c + 0 examples also include noise tweets that are not suitable for concern assessment.Secondly, we gather general tweets from Twitter Streaming API 10 .The ratio of public concern in this set is much less than that in the c + 0 set.Therefore, these general tweets are regarded as candidates of negative examples and labeled as class c − 0 , an approximative negative class.In this section, we empirically analyze features for classifying tweets into C 0 = {c + 0 , c − 0 } towards building a corpus annotated with C = {c + , c − }, a more sophisticated concern definition.
Here, we denote a feature vector of a tweet by where f + i denotes a label representing that the feature f i appears in a tweet, and f − i denotes a label representing that f i does not.A feature f i 's significance for extracting c + 0 tweets can be estimated by the information gain: with The features f i extracted from c + 0 tweets with the information gain, however, are biased due to sample selection bias dependent on the input hashtags.To address the sample selection bias, we formulate bias-penalized information gain (BPIG) with considering a penalty for biased occurrence of feature f i as follows: with where let m + k be a label representing that m k , a hashtag or a user, appears in a tweet is the author of the tweet, m − k be a label representing that m k does not, and α ∈ [0, 1] be a weight of the penalty term.Here, max k∈Ki IG(M k |F i ) can be regarded as a penalty for f i that co-occurs only with a particular hashtag or user m k .
Table I shows the hashtags for gathering c + 0 tweets from Topsy's otter API.We specified Japanese as the language of gathered tweets in query URLs for the API.Temporal distribution of the 32,844 tweets collected as c + 0 is shown in Figure 10.The c + 0 tweets consist mostly of the tweets in the latest months due to the characteristics of time window of the Topsy search.The c − 0 tweets are gathered from Twitter Streaming API.The ratio of public concerns in c − 0 is predicted to be much less than that in c + 0 .Temporal distribution of the 149,984 tweets collected as c − 0 is shown in Table II.Since we presume that the ratio of c − is greater than that of c + , the ratio of c − 0 is also set as greater than that of c + 0 .We conducted an experiment for feature extraction using these 182,828 tweets consisting of c + 0 and c − 0 .Features representing c + 0 and c − 0 are extracted with the following procedure: respectively.2) As features for c + 0 , extract high-ranked features 3) As features for c − 0 , extract high-ranked features f i , such that PMI(c + 0 , f i ) < 0. In this experiment, we regard morpheme N -grams as features of each tweet.Table III and IV represent the results of feature extraction where let N = 3, i.e., in case of morpheme tri-grams.There are some pre-processings before extracting morpheme N -grams; URL strings and user names (starting with @) in tweets are replaced by "[URL]" and "[USER]".Hashtags in tweets are omitted."[B]" and "[E]" are inserted into beginning and end of a tweet, respectively.
The features for the positive example, c + 0 , are shown in Table III.The features extracted by information gain, which are ranked in the left side of the table, are greatly biased due to the input hashtags.For example, both of "NEWS WEB 24" (a name of TV news program) and " " (introducing it in our program) are dependent on the hashtag #nhk24.In contrast, the features extracted by BPIG in the right sides of the tables are not specific to a particular hashtag of a user.These N -gram features are commonly used for describing public concerns, e.g., expressions for stating fact or question.Table V represents features f i which have higher penalties for bias, that is, higher max k∈Ki IG(M k |F i ).The result shows that BPIG can appropriately filter out features that co-occurs only with a particular hashtag or user.
The features for the negative example, c − 0 , are shown in Table IV.Both the c + 0 's features and the c − 0 's features are needed for classifying the positive examples and the negative ones.The c − 0 's features can be used for filtering the negative examples as noise tweets.Although in both cases of information gain and BPIG, expressions for greeting or communication are higher ranked, features with higher p(c + 0 |f i ), such as "！ ！ [E] " and "！ ！ ！", are lower-ranked in BPIG than in information gain.
Morpheme N -grams (N = 2, 3, 4, 5) extracted as features for c + 0 can be classified by modality types as shown in Table 273 | P a g e www.ijacsa.thesai.orgVI.Suggestions, questions, and fact statements with some references (quotation) can be extracted as public concerns from Japanese tweets, according to this analysis result.We suppose that these analyses can be used to define the boundary between positive example c + and negative example c − towards drafting annotation manual and building a concern corpus.

A. Public Debate Using SOCIA Dataset
Citispe@k (pronounced "citi-speak") is a prototype Web application that supports public debate by utilizing the SOCIA dataset.It provides mobility and by supporting Web browsers running on smart phones and tablets.The term citispe@k is based on the idea that citizens speak about social issues and current events of the regions in which they live.Users can discuss and sort out regional issues by referencing news articles, tweets, or other relevant resources on the Web by using citispe@k.By creating discussion topics or inputting opinions into the system, those topics and opinions are also stored as the SOCIA dataset.Figure 7 shows a screenshot of citispe@k.The screenshot has lists of event or related information.Events recently updated are listed on the left of the screenshot.The system initially shows all events.The user can then limit the list to show only events related to a region.When the user selects an event from the list, information about the event is shown on the right side of the screenshot.Information TABLE VI: Modality types of morpheme N -grams extracted as features representing c + 0 (excerpted) consists of news articles, tweets, and events related to the event.Those resources can be easily shown and visualized in an iFrame without leaving the system.Users can append comments, e.g.ideas, questions, and answers, by selecting specific content provided by citispe@k.A comment also be posted to Twitter (via @citispeak account) to further its reach and be stored in SOCIA.Users can create discussion topics related to events, news articles and tweets.The "View related topics" button (Figure 11) is used to see topics related to the event being viewed.Users can create a new discussion topic about the event by clicking the "Make a new topic" button.The cycle of the discussions in citispe@k is that users browse events, get topics related to an event, and add their opinion Citispe@k also has a function supporting concern assessment.The system aim to support the analysis of the trends in citizens' awareness, its background information, and the anxiety about social issues.For example, a committee for scientific verification of road construction in Aioiyama-Fig.11: Creating a new discussion topic Fig. 12: Annotating selected event with tags representing criteria Ryokuchi Park in Nagoya City analyzes road construction. 11 report on their analysis was made based on several criteria: "economic chance", "life, educational or cultural chance", "safety, security", etc.Thus, classifing opinions on the basis of criteria is effective for concern adjustment.Citispe@k provides tags for such criteria.Users can add tags composed of criteria and polarity, such as "Environment +" or "Environment -".Citispe@k also provides tags that can be used to express the intention of an utterance, like "Question", "Idea", and "Refutation".If events or news articles have many such tags, the tags can be used to support the analysis of concerns.Fig. 12 shows an example of tagging an event.We designed the tags by referencing the QOC model [24] and the Deliberatorium [14] for supporting concern assessment through public debates using citispe@k and the contents in SOCIA.

B. Goal Matching Service Using SOCIA Dataset
We are planning to develop a Web service to match citizens and agents who are aiming at similar goals to facilitate collaboration.Toward this end, we expanded the SOCIA ontology to describe the public goals in Fig. 6.The property socia:subgoal enables us to describe the hierarchical structure of goals and subgoals.The public goal matching service that we aim to develop requires high-recall retrieval of similar goals to facilitate inter-domain, inter-area, and interorganizational collaboration.
Pairs of similar goals are connected by the schema:isSimilarTo property 12 .The similarity between public goals can be calculated on the basis of a recursive definition of a bag-of-features vector as: where g denotes a public goal, bof(g) denotes a bag-offeatures vector of g, and sub(g) denotes a set of subgoals of g.Here, w ∈ W denotes a term, z ∈ Z denotes a latent topic derived by a latent topic model [25], and tfidf(w, g) denotes the TF-IDF, i.e., the product of term frequency and inverse document frequency, of w in a title and a description of g.The p(z|g) denotes the probability of z given g, 0 ≤ α, β, γ ≤ 1, and α + β + γ = 1.The reason this definition incorporates a latent topic model is to enable short descriptions of goals to be dealt with because TF-IDF is insufficient for calculating similarities in short texts.The parameters α, β, and γ are empirically determined on the basis of actual data.
This prototyped method of calculating similarities should be tested, verified, and refined though experiments in future work using the LOD set of public goals that we present.

VII. CONCLUSION
We designed the SOCIA ontology, which is a linked data model to share context behind local concerns with two approaches: (1) structuring Web news articles and microblogs about local concerns on the basis of geographical regions and events that were referred to by content, and (2) structuring public issues and their solutions as public goals.We moreover built the SOCIA dataset, which was a linked open dataset, on the basis of the SOCIA ontology.Web news articles and microblogs related to local concerns were semi-automatically gathered and structured.54,854 news articles were stored to the SOCIA dataset and they were automatically linked with prefectures and events.Moreover, 657 public goals were manually extracted from Web content related to revitalization from the Great East Japan Earthquake.
We investigated feature expressions to extract public concerns from microblogs written in Japanese towards more accurate extraction of public concerns.To address a technical issue about sample selection bias in our microblog corpus, we formulated a metric for mining feature expressions, i.e., bias-penalized information gain (BPIG).We conducted an experiment for extracting features representing positive examples and negative examples.The experimental results showed that BPIG is more suitable for dealing with training data with hashtag-dependent sample selection bias than the conventional information gain.Furthermore, we presented applications of the SOCIA dataset, i.e., a public debate support system and a goal matching service.These applications utilize the SOCIA dataset to share context behind local concerns.We are planning to sophisticate the SOCIA ontology and dataset towards facilitating public collaboration in the real world.

Fig. 1 :
Fig. 1: Expected coverage of Linked Open Data on the spectrum of public participation . The gradation in the figure represents the public impact of each stage.The figure also indicates the expected coverage of the use of LOD.Open data generally contributes to transparency, i.e., to the first stage.However, non-linked open data (e.g., CSV table data) generally lack interoperability.LOD is expected to be able to also contribute to the higher/collaborative stages because semantic links compliant with RDF increase the interoperability of data and help us to reuse data for interorganizational collaboration.Contextual information provided by the semantic links provides the potential for developing social Web services to facilitate public collaboration.Over 40 countries currently provide open data portals. 1 The number of open data portals has been increasing since 2009.An open data portal by the Japanese government, data.go.jp, was also launched in 2014.One hundred local governments (14 prefectures and 86 municipalities) in Japan also provide their open government data as of Feb. 2015 2 .

Fig. 4 :
Fig. 4: Core classes for structuring regional information in SOCIA ontology

Fig. 6
Fig. 6 shows an extention of the SOCIA ontology to represent public issues and goals.The classes socia:Issue and socia:Goal are connected with the socia:solution property.These classes are linked with foaf:Agent corresponding to participants or stakeholders and with

Fig. 7 :
Fig. 7: Distribution of news article counts per event

TABLE II :
Temporal distribution of c −

TABLE III :
Morpheme tri-grams extracted as features representing c + 0

TABLE IV :
Morpheme tri-grams extracted as features representing c − 0

TABLE V :
N -grams that frequently co-occur only with a specific hashtag or user in c + 0 (excerpted)