Comparative Analysis of Collaborative Filtering on GraphLab, MLlib and Mahout |
Author : Abdul Samad, Dr. Saif-ur-Rahman |
Abstract | Full Text |
Abstract :Recommendation systems are used to recommend items or products to the user based on their previous purchases, visits, interests, ratings, wish-lists or reviews to develop interest and to display the accurate and suitable items on board. Recommendation systems are used in various online shops (E-Commerce application) and decision making systems. Recommendation is a particular form of information filtering. It falls under the Data Mining and Machine Learning. Collaborative Filtering is the key technique used in this system. In this study, the data loading, model generation, recommendation implementation and accuracy of same algorithm on some major tools and libraries (GraphLab, Mahout-Hadoop, Mahout-Spark and MLLib) has been discussed. To serve the purpose, a well-known algorithm Alternating Least Square ALS for collaborative filtering was used. Netflix Prize (training) data set was used in this research with the listed tools and libraries. At the end of this research a factual comparative analysis of the tools was carried out. |
|
Extracting patterns from Global Terrorist Dataset (GTD) Using Co-Clustering approach |
Author : Muhammad Adnan, Muhammad Rafi |
Abstract | Full Text |
Abstract :Global Terrorist Dataset (GTD) is a vast collection of terrorist activities reported around the globe. The terrorism database incorporates more than 27,000 terrorism incidents from 1968 to 2014. Every record has spatial data, a period stamp, and a few different fields (e.g. strategies, weapon sorts, targets and wounds). There were few earlier studies to find interesting patterns from this textual gamut of data. The author believes that GTD has numerous interesting patterns still hidden and the full potential of this resource is still to be divulged. In this Independent Study, the author tries to investigate the GTD through co-clustering method for pattern discovery. Author has extracted textual data from GTD as per motivation to cluster the data in space and time simultaneously, through co-clustering. Co-clustering has become an important and powerful tool for data mining. By using co-clustering, bilateral data can be analysed by describing the connections between two different entities. There are many applications in the real world that can extensively benefits from this approach of co-clustering, such as market basket analysis and recommendation system. In this study, the effectiveness of coclustering model will be described by performing experiment on database of global terrorist events. |
|
Probabilistic Vs Soft Computing for Classifying Credit Card Transactions. A Case Study of Pakistanis Credit Card Data |
Author : Amjad Ali, Muhammad Rafi |
Abstract | Full Text |
Abstract :Credit cards are now widely used by consumers for purchasing various goods and services due to widespread use of internet and consequential growth of E-commerce over the past few decades. This enhanced use of credit cards has increased the associated risks such as fraudulent use of credit cards that can cause financial loss to the card holders as well as to financial institutions. It is an ethical issue and has legal implications in various countries where laws and regulations forces financial intuitions and credit card companies to employ various techniques to detect and prevent the credit card frauds. Although the changes in technological systems also change the nature of frauds but data mining techniques such as classification, regression and clustering are very useful and are widely used to prevent and detect the frauds associated with credit cards. The credit card fraud prevention and detection functionality is a type of classification problem for the new customer as well for existing customers. There are multiple data mining techniques that can be employed for classification of customers and each has its own pros and cons. This study will compare four classification techniques namely Naïve Bayes, Bayesian network, Artificial Neural Network and Artificial Immune Systems for credit card transactions classification on a dataset obtained from a commercial bank in Pakistan. The major contribution of this study is use of real data on which extensive experiments have been performed and various results have been analysed with conclusion of best technique. |
|
Graph Visualization Tools: A Comparative Analysis |
Author : Fariha Majeed, Dr. Saif-ur-Rahman |
Abstract | Full Text |
Abstract :Data visualization is becoming a necessity for big organizations as the social networking data is growing rapidly. It is becoming difficult to visualize data and perform complex comparisons. There have been large databases to store huge data but to study the behavior is becoming time consuming and sometimes impossible. One can analyze small sets of relations but as the relationship grows, it becomes difficult to make decisions. Data Visualization tools are used to overcome this issue; however, the algorithms used to perform the analysis requires high performance processors otherwise, the data size would degrade the performance. The research provides a comparative study on popular visualization tools that could be used in the analysis of large datasets. The comparison would comprise of statistics on their common features identified on the basis of market research and literature survey. |
|
Analysis of SSD Utilization by Graph Processing Systems |
Author : Haider Qutbuddin, Dr. Saif-ur-Rahman |
Abstract | Full Text |
Abstract :Graph Processing Systems are highly productive when it comes to graph data. While using data parallel approach, it could not exploit common characteristics of a graph computation workload. To address all these challenges, distributed graph processing frameworks were introduced which inherited both the properties of graph parallel systems and data parallel system. Usually, the standard operators which were being used by data parallel systems were filter, join, reduce and etc. while graph parallel system introduced operators such as sub-graph, mrTriplets and etc. In comparison with graph framework operators, the standard relational operators were too slow. Traditionally, all the frameworks and their benchmarks were executed over hard disk drive but modern storage technology has evolved which lead us to use Solid State Drives. Solid state drives are known for their lightning speed as it manages to retrieve and populate data using pulse. This paper presents an analysis of SSD by utilizing graph processing systems. It also discuss the pros and cons faced by the Graph Processing Frameworks and by using TRIM support how the issue of wear leveling can be resolved. |
|
Performance Analysis of Table Driven and Event Driven Protocols for Voice and Video Services in MANET |
Author : Haque Nawaz, Dr. Hasnain Mansoor Ali |
Abstract | Full Text |
Abstract :This research paper encompasses the performance analysis of table driven and event driven routing protocols by using voice and video traffic in mobile Adhoc network (MANET). Particularly, OLSR (table driven) and DSR (Event driven) protocol are considered. The nodes of MANET establish the connections with each other energetically and can move freely in any direction. In mobile Adhoc network, environment event driven and table driven protocols have significant subject matter of study. There is a mobility issue which matters the service performance due to breakage and renewal of links of mobile nodes. Protocols performance has significance on overall performance of MANET. The aim of this study is to present the performance analysis of selected routing protocols by varying the node densities and WLAN physical characteristics. The voice and video traffic applications are configured discretely by using OLSR and DSR in scenarios. Moreover, for the performance observation the parameters are jitter, traffic received, traffic sent, end-to-end delay, traffic load, and throughput. The simulations have been carried out through OPNET 14.5 modeller tool and results has been analysed. |
|
An Investigation on Topic Maps Based Document Classification with Unbalance Classes |
Author : Maher Baloch, Muhammad Rafi |
Abstract | Full Text |
Abstract :Classification of imbalanced data has become a widespread problem due to the fact that the most real world datasets are imbalanced. In a classification task, one of the challenges is to learn the feature-space of classification under class-imbalance setting. The majority classes generally have good representation of features in the learned classification function and the minority classes lack this representation; subsequently, the classification for these classes failed more often. In this paper, authors investigate the task of document classification with topic map based representation of documents under class imbalance setting. In order to measure of topic-map based representation for classification under imbalance data, authors compare three representations: Bag-ofWords, Phrases and Topic terms for three approaches (i) under-sampling, (ii) cost-adjusting, and (iii) cluster based sampling. A series of experiments are carried out and results are reported. |
|
Standard Framework for Comparison of Graph Partitioning Techniques |
Author : Mudasser Iqbal, Dr. Saif-ur-Rahman |
Abstract | Full Text |
Abstract :Graph Partitioning is used to distribute graph partitions across nodes for processing. It is very important in the pre-processing step for distributed graph processing. In Math and Computer Science, many different distributed graph processing solutions use different partitioning approaches. This research deals with the identification of issues associated with the different graph partitioning approaches. This research paper compared the different graph partitioning solution (GraphLab, ParMetis, PT-Scotch) by applying them on different real world datasets and obtained the I/O and partitioning variation between them using different technique. This paper describes the procedure of configuring the GraphLab on Ubuntu OS and applying partitioning and pagerank techniques on it. Pmetis and Kmetis are two graph partitioning algorithms used in ParMetis. These algorithms were on same graph for different numbers of partitions and obtained the I/O and partitioning comparison between Pmetis and Kmetis. Different vertex cut strategies are also discussed in this paper. In this paper, the behavior of PowerGraph and PT-Scotch was explored while working on a very large datasets. |
|
A Semi-supervised approach to Document Clustering with Sequence Constraints |
Author : Murtaza Munawar Fazal, Muhammad Rafi |
Abstract | Full Text |
Abstract :Document clustering is usually performed as an unsupervised task. It attempts to separate different groups of documents (clusters) from a document collection based on implicitly identifying the common patterns present in these documents. A semi-supervised approach to this problem recently reported promising results. In semi-supervised approach, an explicit background knowledge (for example: Must-link or Cannot-link information for a pair of documents) is used in the form of constraints to drive the clustering process in the right direction. In this paper, a semi-supervised approach to document clustering is proposed. There are three main contributions through this paper (i) a document is transformed primarily into a graph representation based on Graph-of-Word approach. From this graph, a word sequences of size=3 is extracted. This sequence is used as a feature for the semi-supervised clustering. (ii) A similarity function based on commonword sequences is proposed, and (iii) the constrained based algorithm is designed to perform the actual cluster process through active learning. The proposed algorithm is implemented and extensively tested on three standard text mining datasets. The method clearly outperforms the recently proposed algorithms for document clustering in term of standard evaluation measures for document clustering task. |
|
Enhancing Data Quality using Human Computation and Crowd Sourcing |
Author : Vikram Kumar Kirpalani, Muhammad Ejaz Tayab |
Abstract | Full Text |
Abstract :This paper is aimed at addressing the issues that are present in the data dumps available at DBpedia by using the concept of associations i.e. concept hierarchy to enhance the quality of those data dumps. These data dumps are extracted from Wikipedia and the issues that prevail in these data dumps is because of either the data extraction frameworks or the human error during crowd-sourcing efforts made on Wikipedia. By using Human Computation techniques and employing Crowd sourcing together with query morphing, diving deeper into this subject would become easier in a better way. One of the key issues with the datasets is the presence of multiple values in a single attribute and vice versa especially in the “Place of Birth” field of important personalities. This paper highlights the implementation process in order to solve these issues and adds a survey conducted on Crowd Sourcing to highlight its impact. |
|
Urdu Optical Character Recognition Technique for Jameel Noori Nastaleeq Script |
Author : Engr. Reema Qaiser Khan, Engr. Wafa Qaiser Khan |
Abstract | Full Text |
Abstract :Urdu OCRs have been an object of interest for many developers in the recent years. Active research is being done pertaining to Urdu OCR’s, but because of the complexity associated with Urdu fonts, it still lacks perfection halting it from coming up to the surface. The main objective was to create a technique that could be applied to any of the existing Urdu fonts/scripts. In this paper, the authors have developed a technique which is capable of extracting the Urdu font Jameel Noori Nastaleeq from images and converts it into editable textual Unicodes. The approach comprises of pre-processing techniques, label connected components, feature extraction, and image comparison. The identified objects are saved as templates which are then compared to the white pixel position length database created by the authors in order to identify the templates which are then converted into Unicode. |
|