Hadoop convert pdf to text

2/29/2024

On file 512 MB, 1 GB, 1.5 GB, and 2 GB size, additional the number of virtual machines can slow down the average speed of each MapReduce on a file size for 164.00, 504.34, 781.27, and 1070.46 seconds. Meanwhile, additional the number of virtual machines from one to two virtual machines with suitable specifications design can slow down the average speed of MapReduce. On file 512 MB, 1 GB, 1.5 GB, and 2 GB size additional the number of physical machines can accelerate MapReduce average speed on each file size for 161.34, 328.00, 460.20, and 525.80 seconds. Based on the study, known that the additional the number of physical machines from one to two physical machines with suitable specifications design can speed up the average speed of MapReduce. Six scenarios are implemented to analyze the speed performance of Hadoop MapReduce. Hadoop uses a framework for application and programing which called MapReduce.

Hadoop is a Java-based software framework and open source which is used to process the data that have a large size in a distributed manner. One of technology that uses the concept of distributed computing is Hadoop. The use of distributed computing allows users to process data using multiple computers that are separated or distributed physically. And also explain the performance issues related with traffic analysis MapReduce jobs.ĭistributed computing is one of the advance technology in data processing. This paper presents a Hadoop-based traffic monitoring system that performs IP, TCP, HTTP, and NetFlow analysis of multi-terabytes of Internet traffic in a scalable manner. Hadoop, an open-source computing platform of MapReduce and a distributed file system, has become a popular infrastructure for massive data analytics because it facilitates scalable data processing and storage services on a distributed computing system consisting of commodity hardware. Scalable Internet traffic measurement and analysis is difficult because a large data set requires matching computing and storage resources. To satisfy demands for the deep analysis of ever-growing Internet traffic data, ISPs need a traffic measurement and analysis system where the computing and storage resources can be scaled out. As the number of network elements, such as routers, switches, and user devices, has increased and their performance has improved rapidly, it has become more and more difficult for Internet Service Providers (ISPs) to collect and analyze efficiently a large data set of raw packet dumps, flow records, activity logs for accounting, management, and security. Nowadays internet traffic measurements and analysis are mostly used to characterize and analysis of network usage and user behaviors, but faces the problem of scalability under the explosive growth of Internet traffic and high-speed access. In computer network, network traffic measurement is the process of measuring the amount and type of traffic on a particular network. However, Parquet still outperforms the text format by an average of approximately 30% in the scan and aggregate queries, and 70% and 40% respectively in the join and aggregate-join queries while showing a 8%-10% increase of performance in aggregate-join queries of over 60 minutes’ worth of PCAP data. This comes at the expense, however, of large data loss due to the need to create a well-defined schema for processing and the conversion time necessary to shift to a different format. After comparing three data storage formats plain text, Parquet, and raw PCAP files for use in Hadoop, the study has determined that the Parquet and text formats greatly outperform the use of raw PCAP files using the hadoop-pcap library which fails to complete tests with high volumes of data. This study attempts to benchmark and profile the current known methods for performing network analysis on Hadoop. While most network data are currently being analyzed by vertically scaled machines, Hadoop provides an alternative method of analysis, allowing large datasets to be analyzed in one horizontally-scaled cluster. As a fault-tolerant and horizontally scalable ecosystem, it becomes a suitable platform for the analysis of big network data. Hadoop's popularity as a distributed computing platform continues to grow as more and more data is generated each year.

0 Comments

Hadoop convert pdf to text

Leave a Reply.

Author

Archives

Categories