장발의 개발러

Hadoop을 이용한 Apache Log 분석 본문

개발이즈 마이라이프/DB & BigData

Hadoop을 이용한 Apache Log 분석

장발의 개발러 2012. 5. 22. 17:42



출처: http://mimul.com/pebble/default/2011/11/05/1320482173560.html


아래는 Apache 로그를 가지고 IP별 방문자수를 카운트하는 샘플을 작성해 보았습니다. 로그 분석을 위해 개념이해하는데 도움이 될까 해서 테스트 내용을 공유합니다. ^^
좀 더 응용하면 시간당, 일별, 월별, 주별 등으로 구분해서 트래픽을 산정할 수 있겠죠.
그리고 R과 결합해 그래프로 도식화하면 추세를 보는데 도움을 줄 수 있습니다. 

1. 로그 데이터(Apache) 분석 준비
Apache Common로그를 다운 받아서 /database/samples/data/apache 디렉토리에 넣은 다음 아래의 순으로 HDFS에 카피를 하여 분석할 준비를 한다.

- 로컬 데이터 HDFS에 카피

[mimul]/hadoop-0.20.204.0> bin/hadoop dfs -copyFromLocal 
/database/samples/data/apache apache [mimul]/hadoop-0.20.204.0> bin/hadoop dfs -ls Found 5 items drwxr-xr-x   - k2 KPCT      0 2011-11-02 18:07 /user/k2/apache drwxr-xr-x   - k2 KPCT      0 2011-10-21 12:52 /user/k2/gutenberg drwxr-xr-x   - k2 KPCT      0 2011-10-21 12:57 /user/k2/gutenberg-output drwxr-xr-x   - k2 KPCT      0 2011-09-28 20:47 /user/k2/input drwxr-xr-x   - k2 KPCT      0 2011-09-28 20:49 /user/k2/output [mimul]/hadoop-0.20.204.0> bin/hadoop dfs -ls apache Found 4 items -rw-r--r-- 1 k2 KPCT 15380331 2011-11-02 18:07 /user/k2/apache/access_log.20110701 -rw-r--r-- 1 k2 KPCT 11754087 2011-11-02 18:07 /user/k2/apache/access_log.20110702 -rw-r--r-- 1 k2 KPCT 12220413 2011-11-02 18:07 /user/k2/apache/access_log.20110703 -rw-r--r-- 1 k2 KPCT 14435475 2011-11-02 18:07 /user/k2/apache/access_log.20110704

2. MapReduce 샘플 소스

- LogMapper.java
public class LogMapper extends Mapper<Object, Text, Text, IntWritable> {
    private static Logger logger = Logger.getLogger(LogMapper.class);
    @Override
    protected void map(Object key,Text value,Context context) throws IOException,
      InterruptedException {

            String logEntryLine = value.toString();             String  logEntryPattern = Constants.APACHE_ACCESS_LOG;             int position = 1;            
            if (isValidLine(logEntryLine, logEntryPattern)){                 String ipAddress=retrieveIPAddress(logEntryLine,logEntryPattern,                  position);                 logger.warn("Ip address in map : "+ipAddress);                 context.write(new Text(ipAddress), Constants.ONE);                                 }     }
    public boolean isValidLine(String logEntryLine,String logEntryPattern){         Pattern p = Pattern.compile(logEntryPattern);         Matcher matcher = p.matcher(logEntryLine);         if (!matcher.matches()){                 logger.warn("정상적인 로그 포멧이 아님");                 return false;             } ;         return true;     }    
    public String retrieveIPAddress(String  logEntryLine,String logEntryPattern,      int position){         Pattern p = Pattern.compile(logEntryPattern);         Matcher matcher = p.matcher(logEntryLine);         if (!matcher.matches() ||           Constants.PARSE_CNT != matcher.groupCount()) {           logger.warn("비정상 로그");           return Constants.INVALID_IPADDRESS;         }             return matcher.group(position);         } }
- LogReducer.java
public class LogReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
     protected void reduce(Text key,Iterable<IntWritable> values,Context context)
       throws IOException ,InterruptedException {
             
             int sum=0;
             for (IntWritable value : values) {
                     sum += value.get();
             }
             context.write(key, new IntWritable(sum));
     }
}
- LogMapReduceJob.java
public class LogMapReduceJob extends Configured implements Tool
{
     private static void initJob(String jobName, Configuration config,
       String inputPath, String outputPath) throws IOException,
       InterruptedException, ClassNotFoundException{
             Job job=new Job(config, jobName);
             job.setJarByClass(LogMapReduceJob.class);
             job.setMapperClass(LogMapper.class);
             job.setReducerClass(LogReducer.class);
             job.setOutputKeyClass(Text.class);
             job.setOutputValueClass(IntWritable.class);
             
             FileInputFormat.addInputPath(job, new Path(inputPath));
             FileOutputFormat.setOutputPath(job, new Path(outputPath));
             
             job.waitForCompletion(true);
     }
     
     public static void main(String[] args) throws Exception {
             ToolRunner.run(new LogMapReduceJob(), args);
     }
     @Override
     public int run(String[] args) throws Exception {
             Configuration config = new Configuration();
             System.out.println("Args [0] :"+args[0]);
             System.out.println("Args [1] :"+args[1]);
             System.out.println("Args [2] :"+args[2]);
             initJob(args[0], config, args[1], args[2]);
             return 0;
     }
}
- build 파일
. LogAnalizerMapReduce.jar

3. 로그 데이터 분석
LogAnalizerMapReduce.jar 파일을 실행하여 Apache로그를 분석한다.
[mimul]/hadoop-0.20.204.0> bin/hadoop jar LogAnalizerMapReduce.jar apache 
apache-output
Args [0] :LogAnalyze
Args [1] :apache
Args [2] :apache-output
11/11/03 15:39:38 WARN mapred.JobClient: Use GenericOptionsParser for 
parsing the arguments.
Applications should implement Tool for the same.
11/11/03 15:39:38 INFO input.FileInputFormat: Total input paths to process : 1
11/11/03 15:39:38 INFO mapred.JobClient: Running job: job_201111031454_0003
11/11/03 15:39:39 INFO mapred.JobClient:  map 0% reduce 0%
11/11/03 15:39:56 INFO mapred.JobClient:  map 46% reduce 0%
11/11/03 15:39:59 INFO mapred.JobClient:  map 78% reduce 0%
11/11/03 15:40:02 INFO mapred.JobClient:  map 100% reduce 0%
11/11/03 15:40:17 INFO mapred.JobClient:  map 100% reduce 100%
11/11/03 15:40:22 INFO mapred.JobClient: Job complete: job_201111031454_0003
11/11/03 15:40:22 INFO mapred.JobClient: Counters: 25
11/11/03 15:40:22 INFO mapred.JobClient:   Job Counters 
11/11/03 15:40:22 INFO mapred.JobClient:     Launched reduce tasks=1
11/11/03 15:40:22 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=22317
11/11/03 15:40:22 INFO mapred.JobClient:     Total time spent by all reduces 
waiting after reserving slots (ms)=0
11/11/03 15:40:22 INFO mapred.JobClient:     Total time spent by all maps 
waiting after reserving slots (ms)=0
11/11/03 15:40:22 INFO mapred.JobClient:     Launched map tasks=1
11/11/03 15:40:22 INFO mapred.JobClient:     Data-local map tasks=1
11/11/03 15:40:22 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=13390
11/11/03 15:40:22 INFO mapred.JobClient:   File Output Format Counters
11/11/03 15:40:22 INFO mapred.JobClient:     Bytes Written=105178
11/11/03 15:40:22 INFO mapred.JobClient:   FileSystemCounters
11/11/03 15:40:22 INFO mapred.JobClient:     FILE_BYTES_READ=1806396
11/11/03 15:40:22 INFO mapred.JobClient:     HDFS_BYTES_READ=15380456
11/11/03 15:40:22 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=3655077
11/11/03 15:40:22 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=105178
11/11/03 15:40:22 INFO mapred.JobClient:   File Input Format Counters
11/11/03 15:40:22 INFO mapred.JobClient:     Bytes Read=15380331
11/11/03 15:40:22 INFO mapred.JobClient:   Map-Reduce Framework
11/11/03 15:40:22 INFO mapred.JobClient:     Reduce input groups=6327
11/11/03 15:40:22 INFO mapred.JobClient:     Map output materialized bytes=1806396
11/11/03 15:40:22 INFO mapred.JobClient:     Combine output records=0
11/11/03 15:40:22 INFO mapred.JobClient:     Map input records=86763
11/11/03 15:40:22 INFO mapred.JobClient:     Reduce shuffle bytes=1806396
11/11/03 15:40:22 INFO mapred.JobClient:     Reduce output records=6327
11/11/03 15:40:22 INFO mapred.JobClient:     Spilled Records=172974
11/11/03 15:40:22 INFO mapred.JobClient:     Map output bytes=1633416
11/11/03 15:40:22 INFO mapred.JobClient:     Combine input records=0
11/11/03 15:40:22 INFO mapred.JobClient:     Map output records=86487
11/11/03 15:40:22 INFO mapred.JobClient:     SPLIT_RAW_BYTES=125
11/11/03 15:40:22 INFO mapred.JobClient:     Reduce input records=86487

[mimul]/hadoop-0.20.204.0> bin/hadoop dfs -ls apache-output Found 3 items -rw-r--r--   1 k2 KPCT      0 2011-11-03 15:22 /user/k2/apache-output/_SUCCESS drwxr-xr-x   - k2 KPCT      0 2011-11-03 15:22 /user/k2/apache-output/_logs -rw-r--r--   1 k2 KPCT 105178 2011-11-03 15:22 /user/k2/apache-output/part-r-00000
[mimul]/hadoop-0.20.204.0> bin/hadoop fs -cat apache-output/part-r-00000 58.225.20.33    1 58.225.23.88    6 58.226.140.46   2 58.227.139.198  7 58.227.156.67   1 58.227.19.60    2 58.227.204.143  3 58.227.31.86    3 58.228.13.105   1 58.228.3.131    1 58.228.60.70    1 58.228.89.39    2 58.229.111.26   1 58.229.146.205  10
4. Jobtracker
 
5. Troubleshooting
분석 대상인 Apache 로그파일이 블락되어 safe mode가 되어 안전모드를 해제하고 다시 재 실행했다.
SafeModeException: Cannot delete 
/tmp/hadoop-k2/mapred/system. Name node is in safe mode.
[mimul]/hadoop-0.20.204.0> bin/hadoop dfsadmin 
-safemode leave