智能文章系统实战-Hive数据仓库(15)
admin 发布于:2018-7-12 16:07 有 2239 人浏览,获得评论 0 条
1. hive 安装
http://www.wangfeilong.cn/server/118.html
2.查看上节用Hadoop清洗过的数据pvlog.txt[root@localhost hive]# cat /tmp/pvlog.txt 192.168.100.1 1530460833 http://news.demo.com/h5.php?action=show&id=89 192.168.100.1 1530460803 http://news.demo.com/h5.php?action=show&id=128
3启动HIVE并且创建数据库
[root@localhost hive]# hive which: no hbase in (/usr/local/soft/hive/bin:/usr/local/soft/Hadoop/hadoop/bin:/usr/local/soft/Hadoop/hadoop/sbin:/usr/local/soft/jdk1.8.0_17/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin) SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/soft/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/soft/Hadoop/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Logging initialized using configuration in jar:file:/usr/local/soft/hive/lib/hive-common-2.3.3.jar!/hive-log4j2.properties Async: true Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. hive> CREATE DATABASE IF NOT EXISTS article; OK Time taken: 7.758 seconds hive> show databases; OK article default demo wordcount Time taken: 0.229 seconds, Fetched: 4 row(s) hive> use article; OK Time taken: 0.068 seconds
4.hive详细日志表
4.1 创建明细表
hive> create table pvlog( > ip string, > times string, > url string) > PARTITIONED BY (stat_date string) > row format delimited fields terminated by '\t' stored as textfile; OK Time taken: 0.582 seconds
4.2 加载文本数据到HIVE数据库表
hive> load data local inpath '/tmp/pvlog.txt' overwrite into table pvlog partition(stat_date='2018-07-01'); Loading data to table article.pvlog partition (stat_date=2018-07-01) OK Time taken: 2.383 seconds
4.3 查询明细表数据
hive> select * from pvlog where stat_date = '2018-07-01'; OK 192.168.100.1 1530460833 http://news.demo.com/h5.php?action=show&id=89 2018-07-01 192.168.100.1 1530460803 http://news.demo.com/h5.php?action=show&id=128 2018-07-01 Time taken: 4.96 seconds, Fetched: 2 row(s)
5.统计数据
5.1 创建统计表
hive> create table stat( > stat_date string, > pv int, > ip int > ) > row format delimited fields terminated by '\t' stored as textfile; OK Time taken: 0.26 seconds
5.2 统计数据
hive> insert into stat > select stat_date,count(*) as pv,count(distinct(ip)) as ip from pvlog where stat_date = '2018-07-01' group by stat_date; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20180710175116_136d9e36-a8fc-4d0d-9f91-93dd71aba321 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1531202649478_0010, Tracking URL = http://localhost:8088/proxy/application_1531202649478_0010/ Kill Command = /usr/local/soft/Hadoop/hadoop/bin/hadoop job -kill job_1531202649478_0010 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2018-07-10 17:51:36,560 Stage-1 map = 0%, reduce = 0% 2018-07-10 17:51:52,289 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.23 sec 2018-07-10 17:52:07,262 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.86 sec MapReduce Total cumulative CPU time: 5 seconds 860 msec Ended Job = job_1531202649478_0010 Loading data to table article.stat MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.86 sec HDFS Read: 9792 HDFS Write: 83 SUCCESS Total MapReduce CPU Time Spent: 5 seconds 860 msec OK Time taken: 53.74 seconds
5.3 查询统计结果
hive> select * from stat; OK 2018-07-01 2 1 Time taken: 0.36 seconds, Fetched: 1 row(s)
6.查看统计的数据文件
7.把统计导入MYSQL数据库