a spark newbie

scala:

lazy val, implicit,

MapPartitionsRDD 在map的时候如果更换了key,就会比较耗时,需要长时间的shuffle操作,甚至比sortByKey都耗时。

 

windows使用hadoop+spark:

https://github.com/karthikj1/Hadoop-2.7.1-Windows-64-binaries

http://blog.csdn.net/u013226462/article/details/48848689

16/09/23 23:09:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
16/09/23 23:09:09 INFO security.UserGroupInformation: Can’t login from keytab, try to login from ticket cache
16/09/23 23:09:09 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm1
16/09/23 23:09:10 INFO yarn.Client: Requesting a new application from cluster with 523 NodeManagers
16/09/23 23:09:10 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
16/09/23 23:09:10 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
Exception in thread “main” java.lang.IllegalArgumentException: Required AM memory (8192+2000 MB) is above the max threshold (8192 MB) of this cluster! Please increase the value of ‘yarn.scheduler.maximum-allocation-mb’.
at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:292)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:141)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1085)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1145)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:749)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

 

 

ss

 

Job aborted due to stage failure: Serialized task 1073:0 was 205781866 bytes, which exceeds max allowed: spark.akka.frameSize (134217728 bytes) – reserved (204800 bytes). Consider increasing spark.akka.frameSize or using broadcast variables for large values.

 

 

   1、spark.driver.maxResultSize 8g    driver获得处理结果的最大内存数,由于我要处理大矩阵,所以这个参数还是不得不改的

     2、spark.yarn.executor.memoryOverhead  2048    跑了一段时间后发现很多executor堆外内存占用过大,采用这个参数后稍好

     3、spark.shuffle.blockTransferService nio     spark 1.2.0以后shuffle service改为了netty,这个很扯淡,我改为nio后堆外内存较少了很多,同时处理时间提示提升了一倍