just a minute its not 23 billion records (parsed via jena records), its 23 trillion records. wov. i in 7.5 hours generated 23 trillion records. with the notebook 50 nodes cluster setup in totally 7.4 hours :) 

 

nice :) 


(this were a challenge to come up to peak efficiency condition in setting up cluster all utilized to process this much data :) but challenge solved recent week :)  )


I would share later zeppelin and aws emr configuration settings to set all nodes be always used in FAIR Scheduler to process this much data in this not so much time (23 trillion records in 7.5 hours :) ) so that anyone needs such config, it can be helpful to share to. 

but of course i wont share my notebook or keeping up at most 400 spark jobs at same time code :)  i wont share my silly notebook which isn coeded with silly fors whiles and some silly c like code and some also scala futures logic there to keep cluster very utilized all times (e.g. 400 spark jobs always) 


that is easily codeable with scala :)  i mean logic to keep up 400 spark jobs always and add new jobs only when there a job ends there. so fully utilizing the cluster very much and also using FAIR Scheduler to ensure its fully utilized.

at the beginning of task, whn special configurations were not applied, only 10 nodes or so were utilized. but after config changes all 50 nodes were utilized all time during the computation of ttl conversion with jena to subject object predicate table rows. 






Yorumlar

Bu blogdaki popüler yayınlar