hmm I last day finally reached the stable spark app condition but have to switch to method of listing s3 files to get partition folders paged by paged. Again at one instant atmost 400 spark jobs are submitted to 50 nodes cluster. it seems to be working ok, has only passed 4hours and seems as there would also be necessary 20 hours also.
then later some of the row count would be reduced. hmm then currently seems as data would be like finally initially 300k partitions of 2.5 megabyte or so ithink. then but those would be also reducedto 1/10th of. then the neo4j db along with ontology hierarchy parsings would be crafted.
ayy last day passed all day with reconfiguring the spark job and finding whilst it stucks and turned out the initial partitionId were very sparsed and just iterating over it is not possible since is very sparse due to original monotously increasing row id mechanism's nonstandard increase. so had moved to listing parttion directories and adding to spark jobs for once at most 400 spark jobs being submitted to. (not to throttle cluster). so it seems going well and seems as it would as mentioned take total 24 hours to complete.
i figured out many problems i had happent due to sparse partition id condition originally last days. but finally would have ontology db stored in cloud file system these recent days.
ayyy i can finally start studying to notebook of dependency graphs today. i just did nto managed to study to that yesterday due to this altering settings/code of spark job effort to have it work stable and fast with utilizing entire cluster.
I am so happy finally this spark task would resolve soon. i mean initial ontology db setup thats not converted to neo4j partititons yet but still its nice that all data would be moved to table form by ~250 euros alike expense. then it would be reduced filtered to less data in new table since not all predicates are relevant. maybe 1/10 reduction of data would happen. then storing of neo4j dbs would happen also. hmm but these could go in tandem now to other topology revisiting/dependency graph studies. yupp.
this data silo it has been not easy to handle this silo's processing application for jena to parse ttls and create dbs in a stable spark job cause there were really too much data. and i were unaware that spark creatred monotous row id very very sparse. this last matter took a lot of time actually until figuring it out and converting it fastly to paged s3 files listing method to move with pages instead.
hey it has been a challenging spark task/jena rewriting task this initial part of ontologytask. but finally subject/object/predicate table would be ready by 20 hours later total. though i might stop job so it can stretch out to many 2 days or so. like everyday 10 hours running of such cluster.
I have during this task expertised methods and code patterns to utilize a cluster fully without not throttling it with too many spark jobs whilst the somehow partition info cant be read thereby i have to provide spark partition folders myself like thereby providing 400 partitions at once most to system alike limit i provide.
if it could detected partitions it self, this wont be necessary but somehow it does not detect partitionId column thereby original partitioning mechanism by such column does not work. but such workaround works well either maybe even better.
-------------------------------------------------------------------
ayy finally could move to other machine learning task side today with dependency graph extractors analysis study as now this spark task does nto need provisioning alot but just in some days it would end (siocne i woldnt run it all 20 hours but would run it like 2 days not all 20 hours consecutively)
this part of data silo tasks turned out challenging task whilst it were not anticipated to be this much challenging but challenges were resolved with workarounds.
now could continuye more focusing on topology revisiting/study and rdf design of synonymous words clusters for word net and machine learning dependency extractor algorithms analysis and selection studies.
hmm so I think study is really going well yayy :)
hmm for june 1 release date of 0.1 ai version, it might be feasible or not. i would continue studying every evening and every weekend fully. but if 0.1 version does not gets finished by june1 it would be at most july 1 alike. i mean i think project is going well and would not be postponed alot its initial release date. i guess it might be postponed to atmost july 1. (i mean 0.1 version of it) or atmost mid of June alike. but then 1.0 version as mentioned would be released in november nice. i am happy that project schedule is going nice. and most (but not all) design decisions are already known/figured out/resolved either but of course its not fully designed yet but there are design principles and every methodology is first checked/analyzed and design methods gets also built iteratively like that.
its so fun to study to project every day and resolve challenges.
its like every day at 7pm or so, my ai project day starts :) (And on weekends its like 13:00pm or so i guess i start studying to this ai project).
---------------------------------------------------------------------------------------------------------------------------
Yorumlar
Yorum Gönder