yayyy studying task now starts and I created  game db initial table as :


create table gamedb_v05 (

            year INT, //game start day time year of day

            month INT,////game start day time month of day

            day INT,////game start day time day of day

            is_joinable BOOLEAN, //if game has allocated capacity for new players and if had not started yet (since a person cant join already started game if not previously joined before it started and some timeout has not passed) 

            game_version STRING, // version of game project (each game bundle has some rules e.g. for battle pass etc )

            capacity INT,  //e.g. for 3x3 game its 6  

            allocated_capacity INT, //currently total users (these capacity are stored as partition columns since even if we wont do an order by sql clause to table, clients searching game would first see games with least required person count) (e.g. if there is a game of 6 capacity and 5 people joined it and there is a second game which has 6 capacity (3x3 battle) and there is 4 people joined, the client would see the first one more priotized (as logic states) (other wise it would be hard for a person to join a game that has enough joined count to have start) (to do this without an order by clause requires something like multiple queries at client side http calls e.g.  first query alike capacity-allocated_capacity < 2 then second query capacity-allocated_capacity < 3 and so on of multiple rest queries alike to priotize ready to start games to the client trying to join a game, and of course game version would also be important in that since e.g. if its a battke pass with a specific software bundle version of the game distributions (a battle pass version) and e.g. then client would either be able to list that game if has requirements to attend the game e.g. if he bought a ticket to so or if has necessary enough score etc etc some such logic there alike)  it also depends on the isjoinable attirubute which is also in this partition columns set because e.g. a game is only joinable if it had not started yet. and if started isjoinable is set to false but if a client that already joined to it before has had connection issue and then could join whether this flag is set to false or not.  so in case network issues happens in stun/webrtc that a client disconnects from game could rejoin if game is still continueing yet in some time out period.  

            hour INT,

            minute INT,

            second INT,

            userid_shortened STRING, //to reduce partition count of the folder system when doing queries 

            userid STRING, 

            gameid STRING, 

            game_capacity INT,  //to be removed since already in above

            current_joined_gamers ARRAY<STRING>,

            current_joined_count  INT, //to be removed since already in above

            creatoruid STRING,

            createdat  timestamp,

            startedat  timestamp,

            finishedat timestamp,

            is_started BOOLEAN,

            is_finished BOOLEAN,

            websocket_url string, //the websocket url of the game creator (since might be on a specific pod where nodejs scales, its required to do proxy pass the clients to this websocket url  for nginx proxypass (e.g. if a client to connect to nodejs, connects with proxypass to the relevant nodejs server pod with nginxproxy pass alike)

            won_team string ) using delta partitioned by (year, month , day,

                is_joinable, game_version, capacity, allocated_capacity,

                hour, minute,second,

                userid_shortened,

                userid,

                gameid)  LOCATION 'gs://*****' 



hmm thinking its not spark its not logical to set it as capacity-allocated_capacity logic. yep. its not logical. lets keep it remaining_capacity. yep.


since I dont know intrinsics of how delta lake without spark works, better keep the conditional clause at queries based on not an unitary or nonunitary operator result but to keep conditional clauses as simple as direct value query. specifically on higher order data cols e.g. alike remaining_capacity but for second i think we could do a nonvalue query alike second <= cur_second value alike query types yep. but for higher order stuff like game version or remaining capacity etc its better to imho keep direct value query instead of a unitary or nonunitary clause there.  since its not dist


ributed wise queried. at most it distributes possibly thread wise maybe or not but its not spark alike distributed by a vm wise. so better keep the query clause as least query work as possible imho otherwise it would be too slow.


second its also important to keep the partition cardinality not very much in same idea wise. even with spark its not good idea to keep partition cardinality alot but in version of without spark library version its even more important to keep partition cardinality low valued (much less than a million or even muchless than a ten thousands or even much less than a thousand imho thinking its not distributed)  for any query there.  

I just started to think that if cardinality of partiton of any query clause is more than 1000 it would query still very slow. i mean there would be of course query filters based on delta lake internal table versions folder. it requires 1 to 1 access imho to any partition there without a partition having many possible inside partition folders imho. 

aha i am wrong in that, it would have index functionality already and if index query is not very complex it would work as fast as a mysql index imho.


unless this exists ->

e.g. if you query a partiiton query without a direct value clause and if that partition has 1000  or not only 1000 even 100 many entrees,  


then you are right to go with without spark option without issue imho.

but if the case is like that -> that there is that many partitions of folders usually ->

if you go without spark option with such a case i guess query might be slow with all the optimizations in table definition of data folder of the delta lake. is my opinion which might not be accurate i dont know. 


if above is the case,  i mean if you have to advance nonvalue predicate clauses on queries on partitions, and if such partitions are like 1000 or even 100 many then spark is the thing to go. but else if thats not the case if your partition predicate query clauses are simple value based, then i think without spark version is also feasible and even more logical approach to go since spark requires a lot time for session setup. its not something for real time or near real time queries on a scaled horizontolaly cluster but rather usually for offline data processing i mean even on nonoffline data, its still for offline type process type not for any online real time nor near to real time queries. 


so its most important thereby to have query clauses span not a lot partitions otherwise with the single python process there even if it were multithreaded by underlying delta lake python or rust library, still it would have been slow since its not distirbuted  task very much. 


so I would review the data tables in this aspect most. so thagt e.g. a query clause have usually at most 5 partitions to scan at a single query alike. since otherwise it would be super slow. if it were 100 even i mean. i thinkpython process would took least 1 minute or 3 minutes even maybe to handle query clause for 100 partition scan imho possibly alike i speculate.  is 1 minute lag ok in a real time service? definitely not. so there we are we have to be very careful withd esign of table to not have this situation yep.

 these are my speculations imho i dont know about implementation of deltalake library without spark. but since it seems nondistributed without spark version, i speculated like that. but maybe i am wrong.

but i would design db table with this idea since i tihnk its possibly like this. since without spark its nondistributed right?



so with correct db table design you can have a horizontally scalable mysql (unlike mysql which it does not have that feature by itself alike) alike fast db without even spark. but it requires correct db table design and if you dont do order bys or joins between tables etc.


for joinable big data either you have to use spark or else there were some other db technology i think enabling those.  


but we dont need any of currently in this python api so I go with this without spark option since its an online web request.  better keep it fast as if connected to a mysql db system.  (i think delta lake can work as fast as mysql wihtoutt spark also in such carefully designed version) 

but of courses for joins/order by thingies etc spark is surely required thingy with delta lake. 

 

so in this aspect, i think above db table is slightly problematic in following :

userid_shortened STRING,

            userid STRING,


consider cases of 1000 users online 

then if there is also 1000 many shorted userids (since the hash of user id which generates uid might even be that much created) 

 then what happens when you do a query alike is would be the db api would retrieve batch wise the folders with its query api to the folder file system there and you have to page by page by scan it.


but considering that in creator of game this wont be an issue. since would be able to directly go the folder with specific user id


for the client that connects to game its not important which user id is retrieved so it would fetch only some count of the dataset but not fully thereby there wont be issue there also.

since we dont have a query where we query a condition based on user id that also has nonvalue unary nonunary operator clauses,  that is it, so no slowness would happen. 


so this shortened user id is not logical since hash would still be alike for 1000 users their first 5 string lenths still would be alike 1000 many folders still. so that were illogical to add that column imho.


but rather more logical to create an int64 alike uid where each user when registering also takes int64 value and in that case we might partitioned based on that for the table to not have many folders at suhc partition hierarchy stuff.


yep. 


i think i would change user id like that or would remove that column at all yep. would remove. not required.  


yep so latest table of game db be alike:


"""create table gamedb_v1000 (year INT,

            month INT,

            day INT,

            is_joinable BOOLEAN,

            game_version STRING,

            remaining_capacity INT,

            hour INT,

            minute INT,

            second INT,

            userid STRING,

            gameid STRING, 

            current_joined_gamers ARRAY<STRING>,

            creatoruid STRING,

            createdat  timestamp,

            startedat  timestamp,

            finishedat timestamp,

            is_started BOOLEAN,

            is_finished BOOLEAN,

            websocket_url string,

            allocated_capacity INT,

            capacity INT, 

            won_team string ) using delta partitioned by (year, month , day,

                is_joinable, game_version, remaining_capacity,

                hour, minute,second,

                userid,

                gameid)  LOCATION 'gs://***'"""





 


Yorumlar

Bu blogdaki popüler yayınlar

disgusting terrsts of foreign gypsies foreign terrorst grp/cult