sql - Using Google Big Query to Organise an Edgelist for Spreadsheet Use -
i'm having trouble trying reorganise large edgelist have exported .csv gephi.
the structure of data is:
- two columns, "source" , "target".
- "source" contains user ids , target contains demographic information such city, country, , university.
- due structure of data, user may duplicated in "source" if have multiple demographic records.
for example:
source target user142 citya user532 countrya user352 cityc user532 citya user143 countryc user532 university1
my desired output each unique user positioned on own row, separate columns "city", "country", , "university" - this:
unique user city country university user142 citya user532 citya countrya university1 user352 cityc user143 countryc
i've been able separate edgelist desired output in excel using number of steps using following index, match, , array formulae. lengthy, , excel crashes when large edgelists involved.
excel process:
step 1:
=index($a$2:$a$5819, match(0, countif($d$1:$d1, $a$2:$a$5819), 0))
this filters users unique records (cola) , places them in new column (cold).
step 2:
=arrayformula(iferror(index($b$2:$b$174, match(0, countif($d2:d2,$b$2:$b$174)+if($a$2:$a$174<>$d2, 1, 0), 0)), 0))
this looks down "target" column (b), against unique users in column created in step 1 (cold). not order output across columns, however. left right may read city, country, university, or country, university, city, etc each unique user.
step 3:
=transpose(sort(transpose(e2:h2)))
this step orders output alphabetically across columns, left-right columns read city, country, university.
my main problem slow , manual process, i'm looking utilise sql in big query make flow efficient possible.
any information on how begin structuring query sort data desired output highly appreciated.
i have experience sql, has been in terms of text extraction , regex.
thanks!
in bigquery, conditional aggregation. this:
select source, max(case when type = 'city' target end) city, max(case when type = 'country' target end) country, max(case when type = 'university' target end) university demographics d group source;
your question doesn't specify column type
. seems need 1 if have different types of information.
wiki
Comments
Post a Comment